Job Description
Role: Site Reliability Engineer
Location: Hyderabad
Notice Period: Immediate to 20 Days
Employment Type: Full Time
Experience
- 7–12 years in site reliability, cloud-based data infrastructure, data pipeline observability, automation, and high-availability engineering within EdTech platforms (2U)
- Primary Skills (Must-Have)
- AWS, CI/CD, Jenkins, IAAC, Terraform, Kubernetes
- Secondary Skills (Good-to-Have)
- AWS systems; Dataiku data, Platform updates and patching
- Tools & Platforms
- Data Warehousing & Processing: Snowflake, Redshift, Apache Airflow, dbt
- CI/CD & Deployment: Jenkins, GitHub Actions, AWS CodePipeline, Terraform
- Cloud & Event Processing: AWS Lambda, API Gateway, SNS/SQS, Kafka, Step Functions
- Monitoring & Logging: DataDog, AWS CloudWatch, Prometheus, Splunk
- Incident Management: PagerDuty, Opsgenie, AWS Health Dashboard
- Collaboration & Code Review: GitHub, Jira, Confluence
Key Responsibilities
Data Pipeline Reliability & Observability:
- Maintain and optimize highly available, fault-tolerant infrastructure for data pipelines, ETL jobs, and real-time data processing
- Implement end-to-end monitoring of Airflow DAGs, Snowflake queries, and AWS-based data workflows
- Automate data pipeline health checks, error handling, and auto-remediation strategies
Infrastructure & Cloud Automation:
- Deploy and manage AWS-based data infrastructure using Terraform and CloudFormation
- Optimize Kubernetes (EKS) clusters for processing large-scale datasets and real-time analytics
- Ensure high availability and cost-efficient scaling for Redshift, Snowflake, and data storage solutions
Performance, Monitoring & Incident Response:
- Implement real-time monitoring, logging, and alerting using DataDog, AWS CloudWatch, and Prometheus
- Define and track SLOs, SLIs, and error budgets to improve data reliability and uptime
- Conduct Root Cause Analysis (RCA), security audits, and post-mortems for incidents
Security & Compliance:
- Ensure GDPR, CCPA, and SOC 2 compliance for data storage, access controls, and retention policies
- Implement AWS security best practices (IAM, KMS, Shield, WAF) to secure data access and encryption
- Secure API gateways, authentication mechanisms, and data lake permissions to prevent unauthorized access
Collaboration & Leadership:
- Work closely with data engineers, analytics teams, and DevOps engineers to enhance data platform reliability
- Participate in incident response drills, disaster recovery planning, and security compliance reviews
- Advocate for best practices in automation, cost optimization, and cloud-native data solutions
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application