Job Description
Core Responsibilities - Automate infrastructure provisioning, configuration, and maintenance using Terraform, Ansible, and Python. - Build, enhance, and maintain CI/CD pipelines using Jenkins, GitHub Actions, or AWS CodePipeline for continuous delivery and consistency across environments. - Implement and optimize monitoring solutions using Datadog, Prometheus, Grafana, and ELK/EFK stacks to ensure high service reliability. - Develop alerting strategies and escalation paths aligned to service-level objectives (SLOs) and key performance indicators (KPIs). - Build custom scripts and automation for patching, validation, and system health checks. - Partner with U.S. SREs and Engineering teams on environment management, change control, and incident response improvements. - Analyze logs and performance metrics to identify stability issues, optimize cloud costs, and drive continuous improvement. - Maintain detailed runbooks, SOPs, and documentation supporting operational readiness and knowledge transfer. - Contribute to open-source or internal tooling that enhances automation, monitoring, or observability capabilities. - Conduct periodic reliability reviews, performance tests, and failover simulations to validate readiness. - Support adoption of infrastructure-as-code, immutable environments, and container orchestration (Docker/Kubernetes). - Promote DevOps and SRE best practices across the engineering organization. Tools & Technologies AWS (EC2, S3, Lambda, CloudWatch, IAM, RDS, ECS/EKS), Terraform, Ansible, Python, Bash, Jenkins, GitHub Actions, Docker, Kubernetes, Prometheus, Grafana, ELK/EFK, Loki, Jira, Confluence. Qualifications - 5–7 years in SRE, DevOps, or Infrastructure Engineering. - Bachelor’s degree in computer science or related field of study preferred, or equivalent experience - Experience supporting U.S. healthcare or other regulated SaaS systems (HIPAA, SOC2, ISO27001). - Strong scripting and automation (Ansible, Jenkins, Python, Bash, Terraform, CloudFormation). - Understanding of CI/CD, networking, and secure cloud architecture. - Proven collaboration with U.S. teams across time zones; clear written and spoken English. - Familiarity with EHR, HL7/FHIR, or state/federal public health systems preferred. - Knowledge of data privacy frameworks (HIPAA, HITRUST, GDPR) and ITIL-based change/incident management. Work Model - Aligns with U.S. Eastern hours for daily collaboration, stand-ups, and sprint planning. - Documents work thoroughly to ensure audit readiness and operational transparency. - Works closely with U.S. SRE leadership on automation priorities, sprint goals, and production readiness activities. Soft Skills - Analytical problem-solver with attention to detail. - Self-driven, collaborative, and process-oriented. - Excellent communication and time management across distributed teams. - Passionate about automation, reliability, and continuous improvement. Example Contributions - Automated patching pipeline for pre-production validation of security updates. - Designed Grafana dashboards reducing alert noise by 40%. - Built Python scripts automating AWS cleanup, saving 15% cloud spend. - Implemented environment consistency checks improving deployment success rates. - Introduced CI/CD optimizations reducing release time by 25%.
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application