Job Description
Role Overview:
You own day-2 service & runtime operations (availability, latency, incident response, release health, capacity, cost & compliance optimisation) for Litmus & Sentinel atop a managed EKS + IaC foundation. You turn operational signals (latency, error budgets, drift, saturation) into continuous improvement. Partner closely with the platform (EKS / Terraform) team, security, and data science to ensure resiliency and regulated data handling while reducing toil and configuration drift.
Job Responsibilities:
- Design & own service observability usage model: ensure all service metrics, logs, traces flow into Elastic Cloud (authoritative); maintain dashboards & SLOs; evaluate pragmatic use of CloudWatch, AWS Managed Prometheus / Grafana for supplemental or fallback views.
- Build proactive, noisereduced alerting and incident response playbooks; drive postincident RCA & remediation tracking (closure SLA).
- Optimize service performance (pro...
Apply for this Position
Ready to join FPT Asia Pacific Pte Ltd? Click the button below to submit your application.
Submit Application