Job Description

Role Overview:

You own day-2 service & runtime operations (availability, latency, incident response, release health, capacity, cost & compliance optimisation) for Litmus & Sentinel atop a managed EKS + IaC foundation. You turn operational signals (latency, error budgets, drift, saturation) into continuous improvement. Partner closely with the platform (EKS / Terraform) team, security, and data science to ensure resiliency and regulated data handling while reducing toil and configuration drift.

Job Responsibilities:

  • Design & own service observability usage model: ensure all service metrics, logs, traces flow into Elastic Cloud (authoritative); maintain dashboards & SLOs; evaluate pragmatic use of CloudWatch, AWS Managed Prometheus / Grafana for supplemental or fallback views.
  • Build proactive, noisereduced alerting and incident response playbooks; drive postincident RCA & remediation tracking (closure SLA).
  • Optimize service performance (pro...

Apply for this Position

Ready to join FPT Asia Pacific Pte Ltd? Click the button below to submit your application.

Submit Application