Job Description
Senior DevOps Engineer / Platform Reliability Lead
Exp : 10-12+ years
Location : Kolkata
Role Overview
We are seeking a Senior DevOps Engineer / Platform Reliability Lead who can take an end-to-end view of our systems, identify improvement areas across architecture, infrastructure, deployment pipelines, and reliability, and guide the platform toward higher scalability, stability, and operational maturity.
This role requires strong system thinking, sound architectural judgment, and the ability to clearly call out risks and improvements.
Key Responsibilities
- Review the complete backend ecosystem (Node.js, Golang services, cloud infrastructure, CI/CD).
- Identify architectural, scalability, reliability, and security gaps post in-house migration.
- Recommend and prioritise short-term fixes and long-term platform improvements.
- Own containerized infrastructure using Docker and Kubernetes in production.
- Design and maintain robust CI/CD pipelines with safe deployment and rollback strategies.
- Implement and improve monitoring, logging, alerting, and incident response practices.
- Define and track meaningful SLIs, SLOs, and error budgets.
- Prepare systems for OTT traffic spikes during releases and live events.
- Improve caching, queuing, and backend performance in collaboration with backend teams.
- Drive secure access, secrets management, and cloud cost optimisation.
- Act as a technical partner to backend, product, and leadership teams.
Required Technical Skills
Cloud & Infrastructure
- Strong experience with AWS (EC2, EKS/ECS, S3, RDS/DynamoDB, IAM)
- Docker and Kubernetes (production environments)
- Infrastructure as Code – Terraform (preferred)
CI/CD & Operations
- GitHub Actions / GitLab CI / Jenkins
- Blue-green / canary deployments and rollback strategies
Backend Awareness
- Node.js (Express / NestJS level understanding)
- Golang (microservices, concurrency, profiling basics)
Observability
- Prometheus, Grafana
- Centralised logging (ELK / OpenSearch / Loki)
- Distributed tracing (Jaeger / OpenTelemetry)
Data, Cache & Messaging
- Redis (cache and/or queues)
- Kafka / SQS / RabbitMQ (deep experience with at least one)
- MongoDB (understanding of No-SQL DBs, bonus if experienced with Atlas offerings)
Security & Reliability
- Secrets management (Vault / AWS Secrets Manager)
- IAM and least-privilege access design
- Production incident handling experience
Personality & Mindset
- Strong ownership and accountability for platform reliability.
- Comfortable identifying what is wrong and explaining how to fix it.
- Calm and structured during incidents and high-pressure situations.
- Clear communication with engineers and non-technical stakeholders.
- Systems thinker who understands end-to-end impact, not just isolated components.
- Pragmatic, data-driven, and collaborative.
Reach out to : [email protected] / [email protected]
Apply for this Position
Ready to join ? Click the button below to submit your application.
Submit Application