Job Description

About the Role:
We are seeking an experienced Site Reliability Engineer / Platform Engineer to join our team and help build and maintain a resilient, scalable infrastructure supporting our applications across multiple cloud providers. In this role, you will design and implement infrastructure solutions, automate operational processes, and work closely with development teams to ensure reliable, efficient systems that scale with our business.
What You'll Do:
- Design, build, and maintain infrastructure across AWS, GCP, and Azure using Infrastructure as Code (Ia C) principles.
- Implement and optimize CI/CD pipelines using tools like Argo and Circle CI to enable rapid, reliable deployments.
- Manage and scale Kubernetes clusters in production environments, ensuring high availability and optimal resource utilization.
- Administer and optimize cloud databases including Mongo DB, Redis, RDS, and other data stores for performance and reliability.
- Develop monitoring, alerting, and observability solutions to identify and resolve issues before they impact users.
- Automate routine operational tasks to reduce manual toil and improve system reliability.
- Conduct incident response and post-mortem analysis to drive continuous improvement.
- Collaborate with development teams to design systems with reliability, scalability, and operational excellence in mind.
- Document infrastructure architecture, runbooks, and operational procedures.
- Evaluate and implement new tools and technologies to improve platform capabilities.
What You'll Bring:
- 3+ years of experience in Site Reliability Engineering, Dev Ops, or Platform Engineering.
- Strong hands-on experience with at least two major cloud providers (AWS, GCP, Azure).
- Proficiency with Kubernetes for container orchestration and management.
- Demonstrated expertise with Ia C tools (Terraform, Cloud Formation, Pulumi, or similar).
- Experience with CI/CD platforms, particularly Argo and/or Circle CI.
- Solid understanding of database technologies including Mongo DB, Redis, and relational databases (RDS).
- Proficiency in at least one programming or scripting language (Python, Go, Bash, Typescript, etc.).
- Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK, Cloud Watch).
- Experience implementing and managing Open Telemetry (OTEL) for distributed tracing, metrics, and logging.
- Strong understanding of networking, security, and infrastructure best practices.
Nice to Have
- Experience managing multi-cloud or hybrid cloud environments.
- Familiarity with service mesh technologies (Istio, Linkerd).
- Knowledge of security hardening and compliance in cloud environments.
- Experience with cost optimization in cloud infrastructure.
- Contributions to open-source infrastructure or Dev Ops projects.
- Certifications from major cloud providers.

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application