Job Description
Responsibilities
- Production experience in SRE / Infrastructure / ops for large-scale systems
- Strong programming/scripting skills (Python, Go, Java, or equivalent)
- Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
- Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
- Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
- Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
- Solid experience in capacity planning, performance tuning, scaling, and incident response
- Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
- Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
- Excellent communication, documentation, and cross-t...
Apply for this Position
Ready to join Tekshapers? Click the button below to submit your application.
Submit Application