Job Description

To apply, it is very important that you complete your profile to be reviewed by our Matching Team. Take the assessment and if you’re the right fit, we’ll reach out to schedule a conversation. Incomplete profiles have less chances of being matched.

About Torc

Torc is a career-first platform that connects top tech talent with meaningful job opportunities. More than just job matching, Torc is a community — offering events, livestreams, and exclusive content to help professionals grow, learn, and stay connected. When you join Torc, you’re not just finding a job — you’re joining a network that supports your career development every step of the way.

About the Role

We’re looking for a Senior DevOps Engineer with deep, hands‑on experience in Datadog to own and evolve observability across a complex, cloud‑native environment. In this role, you’ll focus on infrastructure, application, and cost observability — supporting large‑scale Kubernetes workloads, data platforms, and cloud services.

This role is ideal for an engineer who enjoys improving system reliability, performance visibility, and cost efficiency, and who thrives working across infrastructure, data, and application teams.

What you’ll do:

    >Design, build, and optimize Datadog dashboards, monitors, and alerting across Kubernetes (EKS), EC2, and cloud services.
  • Improve observability for large‑scale EKS clusters and high‑throughput workloads.
  • Implement monitoring and alerting for data platforms such as Databricks and Snowflake, tracking job performance, latency, and resource consumption.
  • Use Datadog to identify cost anomalies, orphaned resources, and usage spikes across cloud environments.
  • Improve signal‑to‑noise ratio in alerts to enable reliable incident detection and response.
  • Collaborate with frontend engineers to support end‑to‑end observability, including Datadog RUM.
  • Partner with platform, data, and infrastructure teams to establish best practices in observability and reliability engineering.

What we’re looking for:

  • Deep, hands‑on expertise with Datadog , including APM, Infrastructure Monitoring, Log Management, RUM, and Cloud Cost Management.
  • Strong experience with AWS , particularly EKS, EC2, and managed data services.
  • Experience monitoring Snowflake and Databricks environments.
  • Ability to read and understand frontend code (e.g., React) to trace end‑to‑end performance issues.
  • 5+ years of experience in DevOps, SRE, Platform Engineering, or related roles.
  • Strong analytical mindset with focus on performance optimization, reliability, and cost efficiency.
  • Excellent communication skills and ability to collaborate across multiple engineering disciplines.

Nice‑to‑Have:

  • Experience with multi‑cloud observability (AWS and Azure).
  • Familiarity with Kubernetes cost optimization and capacity planning.
  • Exposure to incident management, on‑call rotations, or SRE practices.

While we may not be able to respond to every applicant, your profile will remain in the Torc Talent Community , giving you access to future opportunities, events, and resources to grow your career.

#J-18808-Ljbffr

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application