Job Description

Realize your potential by joining the leading performance-driven advertising company!


As Site Reliability Engineer on the IT Production team in our TLV Office, you’ll play a vital role in building robust services and solving infrastructure challenges with automations while working with cutting-edge technologies and bringing those to their limits on our mostly on-prem cloud like infrastructure.


To thrive in this role, you’ll need:

  • 7 years of experience as an SRE, DevOps Engineer, System Administrator in a large distributed environment with focus on Linux operating systems.

  • Experience supporting, troubleshooting and scaling large distributed systems in production.

  • Deep understanding of HTTP protocol, including HTTP/1.1, HTTP/2, caching semantics, TLS and gRPC delivery.

  • Experience configuring and operating CDN services (e.g., Akamai, Fastly, Cloudflare, AWS CloudFront).

  • Deep understanding in Linux system internals and system performance tuning.

  • Experience with Configuration Management Tools (Puppet, Ansible, Chef, Terraform).

  • Experience programming in at least one of the following languages (Python, Golang, Rust, Ruby, C++, Java).

  • Experience with monitoring and metrics collection systems (Prometheus, Grafana, ELK).

  • Experience with cloud providers and platforms (AWS, Azure, GCP, Alibaba).

  • Experience with containerization technologies (Kubernetes, Docker).

  • Deep understanding of networking principles (TCP/IP, DNS, load balancing).
  • How you’ll make an impact:


    As a Site Reliability Engineer, you’ll bring value by:

  • Ensure Reliability & Scalability: Design, implement and manage highly reliable and scalable distributed systems across our on-premise, cloud and AI/ML environments. Proactively optimize performance, efficiency, resource utilization and cloud cost.

  • Drive Automation: Automate repetitive tasks, infrastructure provisioning, configuration and deployments using IaC and scripting languages (e.g., Python, Go, Rust).

  • Develop Observability & Capacity: Implement comprehensive monitoring and alerting systems to ensure system health. Collaborate on capacity planning to meet future growth.

  • Maintain Security & Compliance: Integrate security best practices and ensure compliance with industry standards.

  • Lead Incident Management: Participate in on-call rotations, lead incident responses and conduct root cause analysis to minimize downtime.

  • Foster Collaboration & Improvement: Work closely with development, operations and security teams to drive shared responsibility and continuous improvement in SRE practices.
  • Our Tech Stack:


    Linux, Kubernetes, nginx, Istio, AWS, GCP, Azure, Alicloud, Fastly, Terraform, Consul, Prometheus, Loki, Grafana, Airflow, Redis, Kafka, Vector, Hadoop, Cassandra, Vertica, MySQL, HDFS, ELK.

    Apply for this Position

    Ready to join ? Click the button below to submit your application.

    Submit Application