Job Description

Realize your potential by joining the leading performance-driven advertising company!

As Site Reliability Engineer on the IT Production team in our Tel Aviv Office, you’ll play a vital role in building robust services and solving infrastructure challenges with automations while working with cutting-edge technologies and bringing those to their limits on our mostly on-prem cloud like infrastructure.


To thrive in this role, you’ll need:

  • 4+ years of experience in software development with a proven track record of designing and developing internal tools, automation frameworks and platform components in large-scale distributed production environments with focus on linux operating systems.

  • Deep, demonstrable expertise in one of the following programming languages ( Golang, C, Rust, Python or Java).

  • Experience in observability tooling development, specifically implementing custom metrics, tracing and logging within application code.

  • Practical understanding of the HTTP protocol (including HTTP methods, status codes and headers). Proven ability to design, implement and instrument robust internal APIs (e.g., using REST or gRPC).

  • Understanding in Linux operating system internals: kernel configuration, system calls, process management, memory and I/O.

  • Proven ability to troubleshoot and optimize performance bottlenecks under heavy load using advanced monitoring and profiling tools for high-throughput and low-latency applications.
  • Bonus points if you have:

  • Experience as an SRE, DevOps Engineer, System Administrator in a large distributed environment with focus on Linux operating systems.
  • As a Site Reliability Engineer, you’ll bring value by:

  • Ensure Reliability & Scalability: Design, implement and manage highly reliable and scalable distributed systems across our on-premise, cloud and AI/ML environments. Proactively optimize performance, efficiency, resource utilization and cloud cost.

  • Drive Automation: Automate repetitive tasks, infrastructure provisioning, configuration and deployments using IaC and scripting languages (e.g., Python, Go, Rust).

  • Develop Observability & Capacity: Implement comprehensive monitoring and alerting systems to ensure system health. Collaborate on capacity planning to meet future growth.

  • Maintain Security & Compliance: Integrate security best practices and ensure compliance with industry standards.

  • Lead Incident Management: Participate in on-call rotations, lead incident responses and conduct root cause analysis to minimize downtime.

  • Foster Collaboration & Improvement: Work closely with development, operations and security teams to drive shared responsibility and continuous improvement in SRE practices.
  • Our Tech Stack:


    Linux, Kubernetes, nginx, Istio, AWS, GCP, Azure, Alicloud, Fastly, Terraform, Consul, Prometheus, Loki, Grafana, Airflow, Redis, Kafka, Vector, Hadoop, Cassandra, Vertica, MySQL, HDFS, ELK.

    Apply for this Position

    Ready to join ? Click the button below to submit your application.

    Submit Application