Description
:
Essential Responsibilities:
Take ownership of system performance monitoring, identify inefficiencies, and lead initiatives to improve the overall availability and reliability of digital platforms and applications.Lead and manage the response to complex, high-priority incidents, ensuring prompt resolution and a thorough root cause analysis to prevent future occurrences.Design and implement advanced automation frameworks to improve operational efficiency, streamline processes, and reduce human error.Lead reliability-focused initiatives, ensuring systems are highly available, resilient, and scalable, and promote best practices across engineering teams.Enhance the monitoring infrastructure by identifying key metrics, optimizing alerting, and improving system observability to ensure the reliability of large-scale systems.Forecast resource requirements and lead capacity planning activities to ensure systems can scale effectively to meet growing user demand.Ensure robust disaster recovery strategies are in place and conduct regular testing to ensure systems can recover quickly from failures.Partner with engineering and product teams to identify opportunities for improving system architecture, focusing on scalability, reliability, and fault tolerance.Provide mentorship and technical guidance to junior site reliability engineers, fostering skill development and knowledge sharing.Drive continuous improvement across operational workflows, identifying areas for optimization, cost reduction, and performance enhancement.Expected Qualifications:
3+ years relevant experience and a Bachelor’s degree OR Any equivalent combination of education and experience.Additional Responsibilities & Preferred Qualifications:
Key Responsibilities
Site Resiliency & Infrastructure Management
Proactively identify and address vulnerabilities in cloud (AWS, GCP, Azure) and on-premises infrastructureReview Infrastructure as Code changes for reliability risks as part of change approval processIdentify architectural anti-patterns in Kubernetes deployments and cloud migrationsConduct regular disaster recovery drills and readiness tests before major events (Thanksgiving, Cyber 5, peak shopping seasons)Participate in situation room activities for new product rolloutsDrive site resilience projects to enhance system reliability and uptimeProactively identify and address vulnerabilities in cloud (AWS, GCP, Azure) and on-premises infrastructureImplement automated monitoring solutions to detect single points of failureLead new datacenter and CDN certification initiativesConduct regular disaster recovery drills and readiness tests before major events (Thanksgiving, Cyber 5, peak shopping seasons)Participate in situation room activities for new product rolloutsDrive site resilience projects to enhance system reliability and uptimeIncident Management & Response
Act as incident commander with final decision authority -- directing engineering teams, authorizing rollbacks, and commanding regional failoversDirect application and infrastructure teams during incidents by making work assignments and prioritizing troubleshooting pathsRapidly assess incidents by reading Infrastructure as Code (Terraform, CloudFormation), Kubernetes manifests, and CI/CD configurationsGive final authorization for critical actions including production rollbacks, regional failovers, and emergency changesInterface with executive leadership during critical incidents and post-mortems to provide technical guidance and impact assessmentsIdentify when incidents stem from teams deviating from established cloud-native patternsCommand cross-functional teams during high-severity incidents affecting PayPal core and brand platforms (Venmo, Xoom, Zettle, Braintree)Lead blameless postmortem sessions and contribute to Root Cause Analysis (RCA) processesDrive continuous improvement initiatives based on incident learningsServe as the primary technical escalation point during critical incidentsAccelerate incident response times through standardized playbooks and automated workflowsCoordinate cross-functional teams during high-severity incidents affecting PayPal core and brand platforms (Venmo, Xoom, Zettle, Braintree)Lead blameless postmortem sessions and contribute to Root Cause Analysis (RCA) processesDrive continuous improvement initiatives based on incident learningsManage multiple concurrent incidents during peak periods with efficiency and precisionChange Management & Risk Mitigation
Serve as final approver for emergency changes and provide expert guidance on all production changesAct as advisor and technical authority during change approval processes, identifying potential reliability risksProvide training and guidance to engineering teams on change management best practicesMaintain change audit documentation and compliance requirementsReview and approve changes to production systems, ensuring comprehensive risk assessmentAutomate change validation and rollback procedures to minimize service disruptionsStreamline change management processes to reduce manual errors and bottlenecksProvide training and guidance to engineering teams on change management best practicesMaintain change audit documentation and compliance requirementsCloud Expertise & Technical Leadership
Leverage deep expertise in cloud platforms (AWS, GCP, Azure) to drive incident resolutionSupport Braintree and Venmo cloud infrastructure operationsGuide teams toward solutions by providing architectural direction during incidentsStay current with emerging cloud technologies and best practicesMentor team members on cloud technologies and incident management techniquesCloud Expertise & Technical Leadership
Implement automation, dashboards, and tooling to enhance the team's incident response capabilitiesBuild runbooks and playbooks for cloud-native incident scenariosDevelop internal tools and scripts to improve TDO operational efficiencyDrive projects that advance the Command Center's operational capabilitiesQualifications
Technical Skills
Significant hands-on experience with at least one major cloud provider (AWS or GCP required; multi-cloud experience preferred)Strong proficiency with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi, or equivalent) including ability to read, review, and troubleshoot IaC configurations during incidentsSignificant hands-on experience with Kubernetes and CNCF ecosystem tools, including troubleshooting K8s deployments, manifests, and cluster issuesAbility to quickly read and review code across multiple languages (Python, Go, Bash) and configuration formats (YAML, HCL, JSON)essential for effective incident troubleshootingProven experience managing critical incidents in Infrastructure-as-Code driven environments, including troubleshooting IaC state issues, GitOps failures, and cloud-native deployment problemsProfessional-level certification in at least one major cloud platform (AWS Solutions Architect Professional, Google Cloud Professional Cloud Architect, or equivalent)Experience with monitoring and observability tools (Splunk, Datadog, Prometheus, Grafana)Strong knowledge of networking, load balancing, CDN technologies, and DNS managementProficiency in scripting for operational automation (Python, Bash, PowerShell)5+ years of experience in site reliability engineering, infrastructure operations, or similar technical operations rolesStrong expertise in cloud platforms (AWS, GCP, and/or Azure)Proficiency in infrastructure automation tools (Terraform, Ansible, CloudFormation, etc.)Deep understanding of distributed systems, microservices architecture, and containerization (Docker, Kubernetes)Experience with monitoring and observability tools (Splunk, Datadog, Prometheus, Grafana)Strong knowledge of networking, load balancing, CDN technologies, and DNS managementProficiency in scripting languages (Python, Bash, PowerShell, etc.)
Soft Skills
Exceptional communication skills with ability to articulate complex technical issues to both technical and non-technical stakeholdersExecutive presence and ability to effectively communicate with senior leadership during high-pressure incidents and post-mortemsStrong analytical and problem-solving abilities with a systematic approach to troubleshootingAbility to remain calm under pressure and make critical decisions during incidentsExcellent collaboration skills with experience working across global, cross-functional teamsStrong documentation skills and attention to detailPreferred Skills
Experience with multiple cloud providers (AWS + GCP + Azure)Broader toolset expertise across multiple IaC tools, CI/CD platforms, or GitOps solutionsExperience with payment processing systems or fintech platformsITIL Foundation or higher certificationBackground in security operations or compliance (PCI-DSS, SOC 2, etc.)Experience mentoring or leading technical teamsBachelor's degree in Computer Science, Engineering, or related field (or equivalent experience)Experience with payment processing systems or fintech platformsCertifications in cloud platforms (AWS Solutions Architect, Google Cloud Professional, Azure Administrator, etc.)ITIL Foundation or higher certificationExperience with infrastructure as code and GitOps practicesBackground in security operations or compliance (PCI-DSS, SOC 2, etc.)Experience mentoring or leading technical teamsBachelor's degree in Computer Science, Engineering, or related field (or equivalent experience).Subsidiary:
PayPal
Travel Percent:
0
PayPal does not charge candidates any fees for courses, applications, resume reviews, interviews, background checks, or onboarding. Any such request is a red flag and likely part of a scam. To learn more about how to identify and avoid recruitment fraud please visit .
For the majority of employees, PayPal's balanced hybrid work model offers 3 days in the office for effective in-person collaboration and 2 days at your choice of either the PayPal office or your home workspace, ensuring that you equally have the benefits and conveniences of both locations.
Our Benefits:
At PayPal, we’re committed to building an equitable and inclusive global economy. And we can’t do this without our most important asset—you. That’s why we offer benefits to help you thrive in every stage of life. We champion your financial, physical, and mental health by offering valuable benefits and resources to help you care for the whole you.
We have great benefits including a flexible work environment, employee shares options, health and life insurance and more. To learn more about our benefits please visit .
Who We Are:
to learn more about our culture and community.
Commitment to Diversity and Inclusion
PayPal provides equal employment opportunity (EEO) to all persons regardless of age, color, national origin, citizenship status, physical or mental disability, race, religion, creed, gender, sex, pregnancy, sexual orientation, gender identity and/or expression, genetic information, marital status, status with regard to public assistance, veteran status, or any other characteristic protected by federal, state, or local law. In addition, PayPal will provide reasonable accommodations for qualified individuals with disabilities. If you are unable to submit an application because of incompatible assistive technology or a disability, please contact us at .
Belonging at PayPal:
Our employees are central to advancing our mission, and we strive to create an environment where everyone can do their best work with a sense of purpose and belonging. Belonging at PayPal means creating a workplace with a sense of acceptance and security where all employees feel included and valued. We are proud to have a diverse workforce reflective of the merchants, consumers, and communities that we serve, and we continue to take tangible actions to cultivate inclusivity and belonging at PayPal.
Any general requests for consideration of your skills, please .
We know the confidence gap and imposter syndrome can get in the way of meeting spectacular candidates. Please don’t hesitate to apply.