Senior Manager, Site Reliability Engineering (SRE)

Contract to hire for a BANK

Role Overview

We are seeking a hands-on and strategic Senior Manager to lead our Site Reliability Engineering (SRE), Service Delivery, and Infrastructure Patching teams supporting the Digital Banking Platform. This role is crucial to our mission of providing always-on, secure, and high-performing banking services for millions of customers.

Key Responsibilities

Technical Leadership & Incident Management

Act as the senior technical escalation point for on-call teams, diagnosing and resolving complex infrastructure, cloud, and application issues.
Lead major incident response efforts, ensuring rapid restoration and comprehensive root cause analysis (RCA).
Collaborate across engineering, platform, and security to troubleshoot issues spanning full-stack environments (cloud, container, and legacy platforms).
Maintain high availability and performance of digital banking applications (primarily AWS, OpenShift, Linux, with some legacy WebSphere).
Champion proactive monitoring, observability, and alerting (e.g., Dynatrace, OpenSearch, Prometheus, Grafana).

SRE & Reliability Engineering

Define and implement best practices for reliability, scalability, and availability tailored to large-scale digital banking.
Continuously improve CI/CD pipelines, release automation, and deployment practices.
Drive rigorous postmortem analysis and a culture of blameless continuous improvement.
Optimize for scalability, redundancy, and resilience—minimizing customer impact from incidents.

Infrastructure & Patching

Oversee patching and maintenance for cloud and on-prem environments (AWS, OpenShift, Red Hat VMs, some WebSphere).
Ensure zero-downtime patching strategies and automation to mitigate operational risk and security vulnerabilities.
Partner with security teams to enforce compliance, harden platforms, and remediate vulnerabilities.

Team Leadership & Process Improvement

Lead, mentor, and grow a high-performing team of 8–10 SREs and service engineers.
Drive a culture of ownership, operational excellence, and continuous learning.
Establish and enforce best practices for incident management, operational documentation, and process automation.
Collaborate with development, infrastructure, and product teams to enhance observability, deployment, and proactive issue detection

Required Skills

Exceptional hands-on troubleshooting skills in complex, distributed, or high-availability technical environments.
Experience in observability, monitoring, and incident management for critical platforms.
Demonstrated leadership in technical settings—may include leading projects, initiatives, or mentoring teams, even if not previously a formal people manager.
Excellent communicator, able to translate technical detail for both engineers and executives

Bachelor’s degree in a technical field

Senior Manager, Site Reliability Engineering (SRE)

Apply for this position