Talent.com
Sr. Site Reliability Engineer - Incident Response
Sr. Site Reliability Engineer - Incident ResponseCox Enterprises • Irvine, California, US
[error_messages.no_longer_accepting]
Sr. Site Reliability Engineer - Incident Response

Sr. Site Reliability Engineer - Incident Response

Cox Enterprises • Irvine, California, US
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

Job Description

The Site Reliability Engineer - Incident Response is a critical enterprise-level role responsible for accelerating incident resolution and enhancing the overall incident management process. This individual partners with engineering teams during active incidents to troubleshoot issues using monitoring and logging tools, and post-incident, delivers executive-level summaries that clearly communicate impact, root cause, and resolution. The SRE - Incident Response also plays a key role in analyzing incident response effectiveness and identifying opportunities for systemic improvements.Core Competencies and Qualifications :

  • Bachelor’s degree in a related discipline and 4 years’ experience in a related field. The right candidate could also have a different combination, such as a master’s degree and 2 years’ experience; a Ph.D. and up to 1 year of experience; or 16 years’ experience in a related field.

Applicants must currently be authorized to work in the United States for any employer without current or future sponsorship. No OPT, CPT, STEM / OPT or visa sponsorship now or in future.

Engineering / Tooling : Demonstrates the ability to design, build, and maintain engineering solutions and tools that enhance reliability, automate incident response, and reduce operational toil.

Incident Troubleshooting : Skilled in interpreting logs, metrics, and traces to assist in identifying root causes during live incidents.

Monitoring & Observability : Proficient in tools such as Datadog, Splunk, New Relic, or similar platforms.

Strong programming background in Python, Java, or C# , with experience building, maintaining, and troubleshooting production-grade services and automation tools.

Proven ability to design and implementreliable, scalable, and highly available systems, leveraging software engineering best practices to improve system resilience and operational efficiency.

Experience developingautomation and toolingto reduce toil, improve incident response, and support continuous improvement across monitoring, deployment, and recovery processes.

Ability to collaborate closely with software engineering teams toinfluence architecture and operational readiness, ensuring reliability is built into the system from design through production.

AI Centric Engineering : Effectively leverages artificial intelligence (AI) and machine learning (ML) tools to automate, optimize, and enhance daily engineering and incident response tasks.

Analytical Rigor : Strong attention to detail in validating incident data and identifying trends or gaps in response.

DevOps & Architecture Knowledge : Understanding full-stack systems, CI / CD pipelines, caching, scaling, and cloud-native infrastructure.

Metrics & Reporting : Capable of calculating and interpreting key metrics like MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to Resolve).

Here are the responsibilities of this role when not tied to active on-call : Post-Incident Review Development

Draft and deliver executive summaries post-incident

Develop and coach teams on blameless postmortems .

Create templates, train facilitators, and help guide root cause analysis (e.g., 5 Whys, fishbone diagrams).

Maintain a central library of learnings and cross-cutting themes.

Incident Process Improvement

Actively support engineering teams during incidents by helping diagnose and resolve issues quickly

Navigate and analyze data from observability platforms to make informed inferences about root causes

Analyze the effectiveness of incident response to identify systemic reliability gaps.

Standardize incident response workflows (incident roles, comms, escalation paths).

Create or refine runbooks , incident command frameworks , and severity classification guides .

Metrics and Insights

Build dashboards around incident frequency, MTTR, MTTA, and recurrence rates.

Use incident data to drive reliability of OKRs or engineering investments.

Tooling & AI Solutions

Partner with engineering teams to identify repetitive or high-impact tasks suitable for automation.

Develop, implement, and continuously improve custom scripts, bots, and AI-driven workflows for monitoring, alerting, and incident triage.

Evaluate and integrate emerging AI / ML technologies to optimize detection, root cause analysis, and reporting.

Ensure all tools and automations are secure, maintainable, and aligned with organizational standards and SRE best practices.

Document and socialize new tools and AI solutions, enabling adoption and knowledge sharing across teams.

Cross-Team Collaboration

Collaborate with Engineering Managers and Incident Commanders to gather and validate incident data

Partner with product teams, infra, and leadership to socialize reliability best practices .

Act as a reliability “consultant” to squads that have impactful incidents.

Recommend enhancements to monitoring, alerting, and response processes to reduce future incident impact

Drug Testing

To be employed in this role, you’ll need to clear a pre-employment drug test. Cox Automotive does not currently administer a pre-employment drug test for marijuana for this position. However, we are a drug-free workplace, so the possession, use or being under the influence of drugs illegal under federal or state law during work hours, on company property and / or in company vehicles is prohibited.

Benefits

The Company offers eligible employees the flexibility to take as much vacation with pay as they deem consistent with their duties, the company’s needs, and its obligations; seven paid holidays throughout the calendar year; and up to 160 hours of paid wellness annually for their own wellness or that of family members. Employees are also eligible for additional paid time off in the form of bereavement leave, time off to vote, jury duty leave, volunteer time off, military leave, and parental leave.

About Us

Through groundbreaking technology and a commitment to stellar experiences for drivers and dealers alike, Cox Automotive employees are transforming the way the world buys, owns, sells – or simply uses – cars. Cox Automotive employees get to work on iconic consumer brands like Autotrader and Kelley Blue Book and industry-leading dealer-facing companies like vAuto and Manheim, all while enjoying the people-centered atmosphere that is central to our life at Cox. Benefits of working at Cox may include health care insurance (medical, dental, vision), retirement planning (401(k)), and paid days off (sick leave, parental leave, flexible vacation / wellness days, and / or PTO). For more details on what benefits you may be offered, visit our benefits page. Cox is an Equal Employment Opportunity employer – All qualified applicants / employees will receive consideration for employment without regard to that individual’s age, race, color, religion or creed, national origin or ancestry, sex (including pregnancy), sexual orientation, gender, gender identity, physical or mental disability, veteran status, genetic information, ethnicity, citizenship, or any other characteristic protected by law. Cox provides reasonable accommodations when requested by a qualified applicant or employee with disability, unless such accommodations would cause an undue hardship.

[job_alerts.create_a_job]

Sr Site Reliability Engineer • Irvine, California, US

[internal_linking.similar_jobs]
Mid-Level Site Reliability Engineer

Mid-Level Site Reliability Engineer

Insight Global • Irvine, CA, United States
[job_card.temporary]
One of Insight Global's customers is looking to onboard a Mid-Level Site Reliability Engineer with strong expertise in modern DevOps practices, cloud infrastructure, observability, and platform sec...[show_more]
[last_updated.last_updated_1_day] • [promoted]
Sr. Manager, QSS Disaster Recovery & Insurance Claim Management

Sr. Manager, QSS Disaster Recovery & Insurance Claim Management

Qcells • Irvine, CA, US
[job_card.full_time]
Qcells Service Solutions(‘QSS’) is a residential solar asset operation and maintenance service company powered by Qcells. QSS set up a design standard and installation guide for EnFin TPO and provid...[show_more]
[last_updated.last_updated_30]
Senior Engineer

Senior Engineer

GTT, LLC • Pomona, CA, US
[job_card.full_time]
Civil Engineering Project Lead.Utility Infrastructure Engineer.Transmission & Substation Civil Engineer.Engineering Consultant – Utilities. Senior Project Engineer – Energy Sector.Tr...[show_more]
[last_updated.last_updated_30] • [promoted]
10040 - Executive Principal, Site Reliability Engineering (SRE) - DevOps

10040 - Executive Principal, Site Reliability Engineering (SRE) - DevOps

Hyundai AutoEver America • Irvine, CA, United States
[job_card.full_time]
Executive Principal, Site-Reliability Engineering (SRE) - DevOps.Location : Irvine, CA 92614 (5 days on-site).Hyundai AutoEver America (HAEA) is the dynamic IT powerhouse behind Hyundai Motor Corpor...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Site Reliability Engineering (SRE) / Dev Ops

Site Reliability Engineering (SRE) / Dev Ops

Ampcus • Irvine, CA, United States
[job_card.full_time]
Technology and Business consulting services.We are in search of a highly motivated candidate to join our talented Team.Job Title : Site Reliability Engineering (SRE) / Dev Ops.Design, deploy and confi...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Behavioral Health Technician

Behavioral Health Technician

Monarch Shores • Capistrano Beach, CA, US
[job_card.full_time]
We’re looking for dedicated and dependable Behavioral Health Technicians (BHTs) to join our team and provide compassionate, day-to-day support for individuals working through substance use, m...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Site Reliability Engineer

Site Reliability Engineer

TP-Link Systems Inc. • Irvine, CA, US
[job_card.full_time]
At the forefront of the future of connected living, TP-Link's Systems Inc.R&D Center in Irvine, Southern California's innovation hub, spearheads research and development of next-generat...[show_more]
[last_updated.last_updated_30] • [promoted]
Sr. EHS Engineer 1

Sr. EHS Engineer 1

Skyworks Solutions • Irvine, CA, United States
[job_card.full_time]
If you are looking for a challenging and exciting career in the world of technology, then look no further.Skyworks is an innovator of high-performance analog semiconductors whose solutions are powe...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Irrigation Tech-Level II •Certification Needed •

Irrigation Tech-Level II •Certification Needed •

brightview.com • San Juan Capistrano, CA, United States
[job_card.full_time]
The Best Teams are Created and Maintained Here.At BrightView, the best teams are created and maintained here.If you are searching for your next fulfilling career, picture yourself on a best-in-clas...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Irrigation Tech-Level II •Certification Needed •

Irrigation Tech-Level II •Certification Needed •

BrightView Landscapes • San Juan Capistrano, CA, US
[job_card.full_time]
The Best Teams are Created and Maintained Here.At BrightView, the best teams are created and maintained here.If you are searching for your next fulfilling career, picture yourself on a best-in-clas...[show_more]
[last_updated.last_updated_variable_days]
CLOUD ENGINEER

CLOUD ENGINEER

VSolvit • Norco, CA, US
[job_card.full_time]
POSITION CAN BE ONSITE IN NORCO, CA OR REMOTE • • •.In this role, you will design, implement, and maintain cloud-based infrastructure and services, ensuring they meet the operational and security need...[show_more]
[last_updated.last_updated_30] • [promoted]
Site Reliability Engineer

Site Reliability Engineer

VirtualVocations • Ontario, California, United States
[job_card.full_time]
A company is looking for a Site Reliability Engineer to join a dynamic Cloud Services team in a fully remote role.Key Responsibilities Act as a subject matter expert in cloud technologies, guidin...[show_more]
[last_updated.last_updated_30] • [promoted]
Site Reliability Engineer

Site Reliability Engineer

InterSources • Irvine, CA, United States
[job_card.full_time]
Role : - Site Reliability Engineer (SRE).Develop and provide operational support for full-stack software applications.Collaborate with development operations staff to create, monitor, and troubleshoo...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Reliability Engineer

Reliability Engineer

OrthAlign Inc. • Irvine, CA, US
[job_card.full_time]
Smart Technology, growing medical device company, has an immediate opening for a Reliability Engineer responsible for supporting development and testing of new products and components and changes t...[show_more]
[last_updated.last_updated_30] • [promoted]
Deputy Chief Engineer — Missile Defense Systems Lead (Hybrid)

Deputy Chief Engineer — Missile Defense Systems Lead (Hybrid)

Menlo Ventures • Laguna Beach, CA, United States
[job_card.full_time]
A cutting-edge aerospace company in California is looking for a highly accomplished Digital Thread Architect to lead technical execution for a missile defense program. This role demands over 15 year...[show_more]
[last_updated.last_updated_30] • [promoted]
Systems Engineer (onsite)

Systems Engineer (onsite)

Palomar Products Inc. • Rancho Santa Margarita, CA, US
[job_card.full_time]
[filters_job_card.quick_apply]
Summary Palomar Products is seeking a Systems Engineer to support the development of our next-generation Intercom System for military aerospace platforms. This role will drive system-level requireme...[show_more]
[last_updated.last_updated_30]
Systems Engineer (onsite)

Systems Engineer (onsite)

Palomar • Rancho Santa Margarita, CA, United States
[job_card.full_time]
Palomar Products is seeking a Systems Engineer to support the development of our next-generation Intercom System for military aerospace platforms. This role will drive system-level requirements, arc...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior Site Reliability Engineer (SRE)

Senior Site Reliability Engineer (SRE)

StubHub • Aliso Viejo, CA, US
[job_card.full_time]
StubHub is on a mission to redefine the live event experience on a global scale.Whether someone is looking to attend their first event or their hundredth, we're here to delight them all the way...[show_more]
[last_updated.last_updated_30] • [promoted]