Talent.com
Sr. Site Reliability Engineer - Incident Response
Sr. Site Reliability Engineer - Incident ResponseCox Automotive • Peachtree Corners, GA, United States
Sr. Site Reliability Engineer - Incident Response

Sr. Site Reliability Engineer - Incident Response

Cox Automotive • Peachtree Corners, GA, United States
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

The Site Reliability Engineer - Incident Response is a critical enterprise-level role responsible for accelerating incident resolution and enhancing the overall incident management process. This individual partners with engineering teams during active incidents to troubleshoot issues using monitoring and logging tools, and post-incident, delivers executive-level summaries that clearly communicate impact, root cause, and resolution. The SRE - Incident Response also plays a key role in analyzing incident response effectiveness and identifying opportunities for systemic improvements. Core Competencies and Qualifications: Bachelor's degree in a related discipline and 4 years' experience in a related field. The right candidate could also have a different combination, such as a master's degree and 2 years' experience; a Ph.D. and up to 1 year of experience; or 16 years' experience in a related field. Applicants must currently be authorized to work in the United States for any employer without current or future sponsorship. No OPT, CPT, STEM/OPT or visa sponsorship now or in future. Engineering/Tooling: Demonstrates the ability to design, build, and maintain engineering solutions and tools that enhance reliability, automate incident response, and reduce operational toil. Incident Troubleshooting: Skilled in interpreting logs, metrics, and traces to assist in identifying root causes during live incidents. Monitoring & Observability: Proficient in tools such as Datadog, Splunk, New Relic, or similar platforms. Strong programming background in Python, Java, or C# , with experience building, maintaining, and troubleshooting production-grade services and automation tools. Proven ability to design and implement reliable, scalable, and highly available systems, leveraging software engineering best practices to improve system resilience and operational efficiency. Experience developing automation and tooling to reduce toil, improve incident response, and support continuous improvement across monitoring, deployment, and recovery processes. Ability to collaborate closely with software engineering teams to influence architecture and operational readiness, ensuring reliability is built into the system from design through production. AI Centric Engineering: Effectively leverages artificial intelligence (AI) and machine learning (ML) tools to automate, optimize, and enhance daily engineering and incident response tasks. Analytical Rigor: Strong attention to detail in validating incident data and identifying trends or gaps in response. DevOps & Architecture Knowledge: Understanding full-stack systems, CI/CD pipelines, caching, scaling, and cloud-native infrastructure. Metrics & Reporting: Capable of calculating and interpreting key metrics like MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to Resolve). Here are the responsibilities of this role when not tied to active on-call: Post-Incident Review Development Draft and deliver executive summaries post-incident Develop and coach teams on blameless postmortems . Create templates, train facilitators, and help guide root cause analysis (e.g., 5 Whys, fishbone diagrams). Maintain a central library of learnings and cross-cutting themes. Incident Process Improvement Actively support engineering teams during incidents by helping diagnose and resolve issues quickly Navigate and analyze data from observability platforms to make informed inferences about root causes Analyze the effectiveness of incident response to identify systemic reliability gaps. Standardize incident response workflows (incident roles, comms, escalation paths). Create or refine runbooks , incident command frameworks , and severity classification guides . Metrics and Insights Build dashboards around incident frequency, MTTR, MTTA, and recurrence rates. Use incident data to drive reliability of OKRs or engineering investments. Tooling & AI Solutions Partner with engineering teams to identify repetitive or high-impact tasks suitable for automation. Develop, implement, and continuously improve custom scripts, bots, and AI-driven workflows for monitoring, alerting, and incident triage. Evaluate and integrate emerging AI/ML technologies to optimize detection, root cause analysis, and reporting. Ensure all tools and automations are secure, maintainable, and aligned with organizational standards and SRE best practices. Document and socialize new tools and AI solutions, enabling adoption and knowledge sharing across teams. Cross-Team Collaboration Collaborate with Engineering Managers and Incident Commanders to gather and validate incident data Partner with product teams, infra, and leadership to socialize reliability best practices . Act as a reliability "consultant" to squads that have impactful incidents. Recommend enhancements to monitoring, alerting, and response processes to reduce future incident impact USD 101,500.00 - 169,100.00 per year Compensation: Compensation includes a base salary of $101,500.00 - $169,100.00. The base salary may vary within the anticipated base pay range based on factors such as the ultimate location of the position and the selected candidate's knowledge, skills, and abilities. Position may be eligible for additional compensation that may include an incentive program. Benefits: The Company offers eligible employees the flexibility to take as much vacation with pay as they deem consistent with their duties, the company's needs, and its obligations; seven paid holidays throughout the calendar year; and up to 160 hours of paid wellness annually for their own wellness or that of family members. Employees are also eligible for additional paid time off in the form of bereavement leave, time off to vote, jury duty leave, volunteer time off, military leave, and parental leave.aa415a4b-8b21-40fc-a65c-70d2b25ca29a

[job_alerts.create_a_job]

Sr Site Reliability Engineer Incident Response • Peachtree Corners, GA, United States

[internal_linking.similar_jobs]
Site Reliability Engineer

Site Reliability Engineer

AutoRABIT Holding Inc. • Atlanta, GA, US
[job_card.permanent]
[filters_job_card.quick_apply]
AutoRABIT is looking for a Site Reliability/DevSecOps Engineer to help develop, scale and operate our cloud services In this role you will be an experienced business professional able to implement ...[show_more]
[last_updated.last_updated_30]
Restoration Technician Lead

Restoration Technician Lead

1800 Water Damage • Alpharetta, GA, United States
[job_card.full_time]
WATER DAMAGE is a trusted property restoration company serving across the nation.Our team if fully vetted, IICRC-.We handle a range of restoration projects including emergency mitigation, water dam...[show_more]
[last_updated.last_updated_30] • [promoted]
Entry Level Technician

Entry Level Technician

IICRC • Roswell, GA, United States
[job_card.full_time]
Provides emergency restoration and disaster recovery services to customers following assigned work orders provided by the Lead Technician/Crew Chief.Performs water/fire/smoke damage and mold remedi...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Supervisor, Security Engineering

Supervisor, Security Engineering

Georgia Systems Operations • Tucker, GA, United States
[job_card.full_time]
Tucker, GA, USA | Salary | 124200-155200 per year | Full Time.Medical, Dental, Vision, 401k Match, Parental Leave, Educational Assistance, Annual Performance Bonus, PTO, and Volunteer Time Off.The ...[show_more]
[last_updated.last_updated_30] • [promoted]
Life Safety Service Rep, Electronic

Life Safety Service Rep, Electronic

Johnson Controls • Roswell, GA, United States
[job_card.full_time]
Build your best future with the Johnson Controls team.As a global leader in smart, healthy and sustainable buildings, our mission is to reimagine the performance of buildings to serve people, place...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

Bright Vision Technologies • Atlanta, GA, US
[job_card.full_time]
[filters_job_card.quick_apply]
Site Reliability Engineer (SRE) Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize t...[show_more]
[last_updated.last_updated_variable_days]
Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

Rivka Development • Atlanta, GA, USA
[job_card.full_time]
[filters_job_card.quick_apply]
SRE will work within the Video Network division to design, build, operate our next generation Video Cloud platform, driving efficiency, reliability and scalability across our cloud infrastructure.W...[show_more]
[last_updated.last_updated_30]
System Reliability Engineer, IV or V

System Reliability Engineer, IV or V

Georgia Transmission Corporation • Tucker, GA, USA
[job_card.full_time]
[filters_job_card.quick_apply]
Performs System Reliability functions to improve and enhance the reliability performance of the transmission system to meet the needs of Members and corporate goals.Identifies ways to best utilize ...[show_more]
[last_updated.last_updated_30]
Mitigation Technician

Mitigation Technician

Southeast Restoration • Canton, GA, United States
[job_card.full_time]
Southeast Restoration Group (SRG), a faith-based company, is looking for hardworking and service-minded individuals to join our team as.In this entry-level role, you’ll help mitigate water damage, ...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Vulnerability Engineer (US Remote)

Vulnerability Engineer (US Remote)

First Advantage • Atlanta, GA, US
[filters.remote]
[job_card.full_time]
[filters_job_card.quick_apply]
FA), people are at the heart of everything we do—from our customers and partners to our greatest advantage: our team members.Operating with empathy and compassion, First Advantage fosters a global,...[show_more]
[last_updated.last_updated_30]
Sediment Remediation Specialist (Engineer, Scientist or Geologist)

Sediment Remediation Specialist (Engineer, Scientist or Geologist)

CDM Smith • Atlanta, GA, United States
[job_card.full_time]
CDM Smith's Environmental Services Group is seeking an Engineer, Scientist or Geologist with a.CAD) in remediation projects combined with basic geotechnical engineering experience (significant acad...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Security and Loss Prevention Cluster Manager, NA

Security and Loss Prevention Cluster Manager, NA

Amazon • Atlanta, GA, United States
[job_card.full_time]
The Security and Loss Prevention Cluster Manager (SLP Cluster Manager) leads the effort to efficiently and effectively provide risk mitigation as well as security oversight and asset (People, Prope...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Platform Specialist

Platform Specialist

Hermeus • Atlanta, GA, United States
[job_card.permanent]
Hermeus is a high-speed aircraft manufacturer focused on the rapid design, build, and test of high-Mach and hypersonic aircraft for the national interest.Working directly with the Department of Def...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Security Lead

Security Lead

Veterans Sourcing Group LLC • Alpharetta, GA, United States
[job_card.full_time]
Job Location: Alpharetta, GA 30022.The Application Perimeter Security lead is a key member of the Enterprise Security Solutions team and is responsible for designing, documenting, implementing and ...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Asset Protection Specialist

Asset Protection Specialist

Home Depot • Jasper, GA, United States
[job_card.full_time]
The Asset Protection Specialist is primarily responsible for preventing financial loss caused by theft and fraud and supporting safety and environmental program compliance in their assigned store/m...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Safety Specialist

Safety Specialist

CentiMark • Atlanta, GA, United States
[job_card.full_time]
Commercial/Industrial Roofing).CentiMark Corporation, the nation's leading contractor in commercial and industrial roofing, has an immediate opening for a Safety Specialist in its.This position is ...[show_more]
[last_updated.last_updated_30] • [promoted]
Regional Technical Leader - Wastewater Treatment Process

Regional Technical Leader - Wastewater Treatment Process

Goodwyn Mills Cawood • Atlanta, GA, United States
[job_card.full_time]
Goodwyn Mills Cawood (GMC) is a leading architecture and engineering firm dedicated to creating opportunities for people to build thriving communities.We are seeking a highly motivated and experien...[show_more]
[last_updated.last_updated_30] • [promoted]
COE-Sr. Industrial DC Services Engineer-Atlanta

COE-Sr. Industrial DC Services Engineer-Atlanta

Vertiv Holdings • Atlanta, GA, United States
[job_card.full_time]
Perform complex troubleshooting, specialized tests, inspections, and appraisals on electrical apparatus and electrical systems.Technically support a Region of service engineers.Position may be ment...[show_more]
[last_updated.last_updated_variable_days] • [promoted]