Talent.com
Sr. Site Reliability Engineer - Incident Response
Sr. Site Reliability Engineer - Incident ResponseCox Automotive • Dunwoody, GA, United States
Sr. Site Reliability Engineer - Incident Response

Sr. Site Reliability Engineer - Incident Response

Cox Automotive • Dunwoody, GA, United States
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

The Site Reliability Engineer - Incident Response is a critical enterprise-level role responsible for accelerating incident resolution and enhancing the overall incident management process. This individual partners with engineering teams during active incidents to troubleshoot issues using monitoring and logging tools, and post-incident, delivers executive-level summaries that clearly communicate impact, root cause, and resolution. The SRE - Incident Response also plays a key role in analyzing incident response effectiveness and identifying opportunities for systemic improvements. Core Competencies and Qualifications: Bachelor's degree in a related discipline and 4 years' experience in a related field. The right candidate could also have a different combination, such as a master's degree and 2 years' experience; a Ph.D. and up to 1 year of experience; or 16 years' experience in a related field. Applicants must currently be authorized to work in the United States for any employer without current or future sponsorship. No OPT, CPT, STEM/OPT or visa sponsorship now or in future. Engineering/Tooling: Demonstrates the ability to design, build, and maintain engineering solutions and tools that enhance reliability, automate incident response, and reduce operational toil. Incident Troubleshooting: Skilled in interpreting logs, metrics, and traces to assist in identifying root causes during live incidents. Monitoring & Observability: Proficient in tools such as Datadog, Splunk, New Relic, or similar platforms. Strong programming background in Python, Java, or C# , with experience building, maintaining, and troubleshooting production-grade services and automation tools. Proven ability to design and implement reliable, scalable, and highly available systems, leveraging software engineering best practices to improve system resilience and operational efficiency. Experience developing automation and tooling to reduce toil, improve incident response, and support continuous improvement across monitoring, deployment, and recovery processes. Ability to collaborate closely with software engineering teams to influence architecture and operational readiness, ensuring reliability is built into the system from design through production. AI Centric Engineering: Effectively leverages artificial intelligence (AI) and machine learning (ML) tools to automate, optimize, and enhance daily engineering and incident response tasks. Analytical Rigor: Strong attention to detail in validating incident data and identifying trends or gaps in response. DevOps & Architecture Knowledge: Understanding full-stack systems, CI/CD pipelines, caching, scaling, and cloud-native infrastructure. Metrics & Reporting: Capable of calculating and interpreting key metrics like MTTA (Mean Time to Acknowledge) and MTTR (Mean Time to Resolve). Here are the responsibilities of this role when not tied to active on-call: Post-Incident Review Development Draft and deliver executive summaries post-incident Develop and coach teams on blameless postmortems . Create templates, train facilitators, and help guide root cause analysis (e.g., 5 Whys, fishbone diagrams). Maintain a central library of learnings and cross-cutting themes. Incident Process Improvement Actively support engineering teams during incidents by helping diagnose and resolve issues quickly Navigate and analyze data from observability platforms to make informed inferences about root causes Analyze the effectiveness of incident response to identify systemic reliability gaps. Standardize incident response workflows (incident roles, comms, escalation paths). Create or refine runbooks , incident command frameworks , and severity classification guides . Metrics and Insights Build dashboards around incident frequency, MTTR, MTTA, and recurrence rates. Use incident data to drive reliability of OKRs or engineering investments. Tooling & AI Solutions Partner with engineering teams to identify repetitive or high-impact tasks suitable for automation. Develop, implement, and continuously improve custom scripts, bots, and AI-driven workflows for monitoring, alerting, and incident triage. Evaluate and integrate emerging AI/ML technologies to optimize detection, root cause analysis, and reporting. Ensure all tools and automations are secure, maintainable, and aligned with organizational standards and SRE best practices. Document and socialize new tools and AI solutions, enabling adoption and knowledge sharing across teams. Cross-Team Collaboration Collaborate with Engineering Managers and Incident Commanders to gather and validate incident data Partner with product teams, infra, and leadership to socialize reliability best practices . Act as a reliability "consultant" to squads that have impactful incidents. Recommend enhancements to monitoring, alerting, and response processes to reduce future incident impact USD 101,500.00 - 169,100.00 per year Compensation: Compensation includes a base salary of $101,500.00 - $169,100.00. The base salary may vary within the anticipated base pay range based on factors such as the ultimate location of the position and the selected candidate's knowledge, skills, and abilities. Position may be eligible for additional compensation that may include an incentive program. Benefits: The Company offers eligible employees the flexibility to take as much vacation with pay as they deem consistent with their duties, the company's needs, and its obligations; seven paid holidays throughout the calendar year; and up to 160 hours of paid wellness annually for their own wellness or that of family members. Employees are also eligible for additional paid time off in the form of bereavement leave, time off to vote, jury duty leave, volunteer time off, military leave, and parental leave.aa415a4b-8b21-40fc-a65c-70d2b25ca29a

[job_alerts.create_a_job]

Sr Site Reliability Engineer Incident Response • Dunwoody, GA, United States

[internal_linking.similar_jobs]
Lead Site Reliability Engineer

Lead Site Reliability Engineer

Intellum, Inc. • Atlanta, Georgia, US
[job_card.full_time]
Job Description Job Description About us Intellum is the leader in corporate education technology and powers the largest, most successful customer, partner, and employee learning programs in the wo...[show_more]
[last_updated.last_updated_variable_hours] • [promoted] • [new]
Site Reliability Engineer

Site Reliability Engineer

AutoRABIT Holding Inc. • Atlanta, GA, US
[job_card.permanent]
[filters_job_card.quick_apply]
AutoRABIT is looking for a Site Reliability/DevSecOps Engineer to help develop, scale and operate our cloud services In this role you will be an experienced business professional able to implement ...[show_more]
[last_updated.last_updated_30]
Sr Systems Engineer

Sr Systems Engineer

Primetals Technologies • Alpharetta, GA, US
[job_card.full_time]
Position:Senior Systems Engineer (Primetals Technologies USA LLC; Alpharetta, GA) Duties: Engineering, specification, design, testing, and commissioning of advanced automation control systems.These...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Restoration Technician Lead

Restoration Technician Lead

1800 Water Damage • Alpharetta, GA, United States
[job_card.full_time]
WATER DAMAGE is a trusted property restoration company serving across the nation.Our team if fully vetted, IICRC-.We handle a range of restoration projects including emergency mitigation, water dam...[show_more]
[last_updated.last_updated_30] • [promoted]
Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

Rivka Development • Atlanta, Georgia, US
[job_card.full_time]
Job Description Job Description Role Description: SRE will work within the Video Network division to design, build, operate our next generation Video Cloud platform, driving efficiency, reliability...[show_more]
[last_updated.last_updated_variable_hours] • [promoted] • [new]
Sr. Engineer, Regulatory Compliance

Sr. Engineer, Regulatory Compliance

Oglethorpe Power Corporation • Tucker, Georgia, US
[job_card.full_time]
Job Description Job Description The primary role of this position is to provide engineering and compliance expertise in connection with OPC's ERO Compliance Program, including CIP Standards.This wi...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Systems & Infrastructure Engineer (Level I - V)

Systems & Infrastructure Engineer (Level I - V)

Oglethorpe Power • Tucker, GA, US
[job_card.full_time]
This position leverages expertise in system administration to maintain systems critical to GSOC's system operations function.As a member of the Systems and Infrastructure department, this position ...[show_more]
[last_updated.last_updated_30] • [promoted]
DevOps - Site Reliability Engineer ( SRE)

DevOps - Site Reliability Engineer ( SRE)

Resource Informatics Group Inc • Atlanta, Georgia, US
[job_card.full_time]
Job Description Job Description Role: Site Reliability Engineer Location: Atlanta, GA Duration: 12 months Rate: $market All Inclusive Job Description: * This Software Engineer will be part of the S...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

Bright Vision Technologies • Atlanta, GA, US
[job_card.full_time]
[filters_job_card.quick_apply]
Site Reliability Engineer (SRE) Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize t...[show_more]
[last_updated.last_updated_variable_days]
System Reliability Engineer, IV or V

System Reliability Engineer, IV or V

Georgia Transmission Corporation • Tucker, GA, USA
[job_card.full_time]
[filters_job_card.quick_apply]
Performs System Reliability functions to improve and enhance the reliability performance of the transmission system to meet the needs of Members and corporate goals.Identifies ways to best utilize ...[show_more]
[last_updated.last_updated_30]
Construction Risk Control Engineer

Construction Risk Control Engineer

Berkshire Hathaway Specialty Insurance • Atlanta, Georgia, US
[job_card.full_time]
Job Description Job Description Who are we? A strategic and trusted insurance partner, Berkshire Hathaway Specialty Insurance (BHSI), provides a broad range of commercial property, casualty and spe...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Building Inspector (Statewide)

Building Inspector (Statewide)

State of Georgia • Atlanta, GA, United States
[job_card.full_time]
Under supervision, the Building Inspector (Special Hazards) is responsible for inspecting facilities and construction sites that involve hazardous materials, high-risk operations, or specialized sy...[show_more]
[last_updated.last_updated_30] • [promoted]
Vulnerability Engineer (US Remote)

Vulnerability Engineer (US Remote)

First Advantage • Atlanta, GA, US
[filters.remote]
[job_card.full_time]
[filters_job_card.quick_apply]
FA), people are at the heart of everything we do—from our customers and partners to our greatest advantage: our team members.Operating with empathy and compassion, First Advantage fosters a global,...[show_more]
[last_updated.last_updated_30]
Sr. Security Engineer

Sr. Security Engineer

TradeStation • Atlanta, GA, US
[job_card.full_time]
WeAreTradeStation Remote Position - must reside in Florida, Texas, Illinois, New York, New Jersey, Colorado, Idaho, Massachusetts, Michigan, Minnesota, Missouri, North Carolina, South Carolina, Uta...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
CMT Special Inspector

CMT Special Inspector

NOVA Engineering and Environmental, LLC • Norcross, Georgia, US
[job_card.full_time]
Job Description Job Description NOVA Engineering is currently seeking an ICC-certified CMT Special Inspector (or Engineer) for our Norcross, GA or Kennesaw, GA office to work on projects in the Atl...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Sediment Remediation Specialist (Engineer, Scientist or Geologist)

Sediment Remediation Specialist (Engineer, Scientist or Geologist)

CDM Smith • Atlanta, GA, United States
[job_card.full_time]
CDM Smith's Environmental Services Group is seeking an Engineer, Scientist or Geologist with a.CAD) in remediation projects combined with basic geotechnical engineering experience (significant acad...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Security Operations Engineer (Level I - V)

Security Operations Engineer (Level I - V)

Oglethorpe Power • Tucker, GA, US
[job_card.full_time]
This Engineer role, part of GSOC's Security Operations department, is responsible for protecting the cyber assets that support GSOC and GTC's digital operations.The position focuses on conducting c...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Entry Level Automotive Technician/Express Lane Technician/Lube Technician

Entry Level Automotive Technician/Express Lane Technician/Lube Technician

Shottenkirk Chrysler Dodge Jeep Ram • Canton, GA, US
[job_card.full_time]
Entry Level Automotive Technician/Express Lane Technician/Lube Technician Shottenkirk Chrysler Dodge Jeep Ram in Canton, GA is looking for Express Lane Lube Technicians to join their busy service d...[show_more]
[last_updated.last_updated_variable_days] • [promoted]