The Site Reliability Engineer (SRE) is responsible for improving system reliability and resilience. This role focuses on building automation to reduce manual effort and prevent service-impacting incidents.
The SRE combines software and systems engineering to build and support large-scale, distributed, fault-tolerant systems. This role ensures that critical platforms are available, reliable and able to support a fast rate of improvement. This role relies on monitoring platforms and is continually taking a holistic view of system health and performance.
The SRE will enhance and support cloud-based transformations, and is focused on pushing capabilities forward, staying ahead of customer needs and innovating for continuous improvement. The SRE provides operational support and engineering for multiple large-scale distributed software applications
JOB DUTIES
• Gathers and analyzes metrics from monitoring platforms to assist in performance tuning and fault tolerance.
• Partners with development teams to improve services through testing and release procedures.
• Participates in system design, platform management and capacity planning.
• Balances feature development speed and reliability with service-level objectives.
• Works closely with the incident response team and restoring service to normal operation.
• Understands debugging and applying troubleshooting skills.
• Investigates, blocks and rate-limits unwanted traffic.
• Utilizes monitoring systems and dashboards for proactive changes and alerting.
• Establishes continuous process improvement cycles where the process, performance, and supporting technologies are reviewed and enhanced where applicable.
• Performs other duties as assigned.
EDUCATION & EXPERIENCE
Typically requires a bachelor's degree and five (5) to seven (7) years of experience in a technology and/or software engineering role or an equivalent combination.
KNOWLEDGE, SKILLS, ABILITIES
• Understanding of Kubernetes, containers, clusters and elastic scalability. • Expertise in SRE principles.
• Mindset of continually finding ways to drive scalability, stability and performance.
• Cloud Services experience with Google Cloud Platform (GCP).
• Experience with API, service-based or microservice-based architecture.
• Proficiency in infrastructure, network, database, operating systems or security troubleshooting and remediation.
• Architecture-level knowledge of Windows and Linux and Infrastructure systems
• Experience with production deployment, monitoring and operational support fo enterprise-class applications (Dynatrace a plus).
• Experience working with Continuous Integration/ Continuous Deployment tools.
• Experience in performance diagnostics, capacity planning, performance architecture design, performance tuning and performance monitoring.
• A strong mix of software engineering and operational support skills.
• Knowledge of web technologies – HTTP, proxy, java, etc.
• Experience with Azure DevOps (ADO), Dynatrace, Prometheus, Terraform and Grafana.
COMPANY INFORMATION: Motion offers an excellent benefits package which includes options for healthcare coverage, 401(k), tuition reimbursement, vacation, sick, and holiday pay
Site Reliability Engineer • Birmingham, Alabama