Job Title: Sr DevOps Engineer
Location: Chicago, IL (3 days - Hybrid)
Duration; 12 Months+
Role Summary:
We are seeking a highly experienced Senior DevOps Engineer (Production Support) with deep expertise in AWS, Kubernetes, CI/CD, and cloud-native platforms. This role will focus on operating, stabilizing, and continuously improving production environments, ensuring high availability, performance, and scalability of mission-critical applications.
The ideal candidate is a hands-on DevOps/SRE professional who thrives in fast-paced production environments and can automate, troubleshoot, and optimize distributed systems at scale.
You will work extensively with AWS, Kubernetes (Rancher), Jenkins, GitHub, Terraform, Kafka, Harness, and Python while partnering with engineering, platform, and product teams.
Key Responsibilities:
Production Operations & Reliability
- Provide L2/L3 production support for cloud-native applications running on AWS and Kubernetes.
- Own incident triage, root cause analysis (RCA), and resolution for high-severity production issues.
- Participate in on-call rotations and drive post-incident improvements.
- Improve system reliability, resilience, and observability using SRE best practices.
AWS & Cloud Infrastructure
- Design and operate scalable AWS environments using:
- EC2, EKS, VPC, ALB/NLB
- S3, RDS, DynamoDB
- IAM, CloudWatch, EventBridge
- Optimize cloud cost, performance, and security posture.
- Implement multi-account, multi-region architectures.
Kubernetes & Container Platforms
- Manage and operate Kubernetes clusters (Rancher-managed or EKS).
- Troubleshoot:
- Pod failures
- Resource constraints
- Networking issues (CNI, ingress)
- Stateful workloads
- Improve:
- Autoscaling strategies
- Cluster resilience
- Deployment reliability
CI/CD & Developer Enablement
- Design and maintain CI/CD pipelines using:
- Jenkins
- GitHub Actions
- Harness (preferred)
- Implement:
- Blue/green and canary deployments
- GitOps workflows
- Automated rollbacks
- Enable developer self-service deployment platforms.
Infrastructure as Code & Automation
- Build and maintain infrastructure using:
- Terraform (primary)
- Python automation
- Develop reusable:
- IaC modules
- Platform templates
- Deployment accelerators
- Automate provisioning, scaling, and recovery workflows.
Kafka & Streaming Platforms
- Design and manage Kafka infrastructure including:
- Clusters, topics, brokers
- Producers/consumers
- Schema evolution
- Ensure:
- High availability
- Throughput optimization
- Secure connectivity
- Integrate Kafka with AWS and Kubernetes ecosystems.
Observability & Platform Health
- Implement monitoring and alerting using:
- CloudWatch / Splunk Observability
- Define:
- SLIs/SLOs
- Alerting thresholds
- Runbooks
- Proactively identify bottlenecks and prevent outages.
Security & Compliance
- Implement DevSecOps best practices:
- Secrets management
- IAM least privilege
- Container scanning
- Supply chain security
- Ensure infrastructure adheres to security and compliance standards.
Collaboration & Continuous Improvement
- Partner with development teams to:
- Improve deployment maturity
- Reduce operational toil
- Increase automation coverage
- Drive:
- Platform standardization
- Developer experience improvements
- Operational excellence initiatives
Qualifications
Experience
- 4 - 10 years in DevOps / SRE / Production Support roles
- Strong experience managing production-grade cloud environments
- Proven track record handling live incident management
Technical Skills
Must Have
- AWS (deep hands-on)
- Kubernetes (EKS/Rancher)
- Splunk
- Terraform
- Jenkins / GitHub
- Kafka
- Python or Shell scripting
- Linux systems expertise
Good to Have
- Harness CI/CD
- GitOps (ArgoCD/Flux)
- Service mesh (Istio/Linkerd)
- Observability tools (New Relic, Datadog, Prometheus)
- Platform engineering mindset
Soft Skills
- Strong troubleshooting and debugging mindset
- Excellent communication during incidents
- Ability to work in high-pressure production environments
- Ownership-driven and automation-first approach
Mandatory: Overall DevOps, AWS, Kubernetes/Helm, Terraform/Ansible, Jenkins/Harness, Python/Groovy scripting, Linux, Splunk, Production Support
Secondary: Claude Code, Rancher, DataOps, Consul, Kafka, DevSecOps