Description and Requirements
About Our Team
Lenovo is building Quantum, a next‑generation hybrid AI platform that spans Windows, Android, and cloud. As part of this vision, we are expanding the reliability engineering organization powering Qira, Lenovo’s cross‑device Personal AI that operates seamlessly across Lenovo and Motorola products.
We are hiring a Senior Manager, AI Reliability Engineering to lead the engineering teams responsible for Qira’s foundational reliability capabilities — including system‑level observability, telemetry, performance engineering, resiliency architecture, and the reliability of Qira’s hybrid edge/cloud AI service.
This is a high‑impact leadership role shaping how we measure, operate, and improve reliability across one of Lenovo’s most ambitious AI initiatives.
Location: Open to remote work in the US. The preferred work location is Chicago, IL.
What You’ll Do
Engineering Leadership
Lead and grow multiple engineering teams focused on reliability, observability, and system performance across Qira’s hybrid AI ecosystem.
Define strategy, roadmaps, and priorities to improve reliability, insight, and operational readiness across device, edge, and cloud systems.
Champion reliability as an engineering discipline through design patterns, best practices, and a culture of continuous improvement.
Observability & Telemetry
Own the systems that deliver metrics, logs, traces, distributed tracing, AI‑specific signals, dashboards, and alerting.
Drive the adoption of unified telemetry standards and instrumentation across all Qira components.
Ensure engineers have actionable insight into performance, reliability, cost, and AI behavior.
Service Reliability & Performance Engineering
Lead engineering efforts to improve the reliability, performance, and scalability of Qira’s service architecture — including inference, retrieval, data pipelines, and hybrid edge/cloud workflows.
Drive the design and adoption of resilience patterns such as graceful degradation, fallback paths, bulkheads, and rate‑limiting strategies.
Oversee capacity planning, cost optimization, and performance tuning for high‑throughput AI systems.
System Design & Architectural Influence
Work with cross‑functional engineering teams to embed reliability early in the design process (“shift left”).
Guide architectural decisions to ensure Qira’s engineering foundations remain stable, observable, and predictable at scale.
Set service readiness standards for new components entering production.
Cross‑Functional Collaboration
Partner with Applied AI/ML Engineering, Platform Engineering, Firmware, Product, and Security to align reliability goals with Qira’s broader roadmap.
Collaborate closely with the incident management and operations teams to ensure strong signal quality, runbook depth, and operational tooling.
Act as a reliability engineering representative in executive and engineering leadership forums.
Team & Talent Development
Hire and develop world‑class engineers across observability, reliability, and performance domains.
Provide coaching, mentorship, and clear technical and leadership career paths.
Foster a culture of ownership, operational craftsmanship, and data‑driven engineering.
Basic Qualifications
12+ years of experience in Site Reliability Engineering, Observability Engineering, Platform Engineering, or large‑scale distributed systems, including 5+ years leading engineering teams.
Bachelor’s Degree in Computer Science, Engineering, or a related technical field.
Engineering experience in several of the following:
Observability systems (OpenTelemetry, metrics/logs/traces)
Distributed systems reliability and performance
Cloud infrastructure (Azure preferred)
Kubernetes and containerized environments
CI/CD pipelines and deployment workflows
Infrastructure-as-Code (Terraform, Bicep, etc.)
Deep understanding of Linux systems, networking, scalability, and system performance fundamentals.
Proven ability to lead engineering teams and drive cross‑organizational initiatives.
Preferred Qualifications
Experience building or operating large‑scale telemetry and observability platforms.
Hands‑on experience with Grafana, Prometheus, Loki, Tempo, or similar tooling.
Experience supporting AI/ML inference systems, vector databases, or GPU‑accelerated compute.
Background in hybrid systems spanning device, edge, and cloud.
Experience implementing resilience patterns and reliability frameworks.
Experience with SLOs, SLIs, error budgets, and reliability governance.
Passion for building scalable reliability engineering teams and systems.
Why This Role Matters
Qira’s reliability is mission‑critical to delivering a safe, fast, and trustworthy AI experience to millions of users.
In this role, you will:
Build the telemetry and reliability insights that power Qira
Architect the service‑level reliability patterns that keep Qira stable at scale
Lead the engineering teams that ensure Qira performs predictably across devices, edge, and cloud
Shape how reliability engineering is practiced across Lenovo’s AI ecosystem
This is a rare opportunity to define the engineering foundation of a next‑generation global AI platform.
The base salary budgeted range for this position is $190K - $230K. Individuals may also be considered for bonus and/or commission.
Lenovo’s various benefits can be found on