Job Description
Job Description
Observability Engineer
Department : Information Technology
Location : On-site
About the role
We’re hiring an Observability Engineer in our Information Technology department to own and evolve our production monitoring and telemetry platform. You’ll partner with Dev and IT teams to instrument applications, build meaningful dashboards and monitors, maintain alerting and on-call flows, and consult on the best observability tools and patterns for the organization. This role combines hands-on implementation with cross-team collaboration and a focus on keeping our metrics, dashboards, and alerts accurate, actionable, and relevant. We're looking for a self-starter who is highly motivated and comfortable driving work with minimal supervision.
What you’ll do (core responsibilities)
- Build, maintain, and improve dashboards and monitors across the stack so teams can quickly understand system health and reliability.
- Partner with business, development and IT teams to design, instrument, and roll out metrics, logs, and traces that track service health and user impact.
- Keep metrics, dashboards, synthetic checks, and alert rules up-to-date, relevant, and aligned with business priorities.
- Own alert lifecycle : define thresholds, reduce noise / alert fatigue, route and escalate alerts, and refine alerting through post-incident analysis.
- Run and tune synthetic monitoring and uptime checks (Elastic Synthetics and Site24x7) to verify customer workflows and critical endpoints.
- Improve cloud monitoring using AWS CloudWatch, create and maintain CloudWatch dashboards and alarms, tune metrics / alarms for cloud resources, and recommend alternative cloud-native or third-party monitoring tools where appropriate.
- Integrate and operate error monitoring and ensure actionable error tracking for engineering teams.
- Operate and scale metric ingestion and storage backends (Prometheus, Grafana, Elastic) and monitoring systems (Zabbix) as needed.
- Work with platform teams to automate observability configuration and deployments.
- Use and extend in-house observability products; mentor teams on their use.
- Participate in incident response, triage, and postmortems; translate learnings into monitoring improvements.
Required qualifications
Experience working with production observability, monitoring, or SRE / DevOps teams.Hands-on experience creating dashboards, alerts and instrumentation for services.Practical knowledge of alert routing and incident tools and integrating alert pipelines.Experience with error monitoring tools such as Sentry (or equivalent).Strong understanding of metrics / logs / traces, and alerting best practices.Experience with automation and configuration-as-code.Proficiency with PromQL for queries, alerts, and dashboards.Excellent communicator, able to consult with project managers, developers and IT, translate reliability goals into concrete monitoring actions.Self-starter and highly motivated, able to drive projects, unblock teams, and proactively improve observability practices.Preferred / nice-to-have
Familiarity with any of the following : Prometheus, Grafana, Elastic, Zabbix, Site24x7, Cloudwatch.Experience operating or building monitoring / storage backends at scale (Prometheus federation, long-term metrics storage).Experience working with custom / in-house telemetry tooling or building observability plugins.Experience with container orchestration (Kubernetes) and monitoring in containerized environments.Tools & technologies you’ll use
Prometheus, Alertmanager, Graphite, Elastic (ELK), Elastic Synthetics, Site24x7, Grafana, Zabbix, Cloudwatch, Sentry.io, Opsgenie, Suse Observability, plus our in-house tools.
Why you’ll enjoy this role
You’ll have broad impact, shaping how teams detect, respond to, prevent incidents, and you’ll help move observability from reactive alerts to proactive service reliability. You’ll work across engineering and IT, influence platform direction, and own meaningful reliability outcomes. The role rewards initiative : if you’re a self-starter who enjoys finding gaps and independently delivering improvements, you’ll thrive here.