Observability & Operations¶

Status: 🟢 Active | Owner: SRE Team | Last Reviewed: 2025-Q4

Introduction¶

You cannot operate what you cannot observe. The ability to understand the internal state of a running system from the data it produces — without needing to attach a debugger or log into a production server — is what separates systems that can be operated reliably at scale from systems that require constant, heroic manual intervention.

Observability is often confused with monitoring, but the distinction matters. Monitoring tells you whether known-bad things are happening. Observability tells you why things are happening, including things you did not anticipate when you instrumented the system. A properly observable system allows an on-call engineer to diagnose a novel production issue in minutes rather than hours, using the telemetry the system emits rather than needing to modify and redeploy it.

This section documents the standards for the three pillars of observability — logs, metrics, and traces — along with alerting, SLOs, and incident management that make those pillars operational.

The OpenTelemetry Mandate¶

OpenTelemetry (OTel) is the single, non-negotiable instrumentation standard for all services across the organisation. No proprietary observability SDK — Datadog Agent, New Relic APM, Dynatrace OneAgent, or any other vendor-specific instrumentation library — may be embedded in application code.

This is a deliberate architectural decision with three consequences:

1. Backend independence. Because all telemetry flows through the OTel Collector using the vendor-neutral OTLP protocol, the organisation can route data to any backend — Grafana Tempo, Jaeger, Datadog, Honeycomb — or multiple backends simultaneously, by changing Collector configuration. No service needs to be redeployed to change where its telemetry goes.

2. Consistent correlation across pillars. OTel provides a unified context model. A trace ID generated when a request arrives propagates automatically into every log line emitted during that request and into every metric exemplar. This means logs, metrics, and traces from the same request are always linkable with a single identifier, regardless of which service emitted them.

3. Future-proofing. The observability tooling market continues to evolve. Locking instrumentation to a vendor SDK locks the organisation into that vendor's release cadence, pricing, and data model. OTel ensures the instrumentation investment belongs to the organisation, not to a vendor.

The OTel Collector as the Central Hub¶

Every service exports telemetry exclusively to the OTel Collector. The Collector is responsible for receiving, processing, filtering, transforming, and exporting telemetry to configured backends. Services never export directly to backends.

Service (OTel SDK)
      │ OTLP (gRPC or HTTP)
      ▼
OTel Collector
      ├── Grafana Tempo    (traces)
      ├── Grafana Loki     (logs)
      ├── Prometheus       (metrics)
      └── [additional backends as configured]

This architecture means adding a new observability backend — or removing one — requires a Collector configuration change, not a code change in any service.

Dynamic Telemetry Verbosity via Feature Flags¶

Observability data has a cost. Collecting every trace at full detail in production continuously is expensive in storage, processing, and network. At the same time, when diagnosing a production issue, sparse telemetry is worse than useless — it leaves gaps precisely where you need insight.

The solution is dynamic telemetry verbosity control: the ability to switch any service, or the entire fleet, between coarse (production-normal) and detailed (incident/debug) collection modes at runtime, without a deployment.

The Standard: LaunchDarkly-Driven Verbosity¶

LaunchDarkly is the approved mechanism for controlling telemetry verbosity. The OTel Collector reads flag values from LaunchDarkly at startup and on a polling interval, and adjusts its processor and sampling pipeline configuration accordingly. Application services do not need to be aware of the current verbosity level — they always emit full telemetry to the Collector; the Collector decides what to retain and forward.

This keeps the control plane (LaunchDarkly) cleanly separated from the instrumentation (OTel SDK in services) and the data plane (Collector → backends).

Verbosity Levels¶

Level	Flag Value	When Used	Collector Behaviour
Coarse	`observability-verbosity: coarse`	Normal production operations	Head-based sampling at 5%; DEBUG logs dropped; metric cardinality capped; trace attributes trimmed to required set
Standard	`observability-verbosity: standard`	Default; most of the time	Head-based sampling at 20%; INFO+ logs forwarded; full metric set; full trace attribute set
Detailed	`observability-verbosity: detailed`	Active incident investigation; canary deployments	100% trace sampling; DEBUG logs forwarded; all metric cardinality; all span events captured
Debug	`observability-verbosity: debug`	Local development and QA only — never production unless explicitly approved by SRE Lead	100% sampling; all logs including TRACE level; full attribute capture including request/response bodies

Scoping Verbosity Changes¶

Flag targeting rules allow verbosity to be changed at multiple scopes without affecting the entire fleet:

Scope examples:
  service=payment-service                    → one service only
  service=payment-service AND env=production → one service in production
  env=staging                                → all services in staging
  team=payments                              → all services owned by a team
  [no targeting rules]                       → all services globally

The SRE on-call engineer has the authority to change verbosity for any scope in response to an active incident. Changes to the global production scope require SRE Lead approval.

Verbosity Change Runbook¶

During an incident, increasing telemetry verbosity is a standard first response action:

1. Log into LaunchDarkly → Observability project → verbosity flags
2. Target the affected service(s): service=<name> AND env=production
3. Set value to "detailed"
4. Record the change in the incident Slack channel with timestamp
5. Confirm in Grafana that detailed traces are appearing (allow 60s for Collector to repoll)
6. After incident is resolved, revert verbosity to "standard" within 4 hours
   (cost controls: "detailed" on production services for >4 hours requires SRE Lead approval)

Intent¶

Consistent Telemetry Across the Fleet¶

The value of observability data is proportional to its consistency. When every service emits logs in the same structured JSON format with the same correlation ID field names, and propagates the same W3C trace context headers, a single query can trace a request end-to-end across dozens of services. When every service exposes the same baseline metrics using the RED method, fleet-wide dashboards work for new services on day one without any customisation.

The standards in this section define that consistent baseline. Teams are encouraged to add service-specific instrumentation on top of it — the baseline is a floor, not a ceiling.

SLOs Over Symptom Alerting¶

Traditional threshold-based alerting accumulates noise over time. SLOs define what "good" looks like from the user's perspective and alert on error budget burn rate — the rate at which the organisation is spending its allowed failure budget. This approach dramatically reduces alert fatigue while ensuring that pages are always grounded in real user impact.

Incident Management as a Learning System¶

Every incident is an opportunity to understand something about the system that was previously unknown. The blameless post-mortem process converts that understanding into durable improvement. Post-mortems that blame individuals produce defensiveness and silence future early warnings. Post-mortems that interrogate systemic conditions produce actionable changes and a culture where problems surface quickly.

What You Will Find Here¶

Page	Intent
Logging Standards	OTel log data model, structured JSON format, required fields, log levels, verbosity flag integration
Metrics & Monitoring	OTel metrics SDK, RED method, naming conventions, cardinality management, dashboard standards
Distributed Tracing	OTel tracing SDK, instrumentation requirements, sampling strategy, verbosity flag integration
Alerting & On-Call	Alert design principles, severity levels, PagerDuty routing, on-call rotation standards
SLOs, SLIs & Error Budgets	Defining, measuring, and acting on Service Level Objectives
Incident Management & Post-Mortems	Severity classification, response playbooks, blameless post-mortem process

Last reviewed: 2025-Q4 | Owner: SRE Team