Migrate Observability Without Breaking On-Call: A Phased Path from Prometheus Agents to OpenTelemetry Pipelines + Fluent Bit (with “Done” Criteria)

Observability platform migrations are rarely simple. They’re not just “switch monitoring vendors” projects—they’re multi-quarter changes that touch production traffic, incident response, developer workflows, and reliability culture. If you rush, you break on-call. If you over-scope, you never finish.

This post lays out a pragmatic sequencing strategy—centered on Prometheus, OpenTelemetry (OTel), and Fluent Bit—that lets teams migrate in phases, reduce instrumentation debt, and prove correctness with acceptance tests before flipping the final switches. It’s inspired by and aligned with The New Stack’s migration guide on Prometheus, OpenTelemetry, and Fluent Bit, but framed as a maintenance-first blueprint rather than a single tool swap. (Primary reference: The New Stack, “Observability platform migration guide: Prometheus, OpenTelemetry, and Fluent Bit” https://thenewstack.io/observability-platform-migration-guide/)

Context: why migrations fail (and what “success” actually means)

Migrate Observability Without Breaking On-Call: A Phased Path from Prometheus Agents to OpenTelemetry Pipelines + Fluent Bit (with “Done” Criteria)

Most legacy monitoring/logging stacks weren’t designed to be replaced in one go. They grew organically: exporters, bespoke dashboards, handwritten alert rules, and a zoo of agents. Over time, teams accumulate instrumentation debt: duplicated metrics, high-cardinality labels, inconsistent log formats, and traces that don’t connect to anything.

Engineering leaders feel the pain most when:

Cloud and Kubernetes modernization is blocked by old agents or node-level assumptions.
On-call confidence erodes due to noisy alerts and missing signals.
Data costs rise because everything is scraped/forwarded with no governance.
Teams can’t standardize across services (or across business units).

A successful observability migration is not “we installed OpenTelemetry.” Success is:

On-call stayed stable throughout.
Teams can detect and diagnose incidents at least as well as before.
The new pipelines are operable (owned, documented, tested).
The old stack can be retired without leaving blind spots.

Strategy first: don’t treat this as a tool swap

The New Stack guide emphasizes a key point: migration is a sequencing and risk-management problem. Prometheus, OpenTelemetry, and Fluent Bit are often in the center because they map cleanly to the three signals:

Prometheus for metrics (scrape model, alerting rules, ecosystem).
OpenTelemetry for vendor-neutral traces/metrics/logs pipelines and semantic conventions.
Fluent Bit for lightweight, production-proven log forwarding and enrichment.

But the bigger idea is “phased migration with parallel run,” not “rip and replace.” The goal is to decouple collection from backend and instrumentation from export, so you can shift components independently.

A phased path that protects on-call

Phase 0: inventory, ownership, and blast-radius controls

Before you move any traffic:

Inventory telemetry producers: Prometheus exporters, app metrics endpoints, node agents, sidecars, log shippers, tracing libraries.
Map critical journeys: the alerts and dashboards that the on-call rotation actually uses.
Define SLO/SLI dependencies: which metrics feed paging alerts vs. “nice-to-have” graphs.
Set guardrails: sampling policies, label allow/deny lists, log redaction rules, and a max-cost posture.

Maintenance-first tip: treat this like a dependency upgrade. Identify “must-not-break” interfaces (alert names, label keys, dashboard queries) and freeze changes until you have a compatibility plan.

Phase 1: build the new data plane in parallel (no cutover yet)

Start by standing up the new pipelines without changing alerting behavior.

Metrics path (Prometheus → OTel Collector):

Keep Prometheus scraping as-is.
Introduce an OTel Collector as a receiver/processor/exporter layer where appropriate (for example, to fan out to multiple backends or normalize metadata).
Alternatively, adopt OTel Collector to receive OTLP from instrumented apps while Prometheus continues scraping legacy endpoints.

Logs path (Fluent Bit as the forwarder):

Deploy Fluent Bit as a DaemonSet (Kubernetes) or host agent.
Standardize parsing and enrichment (cluster, namespace, pod, service, environment).
Forward logs to both old and new destinations (dual-output) if feasible.

Traces path (OTel SDKs → OTel Collector):

Add OTel Collector gateways for OTLP ingest.
Start with one “canary” service or a low-risk environment.

Key principle: parallel run. You want a period where the new platform receives the same (or equivalent) telemetry as the old one, so you can validate it without waking up the on-call.

Phase 2: migrate collection incrementally (agents → pipelines)

Once the parallel plane is stable, move producers in small batches.

Metrics: from Prometheus agents/exporters to standardized pipelines

Keep Prometheus where it fits. Many teams retain Prometheus for scraping and rule evaluation while using OTel for normalization and export. Migration doesn’t require “delete Prometheus.”
Normalize labels and metadata early. This is where instrumentation debt hides: inconsistent service, app, env, cluster labels. OTel processors (and Prometheus relabeling) can help.
Watch cardinality like a hawk. Set limits and drop/aggregate problematic labels before exporting.

Incremental cutover idea: move one class of exporters at a time—node metrics, then Kubernetes state metrics, then app metrics—validating each step.

Logs: consolidate with Fluent Bit, then standardize

Fluent Bit shines when you need a small, fast, predictable log forwarder.

Start by forwarding “raw” logs in parallel.
Then standardize formats (JSON where possible), add correlation fields (trace/span IDs), and implement redaction.
Prefer deterministic parsing rules over per-team ad hoc regex.

Traces: pay down instrumentation debt service-by-service

Tracing migrations succeed when you:

Define a minimal propagation standard (W3C Trace Context) and enforce it.
Start with a single critical path (e.g., API → worker → database).
Use consistent service naming and semantic conventions.

This is also a good time to reduce library sprawl: fewer tracing SDKs, fewer custom interceptors, more shared instrumentation wrappers.

Phase 3: move “reads” last (dashboards, alerts, and on-call workflows)

The highest-risk step is changing what people see during incidents.

Keep on-call stable by:

Cloning dashboards in the new system and running them side-by-side.
Mirroring alerts to a non-paging channel first (or a shadow route).
Comparing incident timelines: does the new system show the same spikes, errors, saturation, and deploy markers?

Only after you trust the new views should you migrate paging alerts.

A useful rule: write paths first, read paths last.

“Done” criteria: acceptance tests for telemetry correctness

Modernization projects fail when “done” is vague. Define explicit acceptance tests for each signal.

Metrics “done” checklist

Coverage: For each tier-1 service, the key golden signals exist (latency, traffic, errors, saturation) with agreed names.
Parity: New queries match old queries within a defined tolerance (e.g., p95 latency within ±5% over 24 hours).
Cardinality controls: Documented label allowlist and enforced limits; no runaway series growth.
Alert parity: Shadow alerts match firing behavior over at least one release cycle.
Operational SLO: Collector/Prometheus ingestion has its own dashboards and alerts (drop rate, queue length, scrape failures).

Logs “done” checklist

Completeness: For selected namespaces/services, log volume and event counts match expected baselines.
Parse correctness: A sampled set of logs parses into expected fields; failures are measured.
Redaction & compliance: PII/secret patterns are blocked; retention and access policies are enforced.
Correlation: Trace/span IDs (when present) are preserved end-to-end.

Traces “done” checklist

Propagation: Trace context flows across service boundaries for critical paths.
Sampling policy: Documented and validated; high-value endpoints are sampled appropriately.
Service map sanity: No duplicate service names; stable resource attributes (env, version, cluster).
Latency parity: Trace-derived latency aligns with metrics-derived latency for the same endpoints.

Cutover “done” checklist (the human part)

Runbooks updated: On-call runbooks point to the new dashboards/log search/traces.
Training completed: Engineers know where to look during incidents.
Rollback plan tested: You can revert routing/export within minutes if needed.
Decommission plan approved: Old agents, exporters, and sinks have an owner and a removal schedule.

These criteria turn migration into an engineering change with testable outcomes—similar to how you’d manage a major runtime upgrade.

Practical implications for engineering teams (and CTOs)

Treat observability like a product, not plumbing

Assign an owner, define SLAs for the telemetry pipeline, and ship changes through versioned config and review. This is classic software maintenance discipline applied to your ops stack.

Reduce “instrumentation debt” as part of modernization

As you refactor services for Kubernetes, cloud managed databases, or new frameworks, bake in standards:

consistent service naming
semantic conventions
shared libraries/wrappers
linting/CI checks for label cardinality and required attributes

This keeps the new platform from inheriting the old platform’s mess.

Use parallel run to unlock safe speed

Parallel run costs money temporarily, but it buys correctness and trust. It also enables incremental modernization: you can migrate one cluster, one namespace, or one business-critical workflow at a time.

Apply “data pipeline thinking” from adjacent domains

Large-scale ingestion and correctness challenges aren’t unique to observability. InfoQ’s coverage of Pinterest’s CDC-powered ingestion improvements is a reminder that the big wins come from designing reliable pipelines, validating end-to-end latency, and instrumenting the ingestion itself—not just swapping a component. The same mindset applies here: build a measurable pipeline, then move workloads progressively.

Conclusion: migrate in phases, finish with confidence

An observability migration is a high-stakes modernization project because it rewires the safety net while you’re still using it. The safest path is phased: stand up new pipelines in parallel, migrate collection incrementally (Prometheus/OTel/Fluent Bit), and only then move dashboards and paging. Most importantly, define “done” with acceptance tests for telemetry correctness, so you can cut over—and decommission—with confidence.

If you treat the migration as a sequencing problem rather than a tool swap, you can modernize your platform without breaking on-call—and end up with a stack that’s easier to operate, cheaper to run, and ready for the next wave of cloud and Kubernetes upgrades.