👁️ Universal Observability Model (UOM) — v2 (0–4 scale)

A vendor-neutral baseline describing how platforms collect, normalize, store, explore, alert, correlate, and govern observability data (metrics, logs, traces, events, profiles).

Last updated: 2025-10-12T12:46:23Z


What changed in v2

  • Maturity scale normalized to 0–4 with explicit acceptance gates per phase.
  • Evidence discipline: To claim Level ≥ 3, platforms must provide shareable evidence (permalinks/exports) that reproduce the charts/queries used.
  • Cross-signal gate: Without cross-signal drill-downs (e.g., metrics → logs/traces on the same entity/time window), overall level is capped at 2 (Unified).
  • Governance gate: Level 4 requires RBAC, PII controls, cost guardrails, and ingest/query SLOs.
  • Atlas alignment: token→numeric mapping for the Atlas 👁️ Observability capability.

⚙️ Eight Observability Phases

# Phase Definition Typical Data Expected Capability
1 Instrumentation & Ingest Capture signals from apps/infra safely Metrics, logs, traces, K8s events, profiles SDKs/agents/collectors; scrape/push; secure endpoints; backpressure
2 Normalization & Enrichment Apply schemas; add rich context OTel semantic conv., env/service/version/team Tagging; parsing; K8s enrichment; time sync
3 Storage, Sampling & Indexing Persist/index efficiently by signal TSDB/log store/trace store; retention tiers Tiered retention; sampling; cardinality control
4 Visualization & Exploration Explore across signals & entities Dashboards; *QL (Prom/Log/Trace); entity pages Cross-signal drill-downs; templating
5 Alerting, SLOs & Detection Notify on symptoms/outcomes Rules, SLI/SLO burn, anomalies Multi-signal alerts; dedupe; escalation; runbooks
6 Topology & Context Understand relationships & changes Service maps; K8s; CMDB/ownership Dependency graphs; change awareness
7 Incident Evidence & Handoffs Package findings for action Links, snapshots, exports Evidence packs; permalinks; ticket/chat exports
8 Governance, Cost & Reliability Keep it safe, fast, affordable RBAC, PII, quotas, cost; SLOs Quotas, rate limits, usage & cost reports; ingest/query SLOs

🧩 UOM Maturity Scale (0–4)

Level Label Acceptance (must satisfy all lower levels)
0 None No systematic signals; ad-hoc SSH/logs only.
1 Single-signal One signal (e.g., metrics or logs) with manual dashboards; minimal schema.
2 Unified At least two signals (metrics + logs and/or traces) in one UX; basic enrichment and search; exportable queries.
3 Contextual Cross-signal drill-downs; SLOs; entity/topology pages; evidence packs for incidents.
4 Governed/Optimized Anomaly detection or tail-based sampling, dedupe/correlation; RBAC/PII; ingest/query SLOs; cost guardrails; tenancy.

Gating rules

  • Evidence gate: No shareable permalinks/exports → cap at 2.
  • Cross-signal gate: No metrics↔logs/traces drill-down on entities → cap at 2.
  • Governance gate: No RBAC/PII/cost & SLOs → cap at 3.

🔎 Per‑phase expectations by level (condensed)

1) Instrumentation & Ingest

  • L0: No collectors/SDKs.
  • L1: One signal via agent/scrape; best‑effort endpoints.
  • L2: Two signals; authenticated endpoints; basic backpressure.
  • L3: Multi‑signal with HA collectors; buffering; drop metrics; ingest lag tracked.
  • L4: Fleet management; auto‑instrumentation; SLOs on ingest lag/loss.

2) Normalization & Enrichment

  • L0: Raw payloads.
  • L1: Minimal labels; ad‑hoc parsing.
  • L2: OTel semantic conventions; env/service/version/team labels; K8s enrichment.
  • L3: Consistent parsing rules; time sync; owner mapping.
  • L4: Policy‑validated schemas; schema evolution; PII redaction on ingest.

3) Storage, Sampling & Indexing

  • L0: Ephemeral storage.
  • L1: Single‑tier retention; no sampling/cardinality control.
  • L2: Retention per signal; basic sampling; index strategy.
  • L3: Tiered retention; tail‑based trace sampling; hot/cold moves; compaction.
  • L4: Adaptive sampling/cardinality budgets; cost/SLO‑aware retention.

4) Visualization & Exploration

  • L0: None.
  • L1: Basic dashboards per signal.
  • L2: Unified UI; saved queries; ad‑hoc *QL.
  • L3: Cross‑signal drill‑downs; entity pages; panel templating.
  • L4: Time travel/compare; guided queries; shareable permalinks with provenance.

5) Alerting, SLOs & Detection

  • L0: None.
  • L1: Static threshold alerts per signal.
  • L2: Multi‑signal alerts; maintenance windows; runbook links.
  • L3: SLI/SLO burn‑rates; dedupe; escalation & routing.
  • L4: Anomaly detection with quality tracking; noise budgets; change‑aware hints.

6) Topology & Context

  • L0: None.
  • L1: Manual lists.
  • L2: Basic service map/K8s topology.
  • L3: Ownership links; change awareness; blast‑radius.
  • L4: Versioned topology with time travel; CMDB/graph joins.

7) Incident Evidence & Handoffs

  • L0: None.
  • L1: Screenshots only.
  • L2: Query permalinks; export PNG/CSV/Markdown.
  • L3: One‑click evidence packs (query links, snapshots, context); ticket/chat exporters.
  • L4: Immutable evidence with provenance; retention guarantees.

8) Governance, Cost & Reliability

  • L0: None.
  • L1: Shared admin; best‑effort availability.
  • L2: Basic RBAC; usage reports.
  • L3: Quotas/rate limiters; PII redaction; tenancy.
  • L4: Ingest/query SLOs; cost guardrails & budgets; per‑tenant named‑graph or index isolation.

🧠 Neutral Observability Signal (JSON example)

{
  "observability_signal": {
    "id": "uuid",
    "ts": "2025-10-11T12:00:00Z",
    "signal_type": "metric",
    "resource": {
      "service.name": "api-gateway",
      "service.version": "1.42.0",
      "k8s.namespace": "payments",
      "k8s.pod": "api-gw-7b4d7",
      "cloud.region": "eu-west-1"
    },
    "attributes": {
      "http.method": "GET",
      "http.route": "/checkout",
      "deployment.sha": "0f3a1c2"
    },
    "metric": {
      "name": "http.server.duration",
      "type": "histogram",
      "unit": "ms",
      "p95": 560
    },
    "links": {
      "trace_id": "6f1c2…",
      "commit": "0f3a1c2",
      "runbook": "kb://checkout-latency"
    },
    "slo_context": {
      "sli": "latency_p95",
      "slo_target": 300,
      "window": "1h",
      "burn_rate": 2.1
    }
  }
}

✅ Feature Requirements (FR) & Acceptance Criteria (AC)

F1. Instrumentation & Ingest — SDKs/collectors; secure endpoints; buffering/backpressure.
AC: Ingest health exposes lag/loss; auth on endpoints; HA collector config exists.

F2. Normalization & Enrichment — OTel conventions; env/service/version/team; parsers.
AC: Sample records show normalized labels and K8s enrichment; parsing rules under version control.

F3. Storage, Sampling & Indexing — tiered retention; sampling; cardinality control.
AC: Policy file shows per‑signal retention; trace sampling configured; label cardinality dashboard available.

F4. Visualization & Exploration — cross‑signal drills; entity pages.
AC: From an alert panel, user can hop metrics→logs/traces on the same entity/time; saved permalinks exist.

F5. Alerting, SLOs & Detection — multi‑signal alerts; SLO burn; dedupe.
AC: Alert definition references SLI/SLO; dedupe/noise metrics are visible; maintenance windows defined.

F6. Topology & Context — service map; ownership; change awareness.
AC: Entity page shows owner/team and recent deploys/config changes.

F7. Evidence & Handoffs — exports & packs.
AC: Evidence pack export (Markdown/JSON) contains query links, snapshots, and context; ticket/chat exporters configured.

F8. Governance, Cost & Reliability — RBAC; PII; quotas; SLOs.
AC: RBAC policy applied; PII rules; usage & cost reports; ingest/query SLO dashboards.


📈 Quality KPIs & Target Bands (guide)

  • Ingest lag (P95): ≤ 6 s (L3), ≤ 2 s (L4).
  • Query latency — dashboard (P95): ≤ 3 s (L3), ≤ 1.5 s (L4).
  • Query latency — ad-hoc (P95): ≤ 7 s (L3), ≤ 3 s (L4).
  • SLO coverage: ≥ 50% of services (L3), ≥ 80% (L4).
  • Alert dedupe ratio: ≥ 30% (L3), ≥ 50% (L4).
  • Tail trace sampling coverage: ≥ 30% (L3), ≥ 60% (L4).
  • Evidence coverage: ≥ 90% of incidents have evidence packs (L3–L4).

Tune bands per stack/scale; use the same dataset across vendors.


🔁 Atlas alignment (👁️ Observability token → UOM level)

Atlas token UOM level
N/L 0
P/L 1
P/M 2
Y/M 3
Y/H 4

🔍 Comparison Template (vs. UOM)

Platform Ingest (0–4) Normalize (0–4) Storage/Index (0–4) Visualization (0–4) Alerting/SLO (0–4) Topology (0–4) Evidence/Handoff (0–4) Gov/Cost/Rel (0–4) Overall UOM (0–4) IngestLag_P50_ms IngestLag_P95_ms QueryDash_P95_ms QueryAdhoc_P95_ms SLO_Coverage_% Alert_Dedupe_Ratio TailTraceSampling_% Evidence_Export (Y/N) CrossSignal (Y/N) Retention_Metrics_d Retention_Logs_d Retention_Traces_d CostPerGB_USD Notes
Your Stack                                              

📝 Conformance Checklist

  • Instrumentation: OTel SDK/auto‑inst; K8s & infra collectors; HA + backpressure.
  • Enrichment: Consistent service.*, version, env, team tags; parsing rules; K8s resource attrs.
  • Storage: Per‑signal retention policy; sampling/tiering; index compaction; cardinality dashboards.
  • Queries: PromQL/LogQL/TraceQL; entity pages; cross‑signal hops; shareable permalinks.
  • Alerting/SLO: SLO burn‑rates; dedupe; maintenance windows; runbook links.
  • Topology: Service map & K8s topology; ownership; change awareness.
  • Evidence: One‑click export to Markdown/JSON; ticket/chat exporters.
  • Gov/Cost/Rel: RBAC, PII masking, quotas; ingest/query SLOs; cost/usage reports.

📦 Appendix — Example Evidence Pack (Markdown)

### Incident Evidence Pack — API Latency
- **Panel (PromQL)**: https://grafana/…
- **Logs (Loki)**: https://loki/…
- **Trace (Tempo)**: https://tempo/…
- **Entity page**: https://observability/services/api-gateway
- **SLO burn**: 2.1× / 1h window
- **Topology snapshot**: attached PNG

Ratings (UOM v2)

Platform Overall UOM Level (0–4)
Atlassian – Rovo Dev 1 — Single‑signal
AWS – Strands SDK (Agent Development Framework) 2 — Unified
Cisco – Splunk AI Agents (Agentic Observability) 4 — Governed/Optimized
Databricks – “Agent Bricks” (Lakehouse AI Agents) 2 — Unified
Datadog – “Bits” AI Agents (SRE & DevOps) 4 — Governed/Optimized
Dataiku – AI Agents (DSS Platform) 2 — Unified
DuploCloud – AI CloudOps Help Desk 3 — Contextual
Dynatrace – Davis AI (Autonomous Observability) 4 — Governed/Optimized
Elastic – AI Assistant for Observability 3 — Contextual
GitHub – Copilot (Coding & DevOps Assistant) 0 — None
Google – Vertex AI Conversational Agent (Gen App Builder) 1 — Single‑signal
IBM – AskIAM (Identity Assistant) 2 — Unified
JFrog – “Fly” CI/CD Agent 2 — Unified
Solo.io – Kagent (Kubernetes Assistant) 3 — Contextual

Platform Notes (evidence‑mapped to the 8 phases)

(Platform notes unchanged from your source; ensure they render as lists and paragraphs.)


Summary & Guidance

  • Leaders (Level 4): Cisco/Splunk, Datadog, Dynatrace — meet evidence, cross-signal, and governance gates with mature anomaly/correlation and exportable incident evidence.
  • Strong Context (Level 3): Elastic, DuploCloud, Solo.io — solid cross-signal pivots and entity/topology context; add ingest/query SLOs and stronger auto-correlation to approach L4.
  • Unified but Limited (Level 2): AWS Strands, Databricks, Dataiku, IBM AskIAM, JFrog — can unify or analyze specific domains but miss cross-signal UX and/or evidence exports.
  • Basic/None (≤1): Atlassian Rovo, Google Vertex, GitHub Copilot — not observability platforms; treat them as adjunct assistants, not signal sources.

Next steps for bake-off: use a fixed dataset and attach permalinks/exports for all queries, cross-signal pivots, alert definitions, and evidence packs to validate the assigned levels against the UOM gates.


Table of contents