👁️ Universal Observability Model (UOM) — v2 (0–4 scale)

A vendor-neutral baseline describing how platforms collect, normalize, store, explore, alert, correlate, and govern observability data (metrics, logs, traces, events, profiles).

Last updated: 2025-10-12T12:46:23Z

What changed in v2

Maturity scale normalized to 0–4 with explicit acceptance gates per phase.
Evidence discipline: To claim Level ≥ 3, platforms must provide shareable evidence (permalinks/exports) that reproduce the charts/queries used.
Cross-signal gate: Without cross-signal drill-downs (e.g., metrics → logs/traces on the same entity/time window), overall level is capped at 2 (Unified).
Governance gate: Level 4 requires RBAC, PII controls, cost guardrails, and ingest/query SLOs.
Atlas alignment: token→numeric mapping for the Atlas 👁️ Observability capability.

⚙️ Eight Observability Phases

#	Phase	Definition	Typical Data	Expected Capability
1	Instrumentation & Ingest	Capture signals from apps/infra safely	Metrics, logs, traces, K8s events, profiles	SDKs/agents/collectors; scrape/push; secure endpoints; backpressure
2	Normalization & Enrichment	Apply schemas; add rich context	OTel semantic conv., env/service/version/team	Tagging; parsing; K8s enrichment; time sync
3	Storage, Sampling & Indexing	Persist/index efficiently by signal	TSDB/log store/trace store; retention tiers	Tiered retention; sampling; cardinality control
4	Visualization & Exploration	Explore across signals & entities	Dashboards; *QL (Prom/Log/Trace); entity pages	Cross-signal drill-downs; templating
5	Alerting, SLOs & Detection	Notify on symptoms/outcomes	Rules, SLI/SLO burn, anomalies	Multi-signal alerts; dedupe; escalation; runbooks
6	Topology & Context	Understand relationships & changes	Service maps; K8s; CMDB/ownership	Dependency graphs; change awareness
7	Incident Evidence & Handoffs	Package findings for action	Links, snapshots, exports	Evidence packs; permalinks; ticket/chat exports
8	Governance, Cost & Reliability	Keep it safe, fast, affordable	RBAC, PII, quotas, cost; SLOs	Quotas, rate limits, usage & cost reports; ingest/query SLOs

🧩 UOM Maturity Scale (0–4)

Level	Label	Acceptance (must satisfy all lower levels)
0	None	No systematic signals; ad-hoc SSH/logs only.
1	Single-signal	One signal (e.g., metrics or logs) with manual dashboards; minimal schema.
2	Unified	At least two signals (metrics + logs and/or traces) in one UX; basic enrichment and search; exportable queries.
3	Contextual	Cross-signal drill-downs; SLOs; entity/topology pages; evidence packs for incidents.
4	Governed/Optimized	Anomaly detection or tail-based sampling, dedupe/correlation; RBAC/PII; ingest/query SLOs; cost guardrails; tenancy.

Gating rules

Evidence gate: No shareable permalinks/exports → cap at 2.

Cross-signal gate: No metrics↔logs/traces drill-down on entities → cap at 2.

Governance gate: No RBAC/PII/cost & SLOs → cap at 3.

🔎 Per‑phase expectations by level (condensed)

1) Instrumentation & Ingest

L0: No collectors/SDKs.
L1: One signal via agent/scrape; best‑effort endpoints.
L2: Two signals; authenticated endpoints; basic backpressure.
L3: Multi‑signal with HA collectors; buffering; drop metrics; ingest lag tracked.
L4: Fleet management; auto‑instrumentation; SLOs on ingest lag/loss.

2) Normalization & Enrichment

L0: Raw payloads.
L1: Minimal labels; ad‑hoc parsing.
L2: OTel semantic conventions; env/service/version/team labels; K8s enrichment.
L3: Consistent parsing rules; time sync; owner mapping.
L4: Policy‑validated schemas; schema evolution; PII redaction on ingest.

3) Storage, Sampling & Indexing

L0: Ephemeral storage.
L1: Single‑tier retention; no sampling/cardinality control.
L2: Retention per signal; basic sampling; index strategy.
L3: Tiered retention; tail‑based trace sampling; hot/cold moves; compaction.
L4: Adaptive sampling/cardinality budgets; cost/SLO‑aware retention.

4) Visualization & Exploration

L0: None.
L1: Basic dashboards per signal.
L2: Unified UI; saved queries; ad‑hoc *QL.
L3: Cross‑signal drill‑downs; entity pages; panel templating.
L4: Time travel/compare; guided queries; shareable permalinks with provenance.

5) Alerting, SLOs & Detection

L0: None.
L1: Static threshold alerts per signal.
L2: Multi‑signal alerts; maintenance windows; runbook links.
L3: SLI/SLO burn‑rates; dedupe; escalation & routing.
L4: Anomaly detection with quality tracking; noise budgets; change‑aware hints.

6) Topology & Context

L0: None.
L1: Manual lists.
L2: Basic service map/K8s topology.
L3: Ownership links; change awareness; blast‑radius.
L4: Versioned topology with time travel; CMDB/graph joins.

7) Incident Evidence & Handoffs

L0: None.
L1: Screenshots only.
L2: Query permalinks; export PNG/CSV/Markdown.
L3: One‑click evidence packs (query links, snapshots, context); ticket/chat exporters.
L4: Immutable evidence with provenance; retention guarantees.

8) Governance, Cost & Reliability

L0: None.
L1: Shared admin; best‑effort availability.
L2: Basic RBAC; usage reports.
L3: Quotas/rate limiters; PII redaction; tenancy.
L4: Ingest/query SLOs; cost guardrails & budgets; per‑tenant named‑graph or index isolation.

🧠 Neutral Observability Signal (JSON example)

{
  "observability_signal": {
    "id": "uuid",
    "ts": "2025-10-11T12:00:00Z",
    "signal_type": "metric",
    "resource": {
      "service.name": "api-gateway",
      "service.version": "1.42.0",
      "k8s.namespace": "payments",
      "k8s.pod": "api-gw-7b4d7",
      "cloud.region": "eu-west-1"
    },
    "attributes": {
      "http.method": "GET",
      "http.route": "/checkout",
      "deployment.sha": "0f3a1c2"
    },
    "metric": {
      "name": "http.server.duration",
      "type": "histogram",
      "unit": "ms",
      "p95": 560
    },
    "links": {
      "trace_id": "6f1c2…",
      "commit": "0f3a1c2",
      "runbook": "kb://checkout-latency"
    },
    "slo_context": {
      "sli": "latency_p95",
      "slo_target": 300,
      "window": "1h",
      "burn_rate": 2.1
    }
  }
}

✅ Feature Requirements (FR) & Acceptance Criteria (AC)

F1. Instrumentation & Ingest — SDKs/collectors; secure endpoints; buffering/backpressure.
AC: Ingest health exposes lag/loss; auth on endpoints; HA collector config exists.

F2. Normalization & Enrichment — OTel conventions; env/service/version/team; parsers.
AC: Sample records show normalized labels and K8s enrichment; parsing rules under version control.

F3. Storage, Sampling & Indexing — tiered retention; sampling; cardinality control.
AC: Policy file shows per‑signal retention; trace sampling configured; label cardinality dashboard available.

F4. Visualization & Exploration — cross‑signal drills; entity pages.
AC: From an alert panel, user can hop metrics→logs/traces on the same entity/time; saved permalinks exist.

F5. Alerting, SLOs & Detection — multi‑signal alerts; SLO burn; dedupe.
AC: Alert definition references SLI/SLO; dedupe/noise metrics are visible; maintenance windows defined.

F6. Topology & Context — service map; ownership; change awareness.
AC: Entity page shows owner/team and recent deploys/config changes.

F7. Evidence & Handoffs — exports & packs.
AC: Evidence pack export (Markdown/JSON) contains query links, snapshots, and context; ticket/chat exporters configured.

F8. Governance, Cost & Reliability — RBAC; PII; quotas; SLOs.
AC: RBAC policy applied; PII rules; usage & cost reports; ingest/query SLO dashboards.

📈 Quality KPIs & Target Bands (guide)

Ingest lag (P95): ≤ 6 s (L3), ≤ 2 s (L4).
Query latency — dashboard (P95): ≤ 3 s (L3), ≤ 1.5 s (L4).
Query latency — ad-hoc (P95): ≤ 7 s (L3), ≤ 3 s (L4).
SLO coverage: ≥ 50% of services (L3), ≥ 80% (L4).
Alert dedupe ratio: ≥ 30% (L3), ≥ 50% (L4).
Tail trace sampling coverage: ≥ 30% (L3), ≥ 60% (L4).
Evidence coverage: ≥ 90% of incidents have evidence packs (L3–L4).

Tune bands per stack/scale; use the same dataset across vendors.

🔁 Atlas alignment (👁️ Observability token → UOM level)

Atlas token	UOM level
N/L	0
P/L	1
P/M	2
Y/M	3
Y/H	4

🔍 Comparison Template (vs. UOM)

Platform	Ingest (0–4)	Normalize (0–4)	Storage/Index (0–4)	Visualization (0–4)	Alerting/SLO (0–4)	Topology (0–4)	Evidence/Handoff (0–4)	Gov/Cost/Rel (0–4)	Overall UOM (0–4)	IngestLag_P50_ms	IngestLag_P95_ms	QueryDash_P95_ms	QueryAdhoc_P95_ms	SLO_Coverage_%	Alert_Dedupe_Ratio	TailTraceSampling_%	Evidence_Export (Y/N)	CrossSignal (Y/N)	Retention_Metrics_d	Retention_Logs_d	Retention_Traces_d	CostPerGB_USD	Notes
Your Stack

📝 Conformance Checklist

Instrumentation: OTel SDK/auto‑inst; K8s & infra collectors; HA + backpressure.
Enrichment: Consistent service.*, version, env, team tags; parsing rules; K8s resource attrs.
Storage: Per‑signal retention policy; sampling/tiering; index compaction; cardinality dashboards.
Queries: PromQL/LogQL/TraceQL; entity pages; cross‑signal hops; shareable permalinks.
Alerting/SLO: SLO burn‑rates; dedupe; maintenance windows; runbook links.
Topology: Service map & K8s topology; ownership; change awareness.
Evidence: One‑click export to Markdown/JSON; ticket/chat exporters.
Gov/Cost/Rel: RBAC, PII masking, quotas; ingest/query SLOs; cost/usage reports.

📦 Appendix — Example Evidence Pack (Markdown)

### Incident Evidence Pack — API Latency
- **Panel (PromQL)**: https://grafana/…
- **Logs (Loki)**: https://loki/…
- **Trace (Tempo)**: https://tempo/…
- **Entity page**: https://observability/services/api-gateway
- **SLO burn**: 2.1× / 1h window
- **Topology snapshot**: attached PNG

Ratings (UOM v2)

Platform	Overall UOM Level (0–4)
Atlassian – Rovo Dev	1 — Single‑signal
AWS – Strands SDK (Agent Development Framework)	2 — Unified
Cisco – Splunk AI Agents (Agentic Observability)	4 — Governed/Optimized
Databricks – “Agent Bricks” (Lakehouse AI Agents)	2 — Unified
Datadog – “Bits” AI Agents (SRE & DevOps)	4 — Governed/Optimized
Dataiku – AI Agents (DSS Platform)	2 — Unified
DuploCloud – AI CloudOps Help Desk	3 — Contextual
Dynatrace – Davis AI (Autonomous Observability)	4 — Governed/Optimized
Elastic – AI Assistant for Observability	3 — Contextual
GitHub – Copilot (Coding & DevOps Assistant)	0 — None
Google – Vertex AI Conversational Agent (Gen App Builder)	1 — Single‑signal
IBM – AskIAM (Identity Assistant)	2 — Unified
JFrog – “Fly” CI/CD Agent	2 — Unified
Solo.io – Kagent (Kubernetes Assistant)	3 — Contextual

Platform Notes (evidence‑mapped to the 8 phases)

(Platform notes unchanged from your source; ensure they render as lists and paragraphs.)

Summary & Guidance

Leaders (Level 4): Cisco/Splunk, Datadog, Dynatrace — meet evidence, cross-signal, and governance gates with mature anomaly/correlation and exportable incident evidence.
Strong Context (Level 3): Elastic, DuploCloud, Solo.io — solid cross-signal pivots and entity/topology context; add ingest/query SLOs and stronger auto-correlation to approach L4.
Unified but Limited (Level 2): AWS Strands, Databricks, Dataiku, IBM AskIAM, JFrog — can unify or analyze specific domains but miss cross-signal UX and/or evidence exports.
Basic/None (≤1): Atlassian Rovo, Google Vertex, GitHub Copilot — not observability platforms; treat them as adjunct assistants, not signal sources.

Next steps for bake-off: use a fixed dataset and attach permalinks/exports for all queries, cross-signal pivots, alert definitions, and evidence packs to validate the assigned levels against the UOM gates.

Rating Example (UOM v2)