🧭 Universal Diagnostic Model (UDM) — v2 (0–4 scale)

A vendor-neutral baseline describing how intelligent Ops/Observability systems detect, enrich, correlate, explain (RCA), and recommend/remediate incidents — with verifiable evidence.

Last updated: 2025-10-12T12:28:58Z


What changed in v2

  • Maturity scale normalized to 0–4 (was 0–4/0–5 across sources). Clear acceptance gates per level.
  • Precise phase names: Signal Detection · Context Enrichment · Event Correlation & Classification · Root Cause Analysis · Recommendation/Remediation.
  • Atlas alignment: Token→numeric mapping for the Atlas 🔍 Diagnostics capability.
  • Operational metrics: TTFC/TTRC, Verified-RCA rate, false-positive rate added as non-functional targets.
  • Exports: Scorecard CSV and machine-readable rubric JSON.

⚙️ Five Diagnostic Phases

# Phase Definition Typical Data Expected Capability
1 Signal Detection Identify anomalies or deviations from expected behavior Metrics (CPU/latency/errors), logs, traces, alerts Thresholds, anomaly detectors, drift-aware baselines; capture detection reason & thresholds
2 Context Enrichment Link signals with entities, ownership, deploy/change context Service maps, k8s/CMDB, deploy metadata Stable IDs, dependency graph, change/owner joins, SLO context
3 Event Correlation & Classification Group related signals and classify the probable domain/cause family Multi-signal events across time windows Correlation windows, clustering/causal hints, change-aware grouping
4 Root Cause Analysis (RCA) Produce a testable hypothesis explaining why with evidence Enriched telemetry + historical baselines + change diffs Structured hypothesis + verification plan; confidence; negative evidence considered
5 Recommendation / Remediation Propose (or execute under guardrails) a mitigation with verification Runbooks, IaC diffs, workflows Risk-aware plan, preflight checks, approvals; rollback & post-verify steps

🧩 UDM Maturity Scale (0–4)

Level Label Acceptance (must satisfy this level and all lower levels)
0 None No diagnostics beyond raw alerts/logs; no context; no evidence export.
1 Reactive L1 detection: manual/threshold alerts; minimal labeling; ad-hoc triage notes.
2 Correlated L2 detection + multi-signal correlation or rule-based classification; entity/service mapping; links to evidence (queries/logs/traces).
3 Intelligent L3 adds structured RCA with verification steps (counter-tests), confidence scoring, change awareness, and explainable evidence (permalinks/queries included).
4 Autonomous L4 adds causal reasoning/graphs, automated counter-tests, early-finalize on high confidence, and guardrailed remediation (approvals/rollback), with tracked quality metrics (Verified-RCA rate, FP rate).

Gating rule: A product’s UDM level is the highest level whose acceptance gates (and all below) are met. Any missing gate caps the level.


(Content continues as formatted in prior step…)


Table of contents


This site uses Just the Docs, a documentation theme for Jekyll.