DevOps & SRE AI Platforms Atlas 2025

A vendor-neutral, community-driven wiki – Last updated: January 10, 2026

What This Is

This atlas provides comprehensive, evidence-based analysis of AI agent platforms across the DevOps/SRE/ITOps ecosystem.

Key Features:

Vendor-neutral analysis – No hype, just capabilities
Community-driven – Contributions welcome
Structured data – Machine-readable platform information
Actionable insights – Real implementation guidance

Platform Archetypes – Strengths/weaknesses by category
Platform Overview – One-page summary of all platforms
Platform Directory – Deep dives with capability scoring
Capability Framework – How we score vendors
Community Resources – Contribute, discuss, roadmap

Executive Summary – DevOps & SRE AI Platforms (Oct 11, 2025)

The headline: AI agents for DevOps/SRE have progressed from “chat over dashboards” to actionable co‑workers spanning triage → RCA → guided remediation → post‑incident learning. No single vendor covers the full lifecycle; the 2025 winning pattern is composable adoption: anchor on an observability‑first triage brain, add a workflow orchestrator for approvals/audit, and layer in guardrailed provisioning for safe apply/rollback.

Market map at a glance

Archetype	Best fit	Representative platforms	Core strength	Typical limits
Observability‑First	Fast triage, RCA, incident comms	Dynatrace Davis AI; Cisco (Splunk) AI Agents; Datadog Bits AI & Agents; Elastic AI Assistant; New Relic AI	Multi‑signal correlation, causal/hypothesis reasoning, narrative updates	Usually no direct apply beyond playbooks; relies on external approvals
Provisioning‑Focused	Safe, repeatable infra changes	Azure Copilot (Agent Mode); DuploCloud; Qovery; Kuberns	Generate/apply IaC with approvals and rollback	Lighter AIOps correlation; observability via integrations
Developer‑Centric & Frameworks	Code/PR changes; build your own agents	AWS Strands SDK; Atlassian Rovo Dev; GitHub Copilot Coding Agent; Zencoder; JFrog Fly; Azure AI Agent Service (Foundry)	Planning/tool orchestration, CI/CD fixes, AgentOps	Not a runtime ops console; direct infra apply limited
Enterprise Orchestrators	Cross‑team workflows, audit, CMDB	ServiceNow AI Agent Orchestrator; Salesforce Agentforce (OpsAI); PagerDuty AIOps	Ticket/change graph, approvals, runbooks	Deep telemetry depends on observability tools
Data & MLOps	AI quality, data pipelines	Databricks Agent Bricks; Snowflake Cortex Agents; Dataiku AI Agents	Evaluations/guardrails, lineage, model operations	Infra ops out of scope
Specialized Domain	Deep expertise for a niche	IBM AskIAM (IAM); Solo.io Kagent (K8s)	Accuracy within narrow scope	Limited breadth by design

Top 6 takeaways for 2025

From chat to action: leaders pair explanations with verifiable evidence (queries/logs/traces) and suggest the next safe step (runbook, PR, or controlled apply).
Human‑in‑the‑loop is default: approvals, RBAC and audit trails are table stakes for production change.
AgentOps matters: tracing, evaluations and policy hooks (data boundaries, PII controls, prompt safety) separate pilots from production.
Event ontology wins RCA: platforms with typed entities + change context consistently outperform generic LLMs on root cause.
IaC is the safety rail: even where agents can apply changes, diffs/Bicep/Terraform + approvals remain the safest path.
Compose the stack: the best outcomes pair one observability brain + one orchestrator + optional provisioning.

Quick picks by job‑to‑be‑done

Rapid triage & RCA with evidence: Dynatrace; Cisco (Splunk); Datadog Bits AI — complement with PagerDuty AIOps for comms/response.
End‑to‑end incident coordination & comms: PagerDuty AIOps; ServiceNow; Salesforce — complement with your observability suite.
Direct, approval‑gated infra changes: Azure Copilot (Agent Mode); DuploCloud — complement with Qovery for migration IaC.
Build governed, bespoke ops agents: Azure AI Agent Service (Foundry); AWS Strands — complement with Dataiku/Databricks for evaluations.
Repo/CI‑centric fixes (PRs/tests/docs): GitHub Copilot Coding Agent; Atlassian Rovo Dev; Zencoder — complement with JFrog Fly for release policy.
Kubernetes deep‑dive fixes: Solo.io Kagent — complement with APM/logs.
Identity requests & compliance: IBM AskIAM.
Data & MLOps guardrails/evals: Databricks Agent Bricks; Dataiku; Snowflake Cortex Agents.

Capability trends vs. last edition

Diagnostics: more change‑aware correlation and hypothesis testing; observability tools add post‑mortem drafting and Slack/Teams updates by default.
Activities & history: richer incident timelines and action replays (e.g., “apply last good remediation”), strongest in orchestrators.
Provisioning: clear split—some stay advisory; others (Azure Copilot, DuploCloud) execute with diffs, approvals, rollbacks.
Event ontology: convergence on OTel semantic conventions and CMDB/entity graphs; better joins across alerts, services and changes.
Agent observability: first‑class traces, evaluations, safety scores (Foundry, Dataiku, Databricks, Salesforce Command Center).

Reference architecture (what “good” looks like)

Telemetry backbone: OTel + your observability suite (Dynatrace/Splunk/Datadog/Elastic/New Relic).
Orchestration & audit: ServiceNow or Salesforce; add PagerDuty AIOps for real‑time response.
Provisioning lane: Azure Copilot (Agent Mode) or DuploCloud for controlled applies; Qovery for migration IaC.
Developer loop: GitHub Copilot Coding Agent / Atlassian Rovo Dev / Zencoder; JFrog Fly for release policy.
AgentOps layer: Azure AI Agent Service (Foundry), Dataiku, or Databricks for tracing, evaluations and safety.
Guardrails: RBAC, approval workflows, drift checks, read‑only dry‑runs by default and immutable audit logs.

Risks & guardrails to enforce

False confidence / silent failures: require evidence links for every diagnosis; block summary‑only outputs.
Unsafe changes: enforce two‑person approval and automatic rollback plans; prefer IaC diffs over ad‑hoc commands.
Vendor lock‑in: favour platforms with MCP/A2A, exportable traces and open schemas.
Cost surprises: track agent run counts, tool invocations, LLM usage; set SLOs & budgets for agents.

Bake‑off checklist (pair with Atlas KPIs)

TTFC/TTRC: measure on a fixed incident set, with linked evidence and a verification plan per RCA.
Actionability rate: percentage of cases with a safe next step (rollback/fix or runbook).
Closed‑loop rate: percentage of incidents where the platform proposed and executed a remediation under approvals.
AgentOps quality: presence of traces/evals/guardrails, red‑team tests, data‑boundary controls.
Interoperability: MCP/OTel support; OTel‑CMDB mapping if you use an orchestrator.

Join the Community

This is a living document! Help us keep it current:

Contribute updates via GitHub
Join conversations via GitHub Issues (GitHub Discussions is not enabled)
Report bugs or feature requests via the same Issues link above.
Meet the team & community at our Bucharest XDevOps Meetup

This project is open source and maintained by the xDevOps community.