top of page

AI-Agent Command Center for Observability Platform

May 6
4 min read

Next-generation AI operations: unified issue detection, investigation, and action in a single, inspectable UI.

My Role

As the only designer, I led the Agent Command Center's entire design process.

Owned product design across interaction model, system architecture, and UI
Conducted product discovery and defined core concepts such as Issues, Sessions, and the evidence model
Shaped product requirements and scope in close collaboration with engineering and product
Built a fully functional prototype directly in the production frontend codebase

Background

The product relied on a chatbot interface to drive AI-assisted workflows. In practice, this broke down in a high-stakes environment.

Chat mixed recommendations with data, making outputs hard to verify
Actions were suggested without clear evidence or impact
No persistent view of what the agent had done or decided
Reconstructing its behavior meant scrolling through verbose chat sessions
Investigations required jumping between chat, dashboards, and pipeline configs
Ownership of telemetry was fragmented across teams

The deeper problem wasn't the interface but lack of context and accountability for taking action.

DevOps lacked service-level context and avoided changes that could risk data loss
Developers had little incentive to reduce or modify telemetry
Resolving issues required back-and-forth between teams

AI alone doesn’t solve this. Without visibility into decisions, clear safeguards, and defined ownership, it can’t be trusted to act.

Objective

Design a command center where DevOps defines guardrails and context, and an AI agent can take action within those constraints — handling notification and escalation as needed — while keeping humans in control and decisions inspectable.

UI-First Agent Model

The core insight behind the new approach is that AI should not replace the user interface but work through it. Instead of relying on chat-based AI interactions that mix intent with unstructured text, the system separates intent from evidence. This means:

Chat is used primarily to express intent, and drive back-and-forth agent-assisted troubleshooting, not to finalize decisions
All data and AI outputs live in structured UI components with clear, inspectable evidence
Actions are previewed before execution, letting users understand their impact on real data

This separates what the user asks from what the system does, making decisions inspectable and reducing risk.

System Model

The system organizes telemetry operations into a single, consistent model:

Issues: Problems detected by the AI agent that require attention
Sessions: Contextual investigations tied to specific issues
Tasks: Scheduled or recurring actions executed by the agent
Evidence Panel: Displays system-generated data and previews supporting decisions
Context: Editable business rules, conventions, and safeguards defined by users or learned over time

The agent runs continuously, updating issues and taking action within defined constraints. Users remain in control, but no longer need to manually stitch together context across tools.

Key Design Decisions

Separate intent from evidence

Chat is isolated from system output. Data, previews, and results live in structured UI components that act as the source of truth.

Issues as the primary interface

Replaces fragmented alerts and logs with a consistent model for status, severity, scope, and action. Works for both manual workflows and automation.

Simulation before execution

All changes are previewed on real data before being applied. Reduces risk in production environments.

Example

A service introduces a high-cardinality label, causing a rapid increase in metric cardinality.

The system detects the issue in real time, flags it as critical, and notifies the relevant owners. At the same time, it can simulate a mitigation strategy, such as aggregating or dropping the offending label.

If the issue poses immediate risk and no action is taken, the agent can apply a predefined safeguard within configured guardrails. All actions are visible, reversible, and tied to the affected services and owners.

Interaction Flow

From detection to resolution in a single system

UI Screens

Agent Overview

Central entry point showing system status, active issues, and ongoing investigations

Issue Details

Structured view combining context, evidence, and available actions

Investigate Issue

Chat with the agent to investigate issues, with supporting evidence in structured output.

Simulation

Preview of changes on real telemetry before applying them

Prototype

Building the prototype directly in a branch of the production frontend codebase meant the demo went well beyond what a typical design tool prototype could achieve. Rather than simulating interactions, it used the actual component library and routing, with mock data standing in for live telemetry. The result was a fully interactive, stateful experience where stakeholders could act on issues, work through investigations, and follow the full cycle from detection to resolution.

https://video.wixstatic.com/video/6ace89_d28a9f8651484bfb8eed501989fef5f8/1080p/mp4/file.mp4

Goals & Expected Impact

A control center for reliable, inspectable AI operations, designed to:

Consolidate detection, investigation, and action into one system
Make agent decisions visible and auditable
Enable automation within defined guardrails
Establish clear ownership and execution boundaries
Reduce back-and-forth between DevOps and developers
Validate actions before applying them to production

Next Steps

Define interaction patterns for new issue types, especially around data quality
Design the integration touchpoints that allow issues to be resolved at the data source, i.e. in downstream tools and source control, rather than in the pipeline
Improve how cross-system dependencies and downstream impact are surfaced in the UI
Explore AI-generated UI for presenting evidence, moving beyond rigid templates to surface insights in ways we haven't had to explicitly design for

bottom of page