top of page

AI-Agent Command Center for Observability Platform

  • May 6
  • 4 min read

Next-generation AI operations: unified issue detection, investigation, and action in a single, inspectable UI.




My Role


As the only designer, I led the Agent Command Center's entire design process.

  • Owned product design across interaction model, system architecture, and UI

  • Conducted product discovery and defined core concepts such as Issues, Sessions, and the evidence model

  • Shaped product requirements and scope in close collaboration with engineering and product

  • Built a fully functional prototype directly in the production frontend codebase


Background


The product relied on a chatbot interface to drive AI-assisted workflows. In practice, this broke down in a high-stakes environment.

  • Chat mixed recommendations with data, making outputs hard to verify

  • Actions were suggested without clear evidence or impact

  • No persistent view of what the agent had done or decided

  • Reconstructing its behavior meant scrolling through verbose chat sessions

  • Investigations required jumping between chat, dashboards, and pipeline configs

  • Ownership of telemetry was fragmented across teams


The deeper problem wasn't the interface but lack of context and accountability for taking action.

  • DevOps lacked service-level context and avoided changes that could risk data loss

  • Developers had little incentive to reduce or modify telemetry

  • Resolving issues required back-and-forth between teams


AI alone doesn’t solve this. Without visibility into decisions, clear safeguards, and defined ownership, it can’t be trusted to act.




Objective


Design a command center where DevOps defines guardrails and context, and an AI agent can take action within those constraints — handling notification and escalation as needed — while keeping humans in control and decisions inspectable.


UI-First Agent Model


The core insight behind the new approach is that AI should not replace the user interface but work through it. Instead of relying on chat-based AI interactions that mix intent with unstructured text, the system separates intent from evidence. This means:


  • Chat is used primarily to express intent, and drive back-and-forth agent-assisted troubleshooting, not to finalize decisions

  • All data and AI outputs live in structured UI components with clear, inspectable evidence

  • Actions are previewed before execution, letting users understand their impact on real data

This separates what the user asks from what the system does, making decisions inspectable and reducing risk.


System Model


The system organizes telemetry operations into a single, consistent model:


  • Issues: Problems detected by the AI agent that require attention

  • Sessions: Contextual investigations tied to specific issues

  • Tasks: Scheduled or recurring actions executed by the agent

  • Evidence Panel: Displays system-generated data and previews supporting decisions

  • Context: Editable business rules, conventions, and safeguards defined by users or learned over time


The agent runs continuously, updating issues and taking action within defined constraints. Users remain in control, but no longer need to manually stitch together context across tools.


Key Design Decisions


Separate intent from evidence

Chat is isolated from system output. Data, previews, and results live in structured UI components that act as the source of truth.


Issues as the primary interface

Replaces fragmented alerts and logs with a consistent model for status, severity, scope, and action. Works for both manual workflows and automation.


Simulation before execution

All changes are previewed on real data before being applied. Reduces risk in production environments.


Example

A service introduces a high-cardinality label, causing a rapid increase in metric cardinality.

The system detects the issue in real time, flags it as critical, and notifies the relevant owners. At the same time, it can simulate a mitigation strategy, such as aggregating or dropping the offending label.

If the issue poses immediate risk and no action is taken, the agent can apply a predefined safeguard within configured guardrails. All actions are visible, reversible, and tied to the affected services and owners.


Interaction Flow


From detection to resolution in a single system



UI Screens


Agent Overview

Central entry point showing system status, active issues, and ongoing investigations


Issue Details

Structured view combining context, evidence, and available actions


Investigate Issue

Chat with the agent to investigate issues, with supporting evidence in structured output.


Simulation

Preview of changes on real telemetry before applying them


Prototype


Building the prototype directly in a branch of the production frontend codebase meant the demo went well beyond what a typical design tool prototype could achieve. Rather than simulating interactions, it used the actual component library and routing, with mock data standing in for live telemetry. The result was a fully interactive, stateful experience where stakeholders could act on issues, work through investigations, and follow the full cycle from detection to resolution.



Goals & Expected Impact


A control center for reliable, inspectable AI operations, designed to:

  • Consolidate detection, investigation, and action into one system

  • Make agent decisions visible and auditable

  • Enable automation within defined guardrails

  • Establish clear ownership and execution boundaries

  • Reduce back-and-forth between DevOps and developers

  • Validate actions before applying them to production


Next Steps


  • Define interaction patterns for new issue types, especially around data quality

  • Design the integration touchpoints that allow issues to be resolved at the data source, i.e. in downstream tools and source control, rather than in the pipeline

  • Improve how cross-system dependencies and downstream impact are surfaced in the UI

  • Explore AI-generated UI for presenting evidence, moving beyond rigid templates to surface insights in ways we haven't had to explicitly design for


 
 

© 2024 by Eric Wienke

  • White LinkedIn Icon
  • Medium
  • Flickr / 500px
bottom of page