Overview
Concepts: Core¶
This section explains the core ideas behind Redis SRE Agent and how pieces fit together.
How the Agent Uses LLMs¶
The Redis SRE Agent is powered by Large Language Models (LLMs) — AI systems that understand natural language and can reason about complex problems. If you're new to LLMs, here's what you need to know:
What is an LLM?¶
An LLM is an AI model trained on vast amounts of text that can: - Understand questions in plain English - Reason about technical problems - Decide what actions to take - Generate helpful responses
The agent uses OpenAI's GPT models, but the architecture supports other providers.
The Agent Loop¶
When you ask the agent a question, it doesn't just generate text — it thinks, acts, and observes in a loop:
flowchart LR
Q[Your Question] --> LLM[LLM Thinks]
LLM --> |"Need more info?"| Tools[Call Tools]
Tools --> |Results| LLM
LLM --> |"Ready to answer"| Response[Final Response]
- Think: The LLM reads your question and decides what information it needs
- Act: It calls tools (Prometheus queries, Redis commands, log searches, etc.)
- Observe: It receives tool results and incorporates them into its reasoning
- Repeat: If more information is needed, it calls more tools
- Respond: Once it has enough context, it generates a comprehensive answer
Model Tiers¶
The agent uses different model sizes for different tasks to balance speed and capability:
| Model Tier | Use Case | Example Tasks |
|---|---|---|
| Main Model | Complex reasoning | Triage analysis, recommendations, synthesis |
| Mini Model | Knowledge tasks | Searching docs, summarizing results |
| Nano Model | Simple classification | Query routing, yes/no decisions |
This multi-tier approach keeps the agent fast for simple queries while preserving deep reasoning for complex investigations.
Tool Calling¶
The LLM doesn't execute commands directly. Instead, it requests "tool calls" that the agent executes safely:
sequenceDiagram
participant User
participant LLM as LLM (Brain)
participant Agent as Agent (Executor)
participant Tools as Tools (Prometheus, Redis, etc.)
User->>Agent: "Why is Redis slow?"
Agent->>LLM: User question + available tools
LLM->>Agent: "Call slowlog_get tool"
Agent->>Tools: Execute slowlog_get
Tools-->>Agent: Slowlog results
Agent->>LLM: Tool results
LLM->>Agent: "Call prometheus_query for latency"
Agent->>Tools: Execute prometheus_query
Tools-->>Agent: Latency metrics
Agent->>LLM: Tool results
LLM->>Agent: Final analysis + recommendations
Agent->>User: Comprehensive response
This separation ensures: - Safety: The LLM proposes actions; the agent validates and executes them - Auditability: Every tool call is logged and traceable - Extensibility: New tools can be added without retraining the model
Agent Architecture¶
The system uses three specialized agents selected automatically based on your query and context:
| Agent | When Used | Tools Available | Use Case |
|---|---|---|---|
| Knowledge Agent | No Redis instance linked | Knowledge base search only | General Redis questions, best practices, documentation lookup |
| Chat Agent | Instance linked (default) | All tools (Redis CLI, Prometheus, Loki, MCP servers, etc.) | Most queries: health checks, diagnostics, troubleshooting, status checks |
| Deep Triage Agent | Instance linked + explicit "deep" request | All tools + parallel multi-topic research | Deep investigation with trigger phrases: "deep triage", "deep research", "go deep", "deep dive" |
Automatic Routing¶
The router (redis_sre_agent/agent/router.py) uses a fast LLM (nano model) to categorize queries:
Query received
│
├── No instance_id? ──────────────────► Knowledge Agent
│
└── Has instance_id?
│
├── Deep keywords (deep triage, go deep, deep dive)? ──► Deep Triage Agent
│
└── Everything else (default) ────────────────────────► Chat Agent
Note: The Chat Agent has access to ALL the same tools as the Deep Triage Agent. Use the Chat Agent for most queries. Only request "deep triage" when you need exhaustive multi-topic analysis.
You can override routing via CLI (--agent triage|chat|knowledge) or API (preferred_agent in user preferences).
Agent Details¶
Knowledge Agent (knowledge_agent.py)
- Searches the vector knowledge base (runbooks, docs, KB articles)
- No live system access - safe for general questions
- Fast response, single-turn conversation
Chat Agent (chat_agent.py)
- Lightweight LangGraph workflow optimized for quick Q&A
- Full tool access but simpler execution path
- Good for: "What's the current memory usage?", "Show me the slowlog", "How many connections?"
Triage Agent (langgraph_agent.py)
- Deep-research agent with parallel investigation tracks
- Breaks complex problems into multiple research topics
- Each topic runs its own tool-calling loop
- Synthesizes findings into comprehensive analysis with recommendations
- Good for: "Run a full health check", "I need comprehensive diagnostics", "Audit this instance"
flowchart LR
User[User/Caller]
Router[Router]
KA[Knowledge Agent]
CA[Chat Agent]
TA[Triage Agent]
KB[(Knowledge Base)]
Prov[Providers<br>Prometheus/Loki/MCP]
Redis[(Target Redis)]
User --> Router
Router -->|No instance| KA
Router -->|Quick question| CA
Router -->|Full triage| TA
KA --> KB
CA --> KB
CA --> Prov
CA --> Redis
TA --> KB
TA --> Prov
TA --> Redis
MCP (Model Context Protocol)¶
The agent supports MCP in two directions:
1. Agent as MCP Server (Expose to Claude/other AI)¶
Run the agent as an MCP server so Claude Desktop or other MCP clients can use it:
# HTTP mode (recommended for remote/Docker)
redis-sre-agent mcp serve --transport http --port 8081
# Stdio mode (for local Claude Desktop config)
redis-sre-agent mcp serve --transport stdio
Available MCP tools exposed:
- redis_sre_deep_triage - Start a comprehensive triage session
- redis_sre_general_chat - Quick Q&A with a Redis instance
- redis_sre_database_chat - Chat about a specific database
- redis_sre_knowledge_query - Query the knowledge base
- redis_sre_knowledge_search - Search documentation
- redis_sre_list_instances - List configured Redis instances
- redis_sre_create_instance - Register a new Redis instance
- redis_sre_get_task_status - Check task completion status
- redis_sre_get_thread - Get full results from a triage
2. External MCP Servers as Tool Providers¶
Add external MCP servers to give the agent additional capabilities:
# config.yaml
mcp_servers:
# Memory server for long-term agent memory
redis-memory-server:
command: uv
args: ["tool", "run", "--from", "agent-memory-server", "agent-memory", "mcp"]
env:
REDIS_URL: redis://localhost:6399
# GitHub MCP server
github:
url: "https://api.githubcopilot.com/mcp/"
headers:
Authorization: "Bearer ${GITHUB_PERSONAL_ACCESS_TOKEN}"
The agent discovers tools from configured MCP servers at startup and makes them available to the LLM during triage.
See docs/how-to/tool-providers.md for more on the tool system.
Tasks vs. Threads¶
- Task: How you interact with the agent. Create a task to run a query or triage. Each task has a
task_idand tracks execution status (queued, running, completed, failed). - Thread: What happened during execution. Contains the conversation history, messages, tool calls, and results. Each thread has a
thread_id.
When you create a task, the API creates or reuses a thread to store the execution history. You can:
- Poll the task for status: GET /api/v1/tasks/{task_id}
- Read the thread for results: GET /api/v1/threads/{thread_id}
- Stream updates via WebSocket: ws://localhost:8080/api/v1/ws/tasks/{thread_id} (Docker Compose) or port 8000 (local)
sequenceDiagram
participant Client
participant API
participant Worker
Client->>API: POST /api/v1/tasks (message+context)
API-->>Client: task_id, thread_id
API->>Worker: enqueue task
Worker->>Providers: query metrics/logs
Worker->>Redis: check instance
Worker-->>API: stream updates to thread
Client->>API: GET /api/v1/tasks/{task_id}
API-->>Client: status/result
Schedules¶
Schedules define recurring health checks that run automatically:
- Each schedule specifies an interval (minutes, hours, days, weeks) and optionally a Redis instance
- When a schedule triggers, it creates a new Task and streams results to a Thread
- Manage schedules via CLI (
redis-sre-agent schedule) or API (/api/v1/schedules)
Instances and Context¶
- Create instance records with
instance create(CLI) orPOST /api/v1/instances(API) - Provide
instance_idin your query to trigger live triage with tools - Instance metadata (environment, usage, description) helps the agent understand context
Providers (Integrations)¶
Pluggable integrations for metrics (Prometheus), logs (Loki), tickets (GitHub/Jira), clouds, and more.
Configure via environment. See: docs/how-to/tool-providers.md
Security and Secrets¶
Use a 32-byte master key for envelope encryption of secrets at rest.
See: docs/how-to/configuration/encryption.md