AI Architect Series: How to Build an Efficient AI Ecosystem
AI Architect Series: How to Build an Efficient AI Ecosystem
Introduction: The Transformation Has Already Begun
If you are a DBA, Cloud Architect, or IT operations professional, you have likely noticed a seismic shift in how technology teams operate. The runbooks you maintained, the monitoring dashboards you built, the incident response playbooks you refined over years—all of these are being reimagined through the lens of AI.
This is not about replacing your expertise. It is about amplifying it.
The emergence of AI coding assistants and agentic AI systems has created a new discipline: AI Architecture—the practice of designing, orchestrating, and governing an ecosystem of AI tools that work together to enhance your operational capabilities. Just as Cloud Architecture transformed how we think about infrastructure, AI Architecture is transforming how we think about knowledge work itself.
This post—the first in the AI Architect series—draws insights from Anthropic’s The Complete Guide to Building Skills for Claude and maps those concepts to the realities of enterprise IT. Whether you manage Oracle databases, orchestrate cloud infrastructure, or run an operations center, this guide will help you understand how to build a personal AI ecosystem that makes you dramatically more effective.
From Prompt Engineering to Execution Design
The Old Way: Prompt Engineering
Most IT professionals start their AI journey the same way: typing ad-hoc questions into ChatGPT or Claude. “How do I tune a PGA in Oracle 19c?” or “Write me a Terraform module for an OCI VCN.” This is prompt engineering—crafting one-off requests and hoping for a useful response.
It works. But it does not scale.
Every time you start a new conversation, you lose context. You re-explain your environment, your naming conventions, your compliance requirements. You copy-paste responses and manually verify them. This is the equivalent of SSH-ing into a server to run ad-hoc commands—functional, but not operational.
The New Way: Execution Design
Anthropic’s skills guide introduces a paradigm shift: moving from prompt engineering to execution design. Instead of asking an AI to do something, you design a system that enables the AI to do it consistently, correctly, and autonomously.
The core insight is this:
MCP (Model Context Protocol) gives the AI the kitchen. Skills give it the recipe.
In IT terms: MCP is like giving your AI access to APIs, CLIs, and databases. Skills are like giving it your runbooks, SOPs, and tribal knowledge—packaged in a way the AI can actually use.
This distinction is critical for IT professionals. You already have the recipes. Your runbooks, your monitoring strategies, your incident response procedures—these are the institutional knowledge that makes your team effective. The question is: how do you encode that knowledge so an AI agent can leverage it?
Understanding the AI Skills Framework
Anthropic’s guide defines a skill as a structured set of instructions, packaged in a folder, that teaches an AI how to perform specific tasks consistently. Think of it as an Infrastructure-as-Code template, but for AI behavior.
Skill Anatomy: A SKILL.md File
At the heart of every skill is a SKILL.md file—a Markdown document with YAML frontmatter that tells the AI:
- When to activate (what triggers this skill?)
- How to execute (what are the step-by-step instructions?)
- What to reference (what scripts, templates, or documentation should it consult?)
---
name: oracle-awr-analysis
description: Analyze Oracle AWR reports to identify performance bottlenecks
and generate recommendations for database tuning.
---
## Instructions
1. Read the uploaded AWR report (HTML or text format)
2. Identify the Top 5 Wait Events by DB Time percentage
3. Cross-reference with the SQL Statistics section
4. Check the Instance Efficiency Percentages
5. Generate a summary with:
- Root cause analysis for each major wait event
- Specific init.ora parameter recommendations
- SQL tuning recommendations with evidence
6. Format output as a structured report with severity levels
For someone who has spent years writing runbooks, this format should feel natural. The difference is that this runbook is not for a junior DBA to follow—it is for an AI agent to execute autonomously.
Progressive Disclosure: Loading Context On Demand
One of the most important concepts from the guide is progressive disclosure—a three-level system for loading information efficiently:
| Level | What Loads | When It Loads | IT Analogy |
|---|---|---|---|
| Level 1 | YAML frontmatter (name + description) | Always loaded in the system prompt | Service discovery — the AI knows what skills exist |
| Level 2 | Full SKILL.md body with instructions | Only when the AI determines the skill is relevant | On-demand documentation lookup |
| Level 3 | Reference files, scripts, templates | Only when explicitly needed during execution | Lazy loading of dependencies |
Why does this matter? Because context windows are like memory. Every token loaded into the context window is working memory consumed. If you dump 50 runbooks into a prompt, the AI’s performance degrades—just like an application with memory pressure.
Progressive disclosure is the AI equivalent of just-in-time resource allocation. Load what you need, when you need it.
For those managing large environments, this is analogous to how you architect monitoring solutions: you do not push every metric to every dashboard. You define tiers—high-level health checks always visible, detailed diagnostics available on drill-down, and raw telemetry accessible only when investigating specific incidents.
Building Your AI Ecosystem: A Practical Architecture
Now let us translate these concepts into a practical architecture that IT professionals can implement today.
Layer 1: The Foundation — AI Coding Assistants
Your AI ecosystem starts with a capable AI assistant. Today, the leading options include:
- Claude (Anthropic) — via Claude.ai, Claude Code CLI, or API
- Gemini (Google) — via Gemini CLI, Gemini Canvas, or API
- ChatGPT (OpenAI) — via ChatGPT, Codex CLI, or API
- GitHub Copilot — integrated IDE experience
These are your compute engines. Just as you would not run production workloads on a single VM, you should not rely on a single AI tool. Each has strengths:
| Capability | Best Suited For |
|---|---|
| Deep reasoning and analysis | Complex architecture reviews, root cause analysis |
| Code generation | Terraform/Ansible modules, shell scripts, SQL queries |
| Document synthesis | Runbook creation, incident post-mortems, design docs |
| Multi-turn conversation | Iterative debugging, exploratory analysis |
Layer 2: The Connective Tissue — Model Context Protocol (MCP)
MCP is an open protocol that gives AI assistants access to external tools and data sources. Think of it as an API gateway for AI agents.
For IT professionals, MCP servers can connect your AI to:
- Database systems — query execution, schema exploration, performance metrics
- Cloud APIs — OCI, AWS, Azure, GCP resource management
- Monitoring platforms — metrics queries, log analysis, alert management
- Version control — Git operations, code review, change tracking
- Ticketing systems — Jira, ServiceNow ticket creation and updates
┌─────────────────────────────────────┐
│ AI Assistant (Claude) │
│ │
│ ┌──────────┐ ┌──────────────┐ │
│ │ Skills │ │ Memory │ │
│ │ (Recipes) │ │ (Context) │ │
│ └──────────┘ └──────────────┘ │
│ │
└─────────────┬───────────────────────┘
│ MCP Protocol
┌─────────┼──────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│Database│ │ Cloud │ │Monitor │
│ Server │ │ API │ │ API │
│(Oracle,│ │(OCI, │ │(APM, │
│ MySQL) │ │ AWS) │ │ LA) │
└────────┘ └────────┘ └────────┘
The power of MCP is that it transforms your AI from a conversational chatbot into an operational agent that can actually query your systems, analyze real data, and take actions.
Layer 3: The Intelligence — Skills Library
This is where your institutional knowledge lives. Your skills library is a collection of SKILL.md files organized by domain:
.agents/
├── skills/
│ ├── database/
│ │ ├── awr-analysis/
│ │ │ ├── SKILL.md
│ │ │ └── scripts/
│ │ │ └── parse_awr.py
│ │ ├── capacity-planning/
│ │ │ ├── SKILL.md
│ │ │ └── references/
│ │ │ └── sizing_guidelines.md
│ │ └── incident-response/
│ │ ├── SKILL.md
│ │ └── templates/
│ │ └── incident_report.md
│ ├── cloud/
│ │ ├── oci-deployment/
│ │ │ └── SKILL.md
│ │ └── cost-optimization/
│ │ └── SKILL.md
│ └── monitoring/
│ ├── log-analysis/
│ │ └── SKILL.md
│ └── alert-triage/
│ └── SKILL.md
└── workflows/
├── weekly-health-check.md
└── incident-response.md
Each skill encodes a specific capability:
Example: Database Incident Response Skill
---
name: db-incident-response
description: Triage and diagnose Oracle database performance incidents
using AWR, ASH, and system metrics. Generate structured incident reports.
---
## Trigger Conditions
Activate when the user reports:
- Database performance degradation
- Application timeout errors
- High wait event alerts
- ORA- errors in production
## Workflow
### Phase 1: Triage (< 5 minutes)
1. Query current active sessions: `SELECT * FROM V$SESSION WHERE STATUS='ACTIVE'`
2. Check for blocking sessions: `SELECT * FROM V$LOCK WHERE BLOCK > 0`
3. Review latest AWR snapshot for Top Wait Events
4. Classify severity: P1 (outage), P2 (degraded), P3 (warning)
### Phase 2: Diagnosis (< 15 minutes)
1. Generate ASH report for the incident window
2. Identify top SQL by elapsed time
3. Check for plan changes via DBA_HIST_SQL_PLAN
4. Review system statistics (CPU, I/O, memory)
### Phase 3: Report Generation
Use the template in templates/incident_report.md to generate:
- Timeline of events
- Root cause analysis
- Immediate actions taken
- Long-term recommendations
This is your runbook, but supercharged. The AI does not just read it—it executes it, querying your actual database through MCP, analyzing real metrics, and generating a report you can hand to management.
Layer 4: The Memory — Persistent Context
One of the most transformative capabilities in modern AI systems is persistent memory—the ability for an AI to remember context across conversations.
For IT professionals, this means:
- Environment knowledge: The AI remembers your database versions, patch levels, cluster topology, and naming conventions
- Historical context: Past incidents, performance baselines, and known issues
- Team preferences: Your coding standards, documentation templates, and approval workflows
- Project state: Where you left off on a migration, what has been tested, what remains
This is the equivalent of having a team member who never forgets the institutional knowledge that typically lives in someone’s head or scattered across wikis.
Practical implementation: AI tools like Claude support memory through knowledge items and conversation history. You can accelerate this by creating a structured context file:
# Environment Context
## Database Infrastructure
- Production: Oracle 19c RAC (2-node) on OCI BM.Standard.E4
- Standby: Active Data Guard in different region
- PDB architecture: 1 CDB, 4 PDBs (HR, FIN, SCM, CX)
## Cloud Infrastructure
- Primary region: us-ashburn-1
- DR region: us-phoenix-1
- Compartment structure: prod/nonprod/shared-services
## Monitoring Stack
- OCI Logging Analytics for log consolidation
- OCI APM for application monitoring
- Custom metrics via OCI Monitoring API
- Alerting via OCI Notifications + PagerDuty
## Standards
- Terraform for all infrastructure provisioning
- Ansible for database patching
- Git branching: main → staging → feature/*
- Change management: ServiceNow tickets required for prod changes
The AIOps Transformation: From Reactive to Predictive
For IT operations teams, this AI ecosystem enables a fundamental shift in how you work:
Traditional IT Operations
Alert fires → Engineer investigates → Manual diagnosis → Fix applied
(15-30 min) (30-60 min) (variable)
Each step requires context loading: What environment is this? What changed recently? Has this happened before? What is the standard fix?
AI-Augmented Operations (AIOps)
Alert fires → AI agent triages with full context → Diagnosis + recommendations ready
(< 2 min) (< 5 min)
The AI agent already knows your environment, has access to your monitoring data via MCP, and follows your incident response skill. Your role shifts from executor to reviewer and decision-maker.
What This Looks Like in Practice
Scenario: A P2 database performance alert fires at 2 AM.
Without AI ecosystem:
- On-call DBA receives page
- VPNs into the environment
- Tries to remember which database, which application
- Runs ad-hoc diagnostic queries
- Googles error codes
- Eventually identifies the issue
- Manually writes the incident report next morning
With AI ecosystem:
- On-call DBA receives page with AI-generated triage summary
- Summary includes: affected database, top wait events, suspect SQLs, comparison to historical baseline, and recommended actions
- DBA reviews the analysis, confirms the diagnosis
- AI generates the incident report with full evidence
- Total time: 15 minutes instead of 2 hours
Getting Started: Your First 30 Days
Here is a practical roadmap for building your AI ecosystem:
Week 1: Foundation
- Set up Claude (or your preferred AI) with a professional plan
- Create your environment context file (see template above)
- Start a
.agents/directory in your team’s Git repository - Write your first
SKILL.mdfor a task you do frequently
Week 2: Skills Development
- Convert your top 3 most-used runbooks into skills
- Add reference files (scripts, templates, documentation)
- Test each skill by running through it with your AI assistant
- Iterate based on what the AI gets right and wrong
Week 3: MCP Integration
- Explore available MCP servers for your technology stack
- Set up at least one MCP connection (database, cloud API, or monitoring)
- Create a skill that leverages MCP for real data analysis
- Document the integration for your team
Week 4: Team Adoption
- Share your skills library with your team
- Run a workshop demonstrating the AI ecosystem
- Collect feedback and prioritize the next set of skills to build
- Establish governance: who reviews and approves skills?
Testing and Quality: The Verification Loop
Anthropic’s guide emphasizes that skills should be tested just like code. This resonates deeply with IT professionals who understand the importance of validation.
The guide recommends three levels of testing:
-
Manual Testing: Run through the skill interactively and verify the output matches expectations. This is your smoke test.
-
Scripted Testing: Create test scenarios with known inputs and expected outputs. For a database analysis skill, this might mean feeding it a sample AWR report and verifying it identifies the correct bottlenecks.
-
Metrics Monitoring: Track how often skills are triggered correctly, how many tool calls they make, and their failure rates. This is your SLA monitoring for AI operations.
The key metrics to watch:
| Metric | Target | Why It Matters |
|---|---|---|
| Trigger accuracy | > 95% | Does the right skill activate for the right task? |
| Completion rate | > 90% | Does the skill complete without errors? |
| Tool call efficiency | Minimize unnecessary calls | Fewer calls = faster execution + lower cost |
| Output quality | Consistent with templates | Does the output match your standards? |
Security and Governance: Operating AI in Enterprise Environments
For IT professionals working in regulated environments, governance is non-negotiable. Here are the key considerations:
Data Classification
- Never encode credentials, tokens, or secrets in skill files
- Use environment variables or secret management tools for sensitive configuration
- Classify your skills by the data sensitivity they handle
Access Control
- Store skills in version-controlled repositories with appropriate access controls
- Use code review processes for skill changes, just like infrastructure code
- Implement approval workflows for skills that interact with production systems
Audit and Compliance
- Log all AI agent actions, especially those involving production systems
- Maintain an audit trail of skill executions and their outcomes
- Regularly review and update skills to reflect policy changes
Responsible Use
- Always have a human review AI-generated recommendations before applying to production
- Define clear escalation paths for when AI diagnosis is uncertain
- Establish boundaries: what can the AI do autonomously vs. what requires human approval?
The Bigger Picture: Skills as a Cross-Platform Standard
One of the most compelling aspects of the skills framework is that SKILL.md is an open standard. It works across:
- Claude Code (Anthropic)
- Gemini CLI (Google)
- OpenAI Codex (OpenAI)
- Any AI assistant that can read Markdown files
This means your investment in building skills is portable. If you switch AI providers—or use multiple providers for different tasks—your skills library moves with you. This is analogous to how Terraform provides cloud-agnostic infrastructure definitions: write once, execute anywhere.
For IT organizations evaluating AI tools, this portability significantly de-risks adoption. You are not locked into a vendor; you are building a knowledge asset that appreciates over time.
Conclusion: The AI Architect Mindset
The transition from traditional IT operations to AI-augmented operations is not about learning a new tool. It is about adopting a new architecture mindset:
-
Think in systems, not prompts: Design your AI ecosystem as an architecture with layers, interfaces, and governance—not as a collection of chat conversations.
-
Encode knowledge, not just commands: Your institutional knowledge—runbooks, best practices, lessons learned—is your most valuable asset. Package it as skills so it can be leveraged consistently.
-
Design for progressive disclosure: Load context efficiently. Not everything needs to be in the prompt. Structure your knowledge so the AI can find what it needs, when it needs it.
-
Verify and iterate: Test your skills like you test your infrastructure. Monitor their effectiveness. Continuously improve them based on real-world usage.
-
Govern like production: Apply the same discipline to your AI ecosystem that you apply to your production environments—access control, change management, audit trails, and security reviews.
The AI Architect does not just use AI tools. They design the system in which AI tools operate most effectively—bringing together the right models, the right context, the right skills, and the right governance to create an ecosystem that amplifies the entire team’s capabilities.
In the next installments of this series, we will dive deeper into practical implementations: building MCP integrations for Oracle databases, creating monitoring skills for OCI, and designing AI-powered incident response workflows.
The future of IT operations is not AI replacing DBAs and Cloud Architects. It is AI making every DBA and Cloud Architect ten times more effective.
References
- The Complete Guide to Building Skills for Claude (PDF) — Anthropic
- Model Context Protocol (MCP) Documentation — Anthropic
- Building Effective Agents — Anthropic Engineering
- Claude Code Documentation — Anthropic
- OCI Observability and Management — Oracle Cloud Infrastructure