Reference notes.
AI agents are systems that use LLMs to autonomously plan and execute tasks. Unlike simple chat interactions, agents can use tools, maintain state, and work towards goals over multiple steps.
Core Concepts
Agent Loop
The fundamental pattern:
while goal not achieved:
1. Observe current state
2. Reason about next action
3. Execute action (often using tools)
4. Update state with results
Components
- LLM — The reasoning engine
- Tools — External capabilities the agent can invoke
- Memory — Short-term (conversation) and long-term (persistent storage)
- Planning — Strategy for achieving goals
Reasoning Patterns
ReAct (Reasoning + Acting)
Interleaves thinking and action:
Thought: I need to find the current weather in London.
Action: get_weather(location="London")
Observation: Partly cloudy, 15°C
Thought: Now I can answer the user's question.
Answer: The weather in London is partly cloudy at 15°C.
Chain-of-Thought (CoT)
Step-by-step reasoning before answering. Can be triggered with “Let’s think step by step” or explicit reasoning structure.
Tree-of-Thought (ToT)
Explores multiple reasoning branches:
- Generate several possible next steps
- Evaluate each branch
- Select most promising
- Backtrack if needed
Useful for planning and problem-solving where the first approach may fail.
Reflexion
Self-evaluation and improvement:
- Attempt task
- Evaluate outcome
- Reflect on failures
- Retry with learned insights
Tool Use
Function Calling
Modern LLMs support structured tool definitions:
{
"name": "search_web",
"description": "Search the web for information",
"parameters": {
"query": {"type": "string", "description": "Search query"}
}
}The model outputs structured calls that the system executes.
Common Tool Categories
- Information retrieval — Web search, database queries, file reading
- Code execution — Run code, shell commands
- External APIs — Send emails, create tickets, update systems
- Human interaction — Ask for clarification, approval
Model Context Protocol (MCP)
An open standard (created by Anthropic, now under the Linux Foundation) for connecting AI agents to external tools and data sources. Provides a universal interface so agents can access tools regardless of the underlying provider — similar to how USB-C standardised device connectivity.
- Client-server architecture (agent = client, tool provider = server)
- De facto standard — supported by 12+ frameworks (OpenAI Agents SDK, Claude Agent SDK, LangChain, CrewAI, PydanticAI, Google ADK, and others)
- Growing ecosystem of pre-built MCP servers
- MCP Specification
Agentic AI Foundation (AAIF)
Formed under the Linux Foundation (Dec 2025) to govern agentic AI standards:
- Founding contributions: MCP (Anthropic), goose (Block), AGENTS.md (OpenAI)
- Platinum members: AWS, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, OpenAI
- MCP at 97M+ monthly SDK downloads and 10,000+ active servers by late 2025
- AGENTS.md adopted by 60,000+ open-source projects and major agent frameworks
Tool Design Principles
- Clear, specific descriptions
- Well-defined parameter schemas
- Appropriate granularity (not too broad or narrow)
- Error handling and feedback
Memory
Short-term Memory
Conversation history within context window. Strategies for long conversations:
- Summarisation
- Sliding window
- Relevant message retrieval
Long-term Memory
Persistent storage across sessions:
- Vector Databases for semantic retrieval
- Structured databases for facts
- Knowledge graphs for relationships
Working Memory
Scratchpad for intermediate results during complex tasks.
Hybrid Memory Architectures
Pure vector retrieval has known weaknesses for temporal sequencing, contradiction resolution, and structured facts. The state of practice in 2026 combines several stores:
- Vector search for unstructured semantic recall
- Graph traversal for entities and relationships
- Episodic / temporal stores for event sequences and “what happened when”
- Key-value scratch for active session state
Notable systems:
- Mem0 — Three-tier (user / session / agent) with hybrid vector + graph + KV. Self-edits on contradiction rather than appending. Reported as a strong baseline on the LOCOMO long-conversation benchmark at ECAI 2025.
- Letta / MemGPT — OS-inspired model: core memory (RAM-like, always in-context) + archival memory (searchable disk-like store), with the agent issuing its own paging calls.
- Zep, Cognee — Graph-native, emphasising temporal and relational structure that flat vector stores handle poorly.
Multi-Agent Systems
Architectures
Hierarchical
Orchestrator Agent
├── Research Agent
├── Coding Agent
└── Review Agent
Peer-to-peer
Agents communicate directly, no central controller.
Debate
Multiple agents argue different perspectives, synthesise conclusion.
Coordination Patterns
- Handoff — Pass task to specialist agent
- Collaboration — Agents work on subtasks in parallel
- Critique — One agent reviews another’s work
- Voting — Multiple agents vote on decisions
Interoperability
- Agent2Agent (A2A) — Google’s open protocol for agent-to-agent communication across different frameworks and vendors. Complements MCP (which connects agents to tools) by connecting agents to each other.
Agentic Coding
Agents specialised for software development tasks.
Capabilities
- Computer Use — Directly control a desktop environment by simulating mouse clicks, keystrokes, and analysing screenshots (e.g., Anthropic Computer Use API, OmniParser).
- Navigate and understand codebases
- Write, edit, and refactor code
- Run tests and fix failures
- Execute shell commands
- Interact with version control
Products
Best Practices
- Provide clear project context (AGENTS.md, CLAUDE.md)
- Use version control for safety
- Review changes before committing
- Start with smaller, well-defined tasks
Human-in-the-Loop
Approval Gates
Require human approval for:
- Destructive actions (delete, overwrite)
- External communications (emails, API calls)
- Expensive operations
- Uncertain decisions
Feedback Integration
- Corrections improve future behaviour
- Preferences guide decision-making
- Escalation when confidence is low
Building Agents
Frameworks
- LangChain — Popular, batteries-included
- LangGraph — Graph-based agent workflows
- OpenAI Agents SDK — OpenAI’s agent framework with handoffs and guardrails
- Google Agent Development Kit (ADK) — Google’s multi-agent framework
- LlamaIndex — Data-focused agents
- CrewAI — Multi-agent orchestration
- AutoGen — Microsoft’s multi-agent framework
- Semantic Kernel — Microsoft’s AI orchestration SDK
- Pydantic AI — Type-safe agent framework
Patterns
Minimal agent:
while True:
response = llm.chat(messages, tools=tools)
if response.tool_calls:
for call in response.tool_calls:
result = execute_tool(call)
messages.append(tool_result(result))
else:
return response.contentWith planning:
plan = llm.generate_plan(task)
for step in plan:
result = execute_step(step)
if not successful(result):
plan = llm.replan(task, result, plan)Evaluation
Metrics
- Task completion — Did it achieve the goal?
- Efficiency — Steps taken, tokens used
- Safety — Avoided harmful actions?
- Cost — API costs, compute time
Benchmarks
- SWE-bench — Real GitHub issues. SWE-bench Verified is the most-cited coding-agent benchmark in 2026; Claude Opus 4.7 leads at ~87.6%.
- WebArena — Web task automation
- AgentBench — Diverse agent tasks
- GAIA — General AI assistants
- OSWorld — Computer-use on a real desktop; still in the high-30s for frontier models, the hardest of the major agent benchmarks.
- Tau-Bench / Tau²-Bench (Sierra, 2024-2025) — Tool-agent-user interactions with policy adherence; measures customer-service-style multi-turn workflows under real schemas.
- Terminal-Bench (Stanford, 2025) — 89 hard, expert-curated CLI tasks designed for long-horizon shell agents.
- METR HCAST + Time Horizons — Reports the 50%-time horizon: the human task duration at which an agent succeeds half the time. This metric has been roughly doubling every ~7 months and is the cleanest single-number capability trend for autonomy.
Benchmark gaming caveat: UC Berkeley’s RDI Center showed in April 2026 that an automated scanning agent broke all eight major agent benchmarks via reward hacking, achieving near-perfect scores without solving the tasks. Treat single-benchmark numbers sceptically.
Challenges
- Reliability — Agents can get stuck, loop, or make errors
- Cost — Many LLM calls add up
- Latency — Sequential actions are slow
- Safety — Autonomous actions need guardrails
- Debugging — Hard to trace multi-step failures