Autonomous AI agents have been promised as the next revolution in AI applications, but building reliable, effective agents remains challenging in practice. Generative AI (GenAI) agent implementations go beyond a single prompt-response-they involve smart design patterns for tool use, memory management, decision-making, and more. This review maps out key design patterns in agentic workflows and provides practical guidance on when and how to use them effectively.

If you are new to this space, note that we're using LangGraph (from LangChain) as an orchestration framework in many examples, but the concepts apply generally. Credit goes to many sources cited throughout (LangChain blogs, Anthropic's research, etc.) for inspiring these patterns.

Before getting into patterns, let's clarify what we mean by an "agent" and why building them is tricky today. Then we'll explore a taxonomy of agent design patterns – from memory mechanisms to multi-agent orchestration – each illustrated with real-world examples.

What is an Agent?

Broadly speaking, an AI agent is an LLM given some form of agency to decide on and execute actions toward a goal. This usually means the model can call functions or tools in a loop, adjusting its plan based on intermediate results, rather than just answering a single prompt. Some define an agent narrowly as a fully-autonomous system operating continuously and deciding how to solve tasks, while others use the term for any multi-step LLM process that isn't a fixed script¹. In this review, we'll use "agentic system" to encompass both dynamic agents and more structured workflows (we'll differentiate these soon).

Crucially, an agent goes beyond a one-shot completion: it chooses its own next steps. For example, a retrieval-augmented QA bot that first decides which documents to fetch, then reads them, then answers, can be seen as an agent (the LLM controls the retrieval step). Agents might select from multiple tools, route queries, or iteratively refine outputs.

Monolithic vs. Multi-Agent Approaches

There are two main architectural approaches to building AI agents²:

Monolithic agent: a single large model handles the entire task end-to-end, making all decisions and sub-decisions based on its full context understanding. This leverages emergent abilities of large models and avoids information loss from splitting tasks.
Multi-agent system: a team of specialized agents (often smaller models or distinct prompts) each handle a subtask, coordinated to solve the overall task. Instead of one generalist brain, you have specialists (planner, researcher, coder, etc.) that communicate.

In theory, a sufficiently large monolithic agent with infinite context would be ideal for complex tasks, since dividing context can throw away useful information. In practice, constraints like limited context windows, model cost, and the desire for modularity push us toward multi-agent solutions in many cases². If different parts of a task truly require different expertise or contexts, multi-agent designs can shine. But if a single model could solve it given all information, it usually should – splitting into multiple calls may add overhead and error points.

A useful perspective comes from an OpenAI engineer's comment³:

Single big LLM calls work best when the task's pieces are tightly coupled (interrelated info or steps), or you need immediate output without waiting on intermediate steps. Multiple smaller calls work better when the task can be cleanly decomposed into independent parts, or when you want branching logic (e.g. a decision tree of subtasks), or want certain processing done with guaranteed rules outside the LLM.

In other words, if you can break a problem into truly self-contained steps (where each step doesn't need the full context of others), multiple agents/calls can improve efficiency or cost. If not, a single model run might be more effective.

Workflows vs. Agents

It's also useful to distinguish hardcoded workflows vs. dynamic agents⁴. Anthropic describes it like this:

Workflows: LLMs + tools orchestrated through a predefined, fixed sequence in code. (The developer decides the steps ahead of time.)
Agents: LLMs that dynamically control their own sequence of tool usage and steps. (The LLM decides what to do next, within some allowed tools/instructions.)

In practice, there's a continuum. A retrieval-augmented generation (RAG) pipeline with a fixed "search → answer" sequence is a workflow. A fully autonomous AutoGPT that decides everything on the fly is an agent. Many systems blur the line by giving the LLM some freedom (e.g. choose between routes) but within a high-level structure.

Workflow vs. Agentic system – a fixed script vs. dynamic decision-making

When to use which? If a task can be solved with a simple prompt or a straightforward script, that's usually more efficient and reliable than a complex agent. Only add agentic complexity when needed: typically for open-ended problems where you can't predict the steps in advance⁴. Even then, it pays to constrain agents with some structure.

Challenges in Practice

Building useful agents today comes with serious challenges²⁵:

Reliability: LLMs still hallucinate and make mistakes. Stringing multiple steps together compounds the error rate. For tasks requiring exact or factual outputs, an agent needs careful guardrails or verification at each step, or the errors might snowball.
Speed & Cost: Early experiments with GPT-4-based agents revealed they can be slow and expensive. Multiple model calls, especially with large models, incur latency and API costs. (Though note: token prices have been plummeting – OpenAI's GPT-4o dropped from ~$36 to ~$4 per million tokens in 17 months, ~80% cost reduction per year⁶. This trend suggests cost might become less of an obstacle over time.)
User Trust: It's hard for users to trust a "black box" agent making decisions autonomously. There have been public failures (e.g. an airline's customer-support chatbot that gave a customer misleading info, leading to a legal order⁵). If an agent handles sensitive tasks (money, medical advice, etc.), users (and regulators) will demand transparency and oversight.
Complexity: More moving parts (multiple prompts, tools, memory, etc.) means more surface area for bugs. Debugging a misbehaving agent – figuring out which prompt or tool caused an issue – can be difficult without good logging and introspection tools.

Given these challenges, many experienced teams recommend starting as simple as possible⁴. Often, you might not need a fancy multi-step agent if a well-crafted single LLM call with retrieval or some functions can do the job.

That said, when simple single-call solutions fall short, agentic patterns can significantly improve performance. The rest of this document assumes you have a task that does benefit from an agentic approach, and we'll explore how to design it well.

Philosophies for Building LLM Agents

Here are a few guiding principles we adopt when designing AI agent workflows today:

Prefer structured state machines over fully-autonomous loops. Deterministic or at least constrained flows (think: fancy API endpoints with AI inside) are more reliable than letting an agent run wild. Wherever possible, limit an agent's degrees of freedom to what's needed for the task.
Get it working before optimizing. Don't prematurely micro-optimize token usage or latency. Model API prices are dropping rapidly⁶, and hardware is improving. It's usually better to first build a useful solution, then refine cost if needed. An overly frugal design that doesn't meet user needs is wasted effort.
Keep tasks narrow and well-defined. Today's agents perform best on bounded domains or specific problems (coding, customer support, etc.). "Jack of all trades" agents tend to be brittle. Build domain-specific workflows with clear success criteria.
Stay flexible with tools and models. The LLM landscape evolves quickly. Use abstractions (or frameworks like LangChain/LangGraph) that let you swap out the model or vector store or API without completely rewiring your app. Don't tie yourself to one provider's quirks if you can avoid it.

With those principles in mind, let's dive into the design patterns that enable advanced agent capabilities. We will group patterns into categories for clarity. Each pattern is essentially a reusable solution or mechanism that addresses a common need in agent design. Along the way, we'll highlight real-world examples (case studies) and relevant research.

Agentic Design Patterns

We'll cover the following categories of patterns:

Memory and State Management – techniques for agents to remember information during and across sessions.
Control Flow Patterns – routing, chaining, looping, and branching structures that guide an agent's sequence of actions.
Multi-Agent Orchestration – architectures for using multiple agents (or multiple LLMs) collaboratively.
Self-Reflection and Improvement – getting agents to critique and refine their own outputs in iterative loops.
Human Interaction (Human-in-the-Loop) – incorporating human feedback or control within an agent's operation.
Tool Use and Integration – patterns for using external tools, knowledge sources, or multi-modal inputs/outputs.
Reliability and Evaluation – strategies to evaluate, verify, or debug agent behavior.
Emerging Patterns – experimental or cutting-edge techniques.

Each pattern includes a description, practical tips, and often a footnoted source or example.

Memory and State Management Patterns

Memory in agent systems refers to how the agent retains information over time. Unlike stateless LLM API calls, an agent might need to recall previous conversation turns, user preferences, or facts it discovered earlier. We can draw an analogy to human memory:

Short-term memory: information the agent keeps in the immediate context (e.g. the recent conversation history passed in the prompt). This is limited by the model's context window (thread-scoped memory).
Long-term memory: information stored outside the prompt, which the agent can retrieve when needed (e.g. via a vector database or knowledge graph). This persists across sessions and can be shared among threads.
Sensory memory: raw inputs like recent observations that can be encoded (e.g. using embeddings) for the agent to quickly process.

In practice, short-term memory is handled by prompt history (or a summary of history), and long-term memory is handled by an external store that the agent can query. For instance, you might save embeddings of important facts or user details in a vector store and have the agent query it when relevant.

Schema for agent memory: short-term context vs. long-term vector store

There are a few common memory patterns:

Conversation History Management: Naively prepending the entire chat history will eventually hit context limits. Instead, use strategies like message filtering (drop irrelevant or low-priority turns), trimming (remove or summarize oldest turns when context is full), or conversation summarization (dynamically summarize older parts of the dialogue)⁷. The agent can carry a concise summary plus recent turns to keep context short and relevant.
Long-Term Knowledge Base: Storing facts, user profile data, or past interactions in a database that can be queried via semantic search or keys. At runtime, before the agent answers a question or decides on an action, it can retrieve pertinent info and include it in the prompt. For example, an agent might have a vector DB of company policies to consult when answering HR questions. A noteworthy project is MemInsight (2025), which proposed an autonomous memory module for LLM agents that periodically summarizes new information and stores it, reportedly improving personalization and recall on long-running tasks⁸.
- Multi-vector indexing is a technique where you index different aspects of data separately to improve recall (e.g. store both exact text and high-level summary vectors). This can help the agent find relevant info more accurately when querying⁹.
Memory Extraction and Storage: An agent can be designed to explicitly save new information it learns. For instance, after a conversation, it might extract key facts about the user's preferences and call an API (or a "SaveMemory" tool) to store them to long-term storage. Next time, it can retrieve those facts to personalize responses.
Generative "NPC" Agents with Memory: An interesting line of research created simulated characters (agents) with long-term memory. For example, Stanford's "Generative Agents" had agents in a sandbox environment remembering and reflecting on events in their lives (who they met, what they learned) and using those memories to guide future behavior¹⁰. They used a combination of natural language entries for recent events and summarized "memories" stored externally, demonstrating architectures with both episodic memory (specific events) and semantic memory (general knowledge) that accumulate over time.
Memory-Conditioned Behavior: The agent might adapt its style or strategy based on memory. E.g., if long-term memory shows a user always rejects verbose answers, the agent can shorten its replies or change tone accordingly.

The "Lost in Conversation" Problem: A Core Memory Challenge

Here's a surprising finding that changes how we should think about conversational AI: all LLMs get dramatically worse when users chat with them over multiple turns instead of giving complete instructions upfront. This isn't just a minor issue-it's a fundamental reliability problem that affects every major model.

A major 2025 study tested 15 top LLMs (including GPT-4 and Claude 3 models) across over 200,000 conversations. The results were consistent and alarming: every single model showed a 39% average performance drop when handling realistic multi-turn conversations compared to single, well-specified prompts¹¹.

This "lost in the conversation" effect indicates that our current memory strategies (like appending or summarizing conversation history) are not fully solving the context management problem. Even with memory, agents can degrade over a session.

What Goes Wrong in Multi-Turn Conversations?

Jumping to Conclusions Too Early: LLMs try to write full solutions before getting all the details, then get stuck defending their early (often wrong) assumptions.
Bloated, Off-Target Responses: Answers become 20-300% longer and filled with irrelevant assumptions. The AI essentially overthinks and overwrites.
Tunnel Vision on First and Last Messages: Like humans, LLMs pay most attention to the beginning and end of conversations, missing crucial details shared in the middle.
Error Snowballing: Early mistakes compound as the conversation continues. The AI doubles down on wrong paths instead of course-correcting.
Lost in Their Own Words: The more verbose the AI gets, the more it confuses itself in subsequent turns.

What Doesn't Work (Surprising Results):

Lower Temperature Doesn't Help: Unlike single-turn prompts, cranking down randomness (temperature=0) barely improves multi-turn reliability.
More Reasoning Power Doesn't Fix It: Even advanced "reasoning models" still suffer from this problem.
Partial Fixes Fall Short: While techniques like periodic "recap" strategies (repeating all previous instructions) help, they don't fully restore single-turn reliability, only recovering about 15-20% of the lost performance¹¹.

Practical Takeaways for Agent Builders:

Encourage Complete Upfront Instructions: Guide users to provide all requirements in one comprehensive prompt rather than piecing them together over multiple messages.
Reset When Things Go Sideways: If a conversation starts going off-track, it's often better to start fresh with a consolidated summary of what the user actually wants.
Use Recap Strategically: Periodically summarize and restate all accumulated requirements. Another strategy is giving the model explicit schemas for memory retrieval – e.g., instruct it: "Before answering, recall if the user provided any preference earlier (yes/no) and what it was." By forcing a self-query of memory, it may surface facts that would otherwise be buried in the dialogue history.
Human Oversight is Critical: Given the unreliability, human review becomes even more important for multi-turn agents, especially in high-stakes situations.
Design Escape Hatches: Build mechanisms to detect when conversations are failing and either reset the context or escalate to human assistance.

The bottom line is that making AI agents feel "natural" through conversational interactions might actually make them less reliable. Sometimes the best user experience isn't the most human-like one-it's the one that plays to the AI's strengths rather than its weaknesses¹¹.

Persistence and State

Persistence refers to maintaining the agent's state across runs or even if the process restarts. For example, if your agent is deployed as a service, you might want a user's session state (including conversation memory, any variables, etc.) to be saved so that if the service restarts or scales to a new machine, the state isn't lost.

In frameworks like LangGraph, checkpoints are used to persist state at certain steps. This enables:

Resuming or "Time Travel": You can save the state after each agent action. If something goes wrong or you want to debug, you can replay a past state (deterministically getting the same result as before) or even fork from a past state by altering something and continuing from there¹². This is extremely useful for debugging complex agent behaviors (think of it like version control or an undo for agent reasoning). For instance, if an agent made a wrong decision at step 5 of a 10-step reasoning chain, you could roll back to step 4, adjust the prompt or provide correction, and then let it proceed differently from there.
Fault Tolerance: If the agent crashes midway (maybe an external API failed or was rate-limited), a checkpoint allows restarting from the last good state instead of from scratch.
Human Replay: For human-in-the-loop scenarios (discussed later), you might pause the agent, get input, and then resume. Persistence ensures the agent's state (memory, intermediate results) is still there when resuming, even if there was a time gap.

LangGraph's thread concept automatically persists the graph's state per session. If you come back to a conversation later, it can load the last state and continue. It also supports cross-thread persistence for sharing data across different sessions or agents (e.g., a common knowledge base all agents use)¹³. This means an agent can write to a memory store that outlives one conversation thread and is accessible to others – useful for multi-user systems or global facts storage.

On the engineering side, frameworks are enabling persistent memory that outlives the agent's process. For instance, LangChain/LangGraph supports saving conversation state and vector memories to disk or a database, so that if you come back tomorrow, the agent can pick up where you left off. The challenge is what to store – saving everything verbatim is infeasible, and saving only high-level summaries might omit crucial specifics. Some new research explores hybrid approaches: store detailed logs short-term and distilled facts long-term, or maintain multiple levels of memory (recent context window vs. archived memory) and teach the agent when to query the archives¹⁴. This is analogous to how humans have working memory versus long-term memory.

Persisting agent state across time (with checkpointing for rewind/fast-forward)

Case Study: Time Travel Debugging. Imagine a coding agent that writes and executes code. It got a compilation error on attempt #1. With persistent state and time-travel, the developer can rewind to the code generation step, tweak the prompt or give a hint, and then let it continue from there – effectively guiding the agent to success while diagnosing the failure point. This approach was used to iteratively refine an agent that writes and tests code: whenever it hit a runtime error, the engineer could rewind and adjust the logic, then re-run, without restarting the whole process¹². Such controlled "replaying" made debugging easier and improved the agent's reliability over time.

Memory Update Strategies: Hot Path vs. Background. One nuance in memory design is when the agent updates its long-term memory. There are two patterns⁶¹³:

Hot Path (Synchronous) Memory: The agent updates memory during the conversation flow. For example, it might call a SaveMemory tool as one of its steps before responding to the user. This ensures the memory is immediately up-to-date for the next user query, but it adds extra latency and complexity (the agent is multitasking: solving the user's request and saving info simultaneously).
Background (Asynchronous) Memory: The agent focuses on responding to the user first, and only after the conversation turn is done (or during idle time) does a separate process update the memory store. In LangGraph you can schedule a "memory update graph" to run a few seconds after the main agent responds, performing tasks like summarizing the conversation and storing key facts. If a new user message comes in before the memory update has run, the system can cancel the pending update and schedule a new one after the latest turn, ensuring only the latest conversation gets saved. This pattern avoids slowing down the user interaction – memory is built subconsciously in the background.

Comparing the two: the hot path approach gives immediate memory updates and lets the agent consciously decide what to remember (which can be good for fine control), but it increases response time and intermingles memory logic with task logic. The background approach keeps the user experience snappy and decouples memory logic, at the cost of the memory lagging slightly behind and needing extra logic to trigger the background job⁶. Many systems use background memory updates for things like conversation summarization: e.g., after each user query, schedule a job to extract any important facts and store them, so that over time the agent builds a long-term profile without ever pausing the live conversation.

Finally, note that what to store in memory is highly application-specific⁶. Some agents store user preferences (semantic memory), others store entire episodes or examples of successful outcomes (episodic memory), and some update their own instructions or prompts over time (procedural memory). The CoALA paper (2024) provides a framework mapping human memory types to agent memory, and many of these patterns are being experimented with in modern agents.

In summary, memory and persistence patterns are about extending the effective context of an agent beyond a single prompt. When used well, they make agents feel more consistent and "personalized" over time and allow complex interactions that build on earlier steps. It's crucial, however, to balance immediacy vs. thoroughness in memory updates (hot vs. background) and to have robust state management (checkpoints, cross-session storage) to support these capabilities.

References: Strategies for memory in LLMs have been widely discussed – see LangChain's documentation on managing conversation history⁷ and adding semantic search memory⁹. The LangChain blog provides an overview of hot path vs. background memory updating and shares a template implementing both in LangGraph⁶. For a deeper dive into memory types (semantic, episodic, procedural) and how they map to agents, see Saptak Sen's Long-Term Agentic Memory essay (2025) and the CoALA paper (Sumers et al. 2024). LangGraph's docs cover practical features like state checkpoints¹² and cross-thread memory sharing¹³.

Control Flow Patterns (Routing, Chaining, Looping)

Not all tasks are a straight line from prompt to answer. Control flow patterns let us structure an agent's decision-making path in useful ways, often by splitting tasks into multiple steps or branches. Here are several key patterns:

Prompt Chaining

Prompt chaining is the simplest multi-step workflow: break a task into a sequence of deterministic steps, each handled by an LLM prompt (or tool). The output of one step feeds into the next prompt. Unlike a free-form agent, the sequence is fixed; there might be simple conditional checks, but the path doesn't diverge arbitrarily.

Use prompt chaining when a task naturally decomposes into a few subtasks that always occur in the same order⁴. You sacrifice some speed (multiple calls) to potentially gain accuracy by focusing the model on one sub-problem at a time. Examples:

Write an outline for an essay (step 1), then expand each section into a full essay (step 2). The final output is presumably better structured than doing it in one go.
Generate a draft response to a customer email, then have another prompt evaluate if the draft adheres to company policy. If not, maybe revise or flag for human review (this adds a conditional check after generation).
Take a user request, then produce a series of API calls to fulfill it. This could be a chain: "parse user request → plan API calls → execute calls → format results".

One important variant is adding gating functions between steps – basically, a small piece of code that checks something about the output before moving on (length, JSON validity, presence of certain keywords, etc.). If the check fails, you can retry that step or handle the issue (e.g., ask the model to fix the JSON format) before proceeding. This ensures the chain stays on track.

Prompt chaining is more of a workflow pattern than an "agent" per se, since the flow is predetermined. But it's often the starting point before introducing more agentic flexibility. Many initial LLM applications (like a multi-step data extraction pipeline or form-filling assistant) are implemented as prompt chains.

Routing (Dynamic Decision Paths)

Routing means using the LLM (or a classifier) to pick one of several routes or handling strategies for an input. It's like a switch or router in a network: based on the query, send it to the appropriate specialist or prompt.

Routing workflow – an LLM selects one of N paths to follow

Routing is useful when you have distinct categories of requests that require different processing⁴. For example:

An AI assistant that handles both general chit-chat and technical support. You first ask an LLM (or use rules) to classify the query: "Is this casual conversation, a support issue, a billing question, or something else?" Then route accordingly (small-talk agent vs. support agent vs. billing agent).
A system that chooses model variants based on complexity: e.g. route simple questions to a cheap small model, but complex ones to a more powerful model (saving costs on easy tasks).
A web service that either answers from a knowledge base or falls back to a generative answer if the KB lacks info. A router could decide if the confidence score from the knowledge base is high enough; if not, call the LLM to generate an answer.

The routing decision can be made by an LLM with a carefully crafted prompt asking it to pick a label, or by a traditional ML classifier if you have training data, or even by rule-based heuristics. Anthropic notes that having separate, specialized prompts for each category can improve performance, since each prompt can be optimized for its domain without trying to be one-size-fits-all⁴.

One key to routing is ensuring the output is structured (e.g., the router LLM should output exactly the name of the route or a JSON with a route field). This avoids ambiguity in parsing its decision.

Case Study: A Customer Support Agent might incorporate routing by first classifying the sentiment and urgency of a user's message¹⁵. If it's an angry complaint, route to a "high-priority resolution" path (maybe loop in a human or use a special apology tone). If it's a simple FAQ, route to an FAQ-answering path (perhaps a knowledge base lookup). Nir Diamant's customer support agent example uses an LLM to categorize the query type and sentiment, then uses that to decide how to answer – that's the routing pattern in action.

Looping and Iteration (State Machines)

When you allow an LLM-driven process to iterate (loop) until a condition is met, you get a state machine style agent. This is more unpredictable than a fixed chain because you don't know beforehand how many iterations it will do.

For instance, you might have an agent that alternates between two steps: (1) propose an action, (2) critique or get feedback, and loop until the critique is satisfied. Or a classic ReAct agent that keeps doing "Thought → Tool → Observation" cycles until it decides to stop. These are essentially while-loops where the LLM decides whether to continue.

Combining routing (decision making) with looping gives you a basic agent architecture: at each step the LLM decides an action (which could be "finish now" or could be calling a tool that leads to a new state), and you loop until a termination condition is met. Many frameworks implement this pattern under the hood: e.g., LangChain's ReAct agent is a loop where the prompt generates either an answer or a tool call, and it repeats if it was a tool call.

Why loops? Some tasks are interactive or exploratory. For example, a research agent might search the web, read an article, realize it needs more info, search again, read more, and only then formulate an answer. We as developers might not know upfront how many search steps are needed – so we let it loop until it decides it's done.

Guarding loops: Always have safeguards like a max iterations count or other break conditions, as a stuck agent could loop forever (e.g., if it keeps misunderstanding or encountering an error and not resolving it). Also consider timeouts or user-intervention triggers if loops run too long.

Parallelization and Branching

Parallelization means having an agent or workflow execute multiple steps concurrently to save time, and then combine the results. There are two main forms⁴:

Sectioning (fan-out/fan-in): Split a task into independent parts, solve each in parallel, then aggregate. For example, to summarize a long document, split it into chunks, summarize each chunk in parallel, then have another step combine those summaries.
Voting (redundant runs): Run the same subtask multiple times (perhaps with slight prompt variations or different models) and then choose or aggregate the outcomes. For instance, run a code generation agent 3 times and pick the version that passes tests (or even have them vote on the best solution).

Parallel execution workflow – fan-out to multiple calls, then fan-in to aggregate

Parallelism is especially useful to reduce latency when dealing with large inputs or multiple data sources. It's also a way to increase reliability by ensembling model outputs:

By sectioning, you ensure each chunk of text or each sub-problem gets focused attention from a model without exceeding context limits, and you get results faster than processing sequentially.
By voting or multiple attempts, you can either pick the majority/best, or even combine answers (like taking an average or merging content). This can mitigate randomness – if one run fails or hallucinates, maybe another will succeed.

Many orchestrators support branching. In LangGraph, you can design a graph that "fans out" from one node into multiple parallel nodes and then "joins" them. You often need a reducer function to specify how to combine results (concatenate lists, pick max, etc.). Make sure the tasks are truly independent to avoid race conditions or the need for cross-talk between parallel branches.

Case Study: Report mAIstro (an internal tool we built for writing reports) uses parallelization heavily. When given a report topic with multiple subtopics, it will spawn separate LLM calls to research and draft each section in parallel, then use another step to compile the sections into a final report. This makes generating a multi-section report much faster than doing sections one by one, and it allowed using more total context (each section got its own context for retrieval, etc.). Another example: a News Summarizer agent might take the day's top 5 news articles and summarize all of them in parallel, then output a combined digest¹⁵. By parallelizing the summarization of each article, the overall response is ready 5x faster than summarizing sequentially.

Parallelization is a pattern to use carefully – ensure tasks are truly independent (to avoid needing synchronization between branches, which complicates things). Also be mindful of rate limits: parallel calls multiply your API usage in short bursts.

Map-Reduce Pattern

A specific case of parallel sectioning is the Map-Reduce pattern (inspired by the classical MapReduce programming model):

Map step: Use the LLM (or a tool) to generate or extract intermediate results for each item in a list. These items can be processed concurrently.
Reduce step: Aggregate those results into a final result.

For example, if you have a list of topics and you want a joke about each, you can Map: "generate a joke about X" for each X (in parallel), then Reduce: "compile all these jokes into one response". Or Map: "retrieve relevant info for each question", then Reduce: "compile a Q&A document from all retrieved info".

Importantly, the list of items for the map step might itself be generated by an LLM in a prior step. This gives rise to dynamic map-reduce workflows: the agent decides at runtime how many and which subtasks to execute, then those subtasks run (possibly in parallel), and finally the results are combined.

LangGraph supports this dynamically. A node can output a list of items (like tasks or sections), then a subsequent node is configured to run separately for each item, and finally another node collects the outputs¹⁶. The tricky part is you may not know ahead of time how many items the map step will produce. With a dynamic graph, you can handle that by generating edges on the fly for each item.

Dynamic map-reduce in an agent: the agent decides tasks and spawns multiple calls, then aggregates results

This pattern is great for tasks like:

Multi-document Q&A: First find a set of relevant documents (Map: retrieve top N docs for the query, possibly using multiple queries in parallel with different keywords), then have the LLM read each and answer the question just from that document, then finally combine those answers (Reduce) into a consolidated answer with evidence.
Brainstorming and Filtering: Map: generate 10 ideas for X. Reduce: evaluate each idea (maybe with another LLM or a scoring function) and pick the top 3.
Data extraction at scale: Map: run an LLM or regex over each paragraph of a text to extract a certain field. Reduce: merge all these extracted pieces into one structured output (or a report).

By structuring it explicitly as map-reduce, you introduce a bit more determinism (each piece is processed the same way) and you can often parallelize the map step. Some workflows even do multiple levels of map-reduce (hierarchical) – e.g., first map over documents to get per-doc answers, then map over those answers to get refined answers, etc., then final reduce.

Structured Loops with Termination Checks

We touched on loops earlier; another pattern is adding self-checks to loops. Instead of a potentially endless ReAct loop, you might implement something like:

The agent loops with a max of K iterations.
On each iteration, after the agent's action, a check is run: e.g., evaluate if the goal is achieved or if progress has stalled.
If the check indicates success or no progress, break out to avoid waste.

One example is a search-until-found loop: Use an LLM to generate search queries and read results in a loop. After each retrieval, have a criterion (possibly an LLM judge) check: "Did the result likely contain the answer?" If yes, stop; if not, iterate another search. This is more robust than either a fixed one-shot search or an infinite search – it adapts to the difficulty of the query but has a safety net.

Another example: in code generation, an agent could attempt to compile/run code, see the error, then loop back to fix the code. You limit to e.g. 3 attempts. That's effectively a structured loop with a termination condition (success or reaching attempt limit). Many coding agents implement this compile-test-fix loop with a cap on retries to avoid infinite bug-fix cycles.

In summary, control flow patterns give you building blocks to make agent processes more modular and robust:

Chaining simplifies complex tasks by breaking them down.
Routing ensures inputs go through the right process or model.
Loops allow iterative refinement or tool use until done.
Parallel branches speed things up and allow ensemble behavior.
Map-reduce (including dynamic map-reduce) handles unknown numbers of subtasks elegantly.
Structured termination prevents runaway loops.

Combining these patterns is common. For instance, an agent might route to one of several chained workflows, each of which may have an internal loop or parallel step. Designing the control flow is a lot like designing a program or state machine, except some decisions are learned (via prompts) rather than hardcoded.

Next, we'll look at scaling beyond a single agent into multi-agent systems.

Multi-Agent Orchestration Patterns

Sometimes one agent (even with tools) isn't enough, either due to complexity or a need for specialization. Multi-agent systems involve multiple LLMs (or multiple prompts of possibly the same LLM) that coordinate to solve a task. There are different architectures for this coordination:

Peer collaboration (network): Multiple agents talk to each other relatively symmetrically and decide among themselves how to solve the task. This is hard to manage but mimics a team brainstorming.
Supervisor or Manager-Agent: One agent (the "boss") looks at the task and delegates parts to other specialist agents (the "workers"). The boss then assembles the results. This is akin to an orchestrator-worker pattern⁴.
Tool-based specialists: A variant of the above where the specialist agents are wrapped as callable tools that a main agent can invoke. For example, you might implement separate agents for math, coding, and writing, and expose each as a "function" with an API (like solve_math(problem) calls the math agent internally). The main agent uses function calls to invoke them as needed.
Hierarchical teams: You can have layers of agents – e.g., a top-level planner agent spawns sub-tasks to two mid-level agents, each of which could further spawn their own sub-agents. This gets complex but might be useful for very elaborate processes where tasks naturally break into sub-tasks of sub-tasks.

Multi-agent team with a supervisor coordinating specialized agents

Interoperability and Communication

A key question for multi-agent setups is the communication protocol: How do agents exchange information? You might use a shared memory or "blackboard" (a common datastore where agents post updates), or direct messaging (one agent's output is fed into another's input), or a turn-taking dialogue if it's two agents conversing.

In April 2025, Google introduced the Agent2Agent (A2A) protocol, an open standard to let agents communicate and coordinate across different platforms¹⁷. The idea is that an agent from Vendor A could call upon an agent from Vendor B as a helper, or a company could mix-and-match best-of-breed agents in one workflow. A2A handles message passing, authentication, and session state between agents, aiming for "universal interoperability." This is a recognition that the future may consist of an ecosystem of agents that need a common language, much like how microservices have APIs.

The Challenge of Multi-Agent Systems

Despite the excitement, a sobering 2025 study titled "Why Do Multi-Agent LLM Systems Fail?" found that current multi-agent frameworks often show minimal performance gains over single-agent approaches on benchmarks¹⁸. Through analyzing 7 frameworks on 200+ tasks, the authors built a taxonomy of 14 distinct failure modes in multi-agent systems. These fall into three broad categories:

Specification issues: The overall task or roles of agents are ambiguously defined, so agents either duplicate work or leave gaps.
Inter-agent misalignment: Agents may pursue conflicting sub-goals or miscommunicate, especially without a shared understanding of the objective.
Task verification gaps: No single agent sees the whole picture, and if there's no robust final checker, errors go unnoticed.

They found problems like agents ping-ponging wrong answers to each other, "leader" agents spawning excessive subagents for trivial tasks, or conversely failing to delegate when they should. The study provides a roadmap calling for better coordination mechanisms and verification steps. In essence, multi-agent systems introduce new surface area for errors that can be hard to fix.

Best Uses of Multi-Agent Setups

According to both anecdotal reports and research, multi-agent setups work best when a task can be parallelized or modularized naturally. Anthropic's team, for instance, built a multi-agent "research assistant" that spawns parallel search agents for different facets of a question¹⁹. This shines for complex questions that require gathering diverse information quickly.

However, the cost is also higher: multi-agent runs used ~15× more tokens than single-agent chats in their experiments, translating to higher latency and expense. Thus, multi-agent is only worth it when the task is high-value enough to justify the cost (e.g. in-depth research, not a simple FAQ answer).

Limitations: Besides cost, some tasks simply don't decompose well. If every subtask depends on understanding the entire context, splitting it among agents leads to information loss. Anthropic found that many coding tasks, for example, weren't a good fit for multi-agent approaches because they require a tight, shared context that's easier for a single model to handle⁴.

We're likely to see hybrid systems where a powerful primary agent uses a few specialized helper-agents/tools, rather than swarms of dozens of agents. Keeping the team small and well-organized mitigates many failure modes. Even in multi-agent land, less can be more.

Orchestrator-Worker Pattern

As described by Anthropic⁴, one LLM (the orchestrator) dynamically breaks down a task and distributes pieces to other LLMs (workers), then synthesizes the results. This pattern shines when you can't know ahead what subtasks are needed or how many. For example, a coding agent given "implement this feature" might determine it needs to edit 3 different source files, and it will create 3 worker subtasks (one per file). Another example: a research agent might spawn multiple search/query tasks in parallel via workers, then gather their findings to answer a complex question.

Orchestrator-workers pattern – one boss delegating tasks and combining results

This is essentially a multi-agent approach (the workers could even be the same underlying model or different models specialized per task type). The orchestrator acts as a dynamic router and planner combined.

Mixture-of-Experts (MoE) / Mixture-of-Agents: Instead of one agent with a huge prompt or many tools, you have multiple focused agents and a router that picks which agent(s) should handle a given input. This is similar to routing but at an agent level. Together AI's recent demo with Groq hardware showcased a "mixture-of-agents" where a user query is passed to a manager that decides which expert agent – say one for coding, one for math, one for general knowledge – should respond, and then possibly combines their answers²⁰. This leverages the strengths of different models (e.g., using a code-tuned model for code tasks, a dialogue model for others) in one system.

Multi-Agent Conversations: Another emerging pattern is to have agents engage in dialogue with each other to reach a solution. For instance, you might have a "questioner" agent and an "answerer" agent: the questioner's job is to probe for missing information or clarify requirements, and the answerer tries to produce a solution. They chat until satisfied. This was explored in research to improve factual accuracy – the questioning agent forces the other to justify or clarify, hopefully leading to a more correct final answer. It's also used in simulations (e.g., two agents role-play an interviewer and interviewee to test a chatbot's knowledge or practice an interaction).

Multi-agent dialogues can also be used for evaluation: have two agents debate or one agent quiz another to assess correctness. Some evaluation harnesses simulate a user-agent conversation with an AI user to test the agent's performance in a realistic interactive setting²¹.

Coordination & State: In complex multi-agent systems, managing state and information flow is crucial. Each agent might have its own memory or context window. Often, you will structure it so that each agent has a subset of the global state relevant to its role. LangGraph's Subgraph feature is helpful here: you can encapsulate an agent's internal prompt and tools in a subgraph node that has its own local state schema, exposing only certain outputs to the parent graph²². Essentially, each agent runs somewhat independently, and you control what they share with others. This prevents one agent's irrelevant context from cluttering another's prompt.

Subgraph example: each agent subgraph has its own internal steps and state, interfacing through defined inputs/outputs

Case Studies:

ATLAS (Academic Task Learning Agent System)¹⁵ – a multi-agent system for student assistance. It has a Coordinator agent that interacts with the user and delegates to specialist agents: a Planner agent (for creating study plans), a Notetaker agent (for summarizing content), and an Advisor agent (for answering conceptual questions). Each of these is implemented as a distinct LLM prompt/agent. The Coordinator orchestrates the overall flow, deciding when to invoke each specialist and how to merge their outputs. ATLAS shows how multi-agent design can tackle a complex educational workflow by breaking it into clear pieces with defined roles.
AutoGen Research Team (Microsoft)¹⁵ – an example using Microsoft's AutoGen framework to create a team of agents with specific roles: an Admin (facilitator), a Planner (who outlines tasks), a Developer (writes code), a Tester (runs code and reports issues), etc. They collaboratively solve a problem (like writing a program). The Admin manages turn-taking and ensures everyone stays on task. This showcases the orchestrator-worker pattern in a coding scenario, improving reliability by dividing responsibilities (and it draws inspiration from human software teams).
Kiroku (AI document assistant) – this system (open-sourced by Harrison Chase) is essentially a writing assistant that heavily uses human-in-the-loop but also multiple internal agents for different tasks. For example, one agent might propose content for a section of a report, another agent checks for consistency or style, and a human overseer approves each section. Kiroku highlights that even with many agents, you often keep a human supervisor in the loop for final approval. It's a complex agent network designed to draft documents (like medical reports) with accuracy and compliance, using agents to enforce each requirement (one for medical terminology, one for clarity, etc.) and a human to sign off the final output.

Multi-agent systems can be powerful but are even more complex to get right than single-agent flows. Some tips:

Keep the number of agents minimal – more agents means more communication overhead and points of failure.
Clearly define each agent's role and ensure their prompts/instructions reflect that role (to avoid overlap or contradictions).
Use a shared knowledge base if needed (so agents have some common facts to refer to), but avoid giving every agent the entire state unnecessarily (to keep their context focused).
Simulate or test each agent independently first, then in combination. It's easier to debug one agent's prompt than an entire swarm at once.
Monitor the interactions – tools like LangSmith (for logging and visualizing LLM calls) or LangGraph's built-in trace viewer help a lot in debugging who said what when. Logging each message passed between agents is crucial for diagnosing issues in multi-agent coordination.

In practice, many "multi-agent" apps are actually implemented as a single agent that can call multiple tools (where each tool might internally invoke an LLM). The boundary can blur. For example, an "SQL Analyst Agent" might have one tool for direct database queries and another tool that is actually "ask the data analyst agent for help" – to the orchestrator, it's just calling a function, but behind that function a separate agent is doing work. Use whatever abstraction makes development and maintenance easier for you.

Self-Reflection and Self-Improvement Patterns

One exciting aspect of agentic systems is the ability for the AI to evaluate and improve its own outputs. Instead of relying solely on humans to spot errors or give feedback, we can design agents that critique themselves or each other. This can greatly enhance reliability and performance. Several patterns fall in this category:

LLM-as-a-Judge (Self-Evaluation)

The idea of using an LLM to grade or evaluate another LLM's output has gained popularity. For instance, after an agent produces an answer, you can feed the question and answer into a separate "judge" prompt: "Given the user's request and the assistant's answer, score from 1-10 how well it met the requirements and explain any errors." The judge LLM's analysis can then be used to decide if the answer is good to return or if the agent should try again.

However, an LLM judge can be biased or flawed in similar ways to the original model – it might incorrectly criticize a correct answer or miss subtle issues. A trick to improve this is a self-improving feedback loop:

Start with an LLM-as-judge using a generic prompt or criteria.
Whenever the judge's evaluation disagrees with human feedback (e.g., a user says the answer was wrong but the judge thought it was fine), log that case. Then periodically update the judge prompt (or fine-tune it) with a few-shot example of that scenario: "If the question is X and the assistant said Y, a human marked it wrong because... So you (the judge) should learn to catch that."
Over time, the judge gets better aligned with human preferences by incorporating real-world corrections into its evaluation process²³.

In other words, treat the evaluator like a model you are tuning – use human corrections to continuously refine it. This creates a sort of evaluation flywheel: more data → better judge → better evaluation of your main agent → targeted improvements in the agent itself.

One LLM ("Judge") evaluating another's output, with human feedback to improve the judge

This pattern is used in constructing automated evaluation pipelines (e.g., for checking the quality of summaries or code outputs, where the LLM judge can save a lot of human effort if reliable). It's important to keep a human in the loop at least periodically to make sure the judge isn't drifting or reinforcing subtle biases. Research like "Who Validates the Validators?" explores how to better align LLM-based evaluators with human judgments²³. The key takeaway: LLM evaluators are useful but should be calibrated and validated against real human opinions.

"Reflexion" (with an x) refers to a specific approach where an agent, after attempting a solution, reflects on what might be missing or wrong, and then tries again. This was proposed by Shinn et al. (2023) for improving task-solving by allowing the model to correct itself after a failure.

A simple implementation:

The agent produces an initial solution (say, code that likely has bugs, or an answer that might be incomplete or incorrect).
The agent is then prompted to critique its own output: e.g., "Critique the above output. What could be improved or what might be incorrect or still needed?"
The agent generates a critique or list of mistakes.
Use that critique to guide a second attempt at the solution. For example, include the critique as additional instructions: "Given the critique, produce an improved solution."
Optionally, loop this process multiple times or until the critique finds no major issues.

This effectively has the agent play both "student" and "teacher" to itself. It's surprisingly effective in practice for things like math word problems or coding challenges, where the first answer might be wrong but the model can spot its own mistake upon reflection. The critique step forces the model to consider the requirements or correctness criteria explicitly.

A variant is to use a second model as the critic (like GPT-4 as critic for a GPT-3.5 answer). But even a single model can do it in two stages by role-playing evaluator and solver.

Case Study: An evaluator-optimizer loop as described by Anthropic⁴ is basically this pattern: one LLM generates a solution, another (or the same in a different mode) evaluates it, and based on that feedback, the first tries again, repeating until the evaluation is satisfactory. They give examples like writing a story with certain criteria, where another model checks if the story meets those criteria, iterating until it does.

Evaluator-optimizer loop – one model proposes, another evaluates, and feedback loops until criteria are met

Another interesting use is in tree search. There's a concept called Language Model Monte Carlo Tree Search (LM-MCTS) where an agent tries multiple reasoning paths (branching out possible solutions), and at each intermediate step a value function (which could be an LLM or a heuristic) evaluates partial solutions to decide which branches to explore further. Reflection can serve as that value function, steering the search towards the most promising line of reasoning by pruning bad trajectories early.

Self-Healing Code Agents

When agents write code, one common approach is to let them run the code (or at least compile it) and use the results to improve the code. This has become a pattern in itself:

The agent generates code based on a prompt or specification.
Execute the code or run tests (this might be done by the agent calling a tool).
If an error or failing test is encountered, feed that back into the agent: e.g., include the error message in the prompt with instructions to fix it.
The agent debugs and produces a new version of the code. Loop as needed (with a cap on attempts).

It's essentially using the environment (runtime errors and test results) as feedback instead of a learned evaluator. Many code agents (OpenAI Codex, Replit's Ghostwriter, Amazon CodeWhisperer, etc.) do this automatically in their backend. It dramatically increases success rates in coding tasks, because the model doesn't have to get it perfect on the first try – it can learn from mistakes.

A Self-Healing Codebase agent¹⁵ extends this idea with memory: it not only fixes one bug, but also stores the bug+fix as knowledge so that if a similar error happens in the future (even in a different part of the code), it can recognize "aha, I've seen this error and the fix before." In that system, they vector-embedded error messages and their associated fixes. On a new error, they retrieve similar past errors to guide the new fix. This is an innovative blend of long-term memory + reflection specifically for coding. Over time, the agent "learns" from each coding task and builds up a library of solutions to common pitfalls.

Red Teaming and Fuzz Testing Agents

To improve an agent, you need to find its weaknesses. Red teaming means stress-testing the agent with tricky or adversarial inputs to see how it fails. One pattern is using generative approaches to create these stress tests:

LangFuzz (an experimental tool by LangChain) generates pairs of similar inputs designed to probe consistency²⁴. For example, it might take a question and slightly alter a detail (metamorphic testing) to create a new question. If the agent's answers to the pair are dramatically inconsistent (and they shouldn't be), that flags a potential issue (at least one of the answers is likely wrong).
These flagged cases can then be added to a test suite or dataset for regression testing. They can also be fed back to the agent (or used in fine-tuning) to improve it.

Another tactic is to simulate malicious or edge-case users. Multi-agent simulations can do this: for example, one agent acts as an "attacker" trying to trick the main agent into breaking rules or revealing secrets, while the main agent is supposed to resist. By observing where the main agent fails in these simulated attacks, developers can patch those vulnerabilities (either via prompt changes or additional guardrails). This method was used by some researchers and companies to red-team GPT-4 by having one AI attempt to socially engineer another AI.

By incorporating such tests into your development cycle (and even automating them), you create a continuous improvement loop: every time the agent fails a new edge case, you either adjust the prompt, add a guard, or fine-tune on that case, then test again. Over time, the agent becomes much more robust.

Learning from Experience

So far, we mostly discussed patterns at the prompt orchestration level. But there's also the angle of learning from actual usage. If you have users interacting with your agent, you can capture outcomes (was the user satisfied? Did they have to correct the agent? Did an error occur?) and feed this back into training or prompt engineering:

Maintain a log of failures and their fixes as few-shot examples (similar to the self-improving judge above, but applied to the agent itself). Periodically retrain or re-prompt the agent to handle those cases better.
Use reinforcement learning from human feedback (RLHF) on your specific task: e.g., fine-tune the model to prefer actions that led to success in your logs vs. those that didn't. This is non-trivial outside large orgs, but conceptually an extension of these ideas.
At minimum, do offline evaluation on real conversation transcripts to identify common failure modes and address them (through better prompts, new tools, or additional training data). Even without formal RLHF, iteratively updating your prompts based on real failures is a simple form of learning.

The theme across these reflection and improvement patterns is closing the loop: Agents are not fire-and-forget scripts; you monitor how they perform, have mechanisms (automated or human-guided) to evaluate that performance, and loops to make them better either on the fly or in the next version. This transforms agent building from a one-time prompt engineering effort into a continual process of evaluation and refinement.

Human-in-the-Loop Patterns

Despite fancy AI capabilities, human oversight remains incredibly important for agent systems – especially in high-stakes or customer-facing scenarios. Human-in-the-loop (HITL) design means a human can intervene or contribute at certain points in the agent's operation. Rather than full autonomy, the agent and human collaborate.

There are different degrees of human involvement:

Approval/Confirmation Checkpoints: The agent pauses at a certain step and asks a human supervisor (or the end-user) "Should I proceed with doing X?" or "Here is my plan/result, do you approve or want to change anything?" If the human says no or provides edits, the agent incorporates that and continues. This is common when an agent is about to do something irreversible or high-impact (e.g., before an agent executes a large transaction, require human approval).
Edit and Continue: The agent generates something intermediate (like a plan, an email draft, or a piece of code) and a human can edit that output directly. The agent then takes the human-edited version as the basis for the next steps. This is powerful because the human can make nuanced corrections that would be hard to specify in advance. For instance, if an agent is about to call an API with certain parameters and the human sees one parameter is wrong, they can just fix it in the JSON, then let the agent proceed.
Disambiguation Queries: The agent might proactively ask the user (or a human operator) questions when it's unsure. Instead of guessing and possibly going astray, it can say "I have two ways to interpret your request, can you clarify which one you mean?" This uses human input to steer the agent at decision points.
Fallback to Human: If the agent gets stuck, detects its own low confidence, or runs into a scenario it's not programmed to handle, it can hand off the task to a human entirely or escalate to a human for guidance. Many real-world deployments do this: e.g., a customer support chatbot that seamlessly transfers the chat to a human agent when it can't help or when the user explicitly requests a human.
Human-as-a-Service (Agents Hiring Humans): Pushing this concept further, some projects treat humans as on-demand "workers" that agents can pro-actively call when needed. This is sometimes jokingly called the PayMan pattern¹⁵:
- An agent is given access to a special tool, e.g. request_human_help(task_description, price). When invoked, this would post the task to a human marketplace or alert a human operator.
- The agent might use this if it encounters something it absolutely cannot do, such as solving a CAPTCHA, making a phone call, or performing a physical-world action.
- The human completes the task and the result is returned to the agent, which then continues its process.
- While not common in production due to ethical and latency concerns, this pattern underscores a philosophy: use the AI for what it's good at, and involve humans for the rest. The agent becomes an orchestrator of both AI and human resources.

In design, adding HITL means identifying points in the flow where a pause for human input is valuable. This often corresponds to:

Right before final actions with external effects (sending an email, making a purchase, deleting data).
After some analysis but when multiple options are possible or the agent is unsure (so a human decision can prevent wasted effort).
When the agent's confidence is below a threshold (if you have a way to estimate that, e.g., from an LLM judge or a heuristic).
On any trigger that the agent might be producing a disallowed or risky output (a human review for safety).

For implementation, frameworks like LangGraph allow you to insert breakpoint nodes which essentially halt the graph and await human input²⁵. The agent's state can even be modified by the human at that point (e.g., the human can edit the agent's proposed solution directly). Once the input is provided, the agent resumes from that state. For example, LangGraph has a tutorial where an agent is creating UI code; it generates some HTML/CSS, then a human can preview it live and either approve (continue) or give feedback. The feedback is inserted into the agent's state (like adding an instruction "make the button blue instead of red") and then the agent continues to refine the UI code with those instructions²⁶.

Agent paused for human input – e.g., user can approve or edit the agent's plan before continuing

Case Studies:

Kiroku (Document Assistant) – heavily uses human input. The AI suggests sections of a document (like a patient medical report), but the human user can edit each section, reorder them, delete or add sections, etc., before finalizing. Essentially, the AI does the heavy lifting of drafting and organizing, but a human is in the loop at every step ensuring the final output meets requirements. This pattern acknowledges that fully automating complex writing might be too error-prone; instead, AI handles perhaps 80% of the work, with a human doing the final 20% polish and sign-off.
Customer Support Chatbots – in production, many AI customer support bots have a flow like: if the user asks to speak to a human, or if the AI's confidence in its answer is low, or if the conversation goes in circles, then escalate to a human agent. The human agent sees the conversation history and continues from there, often with suggestions from the AI. This kind of HITL ensures a user is not left frustrated by an AI that can't solve their issue. It's essentially a safety net.

The challenge with HITL is to make the human-agent handoff smooth. If a human intervenes, the system should handle that gracefully – e.g., incorporate the human's input into the agent's state (so the agent doesn't ignore the correction) and maintain context so the human doesn't have to repeat information. It also requires UI considerations: the user or supervisor needs a clear way to provide input (approve, edit, comment, etc.), and they need to understand what the agent is asking for.

From a design perspective, figuring out where to put human checkpoints is key. Too many, and you negate the efficiency gains of automation (the process becomes annoying to the user). Too few, and you risk errors going unchecked or users losing trust. A common compromise is to have the agent handle routine cases autonomously, but route uncertain or risky cases to humans. This way, human experts focus attention only where it's most needed – a form of hybrid intelligence that pairs AI speed with human judgment.

In summary, human-in-the-loop patterns acknowledge that AI is not infallible and that human guidance can dramatically improve quality and trust. Especially in enterprise and high-stakes domains, these patterns are often non-negotiable for production deployment.

Tool Use and Integration Patterns

One of the superpowers of modern LLM agents is their ability to use external tools (via API calls, function calling, etc.). This allows them to operate in the real world: browse the web, execute code, query databases, control IoT devices – in general, to augment their text-only knowledge with actions. Several patterns revolve around effective tool use:

ReAct and Tool-Using Agents

The ReAct framework (by Yao et al., 2022¹) is a popular baseline for tool use. In ReAct, the agent interleaves reasoning ("Thought") and acting ("Action") steps. A prompt is structured so the model outputs either a thought (which might be followed by a tool call) or a final answer. The agent uses tools to gather information until it decides it has enough to answer.

This has become the template for many general-purpose agents:

The prompt includes instructions and possibly few-shot examples showing how to use tools. Typically it lists available tools and a format like: "Thought: <the agent's reasoning>\nAction: <tool_name>[<tool_input>]"
On each loop, the agent can pick one of the available tools and provide input.
The environment (the code outside the model) executes that tool (e.g., calls the search API, runs a calculator) and returns the result.
The result is fed back into the model's context (appended as an "Observation").
The model then continues with another Thought, maybe another Action, and so on, until it outputs a final answer.

Pattern-wise, key aspects are:

Having a well-defined tool interface: Each tool should have a clear description and input/output format so the model knows how to use it. Good tool documentation in the prompt significantly improves reliability⁴. With OpenAI's function calling or similar, you provide a JSON schema for tool inputs/outputs, which helps the model call tools correctly.
Chain-of-Thought prompting: Encouraging the model to think step-by-step (thoughts) before actions, which improves the quality of its decisions. ReAct implicitly does this by alternating reasoning and action.
Handling tool errors or limits: The agent's prompt or code should handle cases like a tool returning nothing useful or encountering an error (e.g., a search returning no results, an API call failing). The agent might then try a different approach or ultimately apologize it can't do it.
Termination condition: In a ReAct loop, the agent needs a way to decide to stop using tools and provide the final answer. This is often when the thought says something like "I now have enough information" and then it outputs the answer instead of an action.

ReAct agents are typically goal-directed but not strictly constrained in how many steps to take, making them quite autonomous. They are powerful for open-ended tasks like "research this topic and write a report" or "find and book the best flight for me" where multiple steps and external info are needed. The drawback is unpredictability and the potential for doing something you didn't expect (hence the need for safeguards and possibly human oversight, as discussed).

Anthropic's description of building agents⁴ essentially describes this pattern: an agent will plan and execute in a loop, possibly pausing for human input or to reconsider. They caution that a fully autonomous loop is hard to control, which is why all the earlier patterns (routing, validations, etc.) often need to be layered around a ReAct core to keep it reliable.

Retrieval-Augmented Generation (RAG) Agents

Retrieval-Augmented Generation (RAG) is so common it's worth highlighting as its own pattern. RAG means the agent taps into an external knowledge source (usually via vector similarity search or a database query) to ground its responses in up-to-date or specific data.

A basic RAG pipeline is:

Take the user's query and generate an appropriate search or database query (this could be as simple as a vector similarity search on a document index).
Fetch relevant documents or facts.
Insert those documents (or their summaries) into the prompt, along with the question, so the LLM can base its answer on that information.
The LLM generates an answer, often with direct quotes or citations from the retrieved docs to increase factuality and trust.

An agentic RAG pattern is when the retrieval isn't just one fixed step before answering, but rather controlled by the agent in a loop or dynamically:

The agent might do multiple retrieval steps. For example: ask one query, read results, realize something is missing, then do another query with a refined question, and so on.
The agent might choose among multiple data sources or search strategies. For instance, it could decide: "This question is about a specific person's resume, I should search our internal HR database, not the general docs." Or "Let me query the vector store for technical docs, and also do a keyword search in the FAQ database, then merge results."
The agent might verify the information from retrieval. For example, after generating an answer, it could double-check each factual claim against the source docs (and if a claim isn't supported, it might trigger another search or mark it for human review).

Some advanced retrieval patterns include:

Multi-hop Retrieval: The agent does a chain of retrievals where each informs the next. E.g., user asks: "Who won the Nobel Prize in Physics in the year the author of The Road Not Taken was born?" The agent might need to break this into steps: find who wrote The Road Not Taken (Robert Frost), find what year he was born (1874), then find Nobel Physics winner in 1874. A static single-step RAG might not handle that well, but an agent can do it with multiple linked queries.
Graph-augmented Retrieval (GraphRAG): Instead of (or in addition to) a vector store, the agent queries a structured knowledge graph. For example, if you have data in a Neo4j graph, the agent might translate a user's question into a Cypher query (via a tool) to get precise relational info. Graph-based retrieval is especially useful for multi-hop questions about relationships. Microsoft researchers proposed a "From Local to Global GraphRAG" approach where they extract entities and relations from text into a graph, then use network analysis and prompting to answer complex queries²⁷. Tomaz Bratanic reproduced this with Neo4j and LangChain: the agent first identifies key entities in the query, queries the graph for related info, and uses that to formulate the answer. The pattern here is combining unstructured and structured data sources seamlessly.
Self-Querying Retrievers: A technique where the LLM itself generates a search query that includes not just keywords but also metadata filters, which a specialized retriever can interpret. For example, LangChain has a SelfQueryRetriever that allows an LLM to say: "search the documents for 'climate policy' where year:2021 and type:report." The retriever then converts those filters to a structured database or index query. This results in more targeted retrieval, as the model can leverage its understanding of the user query to narrow down results beyond pure semantic similarity.
Ensemble Retrieval: Using multiple retrieval methods in parallel and combining results. For instance, an agent might do both a keyword search and a vector search, or search in multiple indexes (one index of fine-grained paragraphs and another of high-level summaries). It then merges the two result sets. This can yield better coverage: the vector search might catch something the keyword search missed and vice versa. IncarnaMind is an example project that used an ensemble of sliding-window chunk retrieval (fine detail) and broader document-level retrieval (coarse context) to answer questions about personal documents¹⁵. The agent queries both and uses the combination to form a more accurate answer.

The core idea of RAG is giving the agent access to knowledge beyond its parameters. This greatly reduces hallucinations and allows handling of queries on up-to-the-minute information or proprietary data. The patterns above add flexibility: multi-step retrieval lets it handle complex queries, graph retrieval adds precision for relational data, self-query improves relevance, and ensemble methods increase recall.

Case Study: The Sophisticated Controllable RAG Agent¹⁵ is an open-source project that implements a complex RAG workflow as a deterministic LangGraph. It breaks down hard questions into sub-questions, performs iterative retrieval and re-planning, and verifies each piece of the answer against sources to avoid hallucination. It's not just one agent prompt – it's a network of nodes handling each stage (question decomposition, document search, reading, answer assembly, source citation). They manage to avoid hallucinations by verifying every claim with a retrieved source. This is an example of combining many patterns: planning, multi-step retrieval, reflection (checking the answer), and a structured flow that the developers control. The trade-off is complexity and more steps, but it's a glimpse into how one might design agents for enterprise Q&A that must be correct and traceable.

Another example: Realm-X (AppFolio's AI assistant for property management) uses parallel retrieval and workflow branching to answer user questions about real estate data. If a query is about a specific property, it routes to an agent that pulls data from internal databases; if it's a general question, it might use a different path. They reported boosting accuracy by 2x through these strategies, and using LangGraph to parallelize tasks (like fetching different pieces of info simultaneously) to reduce latency. This shows the real-world impact of careful RAG design and control flow – it can significantly improve correctness and speed in a production agent.

Structured Output Enforcement

When an agent must return data in a specific format (JSON, XML, a SQL query, etc.), we use structured output techniques to reduce errors:

The simplest is to put in the prompt: "Return the answer as JSON with fields X, Y, Z." and maybe give an example output. Models often follow this, but as outputs get larger or more complex, errors (like missing a bracket or misnaming a field) can occur.
More robust: use the model's native structured output features. For instance, OpenAI's function calling allows you to define a schema for output, and the model will respond with a JSON object that can be programmatically parsed (or you get a Python dict directly in the API response).
If function calling isn't available, some use libraries like Guardrails or regex-based validators to catch format issues and have the model correct them. For example, if you expect valid JSON and the model returns invalid JSON, you feed the error back in: "The JSON was invalid because X, please fix it."
A newer technique is using JSON Patch Operations (the approach behind the open-source tool TrustCall). Instead of asking the model to output a full complex JSON, you ask it to output a JSON Patch (a standardized format for describing changes to a JSON document). You can start with an empty or partial JSON, and have the model iteratively add or modify parts via patches until the schema is complete and valid. Each patch is small and easier for the model to get right, and after each patch you can validate the JSON and only continue if needed. This "patch, don't regenerate" approach has been shown to be faster and more reliable for filling complex schemas²⁸.

For instance, if the agent's job is to extract information into a deeply nested JSON structure, doing it in one shot often leads to some part missing or incorrectly placed. With a patch-based approach, the model can focus on one section at a time. If it makes a mistake, it only has to patch that part, not rewrite the whole thing. TrustCall (by W. Hinthorn) implements this: it asks the LLM to generate JSON Patch instructions rather than full JSON. The benefits reported are faster and cheaper structured output generation (because the model doesn't waste tokens rewriting large chunks), resilience to validation errors (you catch and fix them incrementally), and accurate updates to existing data without accidentally deleting other parts. It works across use cases like information extraction, routing (filling a routing schema), and multi-step agent tool use where intermediate data is in JSON.

One can combine function calling with patching: e.g., the function schema could expect a list of JSON Patch operations, which you then apply to an in-memory object.

Code Outputs and Validation: Similarly, if an agent outputs code, you can enforce structure by providing templates or using linters. For example, if the agent should output only a SQL query, give a few-shot prompt where every output is a SQL snippet with no extra commentary. Then programmatically test that the query is syntactically valid (perhaps by running an EXPLAIN on it) and doesn't have forbidden clauses. If it fails, prompt the agent to fix the SQL. For more general code, incorporate compilation or unit tests into the loop (like the self-healing code pattern above) – in effect, the "structure" being enforced is passing all tests.

Another strategy is piecewise output assembly: have the agent output a certain part in one step, then use another step to wrap or merge parts. E.g., first generate a function body, then have another prompt to insert that into a provided class template. This divides the task so the model's outputs are more constrained at each step (which can improve correctness).

In summary, enforcing structured output often means shifting some work from the model to the system. Instead of relying on the model's internal reliability, we guide it with scaffolding: schemas, patch instructions, iterative validation. This reduces errors significantly in real-world pipelines where structured data is needed.

References: OpenAI's function calling documentation²⁹ describes how forcing an output schema can be done via the API (this was a big step forward for tool use and output formatting). The TrustCall approach was introduced in 2023 – see the GitHub README for TrustCall which explains the JSON patch method and its benefits. This approach is gaining traction for any scenario where JSON outputs are large or dynamic. Also see the "Structured output with LangChain" guides¹⁵ for examples of using Pydantic schemas and step-by-step output construction.

Agents can be extended beyond text inputs and outputs, allowing multi-modal interaction:

Voice Integration: By plugging in speech-to-text and text-to-speech, you can create agents that users can talk to and that speak back. For instance, use OpenAI's Whisper or Google's speech API to transcribe user speech into text, feed that to your agent, then take the agent's text response and use a service like ElevenLabs or Amazon Polly to synthesize speech. The agent's core logic remains text-based, but it's wrapped in a voice interface. Patterns to consider: you might need the agent to produce more conversational, auditory-friendly output (shorter sentences, maybe adding verbal cues like "hmm" or controlling tone). It's usually good to allow barge-in or handle partial speech input (for responsiveness). Some applications also use voice for notifications ("Agent reads out the summary of new emails" etc.).
Vision and UI Control: Agents can interpret images or control graphical interfaces. For example, OpenAI's vision-capable models (GPT-4 with vision) can analyze an image provided in the prompt and output text about it. This can be part of a tool-using agent: one tool could be "see_image" which returns a description of an image, enabling the agent to reason about visual data. Conversely, for control: tools like Selenium or robotic process automation can let an agent click buttons or fill forms on a webpage. A visual agent might get a description of the webpage DOM or a screenshot and then output an action like "click the 'Submit' button" which the tool executes. There are community projects integrating LangChain with browser automation to allow an agent to operate web apps through a combination of reading the HTML and simulating clicks/keystrokes¹⁵. This pattern effectively treats the whole web UI as a tool.
Robotics and IoT: In a physical environment, an agent might have tools to move a robot arm, take a photo, read a sensor, etc. The same principles apply – define actions, get observations. NASA JPL's ROSA agent, for example, connects a LangChain agent to a Robot Operating System (ROS) interface, allowing natural language commands to control a robot¹⁵. The agent interprets user commands into sequences of robot actions (like navigation or manipulating an object), possibly querying the environment via sensors in between. Key here is safety – often a human-in-loop is used for critical actions, and the action space is constrained to avoid dangerous moves.

Multi-modal integration often boils down to treating each non-text modality as either an input conversion or a tool. For input modalities (speech, images), you convert them to text or structured data the agent can handle. For output modalities (speech, GUI actions), you convert the agent's text or structured output into the desired format or action.

One challenge is that multi-modal interactions can increase the agent's context needs significantly. E.g., describing a complex image may produce a lot of text tokens. This requires efficient summarization or region-of-interest focusing. Another is that multi-modal outputs can be slow (speaking long text takes time, executing a sequence of UI actions takes time), so expectations of latency must be managed.

Despite these challenges, multi-modality is a frontier where agents can become far more useful. Think of an agent that can read a diagram or chart for you and answer questions, or one that can fill out webforms to accomplish tasks online, or an assistant that you can talk to hands-free. Many of the previously discussed patterns (memory, loops, etc.) still apply in multi-modal settings, just with additional data types in the mix.

Planning and Reasoning Patterns

Agents that have to come up with a complex plan before execution use patterns often called Plan-and-Execute or Meta-planning:

The agent first generates a high-level plan of steps to take (without executing them yet). For example, "To solve this, I will: 1) gather requirements, 2) generate two options, 3) compare options, 4) present the best option."
Then either the same agent or another agent/tool goes through the plan steps one by one. Sometimes the plan is purely for the agent's own guidance (it's not shown to the user, just used internally).
If a step fails or yields unexpected info, the agent might revise the plan on the fly. Or it might re-plan entirely if a new obstacle emerges.

This is useful when tasks benefit from forethought. Without an explicit plan, an agent might meander; with a plan, it has a roadmap. Some frameworks encourage writing out a plan in the prompt before proceeding to act. It's similar to chain-of-thought but more structured. OpenAI's function calling examples sometimes use a "plan" function that just outputs a plan, then an "execute" function.

Another reasoning pattern is ReAct with deliberation – sometimes called "ReAct+Reflect" or ReWoo (Reasoning Without Observation). The idea is to separate reasoning that requires external info from reasoning that can be done with internal knowledge. For example, an agent might first reason through a problem ignoring tools ("If I had all info, how would I solve this?") to avoid getting distracted by unnecessary tool use, then identify exactly what needs to be looked up and use tools for those missing pieces, then finalize the answer. This can lead to more efficient tool use because the agent doesn't just blindly start searching; it has a hypothesis to test or a specific gap to fill.

Autonomous Tool Discovery and Workflow Generation

A challenge with tool integration has been that the set of available tools is usually static and predefined. A cutting-edge idea is letting agents discover or learn new tools on the fly. Meta's Toolformer was an early step that fine-tuned a model to insert API calls into its text during training³⁰. More recently, we see efforts to have a meta-agent read API documentation and decide how to use an unfamiliar tool. The Model Context Protocol (MCP) and Google's A2A standard both envision scenarios where an agent can dynamically query for available tools and their capabilities[^29, 40].

A fascinating related pattern is agents creating and then calling new tools for themselves. Microsoft's AutoGen framework demonstrated agents that collaborate to write small Python functions which then get executed as tools¹⁵. Similarly, a paper called "WorkTeam" had multiple agents translate a natural language instruction into a formal workflow-essentially writing a little program with steps-and then executing it³¹. This blurs the line between programming and agent prompting. While powerful, it amplifies any errors, so researchers are exploring sandboxing these auto-generated tools.

Security: Prompt Injection Defense

As agents gain more power through tools, security becomes paramount. A very recent work proposes design patterns for securing agents against prompt injection attacks³². Prompt injection is when malicious input tricks the agent into ignoring its instructions or doing something harmful.

Some recommended patterns include:

Running the LLM in a more constrained mode.
Strictly validating outputs (e.g., if expecting an SQL query, disallow anything that's not a SELECT).
Using sandboxes for tool execution, so even if an injection causes the agent to execute code, it's in a safe environment with minimal permissions.
Confirmation checkpoints, where the agent must double-confirm dangerous actions with a human.
Context segmentation, where user content is kept separate from system prompts to avoid confusion.

The research emphasizes that security-by-design is becoming part of the agentic pattern language. We are likely to see frameworks providing out-of-the-box safety wrappers.

Large-Scale Deployment Patterns

When you deploy agents in a production system serving many users, some additional patterns and considerations arise:

Logging and Tracing: It's vital to log each action, tool call, and LLM prompt/response (excluding sensitive data) for debugging and monitoring. Tools like LangSmith or custom logging allow you to replay conversations, see where an agent might have gone wrong, and gather stats (like how many tokens each query uses, or how often each tool is called). Logging is also critical for auditability – if an agent made a decision (especially a wrong one), you want a record of why. Many teams build internal dashboards to inspect agent traces in near-real-time.
Batching and Async Processing: If you have many agent invocations (e.g., a thousand users asking things at once), treat LLM calls as asynchronous operations and batch them when possible. Some LLM APIs support sending multiple prompts in one request (which amortizes overhead and can be cheaper). For example, if your agent has a step where it needs to summarize 10 documents, instead of doing 10 serial calls, you might batch those into 2 calls of 5 each (if the API allows multiple completions per call) or run them truly in parallel with an async framework.
Task Queue and Workers: For background or long-running agents, use a job queue (like Redis Queue or Celery) and worker processes. This is more of a traditional software architecture pattern, but it applies: you don't want your web server thread blocking for 30 seconds while an agent does complex work. Instead, put the task in a queue, return an immediate response ("Your request is being processed"), and later provide the result (via WebSocket update or email, etc.). This also helps with retrying failed tasks and scaling horizontally by adding more worker machines.
Dynamic Scaling and Resource Management: Agents can be heavy on memory and CPU (especially if running local models). You may need to dynamically spin up more instances or use serverless functions for spikes in usage. It's good to implement backpressure (e.g., if too many requests, queue them or politely reject some) to avoid overwhelming the system or hitting rate limits on external APIs.
Monitoring and Alerts: Set up monitoring on key metrics: error rates (how many agent tasks fail or throw exceptions), latency (if it spikes, something might be wrong), cost (if tokens used per task jumps due to a prompt bug, you want to catch that), and user feedback (spikes in thumbs-down ratings or support tickets might indicate a problem with the agent). Use alerts to get notified of anomalies. Essentially, treat your agent like a microservice that needs ops monitoring.
Continuous Evaluation and A/B Testing: Regularly run a suite of test queries (maybe dozens or hundreds of example tasks) through your agent to check for regressions. If you update a prompt or add a new tool, run these tests and compare outputs before and after. Automated eval might include both LLM-as-judge scoring and specific checks (like "did it cite sources for each answer?"). Moreover, A/B testing different agent versions with real users can be insightful – for example, you might compare a version using GPT-4 vs. GPT-3.5 to see if the quality gain is worth the cost, or test a new planning strategy with a small percentage of traffic before rolling it out fully.
Versioning and Rollback: As you iterate on your agent (which you will, frequently), version your prompts and chains. If a new version performs worse, be ready to rollback. Also consider having a staging environment where new changes run on test users or internal users first.
Security and Permissions: If your agent can use tools that perform actions (like sending emails or making purchases), implement proper authentication and permission checks. Usually the agent will call a backend API that then does the action, so that backend can enforce rules (e.g., "don't allow spending over $X without approval" or "only access this user's data, not others"). The agent itself should be sandboxed – e.g., if it's allowed to execute code, run that in a secure sandbox environment.
Data Privacy: If your agent logs conversations or stores long-term memory, be mindful of personal data. Either avoid storing sensitive info or anonymize it. Comply with regulations (GDPR etc.) by providing ways to delete user data from the memory store upon request. Vector databases should be treated as containing personal data if you store user utterances or profile info in them.

Reliability and Evaluation

A significant development is the rise of new benchmarks and evaluation methodologies that highlight where agents still fall short.

Emerging Benchmarks for Agentic Systems

Beyond synthetic puzzles, new benchmarks focus on real-world capabilities and failure modes:

Real-World Task Benchmarks: Benchmarks like BEARCUBS focus on "computer-using" web agents-tasks like booking a flight or finding and editing a Google Doc. In BEARCUBS, human success was 84.7% but state-of-the-art agents only achieved ~24%⁸. That huge gap underscores how brittle current agents can be in uncontrolled environments where small things like pop-ups can derail them.
Economic Decision Making: The EconEvals benchmark tests agents in unknown scenarios requiring budgeting, game-theoretic reasoning, or adapting to changing "market" conditions. It highlights that LLMs lack consistent rational decision-making over multiple turns when there are hidden or shifting variables³³.
Inner Monologue and Theory of Mind: A quirky benchmark called ROLETHINK evaluates how well role-playing agents can maintain an inner thought process distinct from their outward persona. Early results show models have a shallow grasp of this, often failing to conceal information or make correct inferences about others' beliefs³⁴.
Safety and Tool Use Evaluation: A "SAFEARENA" benchmark tests autonomous web agents on tasks that have potential for harm (like navigating to disallowed content)³⁵. It was found that many agents fail to recognize unsafe actions, spurring interest in tool permissioning systems that give each tool a safety profile.

Trust and Transparency

We're seeing more discussion of trustworthy agent design. To deploy agents in high-stakes settings, we need systematic ways to audit their decision trails, verify their knowledge sources, and ensure they follow constraints³⁶. This could mean requiring agents to produce:

Attribution: Citations for factual claims, already common in RAG.
Rationales: Explanations for why an action was chosen.

However, getting truthful rationales is tricky; there's a risk of rationalizing, where the agent generates a plausible but untrue explanation. Research into interpretable AI agents is just beginning.

Automated Chain Analysis (Identifying Weak Links)

As agent workflows become complex, it can be hard to know which part of the chain is most responsible for failures. A tool/pattern called SCIPE (Systematic Chain Improvement and Problem Evaluation) aims to automate error analysis of agent chains³⁷.

The idea is to run the agent on test cases and have an LLM "judge" evaluate the output of each node in the chain. By computing statistics for each node (how often it fails independently, how often it fails due to upstream errors), SCIPE pinpoints the node that, if improved, would most likely yield the biggest boost to final accuracy. This targeted approach can save a lot of time compared to guessing.

This pattern of systematic chain evaluation is emerging as a best practice for complex agents. It's like unit testing and profiling for traditional software, but adapted to LLM-based flows.

Case Study: Paradigm – an AI-powered spreadsheet startup – runs thousands of agents in parallel to automate data tasks across spreadsheet cells. They leveraged extensive logging and monitoring to handle this scale. By tracing each agent's operations with LangSmith, they could track token usage per user and even implemented a usage-based billing model³⁸. They also analyzed logs to optimize performance. Paradigm's case illustrates that beyond designing a single agent's logic, designing the system around the agents (for monitoring, cost management, scaling) is equally important for success.

Whew! We've covered a lot of patterns, from the fundamentals to advanced techniques. Designing an AI agent is more like architecting a complex system than just calling an API once. You have to think about data flow, error handling, modularity, and user experience – all under the uncertainty that comes with LLM outputs.

To round things out, let's highlight a couple of emerging patterns that are still experimental but intriguing, as well as some closing thoughts and references.

Emerging and Advanced Patterns

Beyond the mainstream techniques above, there are some cutting-edge or niche patterns worth knowing. These are areas of active development and research in the agent community:

Edge and Local Agents with Function Calling

Most examples assume large cloud-based models, but there's growing interest in running agents on smaller, local models (for privacy or offline use) and still having tool use via function calling. OpenAI's function calling was limited to their own API, but recent open-source efforts like TinyAgent demonstrate similar structured tool use on small models. The pattern is to fine-tune or prompt smaller LMs (like Llama 2 7B/13B, etc.) to output a special format indicating function calls, effectively enabling ReAct-like behavior without needing an API that natively supports it.

While smaller models are less capable in reasoning, the hope is that with the right prompting and perhaps some finetuning on tool-use demonstrations, they can manage basic agent tasks (especially if the tools do heavy lifting, like a calculator or database lookup). This would allow agentic apps to run at the edge (e.g., on a phone or embedded device) without sending data to the cloud, which is attractive for privacy. It's an emerging area: expect to see improvements in the tool-use abilities of local models throughout 2025.

Ensemble of Agents (Debate and Diversity)

We touched on ensembling outputs and multi-agent collaboration. Another emerging idea is to have multiple agents tackle the same problem in different ways and then reconcile their answers:

Debate format: Two agents with opposing viewpoints argue a topic, and a judge agent (or human) decides who is more convincing. This was proposed as a way to get more truthful answers (the agents call out each other's errors).
Diverse thought chains: You prompt multiple instances of an agent with slightly different prompts or randomness to get a variety of solutions, then either vote or take the best. For example, to solve a hard math problem, you might run an agent 5 times and see if at least one run gets it right (then optionally verify it). The runs could also exchange information – e.g., each agent writes down its reasoning, and another agent tries to synthesize a final answer from all of them (like crowd-sourced reasoning).

The pattern here is leveraging multiple independent attempts to increase overall robustness. It's especially useful when a single run has a significant chance of failure or getting stuck. By having parallel "brains" working on it, you reduce the chance that all of them fail in the same way. Of course, this increases cost linearly with the number of agents, so it's a trade-off.

Ensembling is already used in research for evaluation (to reduce variance in judging outputs) and in some products where reliability is paramount (they might run two different models and only trust the answer if they agree, for instance). We may see frameworks offering easier ways to spin up agent ensembles and merge results in the future.

Conclusion: The Path Forward

The hype around fully autonomous AI agents has cooled somewhat, as many discovered that agents are still early, often expensive, and sometimes unreliable with current tech². The past year has taught us that:

Larger context and fancy prompts don't automatically yield coherent multi-turn performance (the 39% drop issue). We need new strategies there.
More agents and tools can solve more complex tasks, but they introduce new failure modes and expenses – careful orchestration is needed to reap the benefits.
Self-improvement loops help, but without grounding in external feedback, they hit a ceiling.
Security can't be an afterthought. As agents become more autonomous, ensuring they don't go off the rails is paramount.

However, by leveraging the design patterns we've covered and smartly integrating humans and traditional software logic, we can build practical, narrow-purpose agents that deliver value today.

The most promising near-term approach is to use AI to augment existing processes rather than aim for total autonomy². In practice, that means:

Solve well-scoped tasks where success criteria are clear. For example, "extract these fields from these documents" or "answer support tickets about password resets." Avoid letting the agent drift into undefined territory. Agents shine when their goal can be clearly defined and evaluated.
Keep humans in the loop for critical decisions and oversight. Human-AI collaboration, where the AI handles grunt work and the human handles exceptions, is a proven strategy in many deployments. It improves quality and builds user trust.
Add autonomy incrementally. Perhaps start with a workflow (fixed chain) and gradually let the LLM make more decisions as you gain confidence in its behavior. You don't have to release an AutoGPT-like free-roaming agent on day one.
Set realistic expectations. Even with all these patterns, agents can fail in unexpected ways. Be transparent with users about the AI's limitations and ensure there's a fallback to a human or a safe state. It's better to under-promise and over-deliver.
Continuously evaluate and iterate. Treat agent behavior bugs like software bugs – use tools to find, log, and fix them. Update your prompts, add new tools, or adjust the chain design as you learn from real usage. The work doesn't end when the agent is deployed; in a sense, that's when it begins.

As we solve these challenges, the dream of reliable autonomous AI agents comes closer to reality, step by step.

Open Challenges: A few big-picture challenges remain open as of 2025:

Robustness and Generalization: Agents still struggle with anything outside their training distribution or with adapting to unexpected changes.
Evaluation Metrics: We lack good automated metrics for agent success beyond task completion. Human evaluation is often needed to truly assess if an agent's output is useful, correct, and safe.
Efficiency: The token appetite of advanced agents is enormous (as noted, multi-agent systems can use 4× to 15× more tokens¹⁹). If we want agents running continuously, we need to make them more efficient through techniques like caching, using smaller models for subtasks, or state compression.
Ethics and Alignment: With more autonomous behavior comes a need to ensure agents act in alignment with human values. Aligning agents is arguably harder than aligning a single-turn chatbot, because the agent might discover creative but unapproved strategies to achieve its goals.

It's an exciting time: every month brings new techniques for agent design. By applying the patterns we've reviewed, you can craft AI agents that are not just hype, but genuinely helpful.

References:

Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models" (Oct 2022). Introduced the ReAct prompting approach for LLMs to decide when to use tools and how to chain reasoning steps. ↩ ↩²
Adrian Krebs, ""AI Agents: Hype vs. Reality", Kadoa blog (Dec 2024). An overview of the challenges facing autonomous agents and perspective on monolithic vs. multi-agent designs, arguing for pragmatic use-cases first. ↩ ↩² ↩³ ↩⁴ ↩⁵
Hacker News discussion on single vs. multiple LLM calls (2024). An OpenAI engineer's insight on when to prefer one big model call versus a sequence of calls, shared in a comment (summarized in the text above). ↩
Anthropic, "Building effective agents" (Dec 2024) – a blog post outlining patterns like prompt chaining, routing, parallelization, etc., and best practices for agent design drawn from Anthropic's research and experience. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³
Air Canada chatbot incident – The Guardian (2023). Example of legal issues when an airline's AI assistant provided misleading info, leading to a court injunction (underscoring the importance of reliability). ↩ ↩²
Andrew Ng, The Batch newsletter (Aug 2024). Noted the drop in GPT-4's token pricing by ~$79 per year and advised focusing on functionality over cost optimization (since cost trends are favorable). ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
LangChain Docs – "How to manage conversation history" (2023). Techniques for filtering, trimming, and summarizing chat history to keep prompts concise and relevant. ↩ ↩²
Reddit (r/LangChain), "10 Agent Papers You Should Read from March 2025" – Community summary of recent papers, including Plan-and-Act, MemInsight, and the BEARCUBS benchmark (agents at 24.3% vs humans 84.7% on web tasks). ↩ ↩²
LangChain Docs – "Adding long-term memory (semantic search)" (2023). Discusses multi-vector indexing and advanced retrieval methods for agent memory. ↩ ↩²
Park et al., "Generative Agents: Interactive Simulacra of Human Behavior" (2023). Created simulated characters with long-term memory and reflection. ↩
Laban et al., "LLMs Get Lost In Multi-Turn Conversation" (2025). arXiv:2501.05321. Comprehensive study of 15 LLMs across 200,000+ simulated conversations revealing systematic 39% performance degradation in multi-turn vs single-turn interactions, with identified behavioral patterns including premature solution attempts, answer bloat, loss-in-middle-turns, and cascading errors. ↩ ↩² ↩³
LangGraph Documentation – "Time Travel" (2023). Describes replaying and forking agent states for debugging and iterative improvement. ↩ ↩² ↩³
LangGraph Documentation – "Cross-thread persistence" (2023). Describes sharing state between agent sessions (useful for long-term memory and global knowledge bases). ↩ ↩² ↩³
Saptak Sen, Essay on long-term agent memory (2025). Explores hybrid approaches to memory storage, analogous to human working vs. long-term memory. ↩
Nir Diamant's GenAI Agents project (2023) – Open-source examples of many agent patterns (ATLAS academic agent, multi-agent research assistant, self-healing code agent, etc.) available on GitHub. These illustrate how the concepts discussed can be implemented with LangChain/LangGraph in real applications. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹²
LangGraph Tutorial – "Map-reduce operations" (2023). Example of dynamically spawning graph branches based on LLM-generated lists and aggregating results (implementing flexible map-reduce workflows). ↩
Google Developers Blog, "Announcing the Agent2Agent (A2A) Protocol" (Apr 2025). Introduced a standardized protocol for agent interoperability. ↩
Cemri et al., "Why Do Multi-Agent LLM Systems Fail?" (Apr 2025). Empirical taxonomy of 14 failure modes in multi-agent systems. ↩
Anthropic Engineering, "How we built our multi-agent research system" (Oct 2024). Detailed the orchestrator–worker agent pattern, noting token usage was ~15× higher than single-agent. ↩ ↩²
Together AI demo, "Mixture-of-Agents architecture" (2024). Demonstrated a system routing queries to multiple specialized 8B Groq model agents and then combining their answers (showing MoE approach to agents). ↩
LangChain Evaluation – Simulated chat evaluation (2023). Using two agents (one as a fake user, one as assistant) to generate conversations and then scoring the assistant, as a way to automate evals. ↩
LangGraph Docs – "Subgraphs for multi-agent systems" (2023). How to encapsulate an agent within a subgraph node for better modularity and isolation in a larger graph (each sub-agent manages its own state). ↩
Shreya Shankar et al., "Who Validates the Validators? Aligning LLM-Assisted Evaluation with Human Preferences" (Apr 2024). Proposes methods (EvalLLM, EvalGen) for improving LLM-based evaluators using human feedback, highlighting the importance of calibrating AI judges. ↩ ↩²
LangChain LangFuzz (Sep 2023). Experimental library for fuzz-testing LLM applications by generating pairs of similar inputs to find inconsistencies in outputs (metamorphic testing for AI). ↩
LangGraph Tutorial – "Human-in-the-loop and Breakpoints" (2023). Shows how to insert a pause for user input in an agent workflow and resume after, including editing agent state mid-run. ↩
Community example – "Web UI Generator Agent" by Ajit A. (2023). Implements a loop of an agent generating Tailwind CSS code and a human reviewing it each cycle, using LangGraph interrupts to allow accept/reject feedback. ↩
Poliakov & Shvai, "Improving RAG for Multi-Hop Queries using Database Filtering with LLM-Extracted Metadata" (2024). Research showing that combining LLM reasoning with structured database queries can improve multi-hop question answering (an example of advanced RAG techniques). ↩
W. Hinthorn, "trustcall: Reliable and efficient structured data extraction using JSON patch operations" (Aug 2023). GitHub README and demo for the TrustCall library. Introduces the JSON patch approach to structured outputs. ↩
OpenAI, "Function Calling" announcement (Jun 2023). Describes how models can return structured data via function calls, which helps implement reliable tool use and output formatting. ↩
Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023). Fine-tuned a model to insert API calls into its text. ↩
Anonymous, "WorkTeam" (2025). Paper describing multiple agents collaborating to translate natural language into an executable workflow. ↩
Beurer-Kellner et al., "Design Patterns for Securing LLM Agents against Prompt Injections" (June 2025). Proposed principled patterns to mitigate prompt injection attacks. ↩
Anonymous, "EconEvals" (2025). A benchmark testing agent decision-making in scenarios with shifting variables. ↩
Anonymous, "ROLETHINK" (2025). A benchmark evaluating the separation of inner thought and outward persona in role-playing agents. ↩
Anonymous, "SAFEARENA" (2025). A benchmark testing autonomous web agents on tasks with potential for harm. ↩
AryaXAI, "Risks of Current Agents" (2025). Paper breaking down risks of agents across memory, tools, and environments. ↩
LangChain Blog – "SCIPE: Systematic Chain Improvement and Problem Evaluation" (Nov 2024). Discusses the SCIPE tool for identifying the weakest node in an LLM chain by analyzing intermediate failures. ↩
LangChain Case Study – "How Paradigm runs and monitors thousands of agents in parallel" (Sep 2024). Describes Paradigm's use of LangChain and LangSmith for an AI spreadsheet product, including parallel agent orchestration and usage-based pricing. ↩