Autonomous Agents & Multi-Agent Collaboration

Published: April 16, 2024

Building autonomous agents with LLMs at their core is a promising concept. Projects like AutoGPT and GPT-Engineer highlight the potential of LLMs as powerful general problem solvers. LLMs are also being used for various multi-agent collaboration scenarios:

  1. Behavior Simulation: Using generative agents in a sandbox to mimic human behavior or simulate user behaviors in recommendation systems.
  2. Data Construction: Collecting and evaluating multi-party conversations or generating detailed instructions for complex tasks using role-playing agents.
  3. Performance Improvement: Enhancing performance through role adoption, improving factual correctness and reasoning with multi-agent debates, addressing thought degeneration in self-reflection, and improving negotiation strategies in role-playing games.

Researchers have found that having multiple agents, each with unique attributes and roles, can handle complex tasks more effectively, create more realistic simulations, and even align social behaviors in LLMs. Some work involves designing interactive environments where these agents can interact to achieve goals, like creating believable social interactions or improving negotiation outcomes.

This research area is expanding the capabilities of LLMs beyond single-agent tasks to collaborative multi-agent systems, offering innovative ways to tackle complex problems that are otherwise difficult for individual agents or traditional computational methods.

Key Components of Autonomous Agents

In a LLM-powered autonomous agent system, the LLM functions as the agent’s brain, complemented by several key components:


  • Subgoal and Decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks. Chain of Thought (CoT) (Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks.
  • Reflection and Refinement: The agent performs self-criticism and self-reflection over past actions, learns from mistakes, and refines them for future steps, thereby improving the quality of final results. ReAct (Yao et al. 2023) integrates reasoning and acting within LLM by extending the action space to be a combination of task-specific discrete actions and the language space. This enables LLM to interact with the environment (e.g. use Wikipedia search API), while prompting LLM to generate reasoning traces in natural language.


  • Short-term Memory: Utilizes in-context learning to process information.
  • Long-term Memory: Provides the agent with the capability to retain and recall extensive information over extended periods, often by leveraging an external vector store and fast retrieval.

Tool Use

  • Learning to Call External APIs: The agent learns to call external APIs for additional information, including current data, code execution capabilities, and access to proprietary information sources.
  • MRKL (Modular Reasoning, Knowledge and Language): A neuro-symbolic architecture for autonomous agents, proposed to contain a collection of “expert” modules with the general-purpose LLM working as a router to route inquiries to the best suitable expert module. These modules can be neural (e.g., deep learning models) or symbolic (e.g., math calculator, currency converter, weather API).
  • TALM (Tool Augmented Language Models) and Toolformer: Fine-tune a LM to learn to use external tool APIs. The dataset is expanded based on whether a newly added API call annotation can improve the quality of model outputs.
  • ChatGPT Plugins and OpenAI API Function Calling: Examples of LLMs augmented with tool use capability working in practice. The collection of tool APIs can be provided by other developers (as in Plugins) or self-defined (as in function calls).

Overview of a LLM-powered autonomous agent system:

Overview of a LLM-powered autonomous agent system

Why Use Multi-Agent Systems?

Many studies, including those from MIT and Google Brain, demonstrate that LLMs produce better results when multiple instances with different roles propose and debate their individual responses and reasoning processes over several rounds to arrive at a consensus. This method, referred to as a "multi-agent society," significantly advances LLM capabilities and paves the way for breakthroughs in language generation and understanding. It comes with a few benefits:

  • Black-Box Access: This approach requires only black-box access to language model generations, eliminating the need for internal model information such as likelihoods or gradients.
  • Versatility: It can be used with common public models serving interfaces.
  • Complementary Methods: The method is also orthogonal to other model generation improvements such as retrieval or prompt engineering.
  • Cost Efficiency: While the debate process is more costly, involving multiple model instances and rounds, it produces significantly improved answers and can generate additional training data, creating a model self-improvement loop.

ChatDev: A Case Study

ChatDev is an innovative framework that demonstrates how to handle software development complexity using LLMs. It organizes agents into teams similar to those in a real company, such as design, coding, testing, and documentation teams. These agents, assuming roles like CEO, CTO, professional programmers, and test engineers, collaborate to simulate the entire software development process. It is able to produce some pretty impressive software from a single prompt:

ChatDev community Contribution Software

ChatDev's Process

ChatDev follows a structured process that mirrors the waterfall model, dividing the development into four stages: designing, coding, testing, and documenting. This method helps prevent common issues like code hallucinations.

  1. Designing: Innovative ideas are generated through collaborative brainstorming, and technical requirements are defined.
  2. Coding: Source code is developed and reviewed.
  3. Testing: Components are integrated, and feedback from interpreters is utilized for debugging.
  4. Documenting: Environment specifications and user manuals are generated.

Key Mechanisms in ChatDev

  1. Role Specialization: Ensures each agent fulfills its designated function.
  2. Memory Stream: Maintains a comprehensive record of previous dialogues for informed decision-making.
  3. Self-Reflection: Prompts agents to reflect on proposed decisions to streamline processes and prevent irrelevant discussions.

Coding and Testing

ChatDev employs "thought instruction" to clarify and specify coding instructions, reducing confusion and ensuring accurate final code. During testing, the coder writes the code, the reviewer checks for issues (static debugging), and the tester runs the code to verify its functionality (dynamic debugging).


After the design, coding, and testing phases, ChatDev utilizes agents to generate thorough project documentation, including user manuals and environment specifications.

Generative Agents Simulation: A Case Study

Generative Agents (Park, et al. 2023) is super fun experiment where 25 virtual characters, each controlled by a LLM-powered agent, are living and interacting in a sandbox environment, inspired by The Sims. Generative agents create believable simulacra of human behavior for interactive applications.

The design of generative agents combines LLM with memory, planning and reflection mechanisms to enable agents to behave conditioned on past experience, as well as to interact with other agents.

  • Memory stream: is a long-term memory module (external database) that records a comprehensive list of agents’ experience in natural language.
    • Each element is an observation, an event directly provided by the agent. - Inter-agent communication can trigger new natural language statements.
  • Retrieval model: surfaces the context to inform the agent’s behavior, according to relevance, recency and importance.
    • Recency: recent events have higher scores
    • Importance: distinguish mundane from core memories. Ask LM directly.
    • Relevance: based on how related it is to the current situation / query.
  • Reflection mechanism: synthesizes memories into higher level inferences over time and guides the agent’s future behavior. They are higher-level summaries of past events (note that this is a bit different from self-reflection above)
    • Prompt LM with 100 most recent observations and to generate 3 most salient high-level questions given a set of observations/statements. Then ask LM to answer those questions.
  • Planning & Reacting: translate the reflections and the environment information into actions
    • Planning is essentially in order to optimize believability at the moment vs in time.
    • Prompt template: {Intro of an agent X}. Here is X's plan today in broad strokes: 1)
    • Relationships between agents and observations of one agent by another are all taken into consideration for planning and reacting.
    • Environment information is present in a tree structure.

The generative agent architecture (Image source: Park et al. 2023):

The generative agent architecture

This fun simulation results in emergent social behavior, such as information diffusion, relationship memory (e.g. two agents continuing the conversation topic) and coordination of social events (e.g. host a party and invite many others).

Interface Considerations

When considering the user interface (UI) for agent-based systems, drawing inspiration from intuitive interfaces like OpenAI's chat interface for GPTs is beneficial. A video game-like interface is often proposed for understanding AI employees, data, task flows, and complexities.

Key Projects and Architectures

Some noteworthy projects and agentic architectures include:

Challenges and Future Directions

Despite the potential, several challenges remain:

  • Finite Context Length: Limited context capacity restricts the inclusion of detailed instructions and historical information.
  • Long-Term Planning: Effective exploration of solution spaces and adjusting plans in real-time remain challenging.
  • Reliability of Natural Language Interface: The natural language interface can sometimes produce unreliable outputs, necessitating robust parsing mechanisms.


Multi-agent collaboration with LLMs is a rapidly evolving field, with innovative solutions to complex problems. By using the collective capabilities of multiple agents, we can achieve more accurate, efficient, and creative outcomes. Whether in software development, scientific discovery, or behavior simulation, the possibilities are vast and exciting.