🤯 AI Breakthrough: ReasoningBank – Smarter Learning! 🧠

April 23, 2026 |

AI

🎧 Audio Summaries
🎧
English flag
French flag
German flag
Japanese flag
Korean flag
Spanish flag
🛒 Shop on Amazon

🧠Quick Intel


  • ReasoningBank, developed by Google Cloud AI, the University of Illinois Urbana-Champaign, and Yale University, is a memory framework that distills reasoning from agent experiences.
  • At k=1, Retrieval success rate is 49.7%, but increasing memory retrieval (k > 1) decreases performance to 34.2%.
  • The Memory Extractor, powered by an LLM, analyzes trajectories to create structured memory items labeled as “Success” or “Failure.”
  • ReasoningBank, using parallel scaling (k=5), outperforms all baselines on WebArena, Mind2Web, and SWE-Bench-Verified datasets, achieving an 8.3% improvement in overall success rate over the memory-free baseline (40.5% → 48.8%) with Gemini-2.5-Flash.
  • With Gemini-2.5-Pro, ReasoningBank achieves a 57.4% resolve rate versus 54.0% for the no-memory baseline, saving 1.3 steps per task.
  • Memory-aware test-time scaling (MaTTS) further improves performance, reaching 56.3% overall SR on WebArena with Gemini-2.5-Pro, reducing average steps from 8.8 to 7.1 per task.
  • ReasoningBank’s memory evolves, transitioning from simple checklists to complex, adaptive strategies, as demonstrated in a case study involving “User-Specific Information Navigation.”
  • Efficiency gains are most significant on successful trajectories, reducing task completion steps by 2.1 on the Shopping subset (26.9% relative reduction).
  • 📝Summary


    Researchers from Google Cloud AI, the University of Illinois Urbana-Champaign, and Yale University have developed ReasoningBank, a novel memory framework designed to improve agent performance. Unlike existing approaches like trajectory and workflow memory, ReasoningBank employs a closed-loop process with retrieval, extraction, and consolidation stages. An LLM analyzes trajectories to create structured memory items, distinguishing between successes and failures. Experiments across datasets like WebArena and Mind2Web demonstrated significant improvements, boosting overall success rates by up to 8.3 percentage points and reducing interaction steps. Notably, the framework’s memory evolves, transitioning from basic procedural checklists to sophisticated, adaptive strategies, highlighting a dynamic learning process.

    💡Insights



    THE CHALLENGE OF AMNESIA IN AI AGENTS
    AI agents frequently encounter new tasks without prior knowledge, approaching each one as if it’s their first. Despite repeated attempts at similar problems, they consistently replicate past mistakes, losing valuable learning opportunities. This inherent amnesia poses a significant hurdle in developing truly adaptable and efficient AI systems.

    REASONINGBANK: A NEW APPROACH TO AI MEMORY
    Researchers at Google Cloud AI, the University of Illinois Urbana-Champaign, and Yale University have introduced ReasoningBank, a novel memory framework designed to overcome the limitations of existing agent memory systems. Unlike traditional methods, ReasoningBank doesn’t simply record actions but distills lessons from successes and failures into reusable reasoning strategies. This framework centers around a closed-loop process encompassing memory retrieval, extraction, and consolidation.

    THE THREE STAGES OF REASONINGBANK
    ReasoningBank operates through three distinct stages to enhance an agent’s learning process. Initially, the agent queries the framework using embedding-based similarity search to retrieve relevant memory items, which are then injected directly into the agent's system prompt. The default setting utilizes a single retrieved memory item per task, a strategy that has been shown to negatively impact performance. Subsequently, a Memory Extractor, powered by the same LLM as the agent, analyzes the task trajectory to generate structured memory items. These items consist of a title, a description, and content summarizing reasoning steps or operational insights. The extractor differentiates between successful and failed trajectories, treating successes as validated strategies and failures as counterfactual pitfalls.

    MEMORY EXTRACTION AND VALIDATION
    The Memory Extractor employs an LLM-as-a-Judge to determine whether a trajectory was successful or not, based on the user query, the trajectory, and the final page state. This judge doesn’t require perfect accuracy, demonstrating that ReasoningBank maintains robustness even with reduced judge reliability. New memory items are appended to the ReasoningBank store, maintained as JSON with pre-computed embeddings, facilitating efficient cosine similarity search. This closed-loop system allows the agent to continually refine its approach based on learned experiences.

    MEMORY-AWARE TEST-TIME SCALING (MaTTS)
    To further enhance performance, ReasoningBank integrates with test-time compute scaling techniques. MaTTS generates multiple trajectories for the same task, utilizing self-contrast—comparing what went right and wrong across all trajectories—to extract higher-quality, more reliable memory items. This parallel scaling approach provides diverse rollouts, enabling the agent to contrast and learn from multiple outcomes. The system favors parallel scaling (55.1% SR) over sequential scaling (54.5% SR) due to its ability to continually provide diverse rollouts.

    PERFORMANCE AND SCALABILITY ACROSS BENCHMARKS
    ReasoningBank consistently outperforms existing baselines across multiple benchmarks, including WebArena, Mind2Web, and SWE-Bench-Verified. On WebArena with Gemini-2.5-Flash, ReasoningBank improved overall success rate by +8.3 percentage points over the memory-free baseline, while reducing average interaction steps by up to 1.4 compared to other memory baselines. The efficiency gains are particularly pronounced on successful trajectories, reducing task completion steps by 26.9% on the Shopping subset. On Mind2Web, ReasoningBank delivers consistent gains across cross-task, cross-website, and cross-domain evaluation splits, with the most significant improvements observed in the cross-domain setting. On SWE-Bench-Verified, results vary by backbone model, achieving a 57.4% resolve rate with Gemini-2.5-Pro, saving 1.3 steps per task. Adding MaTTS (parallel scaling, k=5) further enhances results, reaching 56.3% overall SR on WebArena with Gemini-2.5-Pro, reducing average steps from 8.8 to 7.1 per task.

    Our editorial team uses AI tools to aggregate and synthesize global reporting. Data is cross-referenced with public records as of April 2026.