๐Ÿคฏ FAPO: AI's Secret to Perfect Language ๐Ÿš€

June 21, 2026 |

AI

๐ŸŽง Audio Summaries
English flag
French flag
German flag
Japanese flag
Korean flag
Mandarin flag
Spanish flag
๐Ÿ›’ Shop on Amazon

๐Ÿง Quick Intel


  • FAPO, a Claude Code-driven system, addresses accuracy degradation at scale within LLM pipelines.
  • The system achieved 15 wins out of 18 model-benchmark comparisons against GEPA, with a mean gain of +14.1pp across GPT-4.1-mini, GPT-5.4-mini, and Gemma 3-12B benchmarks.
  • FAPO demonstrated a mean gain of +33.8pp when escalating to pipeline changes on HoVer and IFBench.
  • Cisco evaluated FAPO against GEPA, securing wins in 9 of 12 trials through prompt optimization alone.
  • The FAPO closed-loop system operates through six stages of prompt edits, categorized into retrieval, cascading, format, and reasoning failures.
  • The system supports three providers: OpenAI, Baseten, and SageMaker, managing prompts, datasets, chain definitions, and scorers within a tenant.
  • FAPO utilizes a dataset containing case_id, task_type, context, expected, and metadata for pipeline optimization.
  • ๐Ÿ“Summary


    Ciscoโ€™s FAPO, a Claude Code-driven system, emerged as a significant advancement in optimizing Large Language Model pipelines. The system, operating in a closed loop, systematically evaluated datasets and initial prompts, identifying failures and proposing variants through iterations orchestrated by Claude Code agents. Across sixteen model-benchmark comparisons against GEPA, FAPO achieved a mean gain of +14.1 points across six benchmarks and three models. Notably, in twelve trials, FAPO outperformed GEPA through prompt optimization alone, demonstrating a mean gain of +33.8 points on the HoVer and IFBench benchmarks. The systemโ€™s ability to pinpoint retrieval, cascading, format, and reasoning failures within multi-step pipelines represented a key advancement in achieving targeted accuracy.

    ๐Ÿ’กInsights

    โ–ผ


    CHAPTER 1: THE CHALLENGE OF LLM PROMPT RELIABILITY
    Small wording changes in prompts can drastically alter the accuracy of Large Language Model (LLM) applications, sometimes swinging performance by as much as 20 percent. Traditional methods often fail to scale effectively, as solutions designed for a few examples frequently break down when applied to larger, more complex deployments. The core issue lies in the difficulty of diagnosing failures within multi-step pipelines, where incorrect answers stem from a specific stage. This necessitates a detailed, hands-on inspection of intermediate outputs to pinpoint the root cause.

    CHAPTER 2: INTRODUCING FAPO โ€“ Fully Automated Prompt Optimization
    To address this bottleneck, Cisco AI developed FAPO (Fully Automated Prompt Optimization), a system driven by Claude Code that automates the process of refining LLM pipelines. FAPO begins with a user-supplied dataset and an initial prompt, then iteratively evaluates, classifies failures, proposes variations, validates them, and repeats the cycle, all orchestrated by Claude Code agents. This closed-loop approach provides a scalable solution for optimizing prompt performance at scale.

    CHAPTER 3: FAPOโ€™S CORE MECHANICS AND ARCHITECTURE
    FAPO operates within a โ€œtenantโ€ framework, creating isolated optimization projects. Each tenant directory contains a taskโ€™s prompts, dataset, chain definition, scorer, and configuration. The core engine, named hephaestus, handles evaluation, chain execution, and scoring, supporting three providers: OpenAI, Baseten, and SageMaker. The system relies on LangGraphstate graphs โ€“ chains โ€“ to process test cases, and the initial prompt is scaffolded by Claude. This iterative process continues until the target accuracy is achieved, cycling through six distinct stages.

    CHAPTER 4: FAPO VS. GEPA โ€“ A Comparative Analysis
    FAPO was rigorously tested against GEPA (Generalized Evolutionary Prompt Architecture), a state-of-the-art prompt optimizer. GEPA employs evolutionary search with genetic operators to optimize prompts within multi-step pipelines. Across six benchmarks and three task models (GPT-4.1-mini, GPT-5.4-mini, and Gemma 3-12B), FAPO outperformed GEPA in 15 of 18 model-benchmark comparisons, achieving a mean gain of +14.1pp. Notably, on the HoVer and IFBench benchmarks, where FAPO escalated to pipeline changes, the gains reached +33.8pp.

    CHAPTER 5: FAPOโ€™S INNOVATIONS AND GUARDRAILS
    FAPOโ€™s architecture incorporates several key innovations. It targets multi-step LLM pipelines rather than individual prompts, prioritizing the fastest path to optimization through Claude Codeโ€™s tenant file creation. Guardrails are implemented to prevent overfitting, focusing validation solely on training-split cases while utilizing validation and test sets for aggregate score evaluation. Variant creation is immutable, with an independent reviewer verifying each proposal before execution, ensuring auditability and controlled iteration.