🤯 AI Designs Winning Strategies 🏆🔥

AI

🎧English flagFrench flagGerman flagSpanish flag

Summary

Google DeepMind researchers have explored a new approach to algorithm design using an LLM-powered system called AlphaEvolve. The system utilizes evolutionary coding, iteratively modifying source code – specifically, implementations of Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO) – to improve performance in Multi-Agent Reinforcement Learning games like Kuhn Poker and Leduc Poker. AlphaEvolve operates through a distributed process, with an LLM, Gemini 2.5 Pro, generating code variations and evaluating them against proxy games. Experiments yielded a VAD-CFR variant that outperformed standard CFR across 10 of 11 games, and a SHOR-PSRO variant achieving top performance in 8 of 11 games, demonstrating the potential of LLMs to automate and accelerate algorithmic discovery.

INSIGHTS


[DISCOVERING ALGORITHMS THROUGH AUTOMATED SEARCH]
Researchers are tackling the complex problem of designing algorithms for Multi-Agent Reinforcement Learning (MARL) in imperfect-information games, such as poker, traditionally through manual iteration and expert intuition. This manual process involves iteratively weighting schemes, discounting rules, and equilibrium solvers, a labor-intensive approach that relies heavily on trial and error.

[INTRODUCING ALPHAEVOLVE: AN LLM-POWERED APPROACH]
Google DeepMind’s research team has developed a novel framework called AlphaEvolve, utilizing a Large Language Model (LLM) – Gemini 2.5 Pro – to automate this algorithm design process. Instead of manually tweaking parameters, AlphaEvolve employs an evolutionary coding agent to search for optimal algorithm variants, offering a significant shift from traditional methods.

[CFR: AN ITERATIVE APPROACH TO REGRET MINIMIZATION]
The core of AlphaEvolve’s approach centers around Counterfactual Regret Minimization (CFR), an iterative algorithm that decomposes regret minimization across information sets. At each iteration, CFR accumulates ‘counterfactual regret’ – quantifying the potential gain from playing differently – and derives a new policy proportional to this positive regret. Over numerous iterations, the strategy converges towards a Nash Equilibrium (NE), a stable state where no player can improve their outcome by unilaterally changing their strategy. Variants like Discounted CFR (DCFR) and Predictive CFR+ (PCFR+) enhance convergence by incorporating discounting or predictive update rules, all initially designed through manual refinement.

[PSRO: A HIGH-LEVEL ABSTRACTION FOR POLICY OPTIMIZATION]
In contrast to CFR, Policy Space Response Oracles (PSRO) operates at a higher level of abstraction. It maintains a population of policies, builds a payoff tensor (the meta-game) by calculating expected utilities for all combinations of policies, and then utilizes a meta-strategy solver to determine the optimal probability distribution over this population. Best responses are trained against this distribution, iteratively added to the population, and the meta-strategy solver is the critical design choice targeted for automated discovery. The exact best response oracle is computed via value iteration, and exact payoff values are used, eliminating Monte Carlo sampling noise.

[ALPHAEVOLVE’S IMPLEMENTATION: A DISTRIBUTED EVOLUTIONARY SYSTEM]
AlphaEvolve is a distributed evolutionary system that leverages LLMs to mutate source code rather than numeric parameters. The process begins with a population initialized with a standard implementation of CFR+ (as the seed for CFR experiments) or Uniform (for both PSRO solver classes). At each generation, the parent algorithm with the highest fitness is selected, its source code is passed to the LLM with a prompt to modify it, and the resulting candidate is evaluated on proxy games. Valid candidates are added to the population, and multi-objective optimization is supported, randomly selecting a fitness metric per generation to guide parent sampling. The fitness signal is based on exploitability after K iterations, evaluated on a fixed set of training games: 3-player Kuhn Poker, 2-player Leduc Poker, 4-card Goofspiel, and 5-sided Liars Dice. Final evaluation occurs on a separate test set of larger, unseen games.

[THE CFR SEARCH SPACE: PYTHON CLASSES FOR REGRET ACCUMULATION]
For the CFR experiments, the evolvable search space consists of three Python classes: RegretAccumulator, PolicyFromRegretAccumulator, and PolicyAccumulator. These classes govern regret accumulation, current policy derivation, and average policy accumulation, respectively. The interface is designed to represent all known CFR variants as special cases, allowing for flexible algorithm exploration.

[THE PSRO SEARCH SPACE: EVOLVED SOLVERS FOR META-STRATEGY COMPUTATION]
The evolvable components for PSRO are TrainMetaStrategySolver and EvalMetaStrategySolver – the meta-strategy solvers used during oracle training and during exploitability evaluation. The evolved CFR variant is Volatility-Adaptive Discounted CFR (VAD-CFR). Rather than the linear averaging and static discounting found in the standard CFR family, VAD-CFR incorporates three distinct mechanisms: VAD-CFR is benchmarked against standard CFR, CFR+, Linear CFR (LCFR), DCFR, PCFR+, DPCFR+, and HS-PCFR+(30) across 1000 iterations with K = Exploitability is computed exactly.

[SHOR-PSRO: A HYBRID META-SOLVER WITH ANNEALING SCHEDULES]
The evolved PSRO variant is Smoothed Hybrid Optimistic Regret PSRO (SHOR-PSRO). This hybrid meta-solver constructs a meta-strategy by linearly blending two components at each internal solver iteration. The training solver uses a dynamic annealing schedule over the outer PSRO iterations, shifting from greedy exploitation to equilibrium finding, the diversity bonus decays from 0.05 to 0.001, and the softmax temperature drops from 0.5 to 0.01. The number of internal solver iterations also scales with population size. The training solver returns the time-averaged strategy across internal iterations for stability. The evaluation-time solver uses fixed parameters: λ = 0.01, diversity bonus = 0.0, temperature = 0.001. It runs more internal iterations (base 8000, scaling with population size) and returns the last-iterate strategy rather than the average, for a reactive, low-noise exploitability estimate. This training/evaluation asymmetry was a product of the search, not a human design choice.

[EVALUATION AND GENERALIZATION: A TEST-TRAIN PROTOCOL]
The evaluation protocol separates training and test games to assess generalization. The training set for both CFR and PSRO experiments consists of 3-player Kuhn Poker, 2-player Leduc Poker, 4-card Goofspiel, and 5-sided Liars Dice. The test set used in the main body of the paper consists of 4-player Kuhn Poker, 3-player Leduc Poker, 5-card Goofspiel, and 6-sided Liars Dice – larger and more complex variants not seen during evolution. A full sweep across 11 games is included in the appendix. Algorithms are fixed after training-phase discovery before test evaluation begins.

This article is AI-synthesized from public sources and may not reflect original reporting.