OpenAI’s Prover-Verifier Game Algorithm Aims to Explain AI Decisions

OpenAI has unveiled a new algorithm designed to shed light on the somewhat opaque decision-making processes of large language models (LLMs) such as GPT-4. This development is particularly pertinent for fields like healthcare, legal, and military, where comprehension of AI decisions holds substantial importance. Understanding the Prover-Verifier Setup The core of this algorithm is established on the “Prover-Verifier Game,” a conceptual framework initially devised at the University of Toronto and Vector Institute for Artificial Intelligence in 2021. This framework entails two AI agents: a complex “prover” and a simpler “verifier.” The prover’s objective is to persuade the verifier of an answer’s validity—whether true or false—while the verifier’s function is to determine the veracity of the provided answer. Iterative Refinement and Model Training Researchers at OpenAI employed two fine-tuned versions of GPT-4 to play this game. They tackled elementary math puzzles over several rounds, with the prover oscillating between being honest and deceptive. The verifier, without knowledge of the prover’s intent, assessed the responses based on its training data. All findings have been shown in a new paper from OpenAI.  Each cycle of the game was followed by retraining both AI models with the data obtained from prior rounds. This repetition helped the verifier develop better skills in identifying deceitful answers and improved the prover’s ability to explain its rationale in a way humans could understand. Human reviewers further contributed by evaluating the lucidity of the prover’s explanations. Finding a Balance Between Accuracy and Legibility It was observed that models optimized only for correctness often produced solutions that were hard to comprehend. Human assessors made almost twice the number of mistakes with highly optimized solutions compared to less optimized ones. The researchers discovered that by training robust models to generate text that simpler models could easily verify, humans were also better positioned to assess these texts. The finalized algorithm is designed to strike a balance between accuracy and human readability. OpenAI outlines that this method could lead to AI systems that not only provide accurate results but also offer outputs that are easy to verify, thus improving trust and safety in practical applications. Impact on Research and Community OpenAI researcher Jan Hendrik Kirchner says the value of sharing these insights with the broader community to address the issue of legibility. Co-author Yining Chen mentioned that the methodology holds promise for aligning future models that exceed human intelligence, ensuring their outputs remain verifiable and reliable. The research paper, titled “Prover-Verifier Games Improve Legibility of LLM Outputs,” is accessible on OpenAI’s website. Authors Yining Chen and Jan Hendrik Kirchner led the study, with contributions from Angela Baek, Yuri Burda, Thomas Degry, Harri Edwards, Elie Georges, Cary Hudson, Jan Leike, Nat McAleese, Wes McCabe, Lindsay McCallum, and Freddie Sulit.