LLM performance in audit supervision: Retriever vs. Reasoner
A Banco de España Occasional Paper evaluates Retrieval-Augmented Generation (RAG) pipelines for external audit supervision. It compares open-weight and proprietary LLM models across various retrieval strategies.
Llama 70B matches cloud models with strong retrieval
The Banco de España paper evaluates retrieval-augmented generation (RAG) pipelines for external audit supervision, comparing lexical, semantic, hybrid, and oracle retrieval across open-weight (Llama 3B, Mistral 7B, Llama 70B) and proprietary cloud models (Kimi, Claude Sonnet 4.6).
Using 20 bank audit reports and a 30-question central bank template, operational correctness is scored against supervisor-provided ground truth.
Semantic retrieval yields a sizeable and statistically robust accuracy uplift of 6.2–6.3 percentage points for Kimi and Llama 70B.
Under symmetric strong retrieval, Llama 70B becomes statistically indistinguishable from the best cloud benchmark.
Smaller on-premise models, however, remain constrained on higher complexity judgement questions, indicating a capacity threshold for practical substitution in regulated settings.
Prompt shifts can materially affect evaluation outcomes, with sensitivity strongly model dependent.
RAG for reliable audit evidence
Supervisory reviews of external audit reports require structured evidence extraction and expert judgement under strict confidentiality.
Large language models (LLMs) offer productivity gains but also present risks like hallucinated content and overreliance.
Retrieval-augmented generation (RAG) frameworks mitigate these by grounding LLM outputs in curated internal document repositories, thereby improving factual accuracy and reducing hallucinations.
This modularity supports confidentiality and facilitates auditable deployments.
The paper implements an experimental design that explicitly separates retrieval and reasoning margins in a supervisory RAG pipeline, benchmarking multiple retrieval strategies and generation models to clarify their respective contributions to performance.
A threshold for practical substitution
Strong retrieval can elevate open-weight models to proprietary cloud performance, a critical finding for regulated settings.
Yet, smaller models consistently struggle with complex judgment tasks, highlighting a clear capacity threshold.
This implies that while RAG is transformative, core model reasoning capability remains a binding constraint for full substitution in financial supervision.