Retriever or reasoner? Decomposing retrieval-augmented generation performance in external audit supervision

Retriever or reasoner? Decomposing retrieval-augmented generation performance in external audit supervision

Series: Occasional Papers. 2613.

Author: Andrés Alonso-Robisco, José Manuel Carbó, Carlos José García, Jorge Quintana and Javier Tarancón

Full document

PDF
Retriever or reasoner? Decomposing retrieval-augmented generation performance in external audit supervision (880 KB)

Abstract

Rationale

Supervisory reviews of external audit reports require reliable evidence extraction from long, confidential documents. This paper evaluates whether RAG systems can support that workflow by pre-filling supervisory templates, while disentangling the contribution of retrieval quality from the reasoning capacity of the language model.

Takeaways

  • Before answering a question about an external audit report, the RAG system must locate the passage where the answer is most likely to be found. This step matters: semantic retrieval (based on similarity of meaning through numerical representations of text) raises accuracy by about 6.2-6.3 percentage points for large language models such as Kimi and Llama 70B.
  • However, larger models are not always better. With semantic retrieval, Llama 70B performs very close to Kimi, while smaller architecture models like Mistral 7B and Llama 3B fall behind.
  • Questions that require abstract human judgement remain the main limitation of RAG systems. Automation is more reliable for tasks involving factual information and requires human oversight for interpretive tasks.
Next Evolución de la estructura...