Building an Evaluation Harness for Financial RAG: What I Learned About LLM-as-Judge Calibration
The author built a RAG system for financial Q&A using SEC filings and the FinanceBench benchmark. They uncovered a significant discrepancy between LLM-as-judge evaluations and actual performance, leading to lessons on calibrating LLMs for assessment.