Mike Thrift
Marketing Manager
PHANTOM (NeurIPS 2025): Measuring LLM Hallucination Detection in Financial Documents
PHANTOM (NeurIPS 2025) is the first benchmark to measure LLM hallucination detection on real SEC filings across context lengths up to 30,000 tokens. Qwen3-30B-A3B-Thinking leads with F1=0.882; 7B models score near random guessing — with direct implications for autonomous accounting agents.
FinMaster Benchmark: Why LLMs Score 96% on Financial Literacy but 3% on Statement Generation
FinMaster (arXiv:2505.13533) benchmarks o3-mini, Claude 3.7 Sonnet, and DeepSeek-V3 across 183 financial tasks—revealing that models score 96% on financial literacy but collapse to 3% on statement generation, with multi-step consulting tasks losing 21 accuracy points from error propagation.
ReAct: Synergizing Reasoning and Acting in Language Models
ReAct (Yao et al., ICLR 2023) interleaves chain-of-thought reasoning with tool actions in a single trajectory, outperforming pure CoT on fact verification and imitation learning on embodied tasks by 34 percentage points. This analysis covers the paper's failure modes — search-induced distraction and compounding errors — and what they mean for autonomous agents writing back to Beancount ledgers.