Bean Labs Research Log
Open experiments and findings from Bean Labs — the Finance AI Agent research initiative by Beancount.io. Browse by tag.
BIRD Benchmark: The Real-Database Gap in LLM Text-to-SQL
The BIRD benchmark (NeurIPS 2023) tests LLMs on 95 real databases — GPT-4 reaches only 54.89% execution accuracy with domain hints and 34.88% without, a 20-point gap that directly shapes what a natural-language BQL interface for Beancount would need to solve.
Verifiably Safe Tool Use for LLM Agents: STPA Meets MCP
CMU and NC State researchers propose using System-Theoretic Process Analysis (STPA) and a capability-enhanced Model Context Protocol to derive formal safety specifications for LLM agent tool use, with Alloy-based verification demonstrating absence of unsafe flows in a calendar scheduling case study.
GraphRAG: From Local to Global Query-Focused Summarization
Microsoft's GraphRAG builds a Leiden-partitioned entity graph over a text corpus and precomputes community summaries to answer global sensemaking questions that standard vector RAG cannot handle — but a 2025 bias audit shows its 72–83% win rates collapse after correcting for position and length artifacts in LLM-as-judge evaluation.
FinAuditing: LLMs Score Under 14% on Real SEC XBRL Auditing Tasks
FinAuditing tests 13 LLMs zero-shot on 1,102 real SEC XBRL filing instances; top scores are 13.86% on financial math verification and 12.42% on concept retrieval—results that directly bound what AI accounting tools can be trusted to automate without external tooling.
InvestorBench: Benchmarking LLM Agents on Financial Trading Decisions
InvestorBench (ACL 2025) tests 13 LLM backbones on backtested stock, crypto, and ETF trading using cumulative return and Sharpe ratio — not QA accuracy. Qwen2.5-72B tops the stock leaderboard at 46.15% CR; finance-tuned models backfire on equities. Model size predicts performance more reliably than domain fine-tuning.
StructRAG (ICLR 2025): Picking the Right Document Structure Beats GraphRAG by 28 Points
StructRAG (ICLR 2025) routes each query to a task-appropriate structure type — table, graph, catalogue, algorithm, or chunk — before reasoning, scoring 28 points higher than GraphRAG on the Loong benchmark while running 22× faster, with the DPO-trained router alone accounting for a 15-point accuracy gain.
Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
A 2026 Stanford preprint equalizes thinking-token budgets across five multi-agent architectures and finds single-agent LLMs match or beat multi-agent systems on multi-hop reasoning — with theoretical grounding in the Data Processing Inequality and implications for finance AI agent design.
M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?
M3MAD-Bench stress-tests Multi-Agent Debate across 9 models, 5 domains, and vision-language settings, finding that Collective Delusion causes 65% of failures, adversarial debate cuts accuracy by up to 12.8%, and Self-Consistency typically matches debate accuracy at lower token cost.
AGrail: Adaptive Safety Guardrails for LLM Agents That Learn Across Tasks
AGrail (ACL 2025) introduces a two-LLM cooperative guardrail that adapts safety checks at inference time via test-time adaptation, achieving 0% prompt injection attack success and 95.6% benign action preservation on Safe-OS — compared to GuardAgent and LLaMA-Guard blocking up to 49.2% of legitimate actions.
ShieldAgent: Verifiable Safety Policy Reasoning for LLM Agents
ShieldAgent (ICML 2025) replaces LLM-based guardrails with probabilistic rule circuits built on Markov Logic Networks, achieving 90.4% accuracy on agent attacks with 64.7% fewer API calls — and what it means for verifiable safety in financial AI systems.
Atlas: Joint Retriever-Reader Pre-Training Beats 540B-Parameter LLMs with 11B Parameters
Atlas (JMLR 2023) achieves 42.4% accuracy on Natural Questions with only 64 training examples—beating PaLM 540B by 3 points using 11B parameters—by jointly pre-training a Contriever-based dense retriever with a T5 Fusion-in-Decoder reader. Analysis covers retrieval accuracy limits, 587GB index infrastructure costs, and implications for Beancount ledger QA systems.
Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA
Izacard and Grave's FiD architecture independently encodes retrieved passages then fuses them in the decoder, outperforming RAG-Sequence by 4–11 points on NQ and TriviaQA. This post examines the design and its implications for Beancount ledger QA, where multi-entry synthesis across transactions is the norm.