Mike Thrift

Marketing Manager

May 13, 2026·mike

FinQA: The Benchmark Measuring AI Numerical Reasoning on Financial Reports

FinQA (EMNLP 2021) built 8,281 QA pairs from S&P 500 earnings reports requiring multi-step arithmetic programs. Neural models scored 61% at release versus 91% for human experts; accuracy collapses to 22% on three-or-more-step programs. The failure modes — domain constants, cross-modality grounding, chain length — map directly to the challenges Beancount agents face today.

FinanceBench: Why Vector-Store RAG Fails on Real Financial Documents

FinanceBench evaluates 16 AI configurations against 10,231 questions from real SEC filings; shared-vector-store RAG answers correctly only 19% of the time, and even GPT-4-Turbo with the oracle passage reaches just 85% accuracy — showing that numerical reasoning, not retrieval, is the binding constraint for enterprise finance AI.

DSPy: Replacing Brittle Prompt Engineering with Compiled LLM Pipelines

DSPy replaces hand-crafted prompt strings with declarative signatures and a metric-driven compiler—boosting Llama2-13b from 9.4% to 46.9% on GSM8K math reasoning and offering a more maintainable path for production finance AI pipelines.

LATS: Language Agent Tree Search — 추론, 행동, 계획을 하나의 프레임워크로 통합

LATS(Language Agent Tree Search, ICML 2024)는 ReAct, Tree of Thoughts, Reflexion을 단일 MCTS 프레임워크로 통합하여 GPT-4와 함께 HumanEval에서 92.7%의 pass@1을 달성했습니다. Git 기반의 Beancount 장부의 경우, 운영 환경에서 LATS를 제한하는 상태 복원 요구 사항을 아주 쉽게 충족할 수 있습니다.

Self-RAG: Adaptive Retrieval and Self-Critique for LLMs

Self-RAG (ICLR 2024 Oral) trains a language model to decide when to retrieve and then grade its own results using four reflection tokens — reaching 55.8% on PopQA and 80.2 FactScore on biographies while outperforming ChatGPT on five benchmarks. Analysis covers the mechanism, ablation results, reproducibility limits, and implications for finance AI agents over Beancount ledgers.

Voyager: Skill Libraries as the Foundation for Lifelong AI Agent Learning

Voyager, a GPT-4-powered Minecraft agent from NVIDIA and Caltech, demonstrates that a persistent code skill library enables genuine lifelong learning without fine-tuning — discovering 3.3× more items than prior state-of-the-art. The pattern maps directly onto long-horizon Beancount ledger automation, though financial correctness demands staging layers that game sandboxes never require.

HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs

HippoRAG (NeurIPS 2024) builds a knowledge graph from OpenIE triples and applies Personalized PageRank at query time, reaching 89.1% Recall@5 on 2WikiMultiHopQA versus 68.2% for ColBERTv2—with direct implications for querying complex financial ledgers across multi-year transaction histories.

AgentBench：评估作为代理的 LLM —— 对金融 AI 可靠性的启示

AgentBench（Liu 等人，ICLR 2024）在 8 个交互式环境中对 27 个大语言模型进行了基准测试 —— GPT-4 的综合得分为 4.01，而表现最好的开源模型仅为 0.96。三种主要的失败模式（知识图谱失败中 67.9% 为超出任务限制、数据库失败中 53.3% 为格式错误以及无效操作）直接对应了在真实账本上部署 Beancount 回写代理的风险。

BloombergGPT and the Limits of Domain-Specific LLMs in Finance

Bloomberg trained a 50B-parameter LLM on 569B tokens of financial data and beat general models on sentiment and table-reasoning benchmarks — then GPT-4 matched it without any finance-specific pretraining. What the $10M experiment reveals about domain pretraining trade-offs, tokenization of numbers, and why tool-use is more reliable than model internals for accounting agents.

AutoGen: Multi-Agent Conversation Frameworks for Finance AI

AutoGen (Wu et al., 2023) introduces a multi-agent conversation framework where LLM-backed agents pass messages to complete tasks; a two-agent setup lifts MATH benchmark accuracy from 55% to 69%, and a dedicated SafeGuard agent improves unsafe-code detection by up to 35 F1 points — findings directly applicable to building safe, modular Beancount automation pipelines.

Gorilla: How Retrieval-Aware Training Reduces LLM API Hallucinations from 78% to 11%

Gorilla (Patil et al., NeurIPS 2024) fine-tunes a 7B LLaMA model with Retriever-Aware Training on retrieved API documentation, cutting hallucination rates from 78% to 11% versus GPT-4 zero-shot — with direct implications for finance AI write-back agents where wrong account names or inverted signs are correctness failures, not annoyances.

MemGPT: Virtual Context Management for LLM Agents

MemGPT applies OS-style virtual memory paging to LLMs, using three-tier storage — working memory, recall, and archival — to give agents persistent recall across sessions; on multi-session chat benchmarks, MemGPT with GPT-4 achieves 92.5% accuracy versus a 32.1% fixed-context baseline.

llm

machine-learning

automation

Showing 61–72 of 87 posts

Prev6 / 8Next