Beancount.io LogoBeancount.io

33 tagged with "Plain-Text Accounting"

Research grounded in plain-text accounting formats and workflows

View all tags

BIRD Benchmark: The Real-Database Gap in LLM Text-to-SQL
·mike

BIRD Benchmark: The Real-Database Gap in LLM Text-to-SQL

The BIRD benchmark (NeurIPS 2023) tests LLMs on 95 real databases — GPT-4 reaches only 54.89% execution accuracy with domain hints and 34.88% without, a 20-point gap that directly shapes what a natural-language BQL interface for Beancount would need to solve.

beancount
ai
llm
database
+3
GraphRAG: From Local to Global Query-Focused Summarization
·mike

GraphRAG: From Local to Global Query-Focused Summarization

Microsoft's GraphRAG builds a Leiden-partitioned entity graph over a text corpus and precomputes community summaries to answer global sensemaking questions that standard vector RAG cannot handle — but a 2025 bias audit shows its 72–83% win rates collapse after correcting for position and length artifacts in LLM-as-judge evaluation.

ai
llm
machine-learning
beancount
+3
StructRAG (ICLR 2025): Picking the Right Document Structure Beats GraphRAG by 28 Points
·mike

StructRAG (ICLR 2025): Picking the Right Document Structure Beats GraphRAG by 28 Points

StructRAG (ICLR 2025) routes each query to a task-appropriate structure type — table, graph, catalogue, algorithm, or chunk — before reasoning, scoring 28 points higher than GraphRAG on the Loong benchmark while running 22× faster, with the DPO-trained router alone accounting for a 15-point accuracy gain.

ai
llm
machine-learning
beancount
+3
Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA
·mike

Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA

Izacard and Grave's FiD architecture independently encodes retrieved passages then fuses them in the decoder, outperforming RAG-Sequence by 4–11 points on NQ and TriviaQA. This post examines the design and its implications for Beancount ledger QA, where multi-entry synthesis across transactions is the norm.

ai
machine-learning
llm
beancount
+2
IRCoT: Interleaving Retrieval with Chain-of-Thought for Multi-Step QA
·mike

IRCoT: Interleaving Retrieval with Chain-of-Thought for Multi-Step QA

IRCoT interleaves BM25 retrieval with each step of a chain-of-thought reasoning loop, achieving +11.3 retrieval recall and +7.1 F1 on HotpotQA over one-step RAG — and shows a 3B model can beat GPT-3 175B when retrieval strategy is right.

ai
llm
machine-learning
automation
+3
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
·mike

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis et al.'s NeurIPS 2020 paper introduced the hybrid RAG architecture—a BART-large generator paired with a FAISS-indexed retriever over 21 million Wikipedia passages—achieving 44.5 EM on Natural Questions and establishing the parametric/non-parametric split that now underlies most production AI systems. This review covers RAG-Sequence vs. RAG-Token trade-offs, the retrieval collapse failure mode, and what stale indexes mean for financial AI built on append-only Beancount ledgers.

ai
machine-learning
llm
data-science
+2
LATS: Language Agent Tree Search — 추론, 행동, 계획을 하나의 프레임워크로 통합
·mike

LATS: Language Agent Tree Search — 추론, 행동, 계획을 하나의 프레임워크로 통합

LATS(Language Agent Tree Search, ICML 2024)는 ReAct, Tree of Thoughts, Reflexion을 단일 MCTS 프레임워크로 통합하여 GPT-4와 함께 HumanEval에서 92.7%의 pass@1을 달성했습니다. Git 기반의 Beancount 장부의 경우, 운영 환경에서 LATS를 제한하는 상태 복원 요구 사항을 아주 쉽게 충족할 수 있습니다.

ai
llm
machine-learning
automation
+3
Self-RAG: Adaptive Retrieval and Self-Critique for LLMs
·mike

Self-RAG: Adaptive Retrieval and Self-Critique for LLMs

Self-RAG (ICLR 2024 Oral) trains a language model to decide when to retrieve and then grade its own results using four reflection tokens — reaching 55.8% on PopQA and 80.2 FactScore on biographies while outperforming ChatGPT on five benchmarks. Analysis covers the mechanism, ablation results, reproducibility limits, and implications for finance AI agents over Beancount ledgers.

ai
machine-learning
llm
technology
+3
Voyager: Skill Libraries as the Foundation for Lifelong AI Agent Learning
·mike

Voyager: Skill Libraries as the Foundation for Lifelong AI Agent Learning

Voyager, a GPT-4-powered Minecraft agent from NVIDIA and Caltech, demonstrates that a persistent code skill library enables genuine lifelong learning without fine-tuning — discovering 3.3× more items than prior state-of-the-art. The pattern maps directly onto long-horizon Beancount ledger automation, though financial correctness demands staging layers that game sandboxes never require.

ai
llm
machine-learning
automation
+3
HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs
·mike

HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs

HippoRAG (NeurIPS 2024) builds a knowledge graph from OpenIE triples and applies Personalized PageRank at query time, reaching 89.1% Recall@5 on 2WikiMultiHopQA versus 68.2% for ColBERTv2—with direct implications for querying complex financial ledgers across multi-year transaction histories.

llm
ai
machine-learning
beancount
+3
AgentBench:评估作为代理的 LLM —— 对金融 AI 可靠性的启示
·mike

AgentBench:评估作为代理的 LLM —— 对金融 AI 可靠性的启示

AgentBench(Liu 等人,ICLR 2024)在 8 个交互式环境中对 27 个大语言模型进行了基准测试 —— GPT-4 的综合得分为 4.01,而表现最好的开源模型仅为 0.96。三种主要的失败模式(知识图谱失败中 67.9% 为超出任务限制、数据库失败中 53.3% 为格式错误以及无效操作)直接对应了在真实账本上部署 Beancount 回写代理的风险。

ai
llm
machine-learning
automation
+3
BloombergGPT and the Limits of Domain-Specific LLMs in Finance
·mike

BloombergGPT and the Limits of Domain-Specific LLMs in Finance

Bloomberg trained a 50B-parameter LLM on 569B tokens of financial data and beat general models on sentiment and table-reasoning benchmarks — then GPT-4 matched it without any finance-specific pretraining. What the $10M experiment reveals about domain pretraining trade-offs, tokenization of numbers, and why tool-use is more reliable than model internals for accounting agents.

llm
ai
machine-learning
finance
+3
Showing 13–24 of 33 posts