33 tagged with "Plain-Text Accounting"
Research grounded in plain-text accounting formats and workflows
Uncertainty-Aware Deferral for LLM Agents: When to Escalate from Small to Large Models
ReDAct runs a small model by default and escalates to an expensive model only when token-level perplexity signals uncertainty, achieving 64% cost savings over GPT-5.2-only while matching or exceeding its accuracy — a directly applicable pattern for Beancount transaction-categorization agents.
OpenHands: Open Platform for AI Software Agents and What It Means for Finance Automation
OpenHands is an MIT-licensed, Docker-sandboxed agent platform where CodeAct achieves 26% on SWE-Bench Lite — a sobering benchmark that establishes what AI agents can reliably do today, and why the first productive finance deployments should be tightly scoped rather than autonomous.
LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark
The LLMFinLiteracy benchmark finds that five open-weight ~7B models generate fully correct Beancount transactions only 2.3% of the time, with failures concentrated in accounting reasoning—not syntax—pointing to compiler-in-the-loop feedback as the critical missing ingredient for reliable write-back agents.
TableMaster: Adaptive Reasoning for Table Understanding with LLMs
TableMaster is a prompting-only pipeline that reaches 78.13% on WikiTQ with GPT-4o-mini—13 points above Chain-of-Table—by combining table-of-focus extraction, semantic verbalization, and adaptive switching between text and symbolic reasoning. Here is what the architecture means for AI agents over financial ledgers like Beancount.
τ²-bench: Measuring the Cost of Dual-Control in Conversational AI Agents
τ²-bench extends agent benchmarking to dual-control settings where both the AI and the user invoke tools over shared state — finding that active users cut success rates by 18–25 percentage points, with direct implications for Beancount agents sharing write access with human users.
GAIA Benchmark: Measuring What Frontier AI Agents Can Actually Do
GAIA benchmarks 466 real-world tasks across three difficulty levels; frontier agents reached 74.55% in mid-2026 versus 92% for humans, and the remaining Level 3 gap maps directly to the multi-step coordination challenges in automated Beancount ledger workflows.
WorkArena: How LLM Web Agents Perform on Real Enterprise Knowledge Work
WorkArena benchmarks LLM web agents on 33 real ServiceNow tasks — GPT-4o reaches 42.7% overall but 0% on list-filter tasks, exposing a hard wall between form-filling and structured UI interaction that maps directly to challenges in Beancount ledger automation.
τ-bench: Measuring AI Agent Reliability in Real-World Tool-Use Domains
τ-bench shows that top LLMs like Claude 3.5 Sonnet drop from pass@1 of 0.692 to pass@4 of 0.462 in retail customer-service tasks — a consistency cliff with direct implications for any write-back agent operating on a Beancount ledger.
Chain-of-Table: Evolving Tables in the LLM Reasoning Chain
Chain-of-Table (ICLR 2024) improves LLM tabular reasoning by evolving the table itself as the intermediate state — achieving 67.31% on WikiTQ vs. 61.48% for prior baselines, with a +10.25 point advantage on tables exceeding 4,000 tokens and direct applicability to Beancount ledger query agents.
TableLlama: Can a 7B Open Model Match GPT-4 on Table Understanding?
TableLlama fine-tunes Llama 2 (7B) on 2.6M table-task examples and beats GPT-4 on structural tasks like column type annotation (F1 94 vs 32), but falls 33 points short on WikiTQ compositional reasoning — a calibrated benchmark for what 7B open models can and cannot do in finance AI today.
TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount
TAPAS (Google Research, ACL 2020) answers table questions by selecting cells and applying scalar aggregations — no SQL generated. This post analyzes the architecture, its 12-point SQA accuracy gain, and why the cell-selection paradigm fits small Beancount ledger queries but breaks down at scale.
DIN-SQL: Decomposed In-Context Learning for Text-to-SQL
DIN-SQL (NeurIPS 2023) decomposes text-to-SQL into schema linking, complexity classification, and SQL generation stages, lifting GPT-4 from 67.4% to 85.3% execution accuracy on Spider without fine-tuning — and the same decomposition strategy maps directly onto natural language interfaces for Beancount's BQL query language.