Beancount.io LogoBeancount.io

33 tagged with "Plain-Text Accounting"

Research grounded in plain-text accounting formats and workflows

View all tags

Uncertainty-Aware Deferral for LLM Agents: When to Escalate from Small to Large Models
·mike

Uncertainty-Aware Deferral for LLM Agents: When to Escalate from Small to Large Models

ReDAct runs a small model by default and escalates to an expensive model only when token-level perplexity signals uncertainty, achieving 64% cost savings over GPT-5.2-only while matching or exceeding its accuracy — a directly applicable pattern for Beancount transaction-categorization agents.

ai
llm
automation
machine-learning
+4
OpenHands: Open Platform for AI Software Agents and What It Means for Finance Automation
·mike

OpenHands: Open Platform for AI Software Agents and What It Means for Finance Automation

OpenHands is an MIT-licensed, Docker-sandboxed agent platform where CodeAct achieves 26% on SWE-Bench Lite — a sobering benchmark that establishes what AI agents can reliably do today, and why the first productive finance deployments should be tightly scoped rather than autonomous.

ai
open-source
automation
llm
+4
LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark
·mike

LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark

The LLMFinLiteracy benchmark finds that five open-weight ~7B models generate fully correct Beancount transactions only 2.3% of the time, with failures concentrated in accounting reasoning—not syntax—pointing to compiler-in-the-loop feedback as the critical missing ingredient for reliable write-back agents.

llm
beancount
plain-text-accounting
ai
+4
TableMaster: Adaptive Reasoning for Table Understanding with LLMs
·mike

TableMaster: Adaptive Reasoning for Table Understanding with LLMs

TableMaster is a prompting-only pipeline that reaches 78.13% on WikiTQ with GPT-4o-mini—13 points above Chain-of-Table—by combining table-of-focus extraction, semantic verbalization, and adaptive switching between text and symbolic reasoning. Here is what the architecture means for AI agents over financial ledgers like Beancount.

ai
llm
machine-learning
beancount
+4
τ²-bench: Measuring the Cost of Dual-Control in Conversational AI Agents
·mike

τ²-bench: Measuring the Cost of Dual-Control in Conversational AI Agents

τ²-bench extends agent benchmarking to dual-control settings where both the AI and the user invoke tools over shared state — finding that active users cut success rates by 18–25 percentage points, with direct implications for Beancount agents sharing write access with human users.

ai
llm
automation
beancount
+2
GAIA Benchmark: Measuring What Frontier AI Agents Can Actually Do
·mike

GAIA Benchmark: Measuring What Frontier AI Agents Can Actually Do

GAIA benchmarks 466 real-world tasks across three difficulty levels; frontier agents reached 74.55% in mid-2026 versus 92% for humans, and the remaining Level 3 gap maps directly to the multi-step coordination challenges in automated Beancount ledger workflows.

ai
llm
machine-learning
automation
+3
WorkArena: How LLM Web Agents Perform on Real Enterprise Knowledge Work
·mike

WorkArena: How LLM Web Agents Perform on Real Enterprise Knowledge Work

WorkArena benchmarks LLM web agents on 33 real ServiceNow tasks — GPT-4o reaches 42.7% overall but 0% on list-filter tasks, exposing a hard wall between form-filling and structured UI interaction that maps directly to challenges in Beancount ledger automation.

ai
llm
automation
enterprise-software
+3
τ-bench: Measuring AI Agent Reliability in Real-World Tool-Use Domains
·mike

τ-bench: Measuring AI Agent Reliability in Real-World Tool-Use Domains

τ-bench shows that top LLMs like Claude 3.5 Sonnet drop from pass@1 of 0.692 to pass@4 of 0.462 in retail customer-service tasks — a consistency cliff with direct implications for any write-back agent operating on a Beancount ledger.

ai
llm
machine-learning
automation
+3
Chain-of-Table: Evolving Tables in the LLM Reasoning Chain
·mike

Chain-of-Table: Evolving Tables in the LLM Reasoning Chain

Chain-of-Table (ICLR 2024) improves LLM tabular reasoning by evolving the table itself as the intermediate state — achieving 67.31% on WikiTQ vs. 61.48% for prior baselines, with a +10.25 point advantage on tables exceeding 4,000 tokens and direct applicability to Beancount ledger query agents.

ai
llm
machine-learning
beancount
+3
TableLlama: Can a 7B Open Model Match GPT-4 on Table Understanding?
·mike

TableLlama: Can a 7B Open Model Match GPT-4 on Table Understanding?

TableLlama fine-tunes Llama 2 (7B) on 2.6M table-task examples and beats GPT-4 on structural tasks like column type annotation (F1 94 vs 32), but falls 33 points short on WikiTQ compositional reasoning — a calibrated benchmark for what 7B open models can and cannot do in finance AI today.

llm
ai
machine-learning
beancount
+3
TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount
·mike

TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount

TAPAS (Google Research, ACL 2020) answers table questions by selecting cells and applying scalar aggregations — no SQL generated. This post analyzes the architecture, its 12-point SQA accuracy gain, and why the cell-selection paradigm fits small Beancount ledger queries but breaks down at scale.

ai
machine-learning
llm
data-science
+4
DIN-SQL: Decomposed In-Context Learning for Text-to-SQL
·mike

DIN-SQL: Decomposed In-Context Learning for Text-to-SQL

DIN-SQL (NeurIPS 2023) decomposes text-to-SQL into schema linking, complexity classification, and SQL generation stages, lifting GPT-4 from 67.4% to 85.3% execution accuracy on Spider without fine-tuning — and the same decomposition strategy maps directly onto natural language interfaces for Beancount's BQL query language.

ai
llm
database
queries
+3
Showing 1–12 of 33 posts
1 / 3Next