Skip to main content
Beancount.io LogoBeancount.io

89 tagged with "AI"

Artificial intelligence research and applications in finance and accounting

View all tags

SWE-agent: How Interface Design Unlocks Automated Software Engineering
·mike

SWE-agent: How Interface Design Unlocks Automated Software Engineering

SWE-agent (NeurIPS 2024) introduces Agent-Computer Interfaces (ACIs) — purpose-built layers between LLMs and software environments — showing a 10.7-percentage-point improvement over raw shell access and 12.47% resolution on SWE-bench with GPT-4 Turbo. Interface design, not model capability, is the primary bottleneck for autonomous coding agents.

ai
llm
automation
machine-learning
+4
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
·mike

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

SWE-bench evaluates language models on 2,294 real GitHub issues across 12 Python repositories using execution-based tests; at publication, Claude 2 resolved only 1.96% of issues with realistic retrieval, establishing the de facto benchmark for coding agents and revealing retrieval and patch-length failure modes directly relevant to Beancount write-back agents.

ai
llm
machine-learning
beancount
+3
CodeAct: Why Executable Python Code Makes LLM Agents 20% More Accurate
·mike

CodeAct: Why Executable Python Code Makes LLM Agents 20% More Accurate

CodeAct (ICML 2024) replaces JSON tool-calling with executable Python code, improving GPT-4 agent success rates by ~20 percentage points on multi-tool tasks and reducing interaction turns by 30% — with direct implications for building reliable Beancount reconciliation agents.

ai
llm
automation
machine-learning
+3
LLMs Cannot Self-Correct Reasoning Yet — ICLR 2024 Findings and Finance AI Implications
·mike

LLMs Cannot Self-Correct Reasoning Yet — ICLR 2024 Findings and Finance AI Implications

Huang et al. (ICLR 2024) show that LLMs asked to review their own reasoning without external feedback consistently degrade accuracy — GPT-4 drops from 95.5% to 91.5% on GSM8K — and what this means for designing reliable Beancount journal entry agents.

llm
ai
machine-learning
automation
+3
Tree of Thoughts: Deliberate Problem Solving with LLM Search
·mike

Tree of Thoughts: Deliberate Problem Solving with LLM Search

Tree of Thoughts (ToT) achieves 74% on Game of 24 vs 4% for standard GPT-4 CoT by organizing LLM reasoning into a branching search tree with pruning and backtracking — with direct implications for multi-step financial classification and tax optimization in Beancount workflows.

ai
llm
machine-learning
automation
+2
CRITIC: Why LLM Self-Correction Requires External Tool Feedback
·mike

CRITIC: Why LLM Self-Correction Requires External Tool Feedback

CRITIC (ICLR 2024) achieves 7.7 F1 gains on open-domain QA and a 79.2% toxicity reduction by grounding LLM revision in external tool signals — a verify-then-correct loop that maps directly onto write-back safety for Beancount finance agents.

ai
llm
machine-learning
automation
+4
Reflexion: Language Agents That Learn from Mistakes Without Retraining
·mike

Reflexion: Language Agents That Learn from Mistakes Without Retraining

Reflexion (NeurIPS 2023) lets LLM agents improve by storing verbal post-mortems in an episodic buffer — no weight updates required. It reaches 91% on HumanEval with GPT-4 but fails on WebShop, revealing a structural constraint: verbal reinforcement only works when the evaluator produces a crisp, actionable signal. Here is what that means for building a self-correcting Beancount ledger agent.

ai
llm
machine-learning
automation
+2
Себесъгласуваност: Изборът чрез мнозинство повишава точността на веригата от мисли
·mike

Себесъгласуваност: Изборът чрез мнозинство повишава точността на веригата от мисли

Себесъгласуваността заменя „алчното“ декодиране на веригата от мисли с гласуване с мнозинство върху N извлечени пътища на разсъждение — повишавайки точността на GPT-3 върху GSM8K със 17,9 процентни пункта без допълнително обучение — и се прилага директно към многостъпкови финансови изчисления, където единичното декодиране на модела е ненадеждно.

ai
llm
machine-learning
automation
+3
PAL: Program-Aided Language Models for Reliable Financial Arithmetic
·mike

PAL: Program-Aided Language Models for Reliable Financial Arithmetic

PAL (Program-Aided Language Models) achieves a +38pp accuracy gain over chain-of-thought on arithmetic-heavy tasks by delegating computation to a Python interpreter — a directly applicable architecture for reliable Beancount ledger queries and finance AI.

ai
llm
machine-learning
beancount
+3
Can LLMs Reason Over Tabular Data? What Four Benchmarks Tell Us About Finance AI
·mike

Can LLMs Reason Over Tabular Data? What Four Benchmarks Tell Us About Finance AI

Four 2024–2025 benchmarks show GPT-4 scoring 42% on real-world table QA versus 86% for humans, with complex aggregations collapsing to 19.6%—and Beancount's native syntax sits at the worst-performing end of the serialization hierarchy for LLM input.

ai
llm
beancount
data-science
+3
Constitutional AI for Accounting Agents: RLAIF, Policy Rules, and Goodharting Risks
·mike

Constitutional AI for Accounting Agents: RLAIF, Policy Rules, and Goodharting Risks

Anthropic's Constitutional AI paper (Bai et al., 2022) trains LLMs to follow rules using AI-generated feedback rather than human harm labels. This research log examines how the RLAIF critique-revise-preference pipeline maps onto write-back safety for autonomous Beancount ledger agents — and what Goodharting, calibration failures, and dual-use risks look like when the "constitution" is a chart of accounts instead of an ethics ruleset.

ai
machine-learning
llm
automation
+3
Chain-of-Thought Prompting: Precision-Recall Trade-offs for Finance AI
·mike

Chain-of-Thought Prompting: Precision-Recall Trade-offs for Finance AI

A close reading of Wei et al.'s 2022 Chain-of-Thought paper and what it means for finance AI — why CoT raises precision but may cut recall on rare-event detection, why the scale threshold matters for production agents, and what a finance team building on LLMs should watch out for.

ai
llm
machine-learning
data-science
+3
Showing 73–84 of 89 posts