Beancount.io LogoBeancount.io

89 tagged with "AI"

Artificial intelligence research and applications in finance and accounting

View all tags

BIRD Benchmark: The Real-Database Gap in LLM Text-to-SQL
·mike

BIRD Benchmark: The Real-Database Gap in LLM Text-to-SQL

The BIRD benchmark (NeurIPS 2023) tests LLMs on 95 real databases — GPT-4 reaches only 54.89% execution accuracy with domain hints and 34.88% without, a 20-point gap that directly shapes what a natural-language BQL interface for Beancount would need to solve.

beancount
ai
llm
database
+3
Verifiably Safe Tool Use for LLM Agents: STPA Meets MCP
·mike

Verifiably Safe Tool Use for LLM Agents: STPA Meets MCP

CMU and NC State researchers propose using System-Theoretic Process Analysis (STPA) and a capability-enhanced Model Context Protocol to derive formal safety specifications for LLM agent tool use, with Alloy-based verification demonstrating absence of unsafe flows in a calendar scheduling case study.

ai
llm
security
automation
+3
GraphRAG: From Local to Global Query-Focused Summarization
·mike

GraphRAG: From Local to Global Query-Focused Summarization

Microsoft's GraphRAG builds a Leiden-partitioned entity graph over a text corpus and precomputes community summaries to answer global sensemaking questions that standard vector RAG cannot handle — but a 2025 bias audit shows its 72–83% win rates collapse after correcting for position and length artifacts in LLM-as-judge evaluation.

ai
llm
machine-learning
beancount
+3
FinAuditing: LLMs Score Under 14% on Real SEC XBRL Auditing Tasks
·mike

FinAuditing: LLMs Score Under 14% on Real SEC XBRL Auditing Tasks

FinAuditing tests 13 LLMs zero-shot on 1,102 real SEC XBRL filing instances; top scores are 13.86% on financial math verification and 12.42% on concept retrieval—results that directly bound what AI accounting tools can be trusted to automate without external tooling.

llm
ai
financial-reporting
machine-learning
+2
InvestorBench: Benchmarking LLM Agents on Financial Trading Decisions
·mike

InvestorBench: Benchmarking LLM Agents on Financial Trading Decisions

InvestorBench (ACL 2025) tests 13 LLM backbones on backtested stock, crypto, and ETF trading using cumulative return and Sharpe ratio — not QA accuracy. Qwen2.5-72B tops the stock leaderboard at 46.15% CR; finance-tuned models backfire on equities. Model size predicts performance more reliably than domain fine-tuning.

llm
ai
finance
machine-learning
+3
StructRAG (ICLR 2025): Picking the Right Document Structure Beats GraphRAG by 28 Points
·mike

StructRAG (ICLR 2025): Picking the Right Document Structure Beats GraphRAG by 28 Points

StructRAG (ICLR 2025) routes each query to a task-appropriate structure type — table, graph, catalogue, algorithm, or chunk — before reasoning, scoring 28 points higher than GraphRAG on the Loong benchmark while running 22× faster, with the DPO-trained router alone accounting for a 15-point accuracy gain.

ai
llm
machine-learning
beancount
+3
Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
·mike

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

A 2026 Stanford preprint equalizes thinking-token budgets across five multi-agent architectures and finds single-agent LLMs match or beat multi-agent systems on multi-hop reasoning — with theoretical grounding in the Data Processing Inequality and implications for finance AI agent design.

ai
llm
machine-learning
automation
+3
M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?
·mike

M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?

M3MAD-Bench stress-tests Multi-Agent Debate across 9 models, 5 domains, and vision-language settings, finding that Collective Delusion causes 65% of failures, adversarial debate cuts accuracy by up to 12.8%, and Self-Consistency typically matches debate accuracy at lower token cost.

ai
llm
machine-learning
automation
+3
AGrail: Adaptive Safety Guardrails for LLM Agents That Learn Across Tasks
·mike

AGrail: Adaptive Safety Guardrails for LLM Agents That Learn Across Tasks

AGrail (ACL 2025) introduces a two-LLM cooperative guardrail that adapts safety checks at inference time via test-time adaptation, achieving 0% prompt injection attack success and 95.6% benign action preservation on Safe-OS — compared to GuardAgent and LLaMA-Guard blocking up to 49.2% of legitimate actions.

ai
llm
security
automation
+3
ShieldAgent: Verifiable Safety Policy Reasoning for LLM Agents
·mike

ShieldAgent: Verifiable Safety Policy Reasoning for LLM Agents

ShieldAgent (ICML 2025) replaces LLM-based guardrails with probabilistic rule circuits built on Markov Logic Networks, achieving 90.4% accuracy on agent attacks with 64.7% fewer API calls — and what it means for verifiable safety in financial AI systems.

ai
llm
machine-learning
security
+4
Atlas: Joint Retriever-Reader Pre-Training Beats 540B-Parameter LLMs with 11B Parameters
·mike

Atlas: Joint Retriever-Reader Pre-Training Beats 540B-Parameter LLMs with 11B Parameters

Atlas (JMLR 2023) achieves 42.4% accuracy on Natural Questions with only 64 training examples—beating PaLM 540B by 3 points using 11B parameters—by jointly pre-training a Contriever-based dense retriever with a T5 Fusion-in-Decoder reader. Analysis covers retrieval accuracy limits, 587GB index infrastructure costs, and implications for Beancount ledger QA systems.

ai
machine-learning
llm
data-science
+3
Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA
·mike

Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA

Izacard and Grave's FiD architecture independently encodes retrieved passages then fuses them in the decoder, outperforming RAG-Sequence by 4–11 points on NQ and TriviaQA. This post examines the design and its implications for Beancount ledger QA, where multi-entry synthesis across transactions is the norm.

ai
machine-learning
llm
beancount
+2
Showing 37–48 of 89 posts