Beancount.io LogoBeancount.io

8 tagged with "Technology"

Technology research and software engineering topics relevant to financial AI systems

View all tags

WildToolBench: Why No LLM Exceeds 15% Session Accuracy in Real-World Tool Use
·mike

WildToolBench: Why No LLM Exceeds 15% Session Accuracy in Real-World Tool Use

WildToolBench (ICLR 2026) evaluates 57 LLMs on 1,024 tasks drawn from real user behavior — no model exceeds 15% session accuracy, with compositional orchestration, hidden intent, and instruction transitions as the three sharpest failure modes.

ai
llm
automation
machine-learning
+3
Lost in the Middle: Position Bias in LLMs and Its Impact on Finance AI
·mike

Lost in the Middle: Position Bias in LLMs and Its Impact on Finance AI

The TACL 2024 paper by Liu et al. shows LLMs perform up to 20 points worse on information buried in the middle of long contexts — a U-shaped degradation affecting every tested model including Claude-1.3-100K — with concrete implications for how RAG pipelines should order retrieved passages in finance and accounting applications.

llm
ai
machine-learning
data-science
+3
OSWorld: Desktop AI Agents Succeed on 12% of Tasks Where Humans Succeed on 72%
·mike

OSWorld: Desktop AI Agents Succeed on 12% of Tasks Where Humans Succeed on 72%

OSWorld (NeurIPS 2024) benchmarks multimodal AI agents on 369 real desktop tasks across Ubuntu, Windows, and macOS — finding a 60-percentage-point gap between the best model (12.24%) and human performance (72.36%), with 75% of failures traced to visuomotor grounding errors rather than reasoning failures.

ai
machine-learning
automation
llm
+3
StructRAG (ICLR 2025): Picking the Right Document Structure Beats GraphRAG by 28 Points
·mike

StructRAG (ICLR 2025): Picking the Right Document Structure Beats GraphRAG by 28 Points

StructRAG (ICLR 2025) routes each query to a task-appropriate structure type — table, graph, catalogue, algorithm, or chunk — before reasoning, scoring 28 points higher than GraphRAG on the Loong benchmark while running 22× faster, with the DPO-trained router alone accounting for a 15-point accuracy gain.

ai
llm
machine-learning
beancount
+3
Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
·mike

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

A 2026 Stanford preprint equalizes thinking-token budgets across five multi-agent architectures and finds single-agent LLMs match or beat multi-agent systems on multi-hop reasoning — with theoretical grounding in the Data Processing Inequality and implications for finance AI agent design.

ai
llm
machine-learning
automation
+3
Self-RAG: Adaptive Retrieval and Self-Critique for LLMs
·mike

Self-RAG: Adaptive Retrieval and Self-Critique for LLMs

Self-RAG (ICLR 2024 Oral) trains a language model to decide when to retrieve and then grade its own results using four reflection tokens — reaching 55.8% on PopQA and 80.2 FactScore on biographies while outperforming ChatGPT on five benchmarks. Analysis covers the mechanism, ablation results, reproducibility limits, and implications for finance AI agents over Beancount ledgers.

ai
machine-learning
llm
technology
+3
AgentBench:评估作为代理的 LLM —— 对金融 AI 可靠性的启示
·mike

AgentBench:评估作为代理的 LLM —— 对金融 AI 可靠性的启示

AgentBench(Liu 等人,ICLR 2024)在 8 个交互式环境中对 27 个大语言模型进行了基准测试 —— GPT-4 的综合得分为 4.01,而表现最好的开源模型仅为 0.96。三种主要的失败模式(知识图谱失败中 67.9% 为超出任务限制、数据库失败中 53.3% 为格式错误以及无效操作)直接对应了在真实账本上部署 Beancount 回写代理的风险。

ai
llm
machine-learning
automation
+3
MemGPT: Virtual Context Management for LLM Agents
·mike

MemGPT: Virtual Context Management for LLM Agents

MemGPT applies OS-style virtual memory paging to LLMs, using three-tier storage — working memory, recall, and archival — to give agents persistent recall across sessions; on multi-session chat benchmarks, MemGPT with GPT-4 achieves 92.5% accuracy versus a 32.1% fixed-context baseline.

ai
llm
machine-learning
automation
+4