Beancount.io LogoBeancount.io

65 tagged with "Beancount"

Beancount ledger format, tooling, and ecosystem research

View all tags

Multiagent LLM Debate: Real Accuracy Gains, Uncontrolled Compute, and Collective Delusion
·mike

Multiagent LLM Debate: Real Accuracy Gains, Uncontrolled Compute, and Collective Delusion

A close reading of Du et al.'s ICML 2024 multiagent debate paper — which reports 14.8-point accuracy gains on arithmetic — alongside 2025 rebuttals showing equal-budget single agents match debate performance, and an analysis of why Collective Delusion (65% of debate failures) poses specific risks for AI-assisted ledger commits.

ai
llm
machine-learning
automation
+2
LLMs Are Not Useful for Time Series Forecasting: What NeurIPS 2024 Means for Finance AI
·mike

LLMs Are Not Useful for Time Series Forecasting: What NeurIPS 2024 Means for Finance AI

A NeurIPS 2024 Spotlight paper ablates three LLM-based time series forecasting methods — OneFitsAll, Time-LLM, and CALF — and finds that removing the language model improves accuracy in most cases, with up to a 1,383× training speedup. For finance AI applications like Beancount balance prediction, lightweight purpose-built models consistently beat repurposed LLMs.

ai
machine-learning
forecasting
data-science
+3
Fine-Tuning vs. RAG: Why Retrieval Wins for Injecting New Knowledge into LLMs
·mike

Fine-Tuning vs. RAG: Why Retrieval Wins for Injecting New Knowledge into LLMs

Empirical comparison of RAG vs. unsupervised fine-tuning across 7B-parameter LLMs shows RAG achieves 0.875+ accuracy on post-cutoff facts while fine-tuning plateaus at 0.504 — with direct implications for Beancount agent design and any system requiring frequent knowledge updates.

ai
llm
machine-learning
data-science
+3
IRCoT: Interleaving Retrieval with Chain-of-Thought for Multi-Step QA
·mike

IRCoT: Interleaving Retrieval with Chain-of-Thought for Multi-Step QA

IRCoT interleaves BM25 retrieval with each step of a chain-of-thought reasoning loop, achieving +11.3 retrieval recall and +7.1 F1 on HotpotQA over one-step RAG — and shows a 3B model can beat GPT-3 175B when retrieval strategy is right.

ai
llm
machine-learning
automation
+3
FLARE: Active Retrieval Augmented Generation
·mike

FLARE: Active Retrieval Augmented Generation

FLARE (EMNLP 2023) improves on standard RAG by triggering retrieval mid-generation using token-probability confidence thresholds, reaching 51.0 EM on 2WikiMultihopQA versus 39.4 for single-retrieval — but calibration failures in instruction-tuned chat models limit its reliability for production finance agents.

ai
machine-learning
llm
retrieval-augmented-generation
+3
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
·mike

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis et al.'s NeurIPS 2020 paper introduced the hybrid RAG architecture—a BART-large generator paired with a FAISS-indexed retriever over 21 million Wikipedia passages—achieving 44.5 EM on Natural Questions and establishing the parametric/non-parametric split that now underlies most production AI systems. This review covers RAG-Sequence vs. RAG-Token trade-offs, the retrieval collapse failure mode, and what stale indexes mean for financial AI built on append-only Beancount ledgers.

ai
machine-learning
llm
data-science
+2
FinQA: The Benchmark Measuring AI Numerical Reasoning on Financial Reports
·mike

FinQA: The Benchmark Measuring AI Numerical Reasoning on Financial Reports

FinQA (EMNLP 2021) built 8,281 QA pairs from S&P 500 earnings reports requiring multi-step arithmetic programs. Neural models scored 61% at release versus 91% for human experts; accuracy collapses to 22% on three-or-more-step programs. The failure modes — domain constants, cross-modality grounding, chain length — map directly to the challenges Beancount agents face today.

ai
machine-learning
llm
finance
+2
DSPy: Replacing Brittle Prompt Engineering with Compiled LLM Pipelines
·mike

DSPy: Replacing Brittle Prompt Engineering with Compiled LLM Pipelines

DSPy replaces hand-crafted prompt strings with declarative signatures and a metric-driven compiler—boosting Llama2-13b from 9.4% to 46.9% on GSM8K math reasoning and offering a more maintainable path for production finance AI pipelines.

ai
llm
machine-learning
automation
+2
LATS: Language Agent Tree Search — 추론, 행동, 계획을 하나의 프레임워크로 통합
·mike

LATS: Language Agent Tree Search — 추론, 행동, 계획을 하나의 프레임워크로 통합

LATS(Language Agent Tree Search, ICML 2024)는 ReAct, Tree of Thoughts, Reflexion을 단일 MCTS 프레임워크로 통합하여 GPT-4와 함께 HumanEval에서 92.7%의 pass@1을 달성했습니다. Git 기반의 Beancount 장부의 경우, 운영 환경에서 LATS를 제한하는 상태 복원 요구 사항을 아주 쉽게 충족할 수 있습니다.

ai
llm
machine-learning
automation
+3
Self-RAG: Adaptive Retrieval and Self-Critique for LLMs
·mike

Self-RAG: Adaptive Retrieval and Self-Critique for LLMs

Self-RAG (ICLR 2024 Oral) trains a language model to decide when to retrieve and then grade its own results using four reflection tokens — reaching 55.8% on PopQA and 80.2 FactScore on biographies while outperforming ChatGPT on five benchmarks. Analysis covers the mechanism, ablation results, reproducibility limits, and implications for finance AI agents over Beancount ledgers.

ai
machine-learning
llm
technology
+3
Voyager: Skill Libraries as the Foundation for Lifelong AI Agent Learning
·mike

Voyager: Skill Libraries as the Foundation for Lifelong AI Agent Learning

Voyager, a GPT-4-powered Minecraft agent from NVIDIA and Caltech, demonstrates that a persistent code skill library enables genuine lifelong learning without fine-tuning — discovering 3.3× more items than prior state-of-the-art. The pattern maps directly onto long-horizon Beancount ledger automation, though financial correctness demands staging layers that game sandboxes never require.

ai
llm
machine-learning
automation
+3
HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs
·mike

HippoRAG: Neurobiologically Inspired Long-Term Memory for LLMs

HippoRAG (NeurIPS 2024) builds a knowledge graph from OpenIE triples and applies Personalized PageRank at query time, reaching 89.1% Recall@5 on 2WikiMultiHopQA versus 68.2% for ColBERTv2—with direct implications for querying complex financial ledgers across multi-year transaction histories.

llm
ai
machine-learning
beancount
+3
Showing 37–48 of 65 posts