Beancount.io LogoBeancount.io

40 tagged with "Data Science"

Data science methods applied to financial datasets and accounting workflows

View all tags

Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA
·mike

Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA

Izacard and Grave's FiD architecture independently encodes retrieved passages then fuses them in the decoder, outperforming RAG-Sequence by 4–11 points on NQ and TriviaQA. This post examines the design and its implications for Beancount ledger QA, where multi-entry synthesis across transactions is the norm.

ai
machine-learning
llm
beancount
+2
LLMs Are Not Useful for Time Series Forecasting: What NeurIPS 2024 Means for Finance AI
·mike

LLMs Are Not Useful for Time Series Forecasting: What NeurIPS 2024 Means for Finance AI

A NeurIPS 2024 Spotlight paper ablates three LLM-based time series forecasting methods — OneFitsAll, Time-LLM, and CALF — and finds that removing the language model improves accuracy in most cases, with up to a 1,383× training speedup. For finance AI applications like Beancount balance prediction, lightweight purpose-built models consistently beat repurposed LLMs.

ai
machine-learning
forecasting
data-science
+3
TAT-LLM: Ge-fined-tunde LLaMA 2 voor discreet redeneren over financiële tabellen en tekst
·mike

TAT-LLM: Ge-fined-tunde LLaMA 2 voor discreet redeneren over financiële tabellen en tekst

TAT-LLM fine-tunt LLaMA 2 7B met LoRA op financiële tabel-tekst QA-benchmarks en behaalt 64,60% EM op FinQA — waarmee het de 63,91% van GPT-4 verslaat — door redenering te ontleden in deterministische Extraheer-Redeneer-Voer-uit stappen die rekenkundige fouten elimineren.

llm
ai
machine-learning
finance
+3
Fine-Tuning vs. RAG: Why Retrieval Wins for Injecting New Knowledge into LLMs
·mike

Fine-Tuning vs. RAG: Why Retrieval Wins for Injecting New Knowledge into LLMs

Empirical comparison of RAG vs. unsupervised fine-tuning across 7B-parameter LLMs shows RAG achieves 0.875+ accuracy on post-cutoff facts while fine-tuning plateaus at 0.504 — with direct implications for Beancount agent design and any system requiring frequent knowledge updates.

ai
llm
machine-learning
data-science
+3
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
·mike

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis et al.'s NeurIPS 2020 paper introduced the hybrid RAG architecture—a BART-large generator paired with a FAISS-indexed retriever over 21 million Wikipedia passages—achieving 44.5 EM on Natural Questions and establishing the parametric/non-parametric split that now underlies most production AI systems. This review covers RAG-Sequence vs. RAG-Token trade-offs, the retrieval collapse failure mode, and what stale indexes mean for financial AI built on append-only Beancount ledgers.

ai
machine-learning
llm
data-science
+2
MultiHiertt: Benchmarking Numerical Reasoning Over Multi-Hierarchical Financial Tables
·mike

MultiHiertt: Benchmarking Numerical Reasoning Over Multi-Hierarchical Financial Tables

MultiHiertt (ACL 2022) introduces 10,440 QA pairs from real financial reports averaging 3.89 hierarchical tables each; state-of-the-art models score 38% F1 versus 87% for humans, with a 15-point penalty for cross-table questions — quantifying the retrieval gap finance AI must close.

ai
machine-learning
llm
financial-reporting
+3
ConvFinQA: Multi-Turn Financial QA and the 21-Point Gap Between Models and Human Experts
·mike

ConvFinQA: Multi-Turn Financial QA and the 21-Point Gap Between Models and Human Experts

ConvFinQA (EMNLP 2022) extends FinQA into multi-turn conversation over S&P 500 earnings reports, finding that the best fine-tuned model achieves 68.9% execution accuracy versus 89.4% for human experts—and drops to 52.4% on hybrid multi-aspect conversations where models must carry numerical context across different financial topics.

ai
llm
machine-learning
finance
+3
TAT-QA: Hybrid Table-Text QA Benchmark for Financial Annual Report Reasoning
·mike

TAT-QA: Hybrid Table-Text QA Benchmark for Financial Annual Report Reasoning

TAT-QA is a 16,552-question benchmark over hybrid table-plus-text financial report contexts that showed evidence grounding — not arithmetic — is the core bottleneck in finance AI; by 2024, fine-tuned 7B LLMs reached 83% F1, closing most of the gap against a 91% human ceiling.

ai
machine-learning
llm
finance
+2
FinanceBench: Why Vector-Store RAG Fails on Real Financial Documents
·mike

FinanceBench: Why Vector-Store RAG Fails on Real Financial Documents

FinanceBench evaluates 16 AI configurations against 10,231 questions from real SEC filings; shared-vector-store RAG answers correctly only 19% of the time, and even GPT-4-Turbo with the oracle passage reaches just 85% accuracy — showing that numerical reasoning, not retrieval, is the binding constraint for enterprise finance AI.

ai
llm
machine-learning
financial-reporting
+3
Себесъгласуваност: Изборът чрез мнозинство повишава точността на веригата от мисли
·mike

Себесъгласуваност: Изборът чрез мнозинство повишава точността на веригата от мисли

Себесъгласуваността заменя „алчното“ декодиране на веригата от мисли с гласуване с мнозинство върху N извлечени пътища на разсъждение — повишавайки точността на GPT-3 върху GSM8K със 17,9 процентни пункта без допълнително обучение — и се прилага директно към многостъпкови финансови изчисления, където единичното декодиране на модела е ненадеждно.

ai
llm
machine-learning
automation
+3
PAL: Program-Aided Language Models for Reliable Financial Arithmetic
·mike

PAL: Program-Aided Language Models for Reliable Financial Arithmetic

PAL (Program-Aided Language Models) achieves a +38pp accuracy gain over chain-of-thought on arithmetic-heavy tasks by delegating computation to a Python interpreter — a directly applicable architecture for reliable Beancount ledger queries and finance AI.

ai
llm
machine-learning
beancount
+3
Can LLMs Reason Over Tabular Data? What Four Benchmarks Tell Us About Finance AI
·mike

Can LLMs Reason Over Tabular Data? What Four Benchmarks Tell Us About Finance AI

Four 2024–2025 benchmarks show GPT-4 scoring 42% on real-world table QA versus 86% for humans, with complex aggregations collapsing to 19.6%—and Beancount's native syntax sits at the worst-performing end of the serialization hierarchy for LLM input.

ai
llm
beancount
data-science
+3
Showing 25–36 of 40 posts