40 tagged with "Data Science"
Data science methods applied to financial datasets and accounting workflows
Fusion-in-Decoder: How Multi-Passage Retrieval Improves Generative QA
Izacard and Grave's FiD architecture independently encodes retrieved passages then fuses them in the decoder, outperforming RAG-Sequence by 4–11 points on NQ and TriviaQA. This post examines the design and its implications for Beancount ledger QA, where multi-entry synthesis across transactions is the norm.
LLMs Are Not Useful for Time Series Forecasting: What NeurIPS 2024 Means for Finance AI
A NeurIPS 2024 Spotlight paper ablates three LLM-based time series forecasting methods — OneFitsAll, Time-LLM, and CALF — and finds that removing the language model improves accuracy in most cases, with up to a 1,383× training speedup. For finance AI applications like Beancount balance prediction, lightweight purpose-built models consistently beat repurposed LLMs.
TAT-LLM: Ge-fined-tunde LLaMA 2 voor discreet redeneren over financiële tabellen en tekst
TAT-LLM fine-tunt LLaMA 2 7B met LoRA op financiële tabel-tekst QA-benchmarks en behaalt 64,60% EM op FinQA — waarmee het de 63,91% van GPT-4 verslaat — door redenering te ontleden in deterministische Extraheer-Redeneer-Voer-uit stappen die rekenkundige fouten elimineren.
Fine-Tuning vs. RAG: Why Retrieval Wins for Injecting New Knowledge into LLMs
Empirical comparison of RAG vs. unsupervised fine-tuning across 7B-parameter LLMs shows RAG achieves 0.875+ accuracy on post-cutoff facts while fine-tuning plateaus at 0.504 — with direct implications for Beancount agent design and any system requiring frequent knowledge updates.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Lewis et al.'s NeurIPS 2020 paper introduced the hybrid RAG architecture—a BART-large generator paired with a FAISS-indexed retriever over 21 million Wikipedia passages—achieving 44.5 EM on Natural Questions and establishing the parametric/non-parametric split that now underlies most production AI systems. This review covers RAG-Sequence vs. RAG-Token trade-offs, the retrieval collapse failure mode, and what stale indexes mean for financial AI built on append-only Beancount ledgers.
MultiHiertt: Benchmarking Numerical Reasoning Over Multi-Hierarchical Financial Tables
MultiHiertt (ACL 2022) introduces 10,440 QA pairs from real financial reports averaging 3.89 hierarchical tables each; state-of-the-art models score 38% F1 versus 87% for humans, with a 15-point penalty for cross-table questions — quantifying the retrieval gap finance AI must close.
ConvFinQA: Multi-Turn Financial QA and the 21-Point Gap Between Models and Human Experts
ConvFinQA (EMNLP 2022) extends FinQA into multi-turn conversation over S&P 500 earnings reports, finding that the best fine-tuned model achieves 68.9% execution accuracy versus 89.4% for human experts—and drops to 52.4% on hybrid multi-aspect conversations where models must carry numerical context across different financial topics.
TAT-QA: Hybrid Table-Text QA Benchmark for Financial Annual Report Reasoning
TAT-QA is a 16,552-question benchmark over hybrid table-plus-text financial report contexts that showed evidence grounding — not arithmetic — is the core bottleneck in finance AI; by 2024, fine-tuned 7B LLMs reached 83% F1, closing most of the gap against a 91% human ceiling.
FinanceBench: Why Vector-Store RAG Fails on Real Financial Documents
FinanceBench evaluates 16 AI configurations against 10,231 questions from real SEC filings; shared-vector-store RAG answers correctly only 19% of the time, and even GPT-4-Turbo with the oracle passage reaches just 85% accuracy — showing that numerical reasoning, not retrieval, is the binding constraint for enterprise finance AI.
Себесъгласуваност: Изборът чрез мнозинство повишава точността на веригата от мисли
Себесъгласуваността заменя „алчното“ декодиране на веригата от мисли с гласуване с мнозинство върху N извлечени пътища на разсъждение — повишавайки точността на GPT-3 върху GSM8K със 17,9 процентни пункта без допълнително обучение — и се прилага директно към многостъпкови финансови изчисления, където единичното декодиране на модела е ненадеждно.
PAL: Program-Aided Language Models for Reliable Financial Arithmetic
PAL (Program-Aided Language Models) achieves a +38pp accuracy gain over chain-of-thought on arithmetic-heavy tasks by delegating computation to a Python interpreter — a directly applicable architecture for reliable Beancount ledger queries and finance AI.
Can LLMs Reason Over Tabular Data? What Four Benchmarks Tell Us About Finance AI
Four 2024–2025 benchmarks show GPT-4 scoring 42% on real-world table QA versus 86% for humans, with complex aggregations collapsing to 19.6%—and Beancount's native syntax sits at the worst-performing end of the serialization hierarchy for LLM input.