40 tagged with "Data Science"

Data science methods applied to financial datasets and accounting workflows

June 24, 2026·mike

AnoLLM: Fine-Tuning LLMs for Tabular Anomaly Detection in Financial Data

AnoLLM (ICLR 2025) reformulates tabular anomaly detection as LLM density estimation — fine-tuning on normal rows and scoring by negative log-likelihood. It outperforms classical methods on mixed-type fraud datasets but offers no edge on purely numerical data, with real implications for detecting anomalies in Beancount ledger entries.

TableMaster: Adaptive Reasoning for Table Understanding with LLMs

TableMaster is a prompting-only pipeline that reaches 78.13% on WikiTQ with GPT-4o-mini—13 points above Chain-of-Table—by combining table-of-focus extraction, semantic verbalization, and adaptive switching between text and symbolic reasoning. Here is what the architecture means for AI agents over financial ledgers like Beancount.

Zero-Shot Anomaly Detection with LLMs: How GPT-4 Performs on Tabular Data

GPT-4 achieves 74.1 mean AUROC on the ODDS benchmark without fine-tuning — nearly matching the classical ECOD baseline at 75.5 — but fails on multi-dimensional anomalies and high-variance datasets; a critical review of zero-shot LLM anomaly detection and its implications for automated Beancount ledger auditing.

DocFinQA: Long-Context Financial Reasoning on Full SEC Filings

DocFinQA replaces FinQA's curated 700-word passages with full 123,000-word SEC filings, exposing a 175× context increase that nearly halves GPT-4 accuracy on long documents. Retrieval pipelines fail to surface the right chunk 45% of the time at HR@3 — and long-context models are not a substitute.

GAIA Benchmark: Measuring What Frontier AI Agents Can Actually Do

GAIA benchmarks 466 real-world tasks across three difficulty levels; frontier agents reached 74.55% in mid-2026 versus 92% for humans, and the remaining Level 3 gap maps directly to the multi-step coordination challenges in automated Beancount ledger workflows.

OSWorld: Desktop AI Agents Succeed on 12% of Tasks Where Humans Succeed on 72%

OSWorld (NeurIPS 2024) benchmarks multimodal AI agents on 369 real desktop tasks across Ubuntu, Windows, and macOS — finding a 60-percentage-point gap between the best model (12.24%) and human performance (72.36%), with 75% of failures traced to visuomotor grounding errors rather than reasoning failures.

Chain-of-Table: Evolving Tables in the LLM Reasoning Chain

Chain-of-Table (ICLR 2024) improves LLM tabular reasoning by evolving the table itself as the intermediate state — achieving 67.31% on WikiTQ vs. 61.48% for prior baselines, with a +10.25 point advantage on tables exceeding 4,000 tokens and direct applicability to Beancount ledger query agents.

TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount

TAPAS (Google Research, ACL 2020) answers table questions by selecting cells and applying scalar aggregations — no SQL generated. This post analyzes the architecture, its 12-point SQA accuracy gain, and why the cell-selection paradigm fits small Beancount ledger queries but breaks down at scale.

GraphRAG: From Local to Global Query-Focused Summarization

Microsoft's GraphRAG builds a Leiden-partitioned entity graph over a text corpus and precomputes community summaries to answer global sensemaking questions that standard vector RAG cannot handle — but a 2025 bias audit shows its 72–83% win rates collapse after correcting for position and length artifacts in LLM-as-judge evaluation.

InvestorBench: Benchmarking LLM Agents on Financial Trading Decisions

InvestorBench (ACL 2025) tests 13 LLM backbones on backtested stock, crypto, and ETF trading using cumulative return and Sharpe ratio — not QA accuracy. Qwen2.5-72B tops the stock leaderboard at 46.15% CR; finance-tuned models backfire on equities. Model size predicts performance more reliably than domain fine-tuning.

M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?

M3MAD-Bench stress-tests Multi-Agent Debate across 9 models, 5 domains, and vision-language settings, finding that Collective Delusion causes 65% of failures, adversarial debate cuts accuracy by up to 12.8%, and Self-Consistency typically matches debate accuracy at lower token cost.

Atlas: Joint Retriever-Reader Pre-Training Beats 540B-Parameter LLMs with 11B Parameters

Atlas (JMLR 2023) achieves 42.4% accuracy on Natural Questions with only 64 training examples—beating PaLM 540B by 3 points using 11B parameters—by jointly pre-training a Contriever-based dense retriever with a T5 Fusion-in-Decoder reader. Analysis covers retrieval accuracy limits, 587GB index infrastructure costs, and implications for Beancount ledger QA systems.

machine-learning

llm

data-science

Showing 13–24 of 40 posts

Prev2 / 4Next