Beancount.io LogoBeancount.io

40 tagged with "Data Science"

Data science methods applied to financial datasets and accounting workflows

View all tags

FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain
·mike

FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain

FinRAGBench-V (EMNLP 2025) is the first large-scale benchmark for multimodal RAG with visual citations in finance, covering 112K+ document pages and 1,394 human-annotated QA pairs. Top models achieve only 20–61% block-level citation recall, and multimodal retrieval outperforms text-only by nearly 50 percentage points.

ai
llm
machine-learning
finance
+4
WildToolBench: Why No LLM Exceeds 15% Session Accuracy in Real-World Tool Use
·mike

WildToolBench: Why No LLM Exceeds 15% Session Accuracy in Real-World Tool Use

WildToolBench (ICLR 2026) evaluates 57 LLMs on 1,024 tasks drawn from real user behavior — no model exceeds 15% session accuracy, with compositional orchestration, hidden intent, and instruction transitions as the three sharpest failure modes.

ai
llm
automation
machine-learning
+3
LLM Confidence and Calibration: A Survey of What the Research Actually Shows
·mike

LLM Confidence and Calibration: A Survey of What the Research Actually Shows

A systematic survey of LLM confidence estimation and calibration methods—white-box logit approaches, consistency-based SelfCheckGPT, and semantic entropy—reveals that verbalized confidence scores from GPT-4 achieve only ~62.7% AUROC, barely above chance, with direct implications for deploying uncertainty-aware agents in finance and accounting.

llm
ai
machine-learning
trust
+3
FinToolBench: Оцінка агентів LLM на основі використання фінансових інструментів у реальних умовах
·mike

FinToolBench: Оцінка агентів LLM на основі використання фінансових інструментів у реальних умовах

FinToolBench поєднує 760 активних фінансових інструментів API з 295 виконуваними запитами для тестування агентів LLM на реальних фінансових завданнях — виявивши, що консервативна частота викликів GPT-4o у 22,7% забезпечує вищу якість відповідей (CSS 0,670), ніж агресивна TIR Qwen3-8B у 87,1%, тоді як невідповідність намірів перевищує 50% у всіх протестованих моделях.

ai
llm
automation
machine-learning
+4
OmniEval: Omnidirectional RAG Evaluation Benchmark for the Financial Domain
·mike

OmniEval: Omnidirectional RAG Evaluation Benchmark for the Financial Domain

OmniEval (EMNLP 2025) benchmarks RAG systems across 5 task types × 16 financial topics using 11.4k auto-generated test cases. The best systems achieve only 36% numerical accuracy — concrete evidence that RAG pipelines need validation layers before writing to structured financial ledgers.

ai
machine-learning
llm
finance
+3
LLM Anomaly Detection Survey (NAACL 2025): Strong Taxonomy, Absent Tabular Coverage
·mike

LLM Anomaly Detection Survey (NAACL 2025): Strong Taxonomy, Absent Tabular Coverage

A critical reading of Xu and Ding's NAACL 2025 survey on LLM-based anomaly and OOD detection: the detection-vs-generation taxonomy holds up, but near-total absence of tabular coverage means financial AI practitioners must synthesize insights from vision models themselves.

ai
llm
machine-learning
fraud-detection
+3
Found in the Middle: Calibrating Positional Attention Bias Improves Long-Context RAG
·mike

Found in the Middle: Calibrating Positional Attention Bias Improves Long-Context RAG

A training-free inference-time calibration subtracts positional bias from LLM attention weights, recovering up to 15 percentage points of RAG accuracy when retrieved documents are buried mid-context — and what it means for finance-specific agent pipelines.

ai
llm
machine-learning
data-science
+3
Fin-RATE: How LLMs Fail at Cross-Period and Cross-Entity Financial Analysis
·mike

Fin-RATE: How LLMs Fail at Cross-Period and Cross-Entity Financial Analysis

Fin-RATE benchmarks 17 LLMs on 7,500 expert-curated QA pairs from 2,472 SEC filings, revealing an 18.60% accuracy collapse under longitudinal tracking and a 54-point drop for finance-specialized Fin-R1 on cross-entity tasks — with the retrieval pipeline, not the backbone model, as the binding bottleneck.

llm
ai
machine-learning
analytics
+3
FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG
·mike

FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG

FinDER benchmarks RAG on 5,703 real hedge fund analyst queries against S&P 500 10-K filings; E5-Mistral achieves only 25.95% context recall, and abbreviation-heavy queries cost 8.2 precision points — evidence that query normalization, not better embeddings, is the first fix for finance AI pipelines.

ai
llm
machine-learning
finance
+3
Lost in the Middle: Position Bias in LLMs and Its Impact on Finance AI
·mike

Lost in the Middle: Position Bias in LLMs and Its Impact on Finance AI

The TACL 2024 paper by Liu et al. shows LLMs perform up to 20 points worse on information buried in the middle of long contexts — a U-shaped degradation affecting every tested model including Claude-1.3-100K — with concrete implications for how RAG pipelines should order retrieved passages in finance and accounting applications.

llm
ai
machine-learning
data-science
+3
AD-LLM Benchmark: GPT-4o Hits 0.93+ AUROC Zero-Shot for Text Anomaly Detection
·mike

AD-LLM Benchmark: GPT-4o Hits 0.93+ AUROC Zero-Shot for Text Anomaly Detection

AD-LLM benchmarks GPT-4o and Llama 3.1 8B across three anomaly detection roles — zero-shot detector, data augmenter, and model selector — on five NLP datasets; GPT-4o reaches AUROC 0.93–0.99 zero-shot, but LLM-based model selection remains unreliable, with direct implications for financial audit AI.

llm
ai
machine-learning
data-science
+3
CausalTAD: Causal Column Ordering for LLM Tabular Anomaly Detection
·mike

CausalTAD: Causal Column Ordering for LLM Tabular Anomaly Detection

CausalTAD improves LLM-based tabular anomaly detection by reordering table columns to respect causal dependencies before serialization, lifting average AUC-ROC from 0.803 to 0.834 over AnoLLM on mixed-type benchmarks — with direct implications for detecting anomalies in structured ledger data.

llm
ai
machine-learning
fraud-detection
+3
Showing 1–12 of 40 posts
1 / 4Next