8 tagged with "Queries"

Query generation, table reasoning, and structured data retrieval for financial AI

June 22, 2026·mike

TableMaster: Adaptive Reasoning for Table Understanding with LLMs

TableMaster is a prompting-only pipeline that reaches 78.13% on WikiTQ with GPT-4o-mini—13 points above Chain-of-Table—by combining table-of-focus extraction, semantic verbalization, and adaptive switching between text and symbolic reasoning. Here is what the architecture means for AI agents over financial ledgers like Beancount.

Chain-of-Table: Evolving Tables in the LLM Reasoning Chain

Chain-of-Table (ICLR 2024) improves LLM tabular reasoning by evolving the table itself as the intermediate state — achieving 67.31% on WikiTQ vs. 61.48% for prior baselines, with a +10.25 point advantage on tables exceeding 4,000 tokens and direct applicability to Beancount ledger query agents.

TableLlama: Can a 7B Open Model Match GPT-4 on Table Understanding?

TableLlama fine-tunes Llama 2 (7B) on 2.6M table-task examples and beats GPT-4 on structural tasks like column type annotation (F1 94 vs 32), but falls 33 points short on WikiTQ compositional reasoning — a calibrated benchmark for what 7B open models can and cannot do in finance AI today.

TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount

TAPAS (Google Research, ACL 2020) answers table questions by selecting cells and applying scalar aggregations — no SQL generated. This post analyzes the architecture, its 12-point SQA accuracy gain, and why the cell-selection paradigm fits small Beancount ledger queries but breaks down at scale.

MAC-SQL: Multi-Agent Collaborative Text-to-SQL

MAC-SQL (COLING 2025) uses three specialized agents — Selector for schema reduction, Decomposer for question decomposition, and Refiner for execution-guided SQL correction — to reach 59.59% execution accuracy on the BIRD benchmark; ablation shows the Refiner contributes the most (+4.63 points), with direct implications for Beancount ledger query generation.

DIN-SQL: Decomposed In-Context Learning for Text-to-SQL

DIN-SQL (NeurIPS 2023) decomposes text-to-SQL into schema linking, complexity classification, and SQL generation stages, lifting GPT-4 from 67.4% to 85.3% execution accuracy on Spider without fine-tuning — and the same decomposition strategy maps directly onto natural language interfaces for Beancount's BQL query language.

BIRD Benchmark: The Real-Database Gap in LLM Text-to-SQL

The BIRD benchmark (NeurIPS 2023) tests LLMs on 95 real databases — GPT-4 reaches only 54.89% execution accuracy with domain hints and 34.88% without, a 20-point gap that directly shapes what a natural-language BQL interface for Beancount would need to solve.

GraphRAG: From Local to Global Query-Focused Summarization

Microsoft's GraphRAG builds a Leiden-partitioned entity graph over a text corpus and precomputes community summaries to answer global sensemaking questions that standard vector RAG cannot handle — but a 2025 bias audit shows its 72–83% win rates collapse after correcting for position and length artifacts in LLM-as-judge evaluation.

llm

machine-learning

beancount