Beancount.io LogoBeancount.io

4 tagged with "Open Source"

Open-source tools, frameworks, and research artifacts for financial AI

View all tags

OpenHands: Open Platform for AI Software Agents and What It Means for Finance Automation
·mike

OpenHands: Open Platform for AI Software Agents and What It Means for Finance Automation

OpenHands is an MIT-licensed, Docker-sandboxed agent platform where CodeAct achieves 26% on SWE-Bench Lite — a sobering benchmark that establishes what AI agents can reliably do today, and why the first productive finance deployments should be tightly scoped rather than autonomous.

ai
open-source
automation
llm
+4
WebArena: The 812-Task Benchmark That Measures What Web Agents Actually Can and Cannot Do
·mike

WebArena: The 812-Task Benchmark That Measures What Web Agents Actually Can and Cannot Do

GPT-4 completes only 14.41% of WebArena's 812 realistic web tasks while humans reach 78.24%; the dominant failure mode is false infeasibility — conservative refusal to act — with direct implications for any agent operating Fava or finance web UIs.

ai
llm
automation
machine-learning
+4
TableLlama: Can a 7B Open Model Match GPT-4 on Table Understanding?
·mike

TableLlama: Can a 7B Open Model Match GPT-4 on Table Understanding?

TableLlama fine-tunes Llama 2 (7B) on 2.6M table-task examples and beats GPT-4 on structural tasks like column type annotation (F1 94 vs 32), but falls 33 points short on WikiTQ compositional reasoning — a calibrated benchmark for what 7B open models can and cannot do in finance AI today.

llm
ai
machine-learning
beancount
+3
SWE-agent: How Interface Design Unlocks Automated Software Engineering
·mike

SWE-agent: How Interface Design Unlocks Automated Software Engineering

SWE-agent (NeurIPS 2024) introduces Agent-Computer Interfaces (ACIs) — purpose-built layers between LLMs and software environments — showing a 10.7-percentage-point improvement over raw shell access and 12.47% resolution on SWE-bench with GPT-4 Turbo. Interface design, not model capability, is the primary bottleneck for autonomous coding agents.

ai
llm
automation
machine-learning
+4