FinTagging: the ultimate LLM benchmark for XBRL tagging and US-GAAP concept linking

AI BotpressOct 9, 2025

2023 2024 adaptation Data Performance Points clés

Benchmarking LLMs on real XBRL facts and concepts

We present FinTagging, a table-aware, end-to-end benchmark that evaluates LLMs on extracting financial facts and aligning them to the full US-GAAP taxonomy. By decoupling the task into FinNI (numeric identification) and FinCL (concept linking), we deliver realistic, zero-shot evaluation over both narrative text and complex tables. Results show strong extraction but weak fine-grained alignment, underscoring the need for schema-aware reasoning, structure-aware context, and task-specific adaptation.

Points clés

Over 2 million companies publish financial reports, yet 2023 saw 6,500 tagging errors across 33,000 SEC filings, highlighting the stakes of accurate XBRL tagging.
FinTagging reframes XBRL tagging into two subtasks—FinNI for fact extraction and FinCL for taxonomy-driven concept alignment—covering the 10k+ US-GAAP label space.
The benchmark draws from 30 SEC 10-K filings (2024), totaling 81,325 facts (69,451 standard; 11,874 extensions), 76,835 sentences, and 5,450 financial tables.
We curated 3,354 sentences and 3,245 table sequences, yielding 52,740 annotated numerical entities across 5 item types linked to 2,288 unique US-GAAP concepts.
FinNI-eval contains 6,599 samples with structured JSON outputs; FinCL-eval spans 52,572 query–answer pairs against a taxonomy of 17,388 unique concepts.
At the macro level, DeepSeek-V3 led with macro-F1 0.0582, followed by GPT-4o at 0.0508 and DeepSeek-R1-Distill-Qwen-32B at 0.0266.
Micro-level performance was strongest for DeepSeek-V3 (micro-F1 0.1132) and competitive for GPT-4o (0.0860) without fine-tuning.
In FinNI, larger models dominated; smaller Qwen2.5 variants scored below 0.1 F1, and the financial-domain Fino1-8B underperformed without task-specific training.
In FinCL, DeepSeek-V3 achieved the top accuracy (0.1715), while most models fell below 0.11, reflecting the difficulty of fine-grained concept disambiguation.
An ablation showed extreme flat classification yields zero precision/recall/F1, while structure-aware retrieval (SAC) beat fixed windows in FinCL (e.g., table Acc@100 0.2353 vs. 0.0050).

À retenir

Want cleaner XBRL without losing weekends to spreadsheets? Start with a two-step pipeline: let LLMs extract numbers (FinNI) and then sanity-check concept linking (FinCL) with structure-aware context and retrieval. Keep a human-in-the-loop for GAAP hair-splitting, because even top models confuse lookalike tags, and no, shouting thousands of flat labels at them won’t help. Pilot with your filings, track macro and micro F1, and only then scale—otherwise you’re just automating the creation of better-looking mistakes.

Sources

Revolutionizing Financial Data Extraction: The Ultimate Benchmark for LLM-Driven XBRL Tagging and Concept Linking

Quiz sur le document: 10 questions

FinTagging: the ultimate LLM benchmark for XBRL tagging and US-GAAP concept linking

Articles récents

Tags

Sélection aléatoire d'articles

L’adoption du protocole MCP par la Chine : des assistants IA qui agissent réellement

L’Ascension du Dark Shipping : Menace pour la Sécurité Maritime

ATHENA competency model for creativity: a five‑dimension, 60‑facet framework from theory to training

Articles récents

Tags