Benchmarking LLMs on real XBRL facts and concepts
We present FinTagging, a table-aware, end-to-end benchmark that evaluates LLMs on extracting financial facts and aligning them to the full US-GAAP taxonomy. By decoupling the task into FinNI (numeric identification) and FinCL (concept linking), we deliver realistic, zero-shot evaluation over both narrative text and complex tables. Results show strong extraction but weak fine-grained alignment, underscoring the need for schema-aware reasoning, structure-aware context, and task-specific adaptation.
Points clés
- Over 2 million companies publish financial reports, yet 2023 saw 6,500 tagging errors across 33,000 SEC filings, highlighting the stakes of accurate XBRL tagging.
- FinTagging reframes XBRL tagging into two subtasks—FinNI for fact extraction and FinCL for taxonomy-driven concept alignment—covering the 10k+ US-GAAP label space.
- The benchmark draws from 30 SEC 10-K filings (2024), totaling 81,325 facts (69,451 standard; 11,874 extensions), 76,835 sentences, and 5,450 financial tables.
- We curated 3,354 sentences and 3,245 table sequences, yielding 52,740 annotated numerical entities across 5 item types linked to 2,288 unique US-GAAP concepts.
- FinNI-eval contains 6,599 samples with structured JSON outputs; FinCL-eval spans 52,572 query–answer pairs against a taxonomy of 17,388 unique concepts.
- At the macro level, DeepSeek-V3 led with macro-F1 0.0582, followed by GPT-4o at 0.0508 and DeepSeek-R1-Distill-Qwen-32B at 0.0266.
- Micro-level performance was strongest for DeepSeek-V3 (micro-F1 0.1132) and competitive for GPT-4o (0.0860) without fine-tuning.
- In FinNI, larger models dominated; smaller Qwen2.5 variants scored below 0.1 F1, and the financial-domain Fino1-8B underperformed without task-specific training.
- In FinCL, DeepSeek-V3 achieved the top accuracy (0.1715), while most models fell below 0.11, reflecting the difficulty of fine-grained concept disambiguation.
- An ablation showed extreme flat classification yields zero precision/recall/F1, while structure-aware retrieval (SAC) beat fixed windows in FinCL (e.g., table Acc@100 0.2353 vs. 0.0050).
À retenir
Want cleaner XBRL without losing weekends to spreadsheets? Start with a two-step pipeline: let LLMs extract numbers (FinNI) and then sanity-check concept linking (FinCL) with structure-aware context and retrieval. Keep a human-in-the-loop for GAAP hair-splitting, because even top models confuse lookalike tags, and no, shouting thousands of flat labels at them won’t help. Pilot with your filings, track macro and micro F1, and only then scale—otherwise you’re just automating the creation of better-looking mistakes.
Sources
Quiz sur le document: 10 questions






