mirror of
https://github.com/docling-project/docling.git
synced 2026-05-17 13:10:38 +00:00
1eb5c21dab
docs(xbrl): add notebook for XBRL parsing Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
508 lines
17 KiB
Plaintext
Vendored
508 lines
17 KiB
Plaintext
Vendored
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/xbrl_conversion.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# XBRL Document Conversion\n",
|
||
"\n",
|
||
"This example demonstrates how to parse XBRL (eXtensible Business Reporting Language) documents using Docling, completely offline.\n",
|
||
"\n",
|
||
"XBRL is a standard XML-based format used globally by companies, regulators, and financial institutions for exchanging business and financial information in a structured, machine-readable format. It's widely adopted for regulatory filings (e.g., SEC filings in the US).\n",
|
||
"\n",
|
||
"## What you'll learn\n",
|
||
"\n",
|
||
"- How to configure Docling to parse XBRL documents offline\n",
|
||
"- How to provide a local taxonomy package for XBRL validation\n",
|
||
"- How to extract structured data from XBRL instance documents\n",
|
||
"- How to export XBRL content to various formats (Markdown, JSON, etc.)\n",
|
||
"\n",
|
||
"The data to run this notebook has been fetched from the [SEC's Electronic Data Gathering, Analysis, and Retrieval (EDGAR)](https://www.sec.gov/search-filings) system."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Setup\n",
|
||
"\n",
|
||
"Install Docling with XBRL support:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"/Users/ceb/git/docling-3/.venv/bin/python: No module named pip\n",
|
||
"Note: you may need to restart the kernel to use updated packages.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"%pip install -q docling"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Download Sample XBRL Data\n",
|
||
"\n",
|
||
"For this example, we'll use a sample XBRL instance document and its taxonomy. In a real scenario, you would have your own XBRL files and taxonomy packages.\n",
|
||
"\n",
|
||
"We'll download the test data from the Docling repository:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Downloading XBRL instance file...\n",
|
||
"Downloaded: xbrl_data/mlac-20251231.xml\n",
|
||
"Downloading taxonomy files...\n",
|
||
" Downloaded: mlac-20251231.xsd\n",
|
||
" Downloaded: mlac-20251231_cal.xml\n",
|
||
" Downloaded: mlac-20251231_def.xml\n",
|
||
" Downloaded: mlac-20251231_lab.xml\n",
|
||
" Downloaded: mlac-20251231_pre.xml\n",
|
||
"Downloading taxonomy package...\n",
|
||
" Downloaded: taxonomy_package.zip\n",
|
||
"\n",
|
||
"All files downloaded successfully!\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import urllib.request\n",
|
||
"from pathlib import Path\n",
|
||
"\n",
|
||
"# Create directories for XBRL data\n",
|
||
"data_dir = Path(\"xbrl_data\")\n",
|
||
"taxonomy_dir = data_dir / \"taxonomy\"\n",
|
||
"taxonomy_dir.mkdir(parents=True, exist_ok=True)\n",
|
||
"\n",
|
||
"# Base URL for test data\n",
|
||
"base_url = (\n",
|
||
" \"https://raw.githubusercontent.com/docling-project/docling/main/tests/data/xbrl/\"\n",
|
||
")\n",
|
||
"\n",
|
||
"# Download XBRL instance file\n",
|
||
"instance_file = data_dir / \"mlac-20251231.xml\"\n",
|
||
"if not instance_file.exists():\n",
|
||
" print(\"Downloading XBRL instance file...\")\n",
|
||
" urllib.request.urlretrieve(f\"{base_url}mlac-20251231.xml\", instance_file)\n",
|
||
" print(f\"Downloaded: {instance_file}\")\n",
|
||
"\n",
|
||
"# Download taxonomy files\n",
|
||
"taxonomy_files = [\n",
|
||
" \"mlac-20251231.xsd\",\n",
|
||
" \"mlac-20251231_cal.xml\",\n",
|
||
" \"mlac-20251231_def.xml\",\n",
|
||
" \"mlac-20251231_lab.xml\",\n",
|
||
" \"mlac-20251231_pre.xml\",\n",
|
||
"]\n",
|
||
"\n",
|
||
"print(\"Downloading taxonomy files...\")\n",
|
||
"for filename in taxonomy_files:\n",
|
||
" target_file = taxonomy_dir / filename\n",
|
||
" if not target_file.exists():\n",
|
||
" urllib.request.urlretrieve(f\"{base_url}mlac-taxonomy/{filename}\", target_file)\n",
|
||
" print(f\" Downloaded: {filename}\")\n",
|
||
"\n",
|
||
"# Download taxonomy package (contains URL mappings for offline parsing)\n",
|
||
"taxonomy_package = taxonomy_dir / \"taxonomy_package.zip\"\n",
|
||
"if not taxonomy_package.exists():\n",
|
||
" print(\"Downloading taxonomy package...\")\n",
|
||
" urllib.request.urlretrieve(\n",
|
||
" f\"{base_url}mlac-taxonomy/taxonomy_package.zip\", taxonomy_package\n",
|
||
" )\n",
|
||
" print(\" Downloaded: taxonomy_package.zip\")\n",
|
||
"\n",
|
||
"print(\"\\nAll files downloaded successfully!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Configure XBRL Backend\n",
|
||
"\n",
|
||
"To parse XBRL documents offline, we need to:\n",
|
||
"\n",
|
||
"1. Enable local resource fetching (for taxonomy files)\n",
|
||
"2. Disable remote resource fetching (for offline operation)\n",
|
||
"3. Provide the path to the local taxonomy directory"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"XBRL converter configured successfully!\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from docling.datamodel.backend_options import XBRLBackendOptions\n",
|
||
"from docling.datamodel.base_models import InputFormat\n",
|
||
"from docling.document_converter import DocumentConverter, XBRLFormatOption\n",
|
||
"\n",
|
||
"# Configure XBRL backend options\n",
|
||
"backend_options = XBRLBackendOptions(\n",
|
||
" enable_local_fetch=True, # Allow reading local taxonomy files\n",
|
||
" enable_remote_fetch=False, # Disable remote fetching for offline operation\n",
|
||
" taxonomy=taxonomy_dir, # Path to local taxonomy directory\n",
|
||
")\n",
|
||
"\n",
|
||
"# Create document converter with XBRL support\n",
|
||
"converter = DocumentConverter(\n",
|
||
" allowed_formats=[InputFormat.XML_XBRL],\n",
|
||
" format_options={\n",
|
||
" InputFormat.XML_XBRL: XBRLFormatOption(backend_options=backend_options)\n",
|
||
" },\n",
|
||
")\n",
|
||
"\n",
|
||
"print(\"XBRL converter configured successfully!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"💡 Because the converter must read the supporting taxonomy files, set the `enable_local_fetch` option to **True** in the XBRL backend settings. \n",
|
||
"💡 In addition to the XBRL report's own taxonomy files, you need a *taxonomy package*-a bundle containing URL remappings that enables completely offline parsing. If you prefer not to supply a taxonomy package, omit it and set `enable_remote_fetch` to **True** in the XBRL backend settings. The backend will fetch the web‑referenced files from the remote publishers and cache them locally for reuse."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Convert XBRL Document\n",
|
||
"\n",
|
||
"Now we can convert the XBRL instance document. The converter will:\n",
|
||
"\n",
|
||
"- Parse the XBRL instance file\n",
|
||
"- Validate it against the local taxonomy\n",
|
||
"- Extract metadata, text blocks, and numeric facts\n",
|
||
"- Convert everything to a unified DoclingDocument representation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Converting XBRL document: xbrl_data/mlac-20251231.xml\n",
|
||
"\n",
|
||
"Conversion successful!\n",
|
||
"Document name: mlac-20251231\n",
|
||
"Number of items: 292\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Convert the XBRL document\n",
|
||
"print(f\"Converting XBRL document: {instance_file}\")\n",
|
||
"result = converter.convert(instance_file)\n",
|
||
"doc = result.document\n",
|
||
"\n",
|
||
"print(\"\\nConversion successful!\")\n",
|
||
"print(f\"Document name: {doc.name}\")\n",
|
||
"print(f\"Number of items: {len(list(doc.iterate_items()))}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Inspect Document Structure\n",
|
||
"\n",
|
||
"Let's examine the structure of the converted document:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Document structure:\n",
|
||
" text: 267\n",
|
||
" table: 23\n",
|
||
" title: 1\n",
|
||
" key_value_region: 1\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from docling_core.types.doc import DocItemLabel\n",
|
||
"\n",
|
||
"# Count items by type\n",
|
||
"item_counts = {}\n",
|
||
"for item, _ in doc.iterate_items():\n",
|
||
" label = item.label\n",
|
||
" item_counts[label] = item_counts.get(label, 0) + 1\n",
|
||
"\n",
|
||
"print(\"Document structure:\")\n",
|
||
"for label, count in sorted(item_counts.items(), key=lambda x: x[1], reverse=True):\n",
|
||
" print(f\" {label.value}: {count}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## View Sample Content\n",
|
||
"\n",
|
||
"Let's look at some of the extracted content:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Sample text content:\n",
|
||
"\n",
|
||
"- None\n",
|
||
"\n",
|
||
"- We are a special purpose acquisition company with no business operations. Since our initial public offering, our sole business activity has been identifying and evaluating suitable acquisition transac...\n",
|
||
"\n",
|
||
"- We depend on digital technologies, including information systems, infrastructure and cloud applications and services, including those of third parties with which we may deal. Sophisticated and deliber...\n",
|
||
"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Display first few text items\n",
|
||
"print(\"Sample text content:\\n\")\n",
|
||
"text_count = 0\n",
|
||
"for item, _ in doc.iterate_items():\n",
|
||
" if item.label == DocItemLabel.TEXT and text_count < 3:\n",
|
||
" print(f\"- {item.text[:200]}...\" if len(item.text) > 200 else f\"- {item.text}\")\n",
|
||
" print()\n",
|
||
" text_count += 1"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## View Key-Value Pairs\n",
|
||
"\n",
|
||
"XBRL numeric facts are extracted as key-value pairs:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Total key-value pairs extracted: 319\n",
|
||
"\n",
|
||
"EntityPublicFloat -> 239160600\n",
|
||
"EntityCommonStockSharesOutstanding -> 23805000\n",
|
||
"EntityCommonStockSharesOutstanding -> 7187500\n",
|
||
"Cash -> 452680\n",
|
||
"Cash -> 1383392\n",
|
||
"OtherPrepaidExpenseCurrent -> 16840\n",
|
||
"OtherPrepaidExpenseCurrent -> 23669\n",
|
||
"PrepaidInsurance -> 87776\n",
|
||
"PrepaidInsurance -> 92500\n",
|
||
"AssetsCurrent -> 557296\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Display sample key-value pairs\n",
|
||
"graph_data = doc.key_value_items[0].graph\n",
|
||
"print(f\"Total key-value pairs extracted: {len(graph_data.links)}\\n\")\n",
|
||
"for link in graph_data.links[:10]:\n",
|
||
" source = next(\n",
|
||
" item for item in graph_data.cells if item.cell_id == link.source_cell_id\n",
|
||
" )\n",
|
||
" target = next(\n",
|
||
" item for item in graph_data.cells if item.cell_id == link.target_cell_id\n",
|
||
" )\n",
|
||
" print(f\"{source.text} -> {target.text}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"💡 The current backend implementation flattens all key‑value pairs in an XBRL report. Future improvements will preserve the rich taxonomy of those data points."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Export to Markdown\n",
|
||
"\n",
|
||
"Export the document to Markdown format for easy reading:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Markdown export (first 2000 characters):\n",
|
||
"\n",
|
||
"# 10-K MOUNTAIN LAKE ACQUISITION CORP. 2025-12-31\n",
|
||
"\n",
|
||
"None\n",
|
||
"\n",
|
||
"We are a special purpose acquisition company with no business operations. Since our initial public offering, our sole business activity has been identifying and evaluating suitable acquisition transaction candidates. Therefore, we do not consider that we face significant cybersecurity risk and have not adopted any cybersecurity risk management program or formal processes for assessing cybersecurity risk.\n",
|
||
"\n",
|
||
"We depend on digital technologies, including information systems, infrastructure and cloud applications and services, including those of third parties with which we may deal. Sophisticated and deliberate attacks on, or security breaches in, our information systems or infrastructure, or the information systems or infrastructure of third parties or the cloud, could lead to corruption or misappropriation of our assets, proprietary information and sensitive or confidential data. Because of our reliance on the technologies of third parties, we also depend upon the personnel and the processes of third parties to protect against cybersecurity threats. In the event of a cybersecurity incident impacting us, the management team will report to the board of directors and provide updates on the management team's incident response plan for addressing and mitigating any risks associated with the cybersecurity incident. As an early-stage company without significant investments in data security protection, there can be no assurance that we will have sufficient resources to adequately protect against, or to investigate and remediate any vulnerability to, cyber incidents. It is possible that any of these occurrences, or a combination of them, could have adverse consequences on our business and lead to financial loss.\n",
|
||
"\n",
|
||
"As of the date of this Report, we have not identified any risks from cybersecurity threats, including as a result of any previous cybersecurity incidents, that we believe have, or are likely to, materially affect \n",
|
||
"\n",
|
||
"...\n",
|
||
"\n",
|
||
"Full markdown saved to: xbrl_data/output.md\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Export to Markdown\n",
|
||
"markdown_content = doc.export_to_markdown()\n",
|
||
"\n",
|
||
"# Display first 2000 characters\n",
|
||
"print(\"Markdown export (first 2000 characters):\\n\")\n",
|
||
"print(markdown_content[:2000])\n",
|
||
"print(\"\\n...\")\n",
|
||
"\n",
|
||
"# Save to file\n",
|
||
"output_md = data_dir / \"output.md\"\n",
|
||
"output_md.write_text(markdown_content)\n",
|
||
"print(f\"\\nFull markdown saved to: {output_md}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Export to JSON\n",
|
||
"\n",
|
||
"Export the complete document structure to JSON:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Document exported to JSON: xbrl_data/output.json\n",
|
||
"File size: 1538.26 KB\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import json\n",
|
||
"\n",
|
||
"# Export to JSON\n",
|
||
"output_json = data_dir / \"output.json\"\n",
|
||
"doc.save_as_json(output_json)\n",
|
||
"\n",
|
||
"print(f\"Document exported to JSON: {output_json}\")\n",
|
||
"print(f\"File size: {output_json.stat().st_size / 1024:.2f} KB\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Summary\n",
|
||
"\n",
|
||
"In this example, we demonstrated:\n",
|
||
"\n",
|
||
"✅ How to configure Docling for offline XBRL parsing \n",
|
||
"✅ How to provide a local taxonomy for XBRL validation \n",
|
||
"✅ How to convert XBRL instance documents to DoclingDocument \n",
|
||
"✅ How to extract metadata, text blocks, and numeric facts \n",
|
||
"✅ How to export XBRL content to Markdown and JSON formats \n",
|
||
"\n",
|
||
"### Key Points\n",
|
||
"\n",
|
||
"- **Offline operation**: By setting `enable_remote_fetch=False`, all processing happens locally\n",
|
||
"- **Taxonomy support**: The local taxonomy directory should contain all necessary schema and linkbase files\n",
|
||
"- **Structured extraction**: XBRL numeric facts are extracted as key-value pairs with graph representation\n",
|
||
"- **Text blocks**: HTML text blocks in XBRL are converted to structured content\n",
|
||
"\n",
|
||
"### Note on Future Changes\n",
|
||
"\n",
|
||
"⚠️ The current implementation uses `DoclingDocument`'s `GraphData` object to represent key-value pairs. This design will change in a future release of the `docling-core` library."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": ".venv",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.13.5"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|