mirror of
https://github.com/docling-project/docling-core.git
synced 2026-05-17 13:10:44 +00:00
c73904e68e
* Added ruff to dev dependencies Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Added ruff settings to pyproject.toml as in docling Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Cleanup uf pyproject.toml Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Copied settings for ruff pre-commit hooks from docling Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Excluded test/data/** from ruff formatting / linting Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * ruff format Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Added some ignore statements to pyproject.toml such that ruff check raises fewer issues Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * ruff check --fix Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Ignored some more rules Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Fixed the rest of the errors that would only concern 1 - 3 files Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Added another ignore related to df for DataFrame names Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Modified CONTRIBUTING.md such that black / isort are replaced by ruff Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Added UP045 to ignore list such that Optional[...] does not raise Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Moved .flake8 configs to pyproject.toml Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Moved autoflake to be used with ruff Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Moved all .flake8 settings to pyproject.toml to be compatible with ruff (i.e. no separate [tool.flake8] section Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Removed flake8 from .pre-commit hooks Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Applied ruff format (again); formatted some files as the line-length = 120 equals now what was set for the .flake8 settings Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Set max-complexity to 30 (as was originally) in the pyproject.toml as one linting check would fail Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Adding PD901 to ignore list such that pre-commit hooks run fully again Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * Replaced dtype | None syntax by Optional[dtype] in remaining places Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> * chore: fix 'test' ref in pyproject Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * style: remove typing List, Set, Tuple, Dict Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * style: remove UP015 check from ignore list Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * style: remove UP034 check from ignore list Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * style: normalize dashes in comments and docstrings Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * style: remove PD901 check from ignore list Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * style: remove C403 check from ignore list Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * style: remove C403, C413, C416 check from ignore list Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> * style: remove E203, F811 check from ignore list Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> --------- Signed-off-by: Florian Schwarb <florian.schwarb@gmail.com> Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com> Co-authored-by: Florian Schwarb <florian.schwarb@gmail.com> Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
354 lines
27 KiB
Plaintext
354 lines
27 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "922d396f",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Table annotations"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "50437c89",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from docling_core.types.doc.document import DoclingDocument\n",
|
|
"\n",
|
|
"file_path = \"2408.09869v3.json\"\n",
|
|
"pages = {5} # pages to serialize (for output brevity)\n",
|
|
"\n",
|
|
"doc = DoclingDocument.load_from_json(file_path)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "d35192ea",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from typing import Optional\n",
|
|
"from rich.console import Console\n",
|
|
"from rich.panel import Panel\n",
|
|
"\n",
|
|
"\n",
|
|
"def print_excerpt(\n",
|
|
" txt: str,\n",
|
|
" *,\n",
|
|
" limit: int = 2000,\n",
|
|
" title: Optional[str] = None,\n",
|
|
" min_width: int = 80,\n",
|
|
" table_end: str = \"--|\",\n",
|
|
"):\n",
|
|
" excerpt = txt[:limit]\n",
|
|
" width = max(\n",
|
|
" max([ln.rfind(table_end) for ln in excerpt.splitlines()]) + len(table_end) + 4,\n",
|
|
" min_width,\n",
|
|
" )\n",
|
|
" console = Console(width=width)\n",
|
|
" console.print(Panel(f\"{excerpt}{'...' if len(txt) > limit else ''}\", title=title))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a51271ac",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Adding a table annotation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "557791de",
|
|
"metadata": {},
|
|
"source": [
|
|
"Below we add a demo table annotation, picking the first table for illustrative purposes.\n",
|
|
"\n",
|
|
"Note that `TableMiscData` allows any dict data within the `content` field.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "add64711",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from docling_core.types.doc.document import DescriptionAnnotation, MiscAnnotation\n",
|
|
"\n",
|
|
"assert doc.tables, \"No table available in this document\"\n",
|
|
"table = doc.tables[0]\n",
|
|
"\n",
|
|
"table.add_annotation(\n",
|
|
" annotation=DescriptionAnnotation(\n",
|
|
" text=\"A typical Docling setup runtime characterization.\",\n",
|
|
" provenance=\"model-foo\",\n",
|
|
" ),\n",
|
|
")\n",
|
|
"\n",
|
|
"table.add_annotation(\n",
|
|
" annotation=MiscAnnotation(\n",
|
|
" content={\n",
|
|
" \"type\": \"performance data\",\n",
|
|
" \"sentiment\": 0.85,\n",
|
|
" # ...\n",
|
|
" },\n",
|
|
" ),\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "81408ae6",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Default serialization"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "b1be8540",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭────────────────────────────────────────────────────────────────────────────────── pages={5} ───────────────────────────────────────────────────────────────────────────────────╮\n",
|
|
"│ torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report. │\n",
|
|
"│ │\n",
|
|
"│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We │\n",
|
|
"│ show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the │\n",
|
|
"│ pypdfium backend, using 4 and 16 threads. │\n",
|
|
"│ │\n",
|
|
"│ A typical Docling setup runtime characterization. │\n",
|
|
"│ │\n",
|
|
"│ | CPU | Thread budget | native backend | native backend | native backend | pypdfium backend | pypdfium backend | pypdfium backend | │\n",
|
|
"│ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| │\n",
|
|
"│ | | | TTS | Pages/s | Mem | TTS | Pages/s | Mem | │\n",
|
|
"│ | Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB | │\n",
|
|
"│ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB | │\n",
|
|
"│ │\n",
|
|
"│ ## 5 Applications │\n",
|
|
"│ │\n",
|
|
"│ Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can │\n",
|
|
"│ provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment │\n",
|
|
"│ of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented │\n",
|
|
"│ generation (RAG), we provi... │\n",
|
|
"╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
|
|
"</pre>\n"
|
|
],
|
|
"text/plain": [
|
|
"╭────────────────────────────────────────────────────────────────────────────────── pages={5} ───────────────────────────────────────────────────────────────────────────────────╮\n",
|
|
"│ torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report. │\n",
|
|
"│ │\n",
|
|
"│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We │\n",
|
|
"│ show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the │\n",
|
|
"│ pypdfium backend, using 4 and 16 threads. │\n",
|
|
"│ │\n",
|
|
"│ A typical Docling setup runtime characterization. │\n",
|
|
"│ │\n",
|
|
"│ | CPU | Thread budget | native backend | native backend | native backend | pypdfium backend | pypdfium backend | pypdfium backend | │\n",
|
|
"│ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| │\n",
|
|
"│ | | | TTS | Pages/s | Mem | TTS | Pages/s | Mem | │\n",
|
|
"│ | Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB | │\n",
|
|
"│ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB | │\n",
|
|
"│ │\n",
|
|
"│ ## 5 Applications │\n",
|
|
"│ │\n",
|
|
"│ Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can │\n",
|
|
"│ provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment │\n",
|
|
"│ of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented │\n",
|
|
"│ generation (RAG), we provi... │\n",
|
|
"╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"from docling_core.transforms.serializer.markdown import (\n",
|
|
" MarkdownDocSerializer,\n",
|
|
" MarkdownParams,\n",
|
|
")\n",
|
|
"\n",
|
|
"ser = MarkdownDocSerializer(\n",
|
|
" doc=doc,\n",
|
|
" params=MarkdownParams(\n",
|
|
" pages=pages,\n",
|
|
" ),\n",
|
|
")\n",
|
|
"ser_out = ser.serialize()\n",
|
|
"ser_txt = ser_out.text\n",
|
|
"\n",
|
|
"print_excerpt(ser_txt, title=f\"{pages=}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "50b513c1",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Custom serialization"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "add5b785",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from typing import Any\n",
|
|
"\n",
|
|
"from docling_core.transforms.serializer.base import SerializationResult\n",
|
|
"from docling_core.transforms.serializer.common import create_ser_result\n",
|
|
"from docling_core.transforms.serializer.markdown import MarkdownAnnotationSerializer\n",
|
|
"from docling_core.types.doc.document import MiscAnnotation, DocItem\n",
|
|
"\n",
|
|
"\n",
|
|
"class CustomAnnotationSerializer(MarkdownAnnotationSerializer):\n",
|
|
" def serialize(\n",
|
|
" self,\n",
|
|
" *,\n",
|
|
" item: DocItem,\n",
|
|
" doc: DoclingDocument,\n",
|
|
" **kwargs: Any,\n",
|
|
" ) -> SerializationResult:\n",
|
|
" text_parts: list[str] = []\n",
|
|
"\n",
|
|
" # reusing result from parent serializer:\n",
|
|
" parent_res = super().serialize(\n",
|
|
" item=item,\n",
|
|
" doc=doc,\n",
|
|
" **kwargs,\n",
|
|
" )\n",
|
|
" text_parts.append(parent_res.text)\n",
|
|
"\n",
|
|
" # custom serialization logic (appending misc annotation result):\n",
|
|
" for ann in item.get_annotations():\n",
|
|
" if isinstance(ann, MiscAnnotation):\n",
|
|
" out_txt = \"\".join([f\"- {k}: {ann.content[k]}\\n\" for k in ann.content])\n",
|
|
" text_parts.append(out_txt)\n",
|
|
" text_res = \"\\n\\n\".join(text_parts)\n",
|
|
" return create_ser_result(text=text_res, span_source=item)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "e1107ddb",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭────────────────────────────────────────────────────────────────────────────────── pages={5} ───────────────────────────────────────────────────────────────────────────────────╮\n",
|
|
"│ torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report. │\n",
|
|
"│ │\n",
|
|
"│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We │\n",
|
|
"│ show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the │\n",
|
|
"│ pypdfium backend, using 4 and 16 threads. │\n",
|
|
"│ │\n",
|
|
"│ A typical Docling setup runtime characterization. │\n",
|
|
"│ │\n",
|
|
"│ - type: performance data │\n",
|
|
"│ - sentiment: 0.85 │\n",
|
|
"│ │\n",
|
|
"│ │\n",
|
|
"│ | CPU | Thread budget | native backend | native backend | native backend | pypdfium backend | pypdfium backend | pypdfium backend | │\n",
|
|
"│ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| │\n",
|
|
"│ | | | TTS | Pages/s | Mem | TTS | Pages/s | Mem | │\n",
|
|
"│ | Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB | │\n",
|
|
"│ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB | │\n",
|
|
"│ │\n",
|
|
"│ ## 5 Applications │\n",
|
|
"│ │\n",
|
|
"│ Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can │\n",
|
|
"│ provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment │\n",
|
|
"│ of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as r... │\n",
|
|
"╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
|
|
"</pre>\n"
|
|
],
|
|
"text/plain": [
|
|
"╭────────────────────────────────────────────────────────────────────────────────── pages={5} ───────────────────────────────────────────────────────────────────────────────────╮\n",
|
|
"│ torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report. │\n",
|
|
"│ │\n",
|
|
"│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We │\n",
|
|
"│ show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the │\n",
|
|
"│ pypdfium backend, using 4 and 16 threads. │\n",
|
|
"│ │\n",
|
|
"│ A typical Docling setup runtime characterization. │\n",
|
|
"│ │\n",
|
|
"│ - type: performance data │\n",
|
|
"│ - sentiment: 0.85 │\n",
|
|
"│ │\n",
|
|
"│ │\n",
|
|
"│ | CPU | Thread budget | native backend | native backend | native backend | pypdfium backend | pypdfium backend | pypdfium backend | │\n",
|
|
"│ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| │\n",
|
|
"│ | | | TTS | Pages/s | Mem | TTS | Pages/s | Mem | │\n",
|
|
"│ | Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB | │\n",
|
|
"│ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB | │\n",
|
|
"│ │\n",
|
|
"│ ## 5 Applications │\n",
|
|
"│ │\n",
|
|
"│ Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can │\n",
|
|
"│ provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment │\n",
|
|
"│ of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as r... │\n",
|
|
"╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"ser = MarkdownDocSerializer(\n",
|
|
" doc=doc,\n",
|
|
" annotation_serializer=CustomAnnotationSerializer(),\n",
|
|
" params=MarkdownParams(\n",
|
|
" pages=pages,\n",
|
|
" ),\n",
|
|
")\n",
|
|
"ser_out = ser.serialize()\n",
|
|
"ser_txt = ser_out.text\n",
|
|
"\n",
|
|
"print_excerpt(ser_txt, title=f\"{pages=}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "fb350716",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": ".venv",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.12.4"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|