{ "cells": [ { "cell_type": "markdown", "id": "922d396f", "metadata": {}, "source": [ "# Table annotations" ] }, { "cell_type": "code", "execution_count": 1, "id": "50437c89", "metadata": {}, "outputs": [], "source": [ "from docling_core.types.doc.document import DoclingDocument\n", "\n", "file_path = \"2408.09869v3.json\"\n", "pages = {5} # pages to serialize (for output brevity)\n", "\n", "doc = DoclingDocument.load_from_json(file_path)" ] }, { "cell_type": "code", "execution_count": 2, "id": "d35192ea", "metadata": {}, "outputs": [], "source": [ "from typing import Optional\n", "from rich.console import Console\n", "from rich.panel import Panel\n", "\n", "\n", "def print_excerpt(\n", " txt: str,\n", " *,\n", " limit: int = 2000,\n", " title: Optional[str] = None,\n", " min_width: int = 80,\n", " table_end: str = \"--|\",\n", "):\n", " excerpt = txt[:limit]\n", " width = max(\n", " max([ln.rfind(table_end) for ln in excerpt.splitlines()]) + len(table_end) + 4,\n", " min_width,\n", " )\n", " console = Console(width=width)\n", " console.print(Panel(f\"{excerpt}{'...' if len(txt) > limit else ''}\", title=title))" ] }, { "cell_type": "markdown", "id": "a51271ac", "metadata": {}, "source": [ "## Adding a table annotation" ] }, { "cell_type": "markdown", "id": "557791de", "metadata": {}, "source": [ "Below we add a demo table annotation, picking the first table for illustrative purposes.\n", "\n", "Note that `TableMiscData` allows any dict data within the `content` field.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "add64711", "metadata": {}, "outputs": [], "source": [ "from docling_core.types.doc.document import DescriptionAnnotation, MiscAnnotation\n", "\n", "assert doc.tables, \"No table available in this document\"\n", "table = doc.tables[0]\n", "\n", "table.add_annotation(\n", " annotation=DescriptionAnnotation(\n", " text=\"A typical Docling setup runtime characterization.\",\n", " provenance=\"model-foo\",\n", " ),\n", ")\n", "\n", "table.add_annotation(\n", " annotation=MiscAnnotation(\n", " content={\n", " \"type\": \"performance data\",\n", " \"sentiment\": 0.85,\n", " # ...\n", " },\n", " ),\n", ")" ] }, { "cell_type": "markdown", "id": "81408ae6", "metadata": {}, "source": [ "## Default serialization" ] }, { "cell_type": "code", "execution_count": 4, "id": "b1be8540", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
╭────────────────────────────────────────────────────────────────────────────────── pages={5} ───────────────────────────────────────────────────────────────────────────────────╮\n",
       "│ torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report.                                                      │\n",
       "│                                                                                                                                                                                │\n",
       "│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We     │\n",
       "│ show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the     │\n",
       "│ pypdfium backend, using 4 and 16 threads.                                                                                                                                      │\n",
       "│                                                                                                                                                                                │\n",
       "│ A typical Docling setup runtime characterization.                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│ | CPU                              | Thread budget   | native backend   | native backend   | native backend   | pypdfium backend   | pypdfium backend   | pypdfium backend   | │\n",
       "│ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| │\n",
       "│ |                                  |                 | TTS              | Pages/s          | Mem              | TTS                | Pages/s            | Mem                | │\n",
       "│ | Apple M3 Max                     | 4               | 177 s 167 s      | 1.27 1.34        | 6.20 GB          | 103 s 92 s         | 2.18 2.45          | 2.56 GB            | │\n",
       "│ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16         | 375 s 244 s      | 0.60 0.92        | 6.16 GB          | 239 s 143 s        | 0.94 1.57          | 2.42 GB            | │\n",
       "│                                                                                                                                                                                │\n",
       "│ ## 5 Applications                                                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│ Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can     │\n",
       "│ provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment  │\n",
       "│ of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented │\n",
       "│ generation (RAG), we provi...                                                                                                                                                  │\n",
       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
       "
\n" ], "text/plain": [ "╭────────────────────────────────────────────────────────────────────────────────── pages={5} ───────────────────────────────────────────────────────────────────────────────────╮\n", "│ torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report. │\n", "│ │\n", "│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We │\n", "│ show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the │\n", "│ pypdfium backend, using 4 and 16 threads. │\n", "│ │\n", "│ A typical Docling setup runtime characterization. │\n", "│ │\n", "│ | CPU | Thread budget | native backend | native backend | native backend | pypdfium backend | pypdfium backend | pypdfium backend | │\n", "│ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| │\n", "│ | | | TTS | Pages/s | Mem | TTS | Pages/s | Mem | │\n", "│ | Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB | │\n", "│ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB | │\n", "│ │\n", "│ ## 5 Applications │\n", "│ │\n", "│ Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can │\n", "│ provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment │\n", "│ of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented │\n", "│ generation (RAG), we provi... │\n", "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from docling_core.transforms.serializer.markdown import (\n", " MarkdownDocSerializer,\n", " MarkdownParams,\n", ")\n", "\n", "ser = MarkdownDocSerializer(\n", " doc=doc,\n", " params=MarkdownParams(\n", " pages=pages,\n", " ),\n", ")\n", "ser_out = ser.serialize()\n", "ser_txt = ser_out.text\n", "\n", "print_excerpt(ser_txt, title=f\"{pages=}\")" ] }, { "cell_type": "markdown", "id": "50b513c1", "metadata": {}, "source": [ "## Custom serialization" ] }, { "cell_type": "code", "execution_count": 5, "id": "add5b785", "metadata": {}, "outputs": [], "source": [ "from typing import Any\n", "\n", "from docling_core.transforms.serializer.base import SerializationResult\n", "from docling_core.transforms.serializer.common import create_ser_result\n", "from docling_core.transforms.serializer.markdown import MarkdownAnnotationSerializer\n", "from docling_core.types.doc.document import MiscAnnotation, DocItem\n", "\n", "\n", "class CustomAnnotationSerializer(MarkdownAnnotationSerializer):\n", " def serialize(\n", " self,\n", " *,\n", " item: DocItem,\n", " doc: DoclingDocument,\n", " **kwargs: Any,\n", " ) -> SerializationResult:\n", " text_parts: list[str] = []\n", "\n", " # reusing result from parent serializer:\n", " parent_res = super().serialize(\n", " item=item,\n", " doc=doc,\n", " **kwargs,\n", " )\n", " text_parts.append(parent_res.text)\n", "\n", " # custom serialization logic (appending misc annotation result):\n", " for ann in item.get_annotations():\n", " if isinstance(ann, MiscAnnotation):\n", " out_txt = \"\".join([f\"- {k}: {ann.content[k]}\\n\" for k in ann.content])\n", " text_parts.append(out_txt)\n", " text_res = \"\\n\\n\".join(text_parts)\n", " return create_ser_result(text=text_res, span_source=item)" ] }, { "cell_type": "code", "execution_count": 6, "id": "e1107ddb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
╭────────────────────────────────────────────────────────────────────────────────── pages={5} ───────────────────────────────────────────────────────────────────────────────────╮\n",
       "│ torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report.                                                      │\n",
       "│                                                                                                                                                                                │\n",
       "│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We     │\n",
       "│ show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the     │\n",
       "│ pypdfium backend, using 4 and 16 threads.                                                                                                                                      │\n",
       "│                                                                                                                                                                                │\n",
       "│ A typical Docling setup runtime characterization.                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│ - type: performance data                                                                                                                                                       │\n",
       "│ - sentiment: 0.85                                                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│                                                                                                                                                                                │\n",
       "│ | CPU                              | Thread budget   | native backend   | native backend   | native backend   | pypdfium backend   | pypdfium backend   | pypdfium backend   | │\n",
       "│ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| │\n",
       "│ |                                  |                 | TTS              | Pages/s          | Mem              | TTS                | Pages/s            | Mem                | │\n",
       "│ | Apple M3 Max                     | 4               | 177 s 167 s      | 1.27 1.34        | 6.20 GB          | 103 s 92 s         | 2.18 2.45          | 2.56 GB            | │\n",
       "│ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16         | 375 s 244 s      | 0.60 0.92        | 6.16 GB          | 239 s 143 s        | 0.94 1.57          | 2.42 GB            | │\n",
       "│                                                                                                                                                                                │\n",
       "│ ## 5 Applications                                                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│ Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can     │\n",
       "│ provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment  │\n",
       "│ of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as r...                │\n",
       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
       "
\n" ], "text/plain": [ "╭────────────────────────────────────────────────────────────────────────────────── pages={5} ───────────────────────────────────────────────────────────────────────────────────╮\n", "│ torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report. │\n", "│ │\n", "│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We │\n", "│ show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the │\n", "│ pypdfium backend, using 4 and 16 threads. │\n", "│ │\n", "│ A typical Docling setup runtime characterization. │\n", "│ │\n", "│ - type: performance data │\n", "│ - sentiment: 0.85 │\n", "│ │\n", "│ │\n", "│ | CPU | Thread budget | native backend | native backend | native backend | pypdfium backend | pypdfium backend | pypdfium backend | │\n", "│ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| │\n", "│ | | | TTS | Pages/s | Mem | TTS | Pages/s | Mem | │\n", "│ | Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB | │\n", "│ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB | │\n", "│ │\n", "│ ## 5 Applications │\n", "│ │\n", "│ Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can │\n", "│ provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment │\n", "│ of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as r... │\n", "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ser = MarkdownDocSerializer(\n", " doc=doc,\n", " annotation_serializer=CustomAnnotationSerializer(),\n", " params=MarkdownParams(\n", " pages=pages,\n", " ),\n", ")\n", "ser_out = ser.serialize()\n", "ser_txt = ser_out.text\n", "\n", "print_excerpt(ser_txt, title=f\"{pages=}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "fb350716", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 5 }