docling-core/examples/table_annotations.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "922d396f",
   "metadata": {},
   "source": [
    "# Table annotations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "50437c89",
   "metadata": {},
   "outputs": [],
   "source": [
    "from docling_core.types.doc.document import DoclingDocument\n",
    "\n",
    "file_path = \"2408.09869v3.json\"\n",
    "pages = {5}  # pages to serialize (for output brevity)\n",
    "\n",
    "doc = DoclingDocument.load_from_json(file_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d35192ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Optional\n",
    "from rich.console import Console\n",
    "from rich.panel import Panel\n",
    "\n",
    "\n",
    "def print_excerpt(\n",
    "    txt: str,\n",
    "    *,\n",
    "    limit: int = 2000,\n",
    "    title: Optional[str] = None,\n",
    "    min_width: int = 80,\n",
    "    table_end: str = \"--|\",\n",
    "):\n",
    "    excerpt = txt[:limit]\n",
    "    width = max(\n",
    "        max([ln.rfind(table_end) for ln in excerpt.splitlines()]) + len(table_end) + 4,\n",
    "        min_width,\n",
    "    )\n",
    "    console = Console(width=width)\n",
    "    console.print(Panel(f\"{excerpt}{'...' if len(txt) > limit else ''}\", title=title))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a51271ac",
   "metadata": {},
   "source": [
    "## Adding a table annotation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "557791de",
   "metadata": {},
   "source": [
    "Below we add a demo table annotation, picking the first table for illustrative purposes.\n",
    "\n",
    "Note that `TableMiscData` allows any dict data within the `content` field.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "add64711",
   "metadata": {},
   "outputs": [],
   "source": [
    "from docling_core.types.doc.document import DescriptionAnnotation, MiscAnnotation\n",
    "\n",
    "assert doc.tables, \"No table available in this document\"\n",
    "table = doc.tables[0]\n",
    "\n",
    "table.add_annotation(\n",
    "    annotation=DescriptionAnnotation(\n",
    "        text=\"A typical Docling setup runtime characterization.\",\n",
    "        provenance=\"model-foo\",\n",
    "    ),\n",
    ")\n",
    "\n",
    "table.add_annotation(\n",
    "    annotation=MiscAnnotation(\n",
    "        content={\n",
    "            \"type\": \"performance data\",\n",
    "            \"sentiment\": 0.85,\n",
    "            # ...\n",
    "        },\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81408ae6",
   "metadata": {},
   "source": [
    "## Default serialization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "b1be8540",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭────────────────────────────────────────────────────────────────────────────────── pages={5} ───────────────────────────────────────────────────────────────────────────────────╮\n",
       "│ torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report.                                                      │\n",
       "│                                                                                                                                                                                │\n",
       "│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We     │\n",
       "│ show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the     │\n",
       "│ pypdfium backend, using 4 and 16 threads.                                                                                                                                      │\n",
       "│                                                                                                                                                                                │\n",
       "│ A typical Docling setup runtime characterization.                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│ | CPU                              | Thread budget   | native backend   | native backend   | native backend   | pypdfium backend   | pypdfium backend   | pypdfium backend   | │\n",
       "│ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| │\n",
       "│ |                                  |                 | TTS              | Pages/s          | Mem              | TTS                | Pages/s            | Mem                | │\n",
       "│ | Apple M3 Max                     | 4               | 177 s 167 s      | 1.27 1.34        | 6.20 GB          | 103 s 92 s         | 2.18 2.45          | 2.56 GB            | │\n",
       "│ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16         | 375 s 244 s      | 0.60 0.92        | 6.16 GB          | 239 s 143 s        | 0.94 1.57          | 2.42 GB            | │\n",
       "│                                                                                                                                                                                │\n",
       "│ ## 5 Applications                                                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│ Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can     │\n",
       "│ provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment  │\n",
       "│ of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented │\n",
       "│ generation (RAG), we provi...                                                                                                                                                  │\n",
       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
       "</pre>\n"
      ],
      "text/plain": [
       "╭────────────────────────────────────────────────────────────────────────────────── pages={5} ───────────────────────────────────────────────────────────────────────────────────╮\n",
       "│ torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report.                                                      │\n",
       "│                                                                                                                                                                                │\n",
       "│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We     │\n",
       "│ show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the     │\n",
       "│ pypdfium backend, using 4 and 16 threads.                                                                                                                                      │\n",
       "│                                                                                                                                                                                │\n",
       "│ A typical Docling setup runtime characterization.                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│ | CPU                              | Thread budget   | native backend   | native backend   | native backend   | pypdfium backend   | pypdfium backend   | pypdfium backend   | │\n",
       "│ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| │\n",
       "│ |                                  |                 | TTS              | Pages/s          | Mem              | TTS                | Pages/s            | Mem                | │\n",
       "│ | Apple M3 Max                     | 4               | 177 s 167 s      | 1.27 1.34        | 6.20 GB          | 103 s 92 s         | 2.18 2.45          | 2.56 GB            | │\n",
       "│ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16         | 375 s 244 s      | 0.60 0.92        | 6.16 GB          | 239 s 143 s        | 0.94 1.57          | 2.42 GB            | │\n",
       "│                                                                                                                                                                                │\n",
       "│ ## 5 Applications                                                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│ Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can     │\n",
       "│ provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment  │\n",
       "│ of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented │\n",
       "│ generation (RAG), we provi...                                                                                                                                                  │\n",
       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from docling_core.transforms.serializer.markdown import (\n",
    "    MarkdownDocSerializer,\n",
    "    MarkdownParams,\n",
    ")\n",
    "\n",
    "ser = MarkdownDocSerializer(\n",
    "    doc=doc,\n",
    "    params=MarkdownParams(\n",
    "        pages=pages,\n",
    "    ),\n",
    ")\n",
    "ser_out = ser.serialize()\n",
    "ser_txt = ser_out.text\n",
    "\n",
    "print_excerpt(ser_txt, title=f\"{pages=}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50b513c1",
   "metadata": {},
   "source": [
    "## Custom serialization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "add5b785",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Any\n",
    "\n",
    "from docling_core.transforms.serializer.base import SerializationResult\n",
    "from docling_core.transforms.serializer.common import create_ser_result\n",
    "from docling_core.transforms.serializer.markdown import MarkdownAnnotationSerializer\n",
    "from docling_core.types.doc.document import MiscAnnotation, DocItem\n",
    "\n",
    "\n",
    "class CustomAnnotationSerializer(MarkdownAnnotationSerializer):\n",
    "    def serialize(\n",
    "        self,\n",
    "        *,\n",
    "        item: DocItem,\n",
    "        doc: DoclingDocument,\n",
    "        **kwargs: Any,\n",
    "    ) -> SerializationResult:\n",
    "        text_parts: list[str] = []\n",
    "\n",
    "        # reusing result from parent serializer:\n",
    "        parent_res = super().serialize(\n",
    "            item=item,\n",
    "            doc=doc,\n",
    "            **kwargs,\n",
    "        )\n",
    "        text_parts.append(parent_res.text)\n",
    "\n",
    "        # custom serialization logic (appending misc annotation result):\n",
    "        for ann in item.get_annotations():\n",
    "            if isinstance(ann, MiscAnnotation):\n",
    "                out_txt = \"\".join([f\"- {k}: {ann.content[k]}\\n\" for k in ann.content])\n",
    "                text_parts.append(out_txt)\n",
    "        text_res = \"\\n\\n\".join(text_parts)\n",
    "        return create_ser_result(text=text_res, span_source=item)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "e1107ddb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">╭────────────────────────────────────────────────────────────────────────────────── pages={5} ───────────────────────────────────────────────────────────────────────────────────╮\n",
       "│ torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report.                                                      │\n",
       "│                                                                                                                                                                                │\n",
       "│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We     │\n",
       "│ show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the     │\n",
       "│ pypdfium backend, using 4 and 16 threads.                                                                                                                                      │\n",
       "│                                                                                                                                                                                │\n",
       "│ A typical Docling setup runtime characterization.                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│ - type: performance data                                                                                                                                                       │\n",
       "│ - sentiment: 0.85                                                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│                                                                                                                                                                                │\n",
       "│ | CPU                              | Thread budget   | native backend   | native backend   | native backend   | pypdfium backend   | pypdfium backend   | pypdfium backend   | │\n",
       "│ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| │\n",
       "│ |                                  |                 | TTS              | Pages/s          | Mem              | TTS                | Pages/s            | Mem                | │\n",
       "│ | Apple M3 Max                     | 4               | 177 s 167 s      | 1.27 1.34        | 6.20 GB          | 103 s 92 s         | 2.18 2.45          | 2.56 GB            | │\n",
       "│ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16         | 375 s 244 s      | 0.60 0.92        | 6.16 GB          | 239 s 143 s        | 0.94 1.57          | 2.42 GB            | │\n",
       "│                                                                                                                                                                                │\n",
       "│ ## 5 Applications                                                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│ Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can     │\n",
       "│ provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment  │\n",
       "│ of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as r...                │\n",
       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n",
       "</pre>\n"
      ],
      "text/plain": [
       "╭────────────────────────────────────────────────────────────────────────────────── pages={5} ───────────────────────────────────────────────────────────────────────────────────╮\n",
       "│ torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report.                                                      │\n",
       "│                                                                                                                                                                                │\n",
       "│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We     │\n",
       "│ show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the     │\n",
       "│ pypdfium backend, using 4 and 16 threads.                                                                                                                                      │\n",
       "│                                                                                                                                                                                │\n",
       "│ A typical Docling setup runtime characterization.                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│ - type: performance data                                                                                                                                                       │\n",
       "│ - sentiment: 0.85                                                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│                                                                                                                                                                                │\n",
       "│ | CPU                              | Thread budget   | native backend   | native backend   | native backend   | pypdfium backend   | pypdfium backend   | pypdfium backend   | │\n",
       "│ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| │\n",
       "│ |                                  |                 | TTS              | Pages/s          | Mem              | TTS                | Pages/s            | Mem                | │\n",
       "│ | Apple M3 Max                     | 4               | 177 s 167 s      | 1.27 1.34        | 6.20 GB          | 103 s 92 s         | 2.18 2.45          | 2.56 GB            | │\n",
       "│ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16         | 375 s 244 s      | 0.60 0.92        | 6.16 GB          | 239 s 143 s        | 0.94 1.57          | 2.42 GB            | │\n",
       "│                                                                                                                                                                                │\n",
       "│ ## 5 Applications                                                                                                                                                              │\n",
       "│                                                                                                                                                                                │\n",
       "│ Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can     │\n",
       "│ provide a base for detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment  │\n",
       "│ of different structures in the document, such as tables, figures, section structure or references. For popular generative AI application patterns, such as r...                │\n",
       "╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "ser = MarkdownDocSerializer(\n",
    "    doc=doc,\n",
    "    annotation_serializer=CustomAnnotationSerializer(),\n",
    "    params=MarkdownParams(\n",
    "        pages=pages,\n",
    "    ),\n",
    ")\n",
    "ser_out = ser.serialize()\n",
    "ser_txt = ser_out.text\n",
    "\n",
    "print_excerpt(ser_txt, title=f\"{pages=}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb350716",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}