docling-langchain/examples/docling_loader.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_langchain.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# RAG with LangChain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| Step | Tech | Execution | \n",
    "| --- | --- | --- |\n",
    "| Embedding | Hugging Face / Sentence Transformers | 💻 Local |\n",
    "| Vector store | Milvus | 💻 Local |\n",
    "| Gen AI | OpenAI-compatible API | 💻 Local (e.g. via LM Studio) or 🌐 Remote | "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example leverages the\n",
    "[LangChain Docling integration](../../integrations/langchain/), along with a Milvus\n",
    "vector store, as well as sentence-transformers embeddings.\n",
    "\n",
    "The presented `DoclingLoader` component enables you to:\n",
    "- use various document types in your LLM applications with ease and speed, and\n",
    "- leverage Docling's rich format for advanced, document-native grounding.\n",
    "\n",
    "`DoclingLoader` supports two different export modes:\n",
    "- `ExportType.MARKDOWN`: if you want to capture each input document as a separate\n",
    "  LangChain document, or\n",
    "- `ExportType.DOC_CHUNKS` (default): if you want to have each input document chunked and\n",
    "  to then capture each individual chunk as a separate LangChain document downstream.\n",
    "\n",
    "The example allows exploring both modes via parameter `EXPORT_TYPE`; depending on the\n",
    "value set, the example pipeline is then set up accordingly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 👉 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.\n",
    "- For Colab, adding `--no-warn-conflicts` may be needed due to the pre-populated Python env."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install --progress-bar off langchain-docling langchain-core langchain-huggingface sentence-transformers langchain_milvus \"pymilvus[milvus_lite]\" langchain-text-splitters langchain-classic langchain-openai python-dotenv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "from tempfile import mkdtemp\n",
    "\n",
    "from dotenv import load_dotenv\n",
    "from langchain_core.prompts import PromptTemplate\n",
    "\n",
    "from langchain_docling.loader import ExportType\n",
    "\n",
    "\n",
    "def _get_env_from_colab_or_os(key):\n",
    "    try:\n",
    "        from google.colab import userdata\n",
    "\n",
    "        try:\n",
    "            return userdata.get(key)\n",
    "        except userdata.SecretNotFoundError:\n",
    "            pass\n",
    "    except ImportError:\n",
    "        pass\n",
    "    return os.getenv(key)\n",
    "\n",
    "\n",
    "load_dotenv()\n",
    "\n",
    "# https://github.com/huggingface/transformers/issues/5486:\n",
    "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",
    "\n",
    "FILE_PATH = [\"https://arxiv.org/pdf/2408.09869\"]  # Docling Technical Report\n",
    "EMBED_MODEL_ID = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
    "LM_BASE_URL = \"http://localhost:1234/v1\"\n",
    "LM_API_KEY = _get_env_from_colab_or_os(\"LM_API_KEY\") or \"none\"\n",
    "LM_MODEL_ID = \"openai/gpt-oss-20b\"\n",
    "EXPORT_TYPE = ExportType.DOC_CHUNKS\n",
    "QUESTION = \"Which are the main AI models in Docling?\"\n",
    "PROMPT = PromptTemplate.from_template(\n",
    "    \"Context information is below.\\n---------------------\\n{context}\\n---------------------\\nGiven the context information and not prior knowledge, answer the query.\\nQuery: {input}\\nAnswer:\\n\",\n",
    ")\n",
    "TOP_K = 3\n",
    "MILVUS_URI = str(Path(mkdtemp()) / \"docling.db\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Document loading\n",
    "\n",
    "Now we can instantiate our loader and load documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-11-14 16:46:07,321 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]\n",
      "2025-11-14 16:46:07,341 - INFO - Going to convert document batch...\n",
      "2025-11-14 16:46:07,342 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad\n",
      "2025-11-14 16:46:07,348 - INFO - Loading plugin 'docling_defaults'\n",
      "2025-11-14 16:46:07,350 - WARNING - The plugin langchain_docling will not be loaded because Docling is being executed with allow_external_plugins=false.\n",
      "2025-11-14 16:46:07,350 - INFO - Registered picture descriptions: ['vlm', 'api']\n",
      "2025-11-14 16:46:07,356 - INFO - Loading plugin 'docling_defaults'\n",
      "2025-11-14 16:46:07,358 - WARNING - The plugin langchain_docling will not be loaded because Docling is being executed with allow_external_plugins=false.\n",
      "2025-11-14 16:46:07,358 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']\n",
      "2025-11-14 16:46:12,976 - INFO - Auto OCR model selected ocrmac.\n",
      "2025-11-14 16:46:12,979 - INFO - Accelerator device: 'mps'\n",
      "2025-11-14 16:46:22,098 - INFO - Accelerator device: 'mps'\n",
      "2025-11-14 16:46:22,579 - INFO - Processing document 2408.09869v5.pdf\n",
      "2025-11-14 16:46:27,952 - INFO - Finished converting document 2408.09869v5.pdf in 21.08 sec.\n"
     ]
    }
   ],
   "source": [
    "from docling.chunking import HybridChunker\n",
    "from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\n",
    "\n",
    "from langchain_docling import DoclingLoader\n",
    "\n",
    "loader = DoclingLoader(\n",
    "    file_path=FILE_PATH,\n",
    "    export_type=EXPORT_TYPE,\n",
    "    chunker=HybridChunker(\n",
    "        tokenizer=HuggingFaceTokenizer.from_pretrained(\n",
    "            model_name=EMBED_MODEL_ID,\n",
    "        ),\n",
    "    ),\n",
    ")\n",
    "\n",
    "docs = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    <strong>INFO</strong>: Using the <code>HybridChunker</code> can sometimes lead to a warning from the transformers library, however this is a \"false alarm\" — for details check <a href=\"https://docling-project.github.io/docling/faq/#hybridchunker-triggers-warning-token-indices-sequence-length-is-longer-than-the-specified-maximum-sequence-length-for-this-model\">here</a>.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Determining the splits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "if EXPORT_TYPE == ExportType.DOC_CHUNKS:\n",
    "    splits = docs\n",
    "elif EXPORT_TYPE == ExportType.MARKDOWN:\n",
    "    from langchain_text_splitters import MarkdownHeaderTextSplitter\n",
    "\n",
    "    splitter = MarkdownHeaderTextSplitter(\n",
    "        headers_to_split_on=[\n",
    "            (\"#\", \"Header_1\"),\n",
    "            (\"##\", \"Header_2\"),\n",
    "            (\"###\", \"Header_3\"),\n",
    "        ],\n",
    "    )\n",
    "    splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]\n",
    "else:\n",
    "    raise ValueError(f\"Unexpected export type: {EXPORT_TYPE}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Inspecting some sample splits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "- d.page_content='Version 1.0\\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\\nAI4K Group, IBM Research R¨ uschlikon, Switzerland'\n",
      "- d.page_content='Abstract\\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'\n",
      "- d.page_content='1 Introduction\\nConverting PDF documents back into a machine-processable format has been a major challenge for decades due to their huge variability in formats, weak standardization and printing-optimized characteristic, which discards most structural features and metadata. With the advent of LLMs and popular application patterns such as retrieval-augmented generation (RAG), leveraging the rich content embedded in PDFs has become ever more relevant. In the past decade, several powerful document understanding solutions have emerged on the market, most of which are commercial software, cloud offerings [3] and most recently, multi-modal vision-language models. As of today, only a handful of open-source tools cover PDF conversion, leaving a significant feature and quality gap to proprietary solutions.\\nWith Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition we developed and presented in the recent past [12, 13, 9]. Docling is designed as a simple, self-contained python library with permissive license, running entirely locally on commodity hardware. Its code architecture allows for easy extensibility and addition of new features and models.\\nHere is what Docling delivers today:'\n",
      "...\n"
     ]
    }
   ],
   "source": [
    "for d in splits[:3]:\n",
    "    print(f\"- {d.page_content=}\")\n",
    "print(\"...\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ingestion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-11-14 16:46:28,707 - INFO - Use pytorch device_name: mps\n",
      "2025-11-14 16:46:28,708 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2\n",
      "/Users/pva/work/github.com/DS4SD/docling-langchain/venv/lib/python3.12/site-packages/milvus_lite/__init__.py:15: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.\n",
      "  from pkg_resources import DistributionNotFound, get_distribution\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "from pathlib import Path\n",
    "from tempfile import mkdtemp\n",
    "\n",
    "from langchain_huggingface.embeddings import HuggingFaceEmbeddings\n",
    "from langchain_milvus import Milvus\n",
    "\n",
    "embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)\n",
    "\n",
    "\n",
    "milvus_uri = str(Path(mkdtemp()) / \"docling.db\")  # or set as needed\n",
    "vectorstore = Milvus.from_documents(\n",
    "    documents=splits,\n",
    "    embedding=embedding,\n",
    "    collection_name=\"docling_demo\",\n",
    "    connection_args={\"uri\": milvus_uri},\n",
    "    index_params={\"index_type\": \"FLAT\"},\n",
    "    drop_old=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## RAG"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_classic.chains import create_retrieval_chain\n",
    "from langchain_classic.chains.combine_documents import create_stuff_documents_chain\n",
    "from langchain_openai import ChatOpenAI\n",
    "\n",
    "retriever = vectorstore.as_retriever(search_kwargs={\"k\": TOP_K})\n",
    "\n",
    "llm = ChatOpenAI(model=LM_MODEL_ID, base_url=LM_BASE_URL, api_key=LM_API_KEY)\n",
    "\n",
    "\n",
    "def clip_text(text, threshold=100):\n",
    "    return f\"{text[:threshold]}...\" if len(text) > threshold else text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-11-14 16:46:36,912 - INFO - HTTP Request: POST http://localhost:1234/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Question:\n",
      "Which are the main AI models in Docling?\n",
      "\n",
      "Answer:\n",
      "The two primary AI models that Docling ships with are:\n",
      "\n",
      "1. **Layout Analysis Model** – an accurate object‑detector for identifying page elements such as headings, paragraphs, images, etc.\n",
      "\n",
      "2. **TableFormer** – a state‑of‑the‑art table‑structure recognition model used to parse and reconstruct tables from documents.\n",
      "\n",
      "Source 1:\n",
      "  text: \"3.2 AI models\\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re...\"\n",
      "  source: https://arxiv.org/pdf/2408.09869\n",
      "  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/50', 'parent': {'$ref': '#/body'}, 'children': [], 'content_layer': 'body', 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 108.0, 't': 404.873, 'r': 504.003, 'b': 330.866, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n",
      "\n",
      "Source 2:\n",
      "  text: \"3 Processing pipeline\\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ...\"\n",
      "  source: https://arxiv.org/pdf/2408.09869\n",
      "  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/26', 'parent': {'$ref': '#/body'}, 'children': [], 'content_layer': 'body', 'label': 'text', 'prov': [{'page_no': 2, 'bbox': {'l': 108.0, 't': 272.749, 'r': 504.003, 'b': 176.92399999999998, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 796]}]}], 'headings': ['3 Processing pipeline'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n",
      "\n",
      "Source 3:\n",
      "  text: \"6 Future work and contributions\\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ...\"\n",
      "  source: https://arxiv.org/pdf/2408.09869\n",
      "  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/76', 'parent': {'$ref': '#/body'}, 'children': [], 'content_layer': 'body', 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 322.2, 'r': 504.003, 'b': 259.10300000000007, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}, {'self_ref': '#/texts/77', 'parent': {'$ref': '#/body'}, 'children': [], 'content_layer': 'body', 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 251.654, 'r': 504.003, 'b': 199.07799999999997, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 402]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}\n"
     ]
    }
   ],
   "source": [
    "question_answer_chain = create_stuff_documents_chain(llm, PROMPT)\n",
    "rag_chain = create_retrieval_chain(retriever, question_answer_chain)\n",
    "resp_dict = rag_chain.invoke({\"input\": QUESTION})\n",
    "\n",
    "clipped_answer = clip_text(resp_dict[\"answer\"], threshold=500)\n",
    "print(f\"Question:\\n{resp_dict['input']}\\n\\nAnswer:\\n{clipped_answer}\")\n",
    "for i, doc in enumerate(resp_dict[\"context\"]):\n",
    "    print()\n",
    "    print(f\"Source {i+1}:\")\n",
    "    print(f\"  text: {json.dumps(clip_text(doc.page_content, threshold=350))}\")\n",
    "    for key in doc.metadata:\n",
    "        if key != \"pk\":\n",
    "            val = doc.metadata.get(key)\n",
    "            clipped_val = clip_text(val) if isinstance(val, str) else val\n",
    "            print(f\"  {key}: {clipped_val}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}