docs: add line-based chunker documentation and examples (#3210)

Signed-off-by: anish.raghavendra <anish.raghavendra@ibm.com>
Co-authored-by: anish.raghavendra <anish.raghavendra@ibm.com>
This commit is contained in:
Anish Raghavendra
2026-03-30 14:25:31 +05:30
committed by GitHub
parent 8522b00146
commit 3a64f41af8
4 changed files with 927 additions and 76 deletions
+38 -3
View File
@@ -71,10 +71,38 @@ tokens), &
chunks with same headings & captions) — users can opt out of this step via param
`merge_peers` (by default `True`)
👉 Usage examples:
### Table Chunking with Repeated Headers
- [Hybrid chunking](../examples/hybrid_chunking.ipynb)
- [Advanced chunking & serialization](../examples/advanced_chunking_and_serialization.ipynb)
When chunking tables with [`HybridChunker`](#hybrid-chunker), you can control how table headers are handled:
- **`repeat_table_header`** (default: `True`): When enabled, table headers are repeated at the beginning of each chunk when a table spans multiple chunks. This ensures each chunk maintains context about the table structure.
- **`omit_header_on_overflow`** (default: `False`): When enabled along with `repeat_table_header=True`, this parameter provides flexibility for handling wide tables where rows might not fit with the header included:
- If a table row fits within the token limit **without** the header but would overflow **with** the header, the header is omitted for that specific row
- This helps maximize token efficiency while preserving line integrity for structured content
- Particularly useful for tables with very wide headers or when working with strict token limits
## Line-Based Token Chunker
!!! note "To access `LineBasedTokenChunker`"
- If you are using the `docling` package, you can import as follows:
```python
from docling.chunking import LineBasedTokenChunker
```
- If you are only using the `docling-core` package, you must ensure to install
the `chunking` extra, then import as follows:
```python
from docling_core.transforms.chunker.line_chunker import LineBasedTokenChunker
```
The `LineBasedTokenChunker` is a tokenization-aware chunker that preserves line boundaries, particularly useful for structured content like tables, code, logs, and lists. It attempts to keep lines intact within chunks, only splitting a line if it exceeds the maximum token limit on its own.
Key capabilities:
- Prioritizes keeping entire lines within a single chunk
- Supports adding a repeated prefix to each chunk (e.g., table headers for context)
- Offers overflow handling via `omit_prefix_on_overflow` parameter: when `True`, omits the prefix for lines that would overflow with it but fit without it
## Hierarchical Chunker
@@ -83,3 +111,10 @@ the [`DoclingDocument`](./docling_document.md) to create one chunk for each indi
detected document element, by default only merging together list items (can be opted out
via param `merge_list_items`). It also takes care of attaching all relevant document
metadata, including headers and captions.
## Usage Examples
- [Hybrid chunking](../examples/hybrid_chunking.ipynb)
- [Line-based chunking](../examples/line_based_chunking.ipynb)
- [Advanced chunking & serialization](../examples/advanced_chunking_and_serialization.ipynb)
+247 -73
View File
@@ -141,17 +141,29 @@
"\n",
"=== 1 ===\n",
"chunk.text:\n",
"'IBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 19…'\n",
"chunker.contextualize(chunk):\n",
"'IBM\\nIBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since th…'\n",
"\n",
"=== 2 ===\n",
"chunk.text:\n",
"'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19] and Willa…'\n",
"chunker.contextualize(chunk):\n",
"'IBM\\n1910s1950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889…'\n",
"\n",
"=== 2 ===\n",
"=== 3 ===\n",
"chunk.text:\n",
"'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson,…'\n",
"chunker.contextualize(chunk):\n",
"'IBM\\n1910s1950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John …'\n",
"\n",
"=== 3 ===\n",
"=== 4 ===\n",
"chunk.text:\n",
"'He implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan, \"THINK\", became a mantra for each compa…'\n",
"chunker.contextualize(chunk):\n",
"'IBM\\n1910s1950s\\nHe implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan, \"THINK\", became a mantr…'\n",
"\n",
"=== 5 ===\n",
"chunk.text:\n",
"'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.…'\n",
"chunker.contextualize(chunk):\n",
@@ -282,114 +294,126 @@
"'IBM\\nIBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.'\n",
"\n",
"=== 2 ===\n",
"chunk.text (63 tokens):\n",
"'IBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the'\n",
"chunker.contextualize(chunk) (64 tokens):\n",
"'IBM\\nIBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems. During the 1960s and 1970s, the'\n",
"chunk.text (56 tokens):\n",
"'IBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems.'\n",
"chunker.contextualize(chunk) (57 tokens):\n",
"'IBM\\nIBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed \"International Business Machines\" in 1924 and soon became the leading manufacturer of punch-card tabulating systems.'\n",
"\n",
"=== 3 ===\n",
"chunk.text (44 tokens):\n",
"\"IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\n",
"chunker.contextualize(chunk) (45 tokens):\n",
"\"IBM\\nIBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\n",
"chunk.text (51 tokens):\n",
"\"During the 1960s and 1970s, the IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\n",
"chunker.contextualize(chunk) (52 tokens):\n",
"\"IBM\\nDuring the 1960s and 1970s, the IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]\"\n",
"\n",
"=== 4 ===\n",
"chunk.text (63 tokens):\n",
"'IBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,'\n",
"chunker.contextualize(chunk) (64 tokens):\n",
"'IBM\\nIBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 1990s,'\n",
"chunk.text (59 tokens):\n",
"'IBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad.'\n",
"chunker.contextualize(chunk) (60 tokens):\n",
"'IBM\\nIBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad.'\n",
"\n",
"=== 5 ===\n",
"chunk.text (61 tokens):\n",
"'IBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\n",
"chunker.contextualize(chunk) (62 tokens):\n",
"'IBM\\nIBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005. IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\n",
"chunk.text (36 tokens):\n",
"'Since the 1990s, IBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005.'\n",
"chunker.contextualize(chunk) (37 tokens):\n",
"'IBM\\nSince the 1990s, IBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005.'\n",
"\n",
"=== 6 ===\n",
"chunk.text (62 tokens):\n",
"\"As one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming\"\n",
"chunker.contextualize(chunk) (63 tokens):\n",
"\"IBM\\nAs one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database, the SQL programming\"\n",
"chunk.text (29 tokens):\n",
"'IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\n",
"chunker.contextualize(chunk) (30 tokens):\n",
"'IBM\\nIBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.'\n",
"\n",
"=== 7 ===\n",
"chunk.text (63 tokens):\n",
"'language, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing'\n",
"chunker.contextualize(chunk) (64 tokens):\n",
"'IBM\\nlanguage, and the UPC barcode. The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing'\n",
"chunk.text (59 tokens):\n",
"\"As one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database,\"\n",
"chunker.contextualize(chunk) (60 tokens):\n",
"\"IBM\\nAs one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database,\"\n",
"\n",
"=== 8 ===\n",
"chunk.text (5 tokens):\n",
"'Awards.[16]'\n",
"chunker.contextualize(chunk) (6 tokens):\n",
"'IBM\\nAwards.[16]'\n",
"chunk.text (12 tokens):\n",
"'the SQL programming language, and the UPC barcode.'\n",
"chunker.contextualize(chunk) (13 tokens):\n",
"'IBM\\nthe SQL programming language, and the UPC barcode.'\n",
"\n",
"=== 9 ===\n",
"chunk.text (56 tokens):\n",
"'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine'\n",
"chunk.text (59 tokens):\n",
"'The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing Awards.[16]'\n",
"chunker.contextualize(chunk) (60 tokens):\n",
"'IBM\\n1910s1950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine'\n",
"'IBM\\nThe company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing Awards.[16]'\n",
"\n",
"=== 10 ===\n",
"chunk.text (60 tokens):\n",
"\"(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the\"\n",
"chunker.contextualize(chunk) (64 tokens):\n",
"\"IBM\\n1910s1950s\\n(1889);[19] and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16, 1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the\"\n",
"chunk.text (19 tokens):\n",
"'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E.'\n",
"chunker.contextualize(chunk) (23 tokens):\n",
"'IBM\\n1910s1950s\\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E.'\n",
"\n",
"=== 11 ===\n",
"chunk.text (59 tokens):\n",
"'Computing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,'\n",
"chunker.contextualize(chunk) (63 tokens):\n",
"'IBM\\n1910s1950s\\nComputing-Tabulating-Recording Company (CTR) based in Endicott, New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York; Dayton, Ohio; Detroit, Michigan; Washington,'\n",
"chunk.text (44 tokens):\n",
"'Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19]'\n",
"chunker.contextualize(chunk) (48 tokens):\n",
"'IBM\\n1910s1950s\\nPitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19]'\n",
"\n",
"=== 12 ===\n",
"chunk.text (13 tokens):\n",
"'D.C.; and Toronto, Canada.[22]'\n",
"chunker.contextualize(chunk) (17 tokens):\n",
"'IBM\\n1910s1950s\\nD.C.; and Toronto, Canada.[22]'\n",
"chunk.text (31 tokens):\n",
"\"and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16,\"\n",
"chunker.contextualize(chunk) (35 tokens):\n",
"\"IBM\\n1910s1950s\\nand Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16,\"\n",
"\n",
"=== 13 ===\n",
"chunk.text (60 tokens):\n",
"'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called'\n",
"chunker.contextualize(chunk) (64 tokens):\n",
"'IBM\\n1910s1950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called'\n",
"chunk.text (39 tokens):\n",
"'1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the Computing-Tabulating-Recording Company (CTR) based in Endicott,'\n",
"chunker.contextualize(chunk) (43 tokens):\n",
"'IBM\\n1910s1950s\\n1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the Computing-Tabulating-Recording Company (CTR) based in Endicott,'\n",
"\n",
"=== 14 ===\n",
"chunk.text (59 tokens):\n",
"\"on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business\"\n",
"chunker.contextualize(chunk) (63 tokens):\n",
"\"IBM\\n1910s1950s\\non Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later, was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business\"\n",
"chunk.text (55 tokens):\n",
"'New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York;\\nDayton, Ohio; Detroit, Michigan; Washington, D.C.; and Toronto, Canada.[22]'\n",
"chunker.contextualize(chunk) (59 tokens):\n",
"'IBM\\n1910s1950s\\nNew York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York;\\nDayton, Ohio; Detroit, Michigan; Washington, D.C.; and Toronto, Canada.[22]'\n",
"\n",
"=== 15 ===\n",
"chunk.text (23 tokens):\n",
"\"practices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\n105\"\n",
"chunker.contextualize(chunk) (27 tokens):\n",
"\"IBM\\n1910s1950s\\npractices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\n105\"\n",
"chunk.text (42 tokens):\n",
"'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J.'\n",
"chunker.contextualize(chunk) (46 tokens):\n",
"'IBM\\n1910s1950s\\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J.'\n",
"\n",
"=== 16 ===\n",
"chunk.text (50 tokens):\n",
"'Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later,'\n",
"chunker.contextualize(chunk) (54 tokens):\n",
"'IBM\\n1910s1950s\\nWatson, Sr., fired from the National Cash Register Company by John Henry Patterson, called on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later,'\n",
"\n",
"=== 17 ===\n",
"chunk.text (50 tokens):\n",
"\"was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business practices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\u200a105\"\n",
"chunker.contextualize(chunk) (54 tokens):\n",
"\"IBM\\n1910s1950s\\nwas made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business practices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\\u200a105\"\n",
"\n",
"=== 18 ===\n",
"chunk.text (59 tokens):\n",
"'He implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan,'\n",
"chunker.contextualize(chunk) (63 tokens):\n",
"'IBM\\n1910s1950s\\nHe implemented sales conventions, \"generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker\".[25][26] His favorite slogan,'\n",
"\n",
"=== 17 ===\n",
"chunk.text (60 tokens):\n",
"'\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the'\n",
"chunker.contextualize(chunk) (64 tokens):\n",
"'IBM\\n1910s1950s\\n\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America, Asia and Australia.[25] Watson never liked the'\n",
"\n",
"=== 18 ===\n",
"chunk.text (57 tokens):\n",
"'clumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27] the name was changed on February 14,'\n",
"chunker.contextualize(chunk) (61 tokens):\n",
"'IBM\\n1910s1950s\\nclumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27] the name was changed on February 14,'\n",
"\n",
"=== 19 ===\n",
"chunk.text (21 tokens):\n",
"'1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\n",
"chunker.contextualize(chunk) (25 tokens):\n",
"'IBM\\n1910s1950s\\n1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\n",
"chunk.text (49 tokens):\n",
"'\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America,'\n",
"chunker.contextualize(chunk) (53 tokens):\n",
"'IBM\\n1910s1950s\\n\"THINK\", became a mantra for each company\\'s employees.[25] During Watson\\'s first four years, revenues reached $9 million ($158 million today) and the company\\'s operations expanded to Europe, South America,'\n",
"\n",
"=== 20 ===\n",
"chunk.text (60 tokens):\n",
"'Asia and Australia.[25] Watson never liked the clumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27]'\n",
"chunker.contextualize(chunk) (64 tokens):\n",
"'IBM\\n1910s1950s\\nAsia and Australia.[25] Watson never liked the clumsy hyphenated name \"Computing-Tabulating-Recording Company\" and chose to replace it with the more expansive title \"International Business Machines\" which had previously been used as the name of CTR\\'s Canadian Division;[27]'\n",
"\n",
"=== 21 ===\n",
"chunk.text (29 tokens):\n",
"'the name was changed on February 14,\\n1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\n",
"chunker.contextualize(chunk) (33 tokens):\n",
"'IBM\\n1910s1950s\\nthe name was changed on February 14,\\n1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.'\n",
"\n",
"=== 22 ===\n",
"chunk.text (22 tokens):\n",
"'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'\n",
"chunker.contextualize(chunk) (26 tokens):\n",
@@ -410,6 +434,156 @@
"\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table chunking with header repetition\n",
"\n",
"When chunking documents with tables, the `HybridChunker` can repeat table headers in each chunk to maintain context. This is particularly useful for wide tables where the content spans multiple chunks.\n",
"\n",
"Let's demonstrate this with a CSV file containing customer data."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Document has 1 items\n",
"\n",
"First few lines of the CSV table:\n",
"| Index | Customer Id | First Name | Last Name | Company | City | Country | Phone 1 | Phone 2 | Email | Subscription Date | Website |\n",
"|---------|-----------------|--------------|-------------|---------------------------------|-------------------|----------------------------|------------------------|-----------------------|-----------------------------|-------\n"
]
}
],
"source": [
"# Convert a CSV file with a wide table\n",
"CSV_SOURCE = \"../../tests/data/csv/csv-comma.csv\"\n",
"\n",
"csv_result = DocumentConverter().convert(source=CSV_SOURCE)\n",
"csv_doc = csv_result.document\n",
"\n",
"print(f\"Document has {len(list(csv_doc.iterate_items()))} items\")\n",
"print(\"\\nFirst few lines of the CSV table:\")\n",
"print(csv_doc.export_to_markdown()[:500])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's chunk this table with header repetition enabled. We'll use a small token limit to force the table to be split across multiple chunks."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total chunks created: 5\n",
"\n",
"============================================================\n",
"Chunk 1:\n",
"============================================================\n",
"| Index | Customer Id | First Name | Last Name | Company | City | Country | Phone 1 | Phone 2 | Email | Subscription Date | Website |\n",
"| - | - | - | - | - | - | - | - | - | - | - | - || 1 | DD37Cf93aecA6Dc | Sheryl | Baxter | Rasmussen Group | East Leonard | Chile | 229.077.5154 | 397.884.0519x718 | ...\n",
"\n",
"Tokens: 131\n",
"Has table header: True\n",
"\n",
"============================================================\n",
"Chunk 2:\n",
"============================================================\n",
"| Index | Customer Id | First Name | Last Name | Company | City | Country | Phone 1 | Phone 2 | Email | Subscription Date | Website |\n",
"| - | - | - | - | - | - | - | - | - | - | - | - || 2 | 1Ef7b82A4CAAD10 | Preston | Lozano, Dr | Vega-Gentry | East Jimmychester | Djibouti | 5153435776 | 686-620-1820...\n",
"\n",
"Tokens: 132\n",
"Has table header: True\n",
"\n",
"============================================================\n",
"Chunk 3:\n",
"============================================================\n",
"| Index | Customer Id | First Name | Last Name | Company | City | Country | Phone 1 | Phone 2 | Email | Subscription Date | Website |\n",
"| - | - | - | - | - | - | - | - | - | - | - | - || 3 | 6F94879bDAfE5a6 | Roy | Berry | Murillo-Perry | Isabelborough | Antigua and Barbuda | +1-539-402-0259 | (496)97...\n",
"\n",
"Tokens: 141\n",
"Has table header: True\n",
"\n"
]
}
],
"source": [
"from docling_core.transforms.chunker.hierarchical_chunker import (\n",
" ChunkingDocSerializer,\n",
" ChunkingSerializerProvider,\n",
")\n",
"from docling_core.transforms.serializer.markdown import (\n",
" MarkdownParams,\n",
" MarkdownTableSerializer,\n",
")\n",
"\n",
"\n",
"# Create a custom serializer provider that uses Markdown for tables\n",
"class MDTableSerializerProvider(ChunkingSerializerProvider):\n",
" def get_serializer(self, doc):\n",
" return ChunkingDocSerializer(\n",
" doc=doc,\n",
" table_serializer=MarkdownTableSerializer(),\n",
" params=MarkdownParams(compact_tables=True),\n",
" )\n",
"\n",
"\n",
"small_tokenizer = HuggingFaceTokenizer(\n",
" tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),\n",
" max_tokens=200,\n",
")\n",
"\n",
"chunker_with_headers = HybridChunker(\n",
" tokenizer=small_tokenizer,\n",
" repeat_table_header=True, # Repeat headers in each chunk\n",
" serializer_provider=MDTableSerializerProvider(), # Use Markdown table format\n",
")\n",
"\n",
"csv_chunks = list(chunker_with_headers.chunk(csv_doc))\n",
"\n",
"print(f\"Total chunks created: {len(csv_chunks)}\\n\")\n",
"\n",
"# Display the first few chunks to show header repetition\n",
"for i, chunk in enumerate(csv_chunks[:3], 1):\n",
" print(f\"{'=' * 60}\")\n",
" print(f\"Chunk {i}:\")\n",
" print(f\"{'=' * 60}\")\n",
" chunk_text = chunk.text\n",
" # Show first 300 characters of each chunk\n",
" preview = chunk_text[:300] + \"...\" if len(chunk_text) > 300 else chunk_text\n",
" print(preview)\n",
" print(f\"\\nTokens: {small_tokenizer.count_tokens(chunk_text)}\")\n",
" print(f\"Has table header: {chunk_text.startswith('|')}\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each chunk starts with the table header row, ensuring that every chunk maintains the context of what each column represents. This is especially important when:\n",
"\n",
"- Feeding chunks to an embedding model for semantic search\n",
"- Processing chunks independently in downstream tasks\n",
"- Working with wide tables that naturally span multiple chunks\n",
"\n",
"For more advanced control over header handling in wide tables, including the `omit_header_on_overflow` parameter, see the [Line-based chunking example](../line_based_chunking)."
]
}
],
"metadata": {
+641
View File
@@ -0,0 +1,641 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Line-Based Token Chunking\n",
"## Overview\n",
"The `LineBasedTokenChunker` is a tokenization-aware chunker that preserves line boundaries. It's particularly useful for structured content like tables, code, or logs where line boundaries are semantically important.\n",
"\n",
"Key features:\n",
"- **Line preservation**: Keeps entire lines within a single chunk when possible\n",
"- **Prefix support**: Add repeated context (e.g., table headers) to each chunk\n",
"- **Overflow handling**: Choose between splitting lines or omitting prefix when lines are too long"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install -qU pip docling transformers"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from docling_core.transforms.chunker.line_chunker import LineBasedTokenChunker\n",
"from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\n",
"from transformers import AutoTokenizer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 1: Basic Table Chunking with Prefix\n",
"\n",
"In this example, we'll chunk a table while repeating the header in each chunk."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Max tokens: 50\n",
"Prefix token count: 34\n",
"\n",
"Total chunks: 3\n",
"\n",
"=== Chunk 1 ===\n",
"| Name | Age | Department |\n",
"|------|-----|------------|\n",
"| Alice | 30 | Engineering |\n",
"| Bob | 25 | Marketing |\n",
"\n",
"Tokens: 48\n",
"\n",
"=== Chunk 2 ===\n",
"| Name | Age | Department |\n",
"|------|-----|------------|\n",
"| Charlie | 35 | Sales |\n",
"| Diana | 28 | HR |\n",
"\n",
"Tokens: 48\n",
"\n",
"=== Chunk 3 ===\n",
"| Name | Age | Department |\n",
"|------|-----|------------|\n",
"| Eve | 32 | Finance |\n",
"\n",
"Tokens: 41\n",
"\n"
]
}
],
"source": [
"# Setup tokenizer with a reasonable token limit\n",
"tokenizer = HuggingFaceTokenizer(\n",
" tokenizer=AutoTokenizer.from_pretrained(\"sentence-transformers/all-MiniLM-L6-v2\"),\n",
" max_tokens=50, # Small limit to demonstrate chunking\n",
")\n",
"\n",
"# Create chunker with table header prefix\n",
"chunker = LineBasedTokenChunker(\n",
" tokenizer=tokenizer,\n",
" prefix=\"| Name | Age | Department |\\n|------|-----|------------|\\n\",\n",
" omit_prefix_on_overflow=False, # Always include prefix (default)\n",
")\n",
"\n",
"# Sample table rows\n",
"lines = [\n",
" \"| Alice | 30 | Engineering |\\n\",\n",
" \"| Bob | 25 | Marketing |\\n\",\n",
" \"| Charlie | 35 | Sales |\\n\",\n",
" \"| Diana | 28 | HR |\\n\",\n",
" \"| Eve | 32 | Finance |\\n\",\n",
"]\n",
"\n",
"print(f\"Max tokens: {chunker.max_tokens}\")\n",
"print(f\"Prefix token count: {chunker.prefix_len}\\n\")\n",
"\n",
"chunks = chunker.chunk_text(lines)\n",
"\n",
"print(f\"Total chunks: {len(chunks)}\\n\")\n",
"for i, chunk in enumerate(chunks, 1):\n",
" print(f\"=== Chunk {i} ===\")\n",
" print(chunk)\n",
" print(f\"Tokens: {tokenizer.count_tokens(chunk)}\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 2: Handling Wide Tables with `omit_prefix_on_overflow`\n",
"\n",
"When working with wide tables, some rows might fit without the header but not with it. The `omit_prefix_on_overflow` parameter provides flexibility in these cases."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Prefix token count: 47\n",
"Max tokens: 30\n",
"\n",
"Token counts:\n",
" Line 1: 11 tokens (with prefix: 58 tokens)\n",
" Line 2: 11 tokens (with prefix: 58 tokens)\n",
" Line 3: 17 tokens (with prefix: 64 tokens)\n",
"\n"
]
}
],
"source": [
"# Setup tokenizer with a very small token limit\n",
"tokenizer = HuggingFaceTokenizer(\n",
" tokenizer=AutoTokenizer.from_pretrained(\"sentence-transformers/all-MiniLM-L6-v2\"),\n",
" max_tokens=30, # Very small limit to force overflow\n",
")\n",
"\n",
"# Create chunker with a longer prefix\n",
"prefix = (\n",
" \"| Name | Age | Department | Location |\\n|------|-----|------------|----------|\\n\"\n",
")\n",
"\n",
"print(f\"Prefix token count: {tokenizer.count_tokens(prefix)}\")\n",
"print(f\"Max tokens: {tokenizer.get_max_tokens()}\\n\")\n",
"\n",
"# Sample lines - some will be too long with prefix\n",
"lines = [\n",
" \"| Alice Johnson | 30 | Engineering | San Francisco |\\n\",\n",
" \"| Bob Smith | 25 | Marketing | New York |\\n\",\n",
" \"| Charlie Brown with a very long name | 35 | Sales Department | Los Angeles |\\n\",\n",
"]\n",
"\n",
"# Check token counts for each line\n",
"print(\"Token counts:\")\n",
"for i, line in enumerate(lines, 1):\n",
" line_tokens = tokenizer.count_tokens(line)\n",
" with_prefix = line_tokens + tokenizer.count_tokens(prefix)\n",
" print(f\" Line {i}: {line_tokens} tokens (with prefix: {with_prefix} tokens)\")\n",
"print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Without `omit_prefix_on_overflow` (default behavior)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"============================================================\n",
"WITHOUT omit_prefix_on_overflow (may split long lines)\n",
"============================================================\n",
"\n",
"Total chunks: 5\n",
"\n",
"--- Chunk 1 ---\n",
"\n",
"| Name | Age | Department | Location\n",
"Tokens: 8\n",
"Has prefix: False\n",
"\n",
"--- Chunk 2 ---\n",
"\n",
" |\n",
"|------|-----|------------|--\n",
"Tokens: 30\n",
"Has prefix: False\n",
"\n",
"--- Chunk 3 ---\n",
"--------|\n",
"\n",
"Tokens: 9\n",
"Has prefix: False\n",
"\n",
"--- Chunk 4 ---\n",
"| Alice Johnson | 30 | Engineering | San Francisco |\n",
"| Bob Smith | 25 | Marketing | New York |\n",
"\n",
"Tokens: 22\n",
"Has prefix: False\n",
"\n",
"--- Chunk 5 ---\n",
"| Charlie Brown with a very long name | 35 | Sales Department | Los Angeles |\n",
"\n",
"Tokens: 17\n",
"Has prefix: False\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/anish/Desktop/Programs/docling/.venv/lib/python3.12/site-packages/docling_core/transforms/chunker/line_chunker.py:83: UserWarning: Chunks prefix is too long (47 tokens) for chunk size 30. It will be split into multiple chunks and only included in the first chunk(s). Consider increasing max_tokens to accommodate the full prefix in each chunk.\n",
" warnings.warn(\n"
]
}
],
"source": [
"chunker_no_omit = LineBasedTokenChunker(\n",
" tokenizer=tokenizer,\n",
" prefix=prefix,\n",
" omit_prefix_on_overflow=False, # Default: always include prefix\n",
")\n",
"\n",
"chunks_no_omit = chunker_no_omit.chunk_text(lines)\n",
"\n",
"print(\"=\" * 60)\n",
"print(\"WITHOUT omit_prefix_on_overflow (may split long lines)\")\n",
"print(\"=\" * 60)\n",
"print(f\"\\nTotal chunks: {len(chunks_no_omit)}\\n\")\n",
"\n",
"for i, chunk in enumerate(chunks_no_omit, 1):\n",
" print(f\"--- Chunk {i} ---\")\n",
" print(chunk)\n",
" print(f\"Tokens: {tokenizer.count_tokens(chunk)}\")\n",
" print(f\"Has prefix: {chunk.startswith(prefix)}\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### With `omit_prefix_on_overflow=True`"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"============================================================\n",
"WITH omit_prefix_on_overflow (keeps lines intact)\n",
"============================================================\n",
"\n",
"Total chunks: 5\n",
"\n",
"--- Chunk 1 ---\n",
"\n",
"| Name | Age | Department | Location\n",
"Tokens: 8\n",
"Has prefix: False\n",
"\n",
"--- Chunk 2 ---\n",
"\n",
" |\n",
"|------|-----|------------|--\n",
"Tokens: 30\n",
"Has prefix: False\n",
"\n",
"--- Chunk 3 ---\n",
"--------|\n",
"\n",
"Tokens: 9\n",
"Has prefix: False\n",
"\n",
"--- Chunk 4 ---\n",
"| Alice Johnson | 30 | Engineering | San Francisco |\n",
"| Bob Smith | 25 | Marketing | New York |\n",
"\n",
"Tokens: 22\n",
"Has prefix: False\n",
"\n",
"--- Chunk 5 ---\n",
"| Charlie Brown with a very long name | 35 | Sales Department | Los Angeles |\n",
"\n",
"Tokens: 17\n",
"Has prefix: False\n",
"\n"
]
}
],
"source": [
"chunker_with_omit = LineBasedTokenChunker(\n",
" tokenizer=tokenizer,\n",
" prefix=prefix,\n",
" omit_prefix_on_overflow=True, # Omit prefix for lines that would overflow\n",
")\n",
"\n",
"chunks_with_omit = chunker_with_omit.chunk_text(lines)\n",
"\n",
"print(\"=\" * 60)\n",
"print(\"WITH omit_prefix_on_overflow (keeps lines intact)\")\n",
"print(\"=\" * 60)\n",
"print(f\"\\nTotal chunks: {len(chunks_with_omit)}\\n\")\n",
"\n",
"for i, chunk in enumerate(chunks_with_omit, 1):\n",
" print(f\"--- Chunk {i} ---\")\n",
" print(chunk)\n",
" print(f\"Tokens: {tokenizer.count_tokens(chunk)}\")\n",
" print(f\"Has prefix: {chunk.startswith(prefix)}\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 3: Chunking a DoclingDocument\n",
"\n",
"The `LineBasedTokenChunker` can also be used directly with `DoclingDocument` objects."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total chunks: 11\n",
"\n",
"=== Chunk 1 ===\n",
"Text: # IBM\n",
"\n",
"International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over ...\n",
"Tokens: 57\n",
"Doc items: 12\n",
"\n",
"=== Chunk 2 ===\n",
"Text: IBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for ...\n",
"Tokens: 99\n",
"Doc items: 12\n",
"\n",
"=== Chunk 3 ===\n",
"Text: systems. During the 1960s and 1970s, the IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and ...\n",
"Tokens: 100\n",
"Doc items: 12\n",
"\n"
]
}
],
"source": [
"from docling.document_converter import DocumentConverter\n",
"\n",
"# Convert a document\n",
"converter = DocumentConverter()\n",
"result = converter.convert(\"../../tests/data/md/wiki.md\")\n",
"doc = result.document\n",
"\n",
"# Create chunker\n",
"tokenizer = HuggingFaceTokenizer(\n",
" tokenizer=AutoTokenizer.from_pretrained(\"sentence-transformers/all-MiniLM-L6-v2\"),\n",
" max_tokens=100,\n",
")\n",
"\n",
"chunker = LineBasedTokenChunker(\n",
" tokenizer=tokenizer,\n",
" prefix=\"\", # No prefix for general documents\n",
")\n",
"\n",
"# Chunk the document\n",
"chunks = list(chunker.chunk(doc))\n",
"\n",
"print(f\"Total chunks: {len(chunks)}\\n\")\n",
"\n",
"# Display first few chunks\n",
"for i, chunk in enumerate(chunks[:3], 1):\n",
" print(f\"=== Chunk {i} ===\")\n",
" print(\n",
" f\"Text: {chunk.text[:200]}...\"\n",
" if len(chunk.text) > 200\n",
" else f\"Text: {chunk.text}\"\n",
" )\n",
" print(f\"Tokens: {tokenizer.count_tokens(chunk.text)}\")\n",
" print(f\"Doc items: {len(chunk.meta.doc_items)}\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 4: Handling Large Prefixes\n",
"\n",
"When a prefix exceeds the `max_tokens` limit, it's automatically split into multiple chunks and only included at the beginning."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Large prefix token count: 130 tokens\n",
"Max tokens allowed: 25 tokens\n",
"\n",
"⚠️ Warning issued:\n",
" Chunks prefix is too long (130 tokens) for chunk size 25. It will be split into multiple chunks and only included in the first chunk(s). Consider increasing max_tokens to accommodate the full prefix in each chunk.\n",
"\n",
"Number of prefix chunks: 6\n",
"Prefix len (for single chunk): 0\n",
"\n",
"Prefix chunks:\n",
" Chunk 1: 25 tokens\n",
" Content: \n",
"This is a very long table header that contains a lot of information This is a very long table heade...\n",
"\n",
" Chunk 2: 25 tokens\n",
" Content: \n",
" information This is a very long table header that contains a lot of information This is a very lon...\n",
"\n",
" Chunk 3: 25 tokens\n",
" Content: \n",
" of information This is a very long table header that contains a lot of information This is a very ...\n",
"\n",
" Chunk 4: 24 tokens\n",
" Content: \n",
" lot of information This is a very long table header that contains a lot of information This is a v...\n",
"\n",
" Chunk 5: 24 tokens\n",
" Content: \n",
" contains a lot of information This is a very long table header that contains a lot of information ...\n",
"\n",
" Chunk 6: 7 tokens\n",
" Content: header that contains a lot of information \n",
"\n",
"Total chunks (including prefix chunks): 7\n",
"Content chunks: 1\n",
"\n",
"Chunk 1 [PREFIX CHUNK]:\n",
" Content: \n",
"This is a very long table header that contains a lot of information This is a very long table heade...\n",
" Tokens: 25\n",
"\n",
"Chunk 2 [PREFIX CHUNK]:\n",
" Content: \n",
" information This is a very long table header that contains a lot of information This is a very lon...\n",
" Tokens: 25\n",
"\n",
"Chunk 3 [PREFIX CHUNK]:\n",
" Content: \n",
" of information This is a very long table header that contains a lot of information This is a very ...\n",
" Tokens: 25\n",
"\n",
"Chunk 4 [PREFIX CHUNK]:\n",
" Content: \n",
" lot of information This is a very long table header that contains a lot of information This is a v...\n",
" Tokens: 24\n",
"\n",
"Chunk 5 [PREFIX CHUNK]:\n",
" Content: \n",
" contains a lot of information This is a very long table header that contains a lot of information ...\n",
" Tokens: 24\n",
"\n",
"Chunk 6 [PREFIX CHUNK]:\n",
" Content: header that contains a lot of information \n",
" Tokens: 7\n",
"\n",
"Chunk 7 [CONTENT CHUNK]:\n",
" Content: Row 1: Some data here\n",
"Row 2: More data here\n",
"Row 3: Even more data\n",
"\n",
" Tokens: 18\n",
"\n"
]
}
],
"source": [
"import warnings\n",
"\n",
"# Create a very long prefix that exceeds max_tokens\n",
"tokenizer = HuggingFaceTokenizer(\n",
" tokenizer=AutoTokenizer.from_pretrained(\"sentence-transformers/all-MiniLM-L6-v2\"),\n",
" max_tokens=25, # Small limit\n",
")\n",
"\n",
"large_prefix = (\n",
" \"This is a very long table header that contains a lot of information \" * 10\n",
")\n",
"\n",
"print(f\"Large prefix token count: {tokenizer.count_tokens(large_prefix)} tokens\")\n",
"print(f\"Max tokens allowed: {tokenizer.get_max_tokens()} tokens\\n\")\n",
"\n",
"# Create chunker with large prefix - will trigger warning\n",
"with warnings.catch_warnings(record=True) as w:\n",
" warnings.simplefilter(\"always\")\n",
"\n",
" chunker_large = LineBasedTokenChunker(\n",
" tokenizer=tokenizer,\n",
" prefix=large_prefix,\n",
" )\n",
"\n",
" if w:\n",
" print(\"⚠️ Warning issued:\")\n",
" print(f\" {w[0].message}\\n\")\n",
"\n",
"print(f\"Number of prefix chunks: {len(chunker_large.prefix_chunks)}\")\n",
"print(f\"Prefix len (for single chunk): {chunker_large.prefix_len}\\n\")\n",
"\n",
"# Show the prefix chunks\n",
"print(\"Prefix chunks:\")\n",
"for i, prefix_chunk in enumerate(chunker_large.prefix_chunks, 1):\n",
" preview = prefix_chunk[:100] + \"...\" if len(prefix_chunk) > 100 else prefix_chunk\n",
" print(f\" Chunk {i}: {tokenizer.count_tokens(prefix_chunk)} tokens\")\n",
" print(f\" Content: {preview}\\n\")\n",
"\n",
"# Test chunking with the large prefix\n",
"lines = [\n",
" \"Row 1: Some data here\\n\",\n",
" \"Row 2: More data here\\n\",\n",
" \"Row 3: Even more data\\n\",\n",
"]\n",
"\n",
"chunks_large = chunker_large.chunk_text(lines)\n",
"\n",
"print(f\"Total chunks (including prefix chunks): {len(chunks_large)}\")\n",
"print(f\"Content chunks: {len(chunks_large) - len(chunker_large.prefix_chunks)}\\n\")\n",
"\n",
"# Display all chunks\n",
"for i, chunk in enumerate(chunks_large, 1):\n",
" is_prefix_chunk = i <= len(chunker_large.prefix_chunks)\n",
" chunk_type = \"[PREFIX CHUNK]\" if is_prefix_chunk else \"[CONTENT CHUNK]\"\n",
"\n",
" print(f\"Chunk {i} {chunk_type}:\")\n",
" preview = chunk[:100] + \"...\" if len(chunk) > 100 else chunk\n",
" print(f\" Content: {preview}\")\n",
" print(f\" Tokens: {tokenizer.count_tokens(chunk)}\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"### When to use `LineBasedTokenChunker`\n",
"\n",
"- You need to preserve line boundaries (tables, code, logs)\n",
"- You want to add context (headers, metadata) to each chunk\n",
"- You're working with structured text where lines have semantic meaning\n",
"- You need fine-grained control over how lines are split\n",
"\n",
"### When to use `omit_prefix_on_overflow=True`\n",
"\n",
"- Working with wide tables or long prefixes\n",
"- Token budget is limited\n",
"- Line integrity is more important than consistent formatting\n",
"- You can handle chunks without the prefix in downstream processing\n",
"\n",
"### When to use `omit_prefix_on_overflow=False` (default)\n",
"\n",
"- You need the prefix in every chunk for context\n",
"- Consistent formatting is critical\n",
"- Downstream processing requires the prefix to understand the content\n",
"- Working with narrow content where the prefix doesn't cause overflow"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.11"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
+1
View File
@@ -105,6 +105,7 @@ nav:
- ✂️ Serialization & chunking:
- examples/serialization.ipynb
- examples/hybrid_chunking.ipynb
- examples/line_based_chunking.ipynb
- examples/advanced_chunking_and_serialization.ipynb
- 📤 Information extraction:
- examples/extraction.ipynb