Files
docling/tests/data/latex/example_01.tex
Aditya Sasidhar e6ccb8b2c1 feat: added support for parsing LaTeX (.tex) documents (#2890)
* feat: added support for parsing LaTeX (.tex) documents

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* feat: implement PR #2890 feedback for LaTeX backend

- Add text formatting options (bold, italic, underline) for LaTeX macros
- Enhance image embedding with PIL and ImageRef.from_pil()
- Refactor list processing to use GroupItem structure
- Refactor bibliography to use GroupItem structure
- Add nested list test coverage
- All tests passing (39/39), all linters passing

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135b431d489cd8bf3982524505a0bbd8696d

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135b431d489cd8bf3982524505a0bbd8696d

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* feat: enhance latex backend with robustness fixes and ground truth

- Add custom macro expansion for improved text quality
- Fix preamble filtering to remove metadata garbage
- Support recursive \input{} and \include{} file loading
- Organize test data into subdirectories for complex papers
- Add full end-to-end ground truth for 4 major arXiv papers (Attention, Mistral, DeepSeek, OTSL)
- Pass all 41 unit tests and pre-commit checks

Addresses @cau-git feedback for ground-truth data.

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fix: minor formatting in test file

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* feat: enhance LaTeX backend with robust math and figure support

- Fixed re.error: bad escape in macro expansion by using lambda in re.sub
- Fixed sentences breaking at inline math ($) by preserving it within paragraphs
- Improved figure environment with proper grouping and structured representation
- Fixed crashes on documents starting with % comments
- Added comprehensive unit tests and updated all ground truth data

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* WIP: saving work for laptop migration

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* got rid of the line breaking issues, still some do exist

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fix: generalized LaTeX macro parsing and robustness improvements

This commit addresses several issues with LaTeX parsing:
- Correctly handle unknown macros (like \ion{N}{2}) inline to avoid line breaks.
- Fix extraction of structural macros (section, caption, etc.) vs text-only groups.
- Address PR feedback regarding inline math spacing and splitting.
- Regenerate ground truth files reflecting these improvements.

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* style: apply automatic formatting fixes

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* style: fix ruff linter and formatter errors

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fix: typing issues identified by mypy

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* style: apply formatting fixes to tests

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fix: update groundtruth files for latex backend

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* fixed the ackward line breaking issue, turns out im stupid at considering text buffer

* i forgot to add the groundtruth so here it is

* DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: 7e032635ef
I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: aeba688384

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

* Ran the precommit as requested

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>

---------

Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
2026-02-10 15:13:09 +01:00

27 lines
551 B
TeX
Vendored

\documentclass{article}
\title{Sample Document}
\author{Test Author}
\begin{document}
\maketitle
\section{Introduction}
This is the first paragraph of the introduction.
\section{Background}
Some background information here with \textbf{bold} and \textit{italic} text.
\begin{itemize}
\item First item in unordered list
\item Second item in unordered list
\end{itemize}
\begin{enumerate}
\item First item in ordered list
\item Second item in ordered list
\end{enumerate}
\subsection{Nested Section}
A subsection with more content.
\end{document}