mirror of
https://github.com/docling-project/docling.git
synced 2026-05-17 13:10:38 +00:00
e6ccb8b2c1
* feat: added support for parsing LaTeX (.tex) documents Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * feat: implement PR #2890 feedback for LaTeX backend - Add text formatting options (bold, italic, underline) for LaTeX macros - Enhance image embedding with PIL and ImageRef.from_pil() - Refactor list processing to use GroupItem structure - Refactor bibliography to use GroupItem structure - Add nested list test coverage - All tests passing (39/39), all linters passing Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135b431d489cd8bf3982524505a0bbd8696d Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit: f19f135b431d489cd8bf3982524505a0bbd8696d Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * feat: enhance latex backend with robustness fixes and ground truth - Add custom macro expansion for improved text quality - Fix preamble filtering to remove metadata garbage - Support recursive \input{} and \include{} file loading - Organize test data into subdirectories for complex papers - Add full end-to-end ground truth for 4 major arXiv papers (Attention, Mistral, DeepSeek, OTSL) - Pass all 41 unit tests and pre-commit checks Addresses @cau-git feedback for ground-truth data. Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * fix: minor formatting in test file Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * feat: enhance LaTeX backend with robust math and figure support - Fixed re.error: bad escape in macro expansion by using lambda in re.sub - Fixed sentences breaking at inline math ($) by preserving it within paragraphs - Improved figure environment with proper grouping and structured representation - Fixed crashes on documents starting with % comments - Added comprehensive unit tests and updated all ground truth data Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * WIP: saving work for laptop migration Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * got rid of the line breaking issues, still some do exist Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * fix: generalized LaTeX macro parsing and robustness improvements This commit addresses several issues with LaTeX parsing: - Correctly handle unknown macros (like \ion{N}{2}) inline to avoid line breaks. - Fix extraction of structural macros (section, caption, etc.) vs text-only groups. - Address PR feedback regarding inline math spacing and splitting. - Regenerate ground truth files reflecting these improvements. Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * style: apply automatic formatting fixes Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * style: fix ruff linter and formatter errors Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * fix: typing issues identified by mypy Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * style: apply formatting fixes to tests Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * fix: update groundtruth files for latex backend Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * fixed the ackward line breaking issue, turns out im stupid at considering text buffer * i forgot to add the groundtruth so here it is * DCO Remediation Commit for Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> I, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit:7e032635efI, Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>, hereby add my Signed-off-by to this commit:aeba688384Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> * Ran the precommit as requested Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com> --------- Signed-off-by: Aditya Sasidhar <telikicherlaadityasasidhar@gmail.com>
27 lines
551 B
TeX
Vendored
27 lines
551 B
TeX
Vendored
\documentclass{article}
|
|
\title{Sample Document}
|
|
\author{Test Author}
|
|
\begin{document}
|
|
\maketitle
|
|
|
|
\section{Introduction}
|
|
This is the first paragraph of the introduction.
|
|
|
|
\section{Background}
|
|
Some background information here with \textbf{bold} and \textit{italic} text.
|
|
|
|
\begin{itemize}
|
|
\item First item in unordered list
|
|
\item Second item in unordered list
|
|
\end{itemize}
|
|
|
|
\begin{enumerate}
|
|
\item First item in ordered list
|
|
\item Second item in ordered list
|
|
\end{enumerate}
|
|
|
|
\subsection{Nested Section}
|
|
A subsection with more content.
|
|
|
|
\end{document}
|