mirror of
https://github.com/docling-project/docling.git
synced 2026-05-17 13:10:38 +00:00
101233ebe2
* fix(latex): fully unwrap deeply nested formatting macros
Two related bugs when formatting macros are nested:
1. `\textcolor{color}{...}` extracted the color name alongside the
text content because `_nodes_to_text` fell through to the generic
else branch, which concatenates all arguments. E.g.
`\section{\textcolor{blue}{\textbf{[SEP]}}}` produced heading text
"blue [SEP]" instead of "[SEP]".
2. `\textsc`, `\textsf`, `\textrm`, `\textnormal`, `\mbox` and
`\textcolor`/`\colorbox` are listed in MACROS_STRUCTURAL, so when
encountered mid-sentence `_process_macro_node_inline` flushed the
text buffer and called `_process_macro`, which creates a new doc
node. This broke inline paragraphs into fragments.
Fix:
- Add explicit handlers for MACROS_TEXT_STYLE and textcolor/colorbox
in `_process_macro_node_inline` (before the MACROS_STRUCTURAL flush
path) so they are accumulated inline like MACROS_TEXT_FORMATTING.
- Add matching handlers in `_nodes_to_text` so colour names are
skipped and only the text-content argument is returned.
Fixes #3207
Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
* fix(latex): fully unwrap deeply nested formatting macros
Two related bugs when formatting macros are nested inside each other:
1. `\textcolor{color}{...}` extracted the color name alongside the
text content because `_nodes_to_text` fell through to the generic
else branch, which concatenates all arguments. E.g.
`\section{\textcolor{blue}{\textbf{[SEP]}}}` produced heading text
"blue [SEP]" instead of "[SEP]".
2. `\textsc`, `\textsf`, `\textrm`, `\textnormal`, `\mbox` and
`\textcolor`/`\colorbox` are listed in MACROS_STRUCTURAL, so when
encountered mid-sentence `_process_macro_node_inline` flushed the
text buffer and called `_process_macro`, which creates a new doc
node. This broke inline paragraphs into fragments.
Fix:
- Add MACROS_COLOR_INLINE constant for textcolor/colorbox to keep
all macro classifications in one place (constants.py).
- Add explicit handlers for MACROS_TEXT_STYLE and MACROS_COLOR_INLINE
in `_process_macro_node_inline` (before the MACROS_STRUCTURAL flush
path) so they are accumulated inline like MACROS_TEXT_FORMATTING.
- Merge the identical MACROS_TEXT_FORMATTING and MACROS_TEXT_STYLE
branches in `_nodes_to_text` into a single branch.
- Use argnlist[-1] instead of reversed() iteration for
MACROS_COLOR_INLINE since the text content is always the last arg,
consistent with _extract_macro_arg.
Fixes #3207
Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
* refactor(latex): extract _macro_node_to_text to reduce complexity
Split the macro-handling branch of `_nodes_to_text` into a dedicated
`_macro_node_to_text` helper so that cyclomatic complexity stays within
the ruff C901 limit (was 31, now < 30 for both methods).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
* fix(latex): migrate nested-formatting test to tests/test_latex/
Upstream reorganised all latex tests from tests/test_backend_latex.py
into tests/test_latex/. Move test_latex_nested_formatting_macros to
tests/test_latex/test_macros.py and fix ruff-reported style nits.
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
---------
Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
Co-authored-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>