Commit Graph

69 Commits

Author SHA1 Message Date
geoHeil 5b1df788ef ci: tighten pre-commit guardrails (#3346)
* ci: tighten pre-commit guardrails

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: validate pre-commit guardrail changes

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: switch hook validation to prek

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: exempt active slim plan from max-lines

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: move max-lines config under github

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: fail on uncovered tach modules

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: ignore generated docs in max-lines check

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: clarify local validation tasks

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* docs: refine agent instructions

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: replace mypy with ty

(cherry picked from commit 382afbde8f00abfaeba95ea9c8e9cc603f27a2d9)
Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

* ci: replace justfile with makefile

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>

---------

Signed-off-by: Georg Heiler <georg.kf.heiler@gmail.com>
2026-05-08 15:07:11 +02:00
yasqdb df5fbc3858 docs(readme): improve structure and clarity (#3366)
* Fix README formatting/issues

Signed-off-by: Yassine ben zineb <yassine.benzineb@medtech.tn>

* Fix README formatting/issues

Signed-off-by: Yassine ben zineb <yassine.benzineb@medtech.tn>

* Fix README formatting/issues

Signed-off-by: Yassine ben zineb <yassine.benzineb@medtech.tn>

* Fix README formatting/issues

Signed-off-by: Yassine ben zineb <yassine.benzineb@medtech.tn>

* Fix README formatting/issues

Signed-off-by: Yassine ben zineb <yassine.benzineb@medtech.tn>

* Quickstart + 'what is docling'

Signed-off-by: Yassine ben zineb <yassine.benzineb@medtech.tn>

* fix: line92

Signed-off-by: Yassine ben zineb <yassine.benzineb@medtech.tn>

---------

Signed-off-by: Yassine ben zineb <yassine.benzineb@medtech.tn>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2026-05-06 14:06:32 +02:00
Said Gürbüz f5fa294e17 chore(readme): fix broken Apify badge (typo) (#3296)
fix(readme): correct malformed Apify actor badge URL

Replace duplicate '?' with '&' in Apify badge query string.
The invalid URL caused the badge to render as a broken '?' icon.

Signed-off-by: Saidgurbuz <said.guerbuez@inf.ethz.ch>
2026-04-14 16:28:41 +02:00
Peter W. J. Staar d5af473a70 docs: Chart understanding in README (#3253)
Update README.md

Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>
2026-04-09 12:11:40 +02:00
Peter W. J. Staar 96d7c7ec79 feat: route plain-text and Quarto/R Markdown files to the Markdown backend (#3161)
* feat: route plain-text and Quarto/R Markdown files to the Markdown backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the README and index.md

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* _mime_from_extension: Added a check for extensions in the intersection of XML_USPTO and MD extension lists (currently just txt). These ambiguous
  extensions get pass — leaving mime=None — so the full content-probing chain (_detect_html_xhtml → _detect_csv → text/plain fallback) runs instead of
   prematurely assigning text/markdown.

  _guess_from_content: Removed the elif InputFormat.MD in formats MD fallback for text/plain content. Unrecognised .txt content now correctly returns
  None. MD is only returned from explicit mime types (text/markdown, text/x-markdown) which come from unambiguous extensions like .md, .text, .qmd,
  .rmd.

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* ran pre-commit

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2026-03-20 16:38:16 +01:00
Cesar Berrospi Ramis 1eb5c21dab docs: add XBRL conversion example notebook and update feature listings (#3039)
docs(xbrl): add notebook for XBRL parsing

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-27 16:09:19 +01:00
Cesar Berrospi Ramis d276e60561 feat: export to WebVTT format (#3036)
* style(cli): apply python 3.10+ syntax, remove unnecessary imports

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* feat(vtt): export of DoclingDocument to WebVTT format

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* build: pin docling-core version 2.66.0

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2026-02-27 14:22:52 +01:00
Peter W. J. Staar 704ef0afba docs: Add LaTeX and WebVTT as supported types (#2974)
* Update README.md

Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>

* Fix spelling of 'WebVTT' in README

Corrected the spelling of 'WebVTT' in the features list.

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* updated the index

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2026-02-10 19:59:23 +01:00
Michele Dolfi 7f386587ed feat: Drop support for Python 3.9 (#2905)
* chore: drop support for Python 3.9

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* disable CI for python 3.9

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix: test bump version

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add chore to the changelog but without bumping the version

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* force newer langchain-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* fix linter for 3.10

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add python 3.9 removal notice

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* avoid upgrading docling-core

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* restore semantic release settings

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2026-01-23 10:15:58 +01:00
Michele Dolfi dd03b53117 docs: discord badge with join link (#2473)
* add discord link

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Add Discord link to social section in mkdocs.yml

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* Add Discord link to getting started documentation

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-10-16 10:13:50 +02:00
Cesar Berrospi Ramis 46efaaefee feat: add a backend parser for WebVTT files (#2288)
* feat: add a backend parser for WebVTT files

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: update README with VTT support

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: add description to supported formats

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* chore: upgrade docling-core to unescape WebVTT in markdown

Pin the new release of docling-core 2.48.2.
Do not escape HTML reserved characters when exporting WebVTT documents to markdown.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* test: add missing copyright notice

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-09-22 15:24:34 +02:00
Christoph Auer 17afb664d0 feat: Add granite-docling model (#2272)
* adding granite-docling preview

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the model specs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* typo

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use granite-docling and add to the model downloader

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docs and README

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update final repo_ids for GraniteDocling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update final repo_ids for GraniteDocling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix model name in CLI usage example

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Fix VLM model name in README.md

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-09-17 15:15:49 +02:00
Roy Derks e5cd7020bd docs: Add instructions for using Docling with MCP to README (#2219)
* docs: Add instructions for using Docling with MCP to README

* DCO Remediation Commit for Roy Derks <10717410+royderks@users.noreply.github.com>

Signed-off-by: Roy Derks <roy.derks@ibm.com>

* DCO Remediation Commit for Roy Derks <10717410+royderks@users.noreply.github.com>

I, Roy Derks <10717410+royderks@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 4b9ba1d0ef

Signed-off-by: Roy Derks <roy.derks@ibm.com>

* docs: reorganize documentation on MCP server

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

* docs: align README with documentation index page

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

---------

Signed-off-by: Roy Derks <roy.derks@ibm.com>
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
Co-authored-by: Roy Derks <roy.derks@ibm.com>
Co-authored-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
2025-09-10 10:02:28 +02:00
Panos Vagenas a9f41b088e docs: add information extraction example (#2199)
* docs: add information exctraction example

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update README

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* minor typo

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update README

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-09-05 11:27:09 +02:00
Michele Dolfi 7b5f86098d docs: add chat with dosu (#1984)
add chat with dosu

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-07-24 11:07:36 +02:00
Peter W. J. Staar f3ae3029b8 docs: update readme and add ASR example (#1836)
* updated the README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added minimal_asr_pipeline

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Updated README and added ASR example

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Updated docs.index.md

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated CI and mkdocs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added link tp existing audio file

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added link tp existing audio file

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatting

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-06-23 18:55:16 +02:00
Michele Dolfi cdd401847a feat: simplify dependencies, switch to uv (#1700)
* refactor with uv

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* constraints for onnxruntime

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* more constraints

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-03 15:18:54 +02:00
Peter W. J. Staar cfdf4cea25 feat: new vlm-models support (#1570)
* feat: adding new vlm-models support

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the transformers

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* got microsoft/Phi-4-multimodal-instruct to work

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* working on vlm's

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* refactoring the VLM part

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* all working, now serious refacgtoring necessary

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* refactoring the download_model

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the formulate_prompt

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* pixtral 12b runs via MLX and native transformers

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the VlmPredictionToken

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* refactoring minimal_vlm_pipeline

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the MyPy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added pipeline_model_specializations file

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* need to get Phi4 working again ...

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* finalising last points for vlms support

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the pipeline for Phi4

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* streamlining all code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixing the tests

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the html backend to the VLM pipeline

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the static load_from_doctags

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* restore stable imports

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use AutoModelForVision2Seq for Pixtral and review example (including rename)

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove unused value

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* refactor instances of VLM models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* skip compare example in CI

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use lowercase and uppercase only

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add new minimal_vlm example and refactor pipeline_options_vlm_model for cleaner import

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename pipeline_vlm_model_spec

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* move more argument to options and simplify model init

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add supported_devices

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove not-needed function

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* exclude minimal_vlm

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* missing file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add message for transformers version

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename to specs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use module import and remove MLX from non-darwin

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove hf_vlm_model and add extra_generation_args

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use single HF VLM model class

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* remove torch type

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add docs for vision models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2025-06-02 17:01:06 +02:00
Michele Dolfi 64918a81ac docs: Add OpenSSF Best Practices badge (#1430)
* docs: add openssf badge

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add badge to docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-04-22 11:23:28 +02:00
Maxim Lysak 1c26769785 feat(SmolDocling): Support MLX acceleration in VLM pipeline (#1199)
* Initial implementation to support MLX for VLM pipeline and SmolDocling

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* mlx_model unit

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Add CLI choices for VLM pipeline and model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Initial implementation to support MLX for VLM pipeline and SmolDocling

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* mlx_model unit

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Add CLI choices for VLM pipeline and model

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated minimal vlm pipeline example

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* make vlm_pipeline python3.9 compatible

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed extract_text_from_backend definition

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated README

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated example

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated documentation

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* corrections in the documentation

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Consmetic changes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-03-19 15:38:54 +01:00
Michele Dolfi 1d680b0a32 docs: Linux Foundation AI & Data (#1183)
* point the auxiliary files to the community repo and add lfai in README

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update docs index

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-03-19 09:05:57 +01:00
Michele Dolfi 54a78c307d docs: move apify to docs (#1182)
move apify to docs

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2025-03-18 16:43:55 +01:00
Maxim Lysak 2f72167ff6 feat: updated vlm pipeline (with latest changes from docling-core) (#1158)
* Draft implementation of Doctag backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Updated VLM pipeline doctags to docling conversion, now properly supports lists

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* preparing to migrate to new doctags deserializer

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* re-using DocTagsDocument.from_doctags_and_image_pairs

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* satisfying mypy and other checks

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added support for force_backend_text parameter

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* removed unnecessary transformation

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned up

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Update tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Updated readme

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-03-18 15:44:51 +01:00
Václav Vančura 772487f9c9 feat(actor): Docling Actor on Apify infrastructure (#875)
* fix: Improve OCR results, stricten criteria before dropping bitmap areas  (#719)

fix: Properly care for all bitmap elements in OCR

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Adam Kliment <adam@netmilk.net>

* chore: bump version to 2.15.1 [skip ci]

* Actor: Initial implementation

Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Adam Kliment <adam@netmilk.net>

* Actor: .dockerignore update

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding the Actor badge

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Moving the badge where it belongs

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Documentation update

Signed-off-by: Václav Vančura <commit@vancura.dev>
Signed-off-by: Adam Kliment <adam@netmilk.net>

* Actor: Switching Docker to python:3.11-slim-bookworm

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Enhance Docker security with proper user permissions

- Set proper ownership and permissions for runtime directory.
- Switch to non-root user for enhanced security.
- Use `--chown` flag in COPY commands to maintain correct file ownership.
- Ensure all files and directories are owned by `appuser`.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Optimize Dockerfile with security and size improvements

- Combine RUN commands to reduce image layers and overall size.
- Add non-root user `appuser` for improved security.
- Use `--no-install-recommends` flag to minimize installed packages.
- Install only necessary dependencies in a single RUN command.
- Maintain proper cleanup of package lists and caches.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Add Docker image metadata labels

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Update dependencies with fixed versions

Upgrade pip and npm to latest versions, pin docling to 2.15.1 and apify-cli to 2.7.1 for better stability and reproducibility. This change helps prevent unexpected behavior from dependency updates and ensures consistent builds across environments.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fix apify-cli version problem

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Create Apify user home directory in Docker setup

Add and configure `/home/appuser/.apify` directory with proper permissions for the appuser in the Docker container. This ensures the Apify SDK has a writable home directory for storing its configuration and temporary files.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Update Docker configuration for improved security

- Add `ACTOR_PATH_IN_DOCKER_CONTEXT` argument to ignore the Apify-tooling related warning.
- Improve readability with consistent formatting and spacing in RUN commands.
- Enhance security by properly setting up appuser home directory and permissions.
- Streamline directory structure and ownership for runtime operations.
- Remove redundant `.apify` directory creation as it's handled by the CLI.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Improve shell script robustness and error handling

The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include:

- Added proper quoting around variables to prevent word splitting.
- Improved error messages and logging functionality.
- Implemented a cleanup trap to ensure temporary files are removed.
- Enhanced validation of input parameters and output formats.
- Added better handling of the log file and its storage.
- Improved command execution with proper evaluation.
- Added comments for better code readability and maintenance.
- Fixed potential security issues with proper variable expansion.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Improve script logging and error handling

- Initialize log file at `/tmp/docling.log` and redirect all output to it
- Remove exit on error trap, now only logs error line numbers
- Use temporary directory for timestamp file
- Capture Docling exit code and handle errors more gracefully
- Update log file references to use `LOG_FILE` variable
- Remove local log file during cleanup

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Updating Docling to 2.17.0

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding README

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: README update

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Enhance Dockerfile with additional utilities and env vars

- Add installation of `time` and `procps` packages for better resource monitoring.
- Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance.
- Create a cache directory for EasyOCR to optimize storage usage.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: README update

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding the Apify FirstPromoter integration

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding the "Run on Apify" button

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fixing example PDF document URLs

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Documentation update

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding input document URL validation

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fix quoting in `DOC_CONVERT_CMD` variable

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Documentation update

Removing the dollar signs due to what we discovered at https://cirosantilli.com/markdown-style-guide/#dollar-signs-in-shell-code

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Add specific error codes for better error handling

- `ERR_INVALID_INPUT` for missing document URL
- `ERR_URL_INACCESSIBLE` for inaccessible URLs
- `ERR_DOCLING_FAILED` for Docling command failures
- `ERR_OUTPUT_MISSING` for missing or empty output files
- `ERR_STORAGE_FAILED` for failures in storing the output document

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Enhance error handling and data logging

- Add `apify pushData` calls to log errors when the document URL is missing or inaccessible.
- Introduce dataset record creation with processing results, including a success status and output file URL.
- Modify completion message to indicate successful processing and provide a link to the results.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Normalize key-value store terminology

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Enhance `README.md` with output details

Added detailed information about the Actor's output storage to the `README.md`. This includes specifying where processed documents, processing logs, and dataset records are stored.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding CHANGELOG.md

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Adding dataset schema

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Update README with output URL details

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fix the Apify call syntax and final result URL message

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Add section on Actors to README

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Replace Docling CLI with docling-serve API

This commit transitions the Actor from using the full Docling CLI package to the more lightweight docling-serve API. Key changes include:

- Redesign Dockerfile to use docling-serve as base image
- Update actor.sh to communicate with API instead of running CLI commands
- Improve content type handling for various output formats
- Update input schema to align with API parameters
- Reduce Docker image size from ~6GB to ~600MB
- Update documentation and changelog to reflect architectural changes

The image size reduction will make the Actor more cost-effective for users while maintaining all existing functionality including OCR capabilities.

Issue: No official docling-serve Docker image is currently available, which will be addressed in a future commit.
Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Overhaul the implementation using official docling-serve image

This commit completely revamps the Actor implementation with two major improvements:

1) CRITICAL CHANGE: Switch to official docling-serve image
   * Now using quay.io/ds4sd/docling-serve-cpu:latest as base image
   * Eliminates need for custom docling installation
   * Ensures compatibility with latest docling-serve features
   * Provides more reliable and consistent document processing

2) Fix Apify Actor KVS storage issues:
   * Standardize key names to follow Apify conventions:
     - Change "OUTPUT_RESULT" to "OUTPUT"
     - Change "DOCLING_LOG" to "LOG"
   * Add proper multi-stage Docker build:
     - First stage builds dependencies including apify-cli
     - Second stage uses official image and adds only necessary tools
   * Fix permission issues in Docker container:
     - Set up proper user and directory permissions
     - Create writable directories for temporary files and models
     - Configure environment variables for proper execution

3) Solve EACCES permission errors during CLI version checks:
   * Create temporary HOME directory with proper write permissions
   * Set APIFY_DISABLE_VERSION_CHECK=1 environment variable
   * Add NODE_OPTIONS="--no-warnings" to suppress update checks
   * Support --no-update-notifier CLI flag when available

4) Improve code organization and reliability:
   * Create reusable upload_to_kvs() function for all KVS operations
   * Ensure log files are uploaded before tools directory is removed
   * Set proper MIME types based on output format
   * Add detailed error reporting and proper cleanup
   * Display final output URLs for easy verification

This major refactoring significantly improves reliability and maintainability by leveraging the official docling-serve image while solving persistent permission and storage issues. The Actor now properly follows Apify standards while providing a more robust document processing pipeline.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Refactor `actor.sh` and add `docling_processor.py`

Refactor the `actor.sh` script to modularize functions for finding the Apify CLI, setting up a temporary environment, and cleaning it up. Introduce a new function, `get_actor_input()`, to handle input detection more robustly. Replace inline Python conversion logic with an external script, `docling_processor.py`, for processing documents via the docling-serve API.

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Update CHANGELOG and README for Docker and API changes

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Removing obsolete actor.json keys

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Fixed input getter

Signed-off-by: Adam Kliment <adam@netmilk.net>

* Actor: Always output a zip

Signed-off-by: Adam Kliment <adam@netmilk.net>

* Actor: Resolving conflicts with main

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Resolving conflicts with main (pass 2)

Signed-off-by: Václav Vančura <commit@vancura.dev>

* Actor: Updated main Readme and Actor Readme

Signed-off-by: Adam Kliment <adam@netmilk.net>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Adam Kliment <adam@netmilk.net>
Signed-off-by: Václav Vančura <commit@vancura.dev>
Co-authored-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Adam Kliment <adam@netmilk.net>
2025-03-18 10:17:44 +01:00
Michele Dolfi fa16b12316 chore: move to docling-project org (#1160)
* chore: rename org

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Update docs/faq/index.md

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* update github pages

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* revert test content

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-03-14 12:35:29 +01:00
Michele Dolfi 357d41cc47 docs: Enrichment models (#1097)
* warning for develop examples

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add docs for enrichment models

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* minor reorg of top-level docs (#1098)

* minor reorg of top-level docs

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* fix typo [no ci]

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* trigger ci

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-03-04 14:24:38 +01:00
Peter W. J. Staar d7c082894e docs: updated the readme with upcoming features (#831)
* updated the readme with upcoming features

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the docs-index

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
2025-01-30 09:52:54 +01:00
Panos Vagenas 6875913e34 docs: document Docling JSON parsing (#819)
* docs: document Docling JSON parsing

Also:
- factored out and expanded supported formats
- reorged feature list

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* update feature list, minor fixes

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

---------

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-28 13:23:30 +01:00
Panos Vagenas 4fa8028bd8 docs: add LangChain docs (#717)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2025-01-09 14:12:05 +01:00
Peter W. J. Staar 0d11e30dd8 fix: Enable HTML export in CLI and add options for image mode (#513)
* updated README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed duck in title

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the index.md

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the cli to export html

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added html to cli

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* removed the duck emoji, added the  in the cli. Currently, the referenced seems broken

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* cleaning up the comments

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reference is now working

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Clean up styling and docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Pin docling-core>=2.7.1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-12-06 12:37:57 +01:00
Panos Vagenas 84c46fdeb3 docs: extend integration docs & README (#456)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-28 09:41:21 +01:00
Peter W. J. Staar 2a1d3fd221 chore: update the README (#409)
* chore: update the README

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Update README.md

Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>

* chore: update the docs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter W. J. Staar <91719829+PeterStaar-IBM@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-21 17:28:53 +01:00
Michele Dolfi 97d571af97 chore: add downloads in README, security policy and update ci actions (#401)
* add security policy

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update deprecated actions

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add comment about licenses for new dependencies

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add pypi downloads badge

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add citation file

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-21 13:59:45 +01:00
Michele Dolfi 7b013abcf3 fix: python3.9 support (#396)
* fixes for python3.9

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin docling-parse with python3.9 wheels

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* update deps

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-11-20 15:21:40 +01:00
Panos Vagenas a84ec276b0 docs: update badges & credits (#248)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 13:57:06 +01:00
Panos Vagenas 5ce02c5c59 docs: add coming-soon section (#235)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-11-05 08:53:02 +01:00
Peter W. J. Staar 94a5290789 chore: update the with input formats and DoclingDocument (#188)
---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2024-10-30 15:02:28 +01:00
Christoph Auer 3023f18ba0 feat: Support AsciiDoc and Markdown input format (#168)
* updated the base-model and added the asciidoc_backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* updated the asciidoc backend

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Ensure all models work only on valid pages (#158)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* ci: run ci also on forks (#160)


---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>

* fix: fix legacy doc ref (#162)

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>

* docs: typo fix (#155)

* Docs: Typo fix

- Corrected spelling of invidual to automatic

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>

* add synchronize event for forks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>

* feat: add coverage_threshold to skip OCR for small images (#161)

* feat: add coverage_threshold to skip OCR for small images

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* filter individual boxes

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* rename option

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* chore: bump version to 2.1.0 [skip ci]

* adding tests for asciidocs

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* first working asciidoc parser

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatted the code

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* adding test_02.asciidoc

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Drafting Markdown backend via Marko library

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* work in progress on MD backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* md_backend produces docling document with headers, paragraphs, lists

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Improvements in md parsing

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Detecting and assembling tables in markdown in temporary buffers

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added initial docling table support to md_backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Cleaned code, improved logging for MD

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixes MyPy requirements, and rest of pre-commit

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed example run_md, added origin info to md_backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* working on asciidocs, struggling with ImageRef

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* able to parse the captions and image uri's

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the mypy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* Update all backends with proper filename in DocumentOrigin

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update to docling-core v2.1.0

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for MD Backend, to avoid duplicated text inserts into docling doc

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fix styling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Added support for code blocks and fenced code in MD

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* cleaned prints

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Added proper processing of in-line textual elements for MD backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed issues with duplicated paragraphs and incorrect lists in pptx

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

* Fixed issue with group ordeering in pptx backend, added gebug log into run with formats

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
2024-10-23 16:14:26 +02:00
Panos Vagenas b8d2286dd1 chore: various minor docs fixes (#169)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-22 15:29:36 +02:00
ABHISHEK FADAKE f799e777c1 docs: typo fix (#155)
* Docs: Typo fix

- Corrected spelling of invidual to automatic

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>

* add synchronize event for forks

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: ABHISHEK FADAKE <31249309+fadkeabhi@users.noreply.github.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-18 13:56:48 +02:00
Maxim Lysak 034a411057 docs: add graphical band in readme (#154)
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-17 18:15:40 +02:00
Michele Dolfi 61c092f445 docs: add use docling (#150)
---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-17 18:14:48 +02:00
Christoph Auer 7d3be0edeb feat!: Docling v2 (#117)
---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Maxim Lysak <mly@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
Co-authored-by: Maxim Lysak <mly@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-16 21:02:03 +02:00
Panos Vagenas d504432c1e docs: introduce docs site (#141)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-14 14:13:13 +02:00
Panos Vagenas 5f1bd9e9c8 docs: simplify LlamaIndex example using Docling extension (#135)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-10-09 22:17:56 +02:00
Michele Dolfi f96ea86a00 feat: add options for choosing OCR engines (#118)
---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
2024-10-08 19:07:08 +02:00
Michele Dolfi d44c62d7ce feat: windows support (#122)
* feat: windows support

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* add Windows in README

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
2024-10-03 14:23:47 +02:00
Panos Vagenas c05b692d69 docs: document chunking (#111)
[skip ci]

Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-27 11:16:04 +02:00
Christoph Auer d6df76f90b feat: Support tableformer model choice (#90)
* Support tableformer model choice

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update datamodel structure

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update docs

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add test unit for table options

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Ensure import backwards-compatibility for PipelineOptions

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update README

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Adjust parameters on custom_convert

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

* Update Dockerfile

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
2024-09-26 21:37:08 +02:00
Panos Vagenas f8f2303348 docs: document CLI, minor README revamp (#100)
Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com>
2024-09-24 09:21:28 +02:00