* Move client SDK to docling
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Cleanup and test shims
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update to released docling version
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: harden ray dispatcher durability and recovery
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Remove dispatcher_handoff_timeout
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* More cleanup, recover comments
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Bound Ray dispatcher recovery RPCs for liveness
Add dispatcher_rpc_timeout and liveness_fail_after to RayOrchestratorConfig.
Bound both dispatcher health checks and runtime refresh RPCs with
asyncio.wait_for so head-loss cannot wedge the supervisor on an
unbounded await. Track continuous dispatcher unhealthiness and expose
is_liveness_healthy() for bounded liveness decisions.
Also extend Ray hardening tests to cover both get_health and
refresh_runtime timeout paths.
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Make Ray runtime initialization lazy in RayOrchestrator
RayOrchestrator.__init__() previously called ray.init(), serve.start(),
and deploy_processor() synchronously, coupling API pod startup to Ray
head availability. This caused crash-loop restarts when the Ray head
was unavailable and forced compensatory workarounds in docling-serve
(/ready shallow bypass, /livez liveness logic).
Move all Ray init calls into a new _initialize_ray_runtime() async
method invoked from process_queue(), so construction is Ray-free and
the pod can start serving requests before a Ray session is established.
Use asyncio.to_thread for the blocking Ray calls. Wrap the method body
in BaseException (re-raising CancelledError) so any failure, including
SystemExit(15) from Ray internals, raises DispatcherUnavailableError
rather than escaping as an unhandled exception.
Apply the same BaseException / CancelledError pattern to
_refresh_dispatcher_runtime() and ensure_dispatcher_ready(), which
previously used except Exception and therefore missed SystemExit(15)
from Ray, returning HTTP 500 instead of the intended HTTP 503.
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix(ray): keep processing key cleanup in complete_task_atomic only
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* ray: replace task heartbeat WATCH loop with Lua and derive stale cutoff from heartbeat interval
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* feat(ray): add replica-owned execution lease helpers to RedisStateManager
Adds write_task_execution_lease(), update_task_execution_heartbeat(), and
get_task_execution_lease() to RedisStateManager. Extends finalize_task_*_atomic()
and complete_task_atomic() to delete task:{id}:execution at terminalization.
These methods are used by the next task (replica heartbeat in serve_deployment.py).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(ray): execution lease cleanup pass 2
- Delete update_task_processing_heartbeat() and its Lua script constant;
only caller (_maintain_processing_heartbeat) was removed in D2
- Remove heartbeat_at from mark_task_processing() mapping; the dispatch
key is now a pure admission record, not a heartbeat carrier
- Remove mark_task_processing() call from serve_deployment.py; the
execution lease is the authoritative "execution has begun" signal
- Rename Redis key task:{id}:processing → task:{id}:dispatch and method
get_task_processing_state() → get_task_dispatch_state_hash() to reflect
that this is a dispatcher-written dispatch record, not a replica state
- Wrap _process_convert and _process_chunk with asyncio.to_thread so the
replica event loop stays free during conversion, allowing the execution
lease heartbeat to fire throughout long-running tasks (D1b)
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix(ray): stringify serve replica id before writing execution leas
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* refactor(ray): remove dead dispatch-state cleanup code
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Move Ray runtime init under continuous supervision
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* DCO Remediation Commit for Christoph Auer <cau@zurich.ibm.com>
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: bd82fa1b7d
I, Christoph Auer <cau@zurich.ibm.com>, hereby add my Signed-off-by to this commit: 9c8e1a5a86
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Skip integration tests in CI
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Small cleanups
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Harden test_local_orchestrator
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Move client SDK to docling
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Cleanup and test shims
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Update to released docling version
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* feat: Control maximum concurrent redis requests to avoid pool exhaustion (#6)
* add initial ray_fair orchestrator
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* implementation with ray serve
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix serialization
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* more serialization fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* cannot msgpack the DocumentStream
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* hardening notifier
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* cleanup raydata param and add log level
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* cleanup params and implement object store memory
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add mtls
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* more logging
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* more logging
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* launch all tasks
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* rename params
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* update docs
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix creation of redis pools
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix: Watchdog: update the RQ job statusto FAILED and remove it from StartedJobRegistry (#107)
* fix: Watchdog: update the RQ job statusto FAILED and remove it from StartedJobRegistry
Signed-off-by: Pawel Rein <pawel.rein@prezi.com>
* fix formatter/linter
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Pawel Rein <pawel.rein@prezi.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
* add metadata for orchestrator
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fixes
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add dispatch state
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* rename workers to actors
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* rename fair_ray to ray
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* more rename
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix dispatch vs running
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* add redis manager to the actors
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix running metrics
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix setting rtunning
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* actor cleanup
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* feat: Expose classification filters for picture description (#105)
Preserve legacy picture description filters
Signed-off-by: drk <drukpa1455@gmail.com>
* feat: add on_result_fetched() no-op lifecycle hook to BaseOrchestrator
* feat: add consumed_ttl and on_result_fetched() to RQOrchestrator
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: add consumed_ttl and on_result_fetched() to LocalOrchestrator
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: add expire_result() to RedisStateManager
This method sets a TTL on an existing result key in Redis, enabling
crash-safe single-use deletion of results after they are fetched.
Implements test-driven development with unit test verification.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: add consumed_ttl and on_result_fetched() to RayOrchestrator
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: add connect() guard to expire_result matching peer methods
All 20+ other methods in RedisStateManager check `if not self.redis`
before using the client. expire_result was missing this guard and would
raise RuntimeError if called before connection establishment.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ensure no asyncio.task can be GCed early
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* apply re-formatting
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* add ray actor logging to jobkit
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* fix: run RQ Job.fetch/get_status/get_position in thread pool to avoid blocking the event loop
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* Ensure control over max ongoing requests per ray replica
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* refactor: rename consumed_ttl back to result_removal_delay
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* upgrade uv.lock
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
* move Redis gating and RQ durable status into jobkit
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Pawel Rein <pawel.rein@prezi.com>
Signed-off-by: drk <drukpa1455@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: Paweł Rein <pawel.rein@prezi.com>
Co-authored-by: drk <136856552+drukpa1455@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix test
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
* fix on python 3.14
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
---------
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Pawel Rein <pawel.rein@prezi.com>
Signed-off-by: drk <drukpa1455@gmail.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Co-authored-by: Christoph Auer <CAU@zurich.ibm.com>
Co-authored-by: Paweł Rein <pawel.rein@prezi.com>
Co-authored-by: drk <136856552+drukpa1455@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>