mirror of
https://github.com/swift-server/swift-aws-lambda-runtime.git
synced 2026-05-03 07:22:27 +00:00
34e89b4027
# Fix test hangs caused by Pool cancellation race conditions ## Summary This PR fixes two related race conditions in `Lambda+LocalServer+Pool.swift` that were causing the test suite to hang approximately 10% of the time. ## Problem The test suite exhibited intermittent hangs (~10% frequency) due to two bugs in the Pool implementation: 1. **Individual task cancellation bug**: When one task waiting for a specific `requestId` was cancelled, the cancellation handler would incorrectly cancel ALL waiting tasks instead of just the cancelled one. 2. **Server shutdown hang**: When the server shut down, waiting continuations in the pools were never cancelled, causing handlers to wait indefinitely for responses that would never arrive. ## Root Causes ### Root Cause #1: Cancellation Handler Removes ALL Continuations The `onCancel` handler in `Pool._next()` was removing all continuations from the `waitingForSpecific` dictionary when any single task was cancelled: ```swift onCancel: { // BUG: Removes ALL continuations, not just the cancelled task's for continuation in state.waitingForSpecific.values { toCancel.append(continuation) } state.waitingForSpecific.removeAll() } ``` This caused unrelated concurrent invocations to fail with `CancellationError` when one client cancelled their request. ### Root Cause #2: No Pool Cleanup During Server Shutdown When the server shut down (e.g., test completes), the task group was cancelled but the pools' waiting continuations were never notified. The `/invoke` endpoint handlers would continue waiting for responses that would never arrive because the Lambda function had stopped. ## Solution ### Fix #1: Only Remove Specific Continuation on Cancellation Modified the cancellation handler to only remove the continuation for the specific cancelled task: ```swift onCancel: { // Only remove THIS task's continuation let continuationToCancel = self.lock.withLock { state -> CheckedContinuation<T, any Error>? in if let requestId = requestId { return state.waitingForSpecific.removeValue(forKey: requestId) } else { let cont = state.waitingForAny state.waitingForAny = nil return cont } } continuationToCancel?.resume(throwing: CancellationError()) } ``` ### Fix #2: Add Pool Cleanup During Server Shutdown Added `cancelAll()` method to the Pool class and call it during server shutdown: ```swift func cancelAll() { let continuationsToCancel = self.lock.withLock { state -> [CheckedContinuation<T, any Error>] in var toCancel: [CheckedContinuation<T, any Error>] = [] if let continuation = state.waitingForAny { toCancel.append(continuation) state.waitingForAny = nil } for continuation in state.waitingForSpecific.values { toCancel.append(continuation) } state.waitingForSpecific.removeAll() return toCancel } for continuation in continuationsToCancel { continuation.resume(throwing: CancellationError()) } } ``` Called during server shutdown: ```swift let serverOrHandlerResult1 = await group.next()! group.cancelAll() // Cancel all waiting continuations in the pools to prevent hangs server.invocationPool.cancelAll() server.responsePool.cancelAll() ``` ## Changes ### Modified Files - **Sources/AWSLambdaRuntime/HTTPServer/Lambda+LocalServer+Pool.swift** - Fixed cancellation handler in `_next()` to only remove specific continuation - Added `cancelAll()` method for server shutdown cleanup - **Sources/AWSLambdaRuntime/HTTPServer/Lambda+LocalServer.swift** - Call `cancelAll()` on both pools during server shutdown ### New Files - **Tests/AWSLambdaRuntimeTests/LocalServerPoolCancellationTests.swift** - Added comprehensive test suite with 3 tests - `testCancellationOnlyAffectsOwnTask`: Verifies only the cancelled task receives CancellationError - `testConcurrentInvocationsWithCancellation`: Tests real-world scenario with 5 concurrent invocations - `testFIFOModeCancellation`: Ensures FIFO mode cancellation works correctly ## Testing ### Before Fix - Test suite hung ~10% of the time - When 1 task was cancelled, all 5 concurrent tasks received `CancellationError` - Streaming tests would occasionally hang during shutdown ### After Fix - All 91 tests pass consistently without hangs - When 1 task is cancelled, only that specific task receives `CancellationError` - Other tasks continue waiting normally - Server shutdown properly cleans up all waiting continuations - Multiple consecutive test runs confirm stability ### Test Coverage The new test suite reproduces both bugs and verifies the fixes: 1. **testCancellationOnlyAffectsOwnTask**: Creates 3 tasks waiting for different requestIds, cancels only one, and verifies the others are not affected 2. **testConcurrentInvocationsWithCancellation**: Simulates 5 concurrent invocations with one cancellation 3. **testFIFOModeCancellation**: Tests FIFO mode to ensure it still works correctly --------- Co-authored-by: Sebastien Stormacq <stormacq@amazon.lu>