Files
Sébastien Stormacq 34e89b4027 Fix Test hangs in Lambda+LocalServer (#630) (#631)
# Fix test hangs caused by Pool cancellation race conditions

## Summary

This PR fixes two related race conditions in
`Lambda+LocalServer+Pool.swift` that were causing the test suite to hang
approximately 10% of the time.

## Problem

The test suite exhibited intermittent hangs (~10% frequency) due to two
bugs in the Pool implementation:

1. **Individual task cancellation bug**: When one task waiting for a
specific `requestId` was cancelled, the cancellation handler would
incorrectly cancel ALL waiting tasks instead of just the cancelled one.

2. **Server shutdown hang**: When the server shut down, waiting
continuations in the pools were never cancelled, causing handlers to
wait indefinitely for responses that would never arrive.

## Root Causes

### Root Cause #1: Cancellation Handler Removes ALL Continuations

The `onCancel` handler in `Pool._next()` was removing all continuations
from the `waitingForSpecific` dictionary when any single task was
cancelled:

```swift
onCancel: {
    // BUG: Removes ALL continuations, not just the cancelled task's
    for continuation in state.waitingForSpecific.values {
        toCancel.append(continuation)
    }
    state.waitingForSpecific.removeAll()
}
```

This caused unrelated concurrent invocations to fail with
`CancellationError` when one client cancelled their request.

### Root Cause #2: No Pool Cleanup During Server Shutdown

When the server shut down (e.g., test completes), the task group was
cancelled but the pools' waiting continuations were never notified. The
`/invoke` endpoint handlers would continue waiting for responses that
would never arrive because the Lambda function had stopped.

## Solution

### Fix #1: Only Remove Specific Continuation on Cancellation

Modified the cancellation handler to only remove the continuation for
the specific cancelled task:

```swift
onCancel: {
    // Only remove THIS task's continuation
    let continuationToCancel = self.lock.withLock { state -> CheckedContinuation<T, any Error>? in
        if let requestId = requestId {
            return state.waitingForSpecific.removeValue(forKey: requestId)
        } else {
            let cont = state.waitingForAny
            state.waitingForAny = nil
            return cont
        }
    }
    
    continuationToCancel?.resume(throwing: CancellationError())
}
```

### Fix #2: Add Pool Cleanup During Server Shutdown

Added `cancelAll()` method to the Pool class and call it during server
shutdown:

```swift
func cancelAll() {
    let continuationsToCancel = self.lock.withLock { state -> [CheckedContinuation<T, any Error>] in
        var toCancel: [CheckedContinuation<T, any Error>] = []
        
        if let continuation = state.waitingForAny {
            toCancel.append(continuation)
            state.waitingForAny = nil
        }
        
        for continuation in state.waitingForSpecific.values {
            toCancel.append(continuation)
        }
        state.waitingForSpecific.removeAll()
        
        return toCancel
    }
    
    for continuation in continuationsToCancel {
        continuation.resume(throwing: CancellationError())
    }
}
```

Called during server shutdown:

```swift
let serverOrHandlerResult1 = await group.next()!
group.cancelAll()

// Cancel all waiting continuations in the pools to prevent hangs
server.invocationPool.cancelAll()
server.responsePool.cancelAll()
```

## Changes

### Modified Files

- **Sources/AWSLambdaRuntime/HTTPServer/Lambda+LocalServer+Pool.swift**
- Fixed cancellation handler in `_next()` to only remove specific
continuation
  - Added `cancelAll()` method for server shutdown cleanup

- **Sources/AWSLambdaRuntime/HTTPServer/Lambda+LocalServer.swift**
  - Call `cancelAll()` on both pools during server shutdown

### New Files

- **Tests/AWSLambdaRuntimeTests/LocalServerPoolCancellationTests.swift**
  - Added comprehensive test suite with 3 tests
- `testCancellationOnlyAffectsOwnTask`: Verifies only the cancelled task
receives CancellationError
- `testConcurrentInvocationsWithCancellation`: Tests real-world scenario
with 5 concurrent invocations
- `testFIFOModeCancellation`: Ensures FIFO mode cancellation works
correctly

## Testing

### Before Fix
- Test suite hung ~10% of the time
- When 1 task was cancelled, all 5 concurrent tasks received
`CancellationError`
- Streaming tests would occasionally hang during shutdown

### After Fix
- All 91 tests pass consistently without hangs
- When 1 task is cancelled, only that specific task receives
`CancellationError`
- Other tasks continue waiting normally
- Server shutdown properly cleans up all waiting continuations
- Multiple consecutive test runs confirm stability

### Test Coverage

The new test suite reproduces both bugs and verifies the fixes:

1. **testCancellationOnlyAffectsOwnTask**: Creates 3 tasks waiting for
different requestIds, cancels only one, and verifies the others are not
affected
2. **testConcurrentInvocationsWithCancellation**: Simulates 5 concurrent
invocations with one cancellation
3. **testFIFOModeCancellation**: Tests FIFO mode to ensure it still
works correctly

---------

Co-authored-by: Sebastien Stormacq <stormacq@amazon.lu>
2026-01-15 20:58:03 +01:00
..