Files
Sébastien Stormacq 102f92aafb Fix race condition crash in LambdaRuntimeClient channel lifecycle (Bug #624) (#632)
This PR fixes a race condition in `LambdaRuntimeClient` that causes a
fatal crash when an old channel's `closeFuture` callback fires after a
new connection has been established. The fix adds proper channel
lifecycle tracking and replaces the fatal error with graceful handling.

## Problem

**Crash Location**: `LambdaRuntimeClient.swift:270` in `channelClosed()`

**Error Message**:
```
Fatal error: Invalid state: connected(SocketChannel { ... }), closed
```

**Root Cause**: Race condition where:
1. An old channel's `closeFuture` callback fires
2. AFTER a new connection has been established (`connectionState =
.connected`)
3. BUT `closingState` is `.closed` from a previous close operation
4. The code asserted this was impossible and crashed with `fatalError`

This can occur when:
- Network conditions cause delayed channel cleanup
- Connection is recycled quickly (old channel still closing while new
one connects)
- Timing issues between channel close callbacks and new connection
establishment

## Solution

### Key Changes

1. **Added channel identity tracking**:
   ```swift
   private var channelsBeingClosed: Set<ObjectIdentifier> = []
   ```
Tracks which channels are in the process of closing to distinguish old
channels from the current one.

2. **Enhanced `connectionWillClose()`**:
   - Marks channels as "being closed" using `ObjectIdentifier`
   - Adds logging when old channels close while new connection is active

3. **Rewrote `channelClosed()` with defensive logic**:
- **Early return for tracked old channels**: Handles them gracefully
without affecting current connection
- **Replaced `fatalError` with warning log**: The `(_, .closed)` case
now logs a warning instead of crashing
- **Channel identity checks**: Only transitions state if the closing
channel is the CURRENT channel
- **Removed unconditional state change**: Previously set
`connectionState = .disconnected` for ANY channel close, now only for
the current channel

### Why This Fixes the Bug

The fix addresses the race condition by:
- Distinguishing between "current channel closing" vs "old channel
closing"
- Handling old channel closes gracefully without crashing or corrupting
state
- Not overwriting connection state when old channels close
- Providing visibility through logging when the race condition occurs

## Changes

### Modified Files

- **Sources/AWSLambdaRuntime/HTTPClient/LambdaRuntimeClient.swift**
  - Added `channelsBeingClosed: Set<ObjectIdentifier>` property
  - Enhanced `connectionWillClose()` with channel tracking
  - Rewrote `channelClosed()` with defensive logic and identity checks
  - Replaced `fatalError` with warning log for unexpected states
  - Removed unconditional state change in `closeFuture` callback

**Lines Changed**: ~150 lines modified/added

**Backward Compatibility**:  Fully compatible, no API changes

## Testing

###  All Existing Tests Pass

```bash
swift test
# Result: 91 tests passed in 14 suites
```

All original functionality is preserved with no regressions.

### ⚠️ Note on Test Coverage

While we cannot reproduce the exact race condition from bug #624 in a
deterministic test (it requires specific network timing), the fix:
- Is logically sound for the described race condition
- Improves defensive programming around channel lifecycle
- Replaces a fatal crash with graceful handling + logging
- Should prevent the crash by properly tracking channel identity

## Related Issues

Fixes #624

---------

Co-authored-by: Sebastien Stormacq <stormacq@amazon.lu>
2026-01-27 08:53:31 +00:00
..