ci: unify runner disk cleanup across Linux and Windows (#448)

- Replaces the two drifting `clean-runner-disk` /
`clean-runner-disk-windows` composite actions with a single
OS-dispatched action at `.github/actions/clean-runner-disk/`.
- Swaps the Windows hot path from PowerShell `Remove-Item -Recurse

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Eligio Mariño
2026-05-13 19:08:31 +02:00
committed by GitHub
parent 7d16500293
commit 135de6f707
9 changed files with 459 additions and 98 deletions
+1
View File
@@ -0,0 +1 @@
{"sessionId":"03700b4e-87b3-4015-950b-ccdc61da4692","pid":48327,"procStart":"78552","acquiredAt":1778618403190}
@@ -1,68 +0,0 @@
name: 'Clean Runner Disk (Windows)'
description: 'Cleans the GitHub Actions Windows runner disk by removing unused toolchains and SDKs to free up space on C:.'
runs:
using: composite
steps:
- name: Show disk usage before cleaning
run: Get-PSDrive -PSProvider FileSystem | Format-Table -AutoSize
shell: pwsh
- name: 'Top-level directories on C: before cleaning'
run: |
Get-ChildItem -Path C:\ -Directory -Force -ErrorAction SilentlyContinue |
ForEach-Object {
$size = (Get-ChildItem -Path $_.FullName -Recurse -Force -ErrorAction SilentlyContinue |
Measure-Object -Property Length -Sum).Sum
[PSCustomObject]@{
Path = $_.FullName
SizeGB = [math]::Round($size / 1GB, 2)
}
} | Sort-Object SizeGB -Descending | Format-Table -AutoSize
shell: pwsh
- name: Clean runner disk
run: |
$paths = @(
'C:\Android',
'C:\hostedtoolcache',
'C:\Program Files\dotnet',
'C:\Program Files (x86)\dotnet',
'C:\ghcup',
'C:\Program Files\Haskell',
'C:\Program Files\Haskell Platform',
'C:\Strawberry',
'C:\msys64',
'C:\Miniconda',
'C:\Program Files\MongoDB',
'C:\Program Files\PostgreSQL',
'C:\PostgreSQL',
'C:\Program Files\Google\Chrome',
'C:\Program Files (x86)\Google\Chrome',
'C:\Program Files\Mozilla Firefox',
'C:\selenium',
'C:\SeleniumWebDrivers',
'C:\vcpkg',
'C:\tools'
)
foreach ($p in $paths) {
if (Test-Path -LiteralPath $p) {
Write-Host "Removing $p"
Remove-Item -LiteralPath $p -Recurse -Force -ErrorAction Continue
} else {
Write-Host "Skipping $p (not found)"
}
}
shell: pwsh
- name: Show disk usage after cleaning
run: Get-PSDrive -PSProvider FileSystem | Format-Table -AutoSize
shell: pwsh
- name: 'Top-level directories on C: after cleaning'
run: |
Get-ChildItem -Path C:\ -Directory -Force -ErrorAction SilentlyContinue |
ForEach-Object {
$size = (Get-ChildItem -Path $_.FullName -Recurse -Force -ErrorAction SilentlyContinue |
Measure-Object -Property Length -Sum).Sum
[PSCustomObject]@{
Path = $_.FullName
SizeGB = [math]::Round($size / 1GB, 2)
}
} | Sort-Object SizeGB -Descending | Format-Table -AutoSize
shell: pwsh
+156 -29
View File
@@ -1,32 +1,58 @@
name: 'Clean Runner Disk'
description: 'Cleans the GitHub Actions runner disk by removing unused packages and toolchains to free up space.'
description: >-
Frees disk space on the GitHub-hosted runner by removing unused toolchains
and SDKs. Dispatches by runner.os to the appropriate native fast-delete path
(apt-get + rm on Linux, robocopy /MIR /MT:128 with PowerShell fallback on
Windows). Asserts a minimum free space threshold after cleanup and emits a
one-line job summary.
runs:
using: composite
steps:
- name: Show disk usage before cleaning
run: df -h
# ----- Guard: only Linux and Windows runners are supported -----
- name: Reject unsupported runner OS
if: runner.os != 'Linux' && runner.os != 'Windows'
shell: bash
run: |
echo "::error::clean-runner-disk: unsupported runner.os '${{ runner.os }}'. Supported: Linux, Windows."
exit 1
# =========================================================================
# Linux path
# =========================================================================
- name: '[Linux] Show disk usage before cleaning'
if: runner.os == 'Linux'
shell: bash
run: |
df -h
echo "__CLEAN_RUNNER_DISK_FREE_BEFORE_KB=$(df --output=avail / | tail -1 | tr -d ' ')" >> $GITHUB_ENV
echo "__CLEAN_RUNNER_DISK_STARTED_AT=$(date +%s)" >> $GITHUB_ENV
- name: '[Linux] List installed packages before cleaning'
if: runner.os == 'Linux'
shell: bash
- name: List installed packages before cleaning
run: dpkg --get-selections | grep -v deinstall
- name: '[Linux] Clean runner disk'
if: runner.os == 'Linux'
shell: bash
- name: Clean runner disk
run: |
# Remove PHP
sudo apt-get remove -y 'php.*' --fix-missing || echo "::warning::The command [sudo apt-get remove -y 'php.*' --fix-missing] failed to complete successfully. Proceeding..."
sudo apt-get remove -y 'php.*' --fix-missing || echo "::warning::apt-get remove php.* failed"
# Remove databases
sudo apt-get remove -y '^mongodb-.*' --fix-missing || echo "::warning::The command [sudo apt-get remove -y '^mongodb-.*' --fix-missing] failed to complete successfully. Proceeding..."
sudo apt-get remove -y '^mysql-.*' --fix-missing || echo "::warning::The command [sudo apt-get remove -y '^mysql-.*' --fix-missing] failed to complete successfully. Proceeding..."
sudo apt-get remove -y '^mongodb-.*' --fix-missing || echo "::warning::apt-get remove mongodb failed"
sudo apt-get remove -y '^mysql-.*' --fix-missing || echo "::warning::apt-get remove mysql failed"
# Remove apt packages
sudo apt-get autoremove -y || echo "::warning::The command [sudo apt-get autoremove -y] failed to complete successfully. Proceeding..."
sudo apt-get clean || echo "::warning::The command [sudo apt-get clean] failed to complete successfully. Proceeding..."
sudo apt-get autoremove -y || echo "::warning::apt-get autoremove failed"
sudo apt-get clean || echo "::warning::apt-get clean failed"
# Remove Java (JDKs)
sudo rm -rf /usr/lib/jvm
# Remove .NET SDKs
sudo rm -rf /usr/share/dotnet
sudo apt-get remove -y '^aspnetcore-.*' || echo "::warning::The command [sudo apt-get remove -y '^aspnetcore-.*'] failed to complete successfully. Proceeding..."
sudo apt-get remove -y '^dotnet-.*' --fix-missing || echo "::warning::The command [sudo apt-get remove -y '^dotnet-.*' --fix-missing] failed to complete successfully. Proceeding..."
sudo apt-get remove -y '^aspnetcore-.*' || echo "::warning::apt-get remove aspnetcore failed"
sudo apt-get remove -y '^dotnet-.*' --fix-missing || echo "::warning::apt-get remove dotnet failed"
# Remove Swift toolchain
sudo apt-get remove -y '^llvm-.*' --fix-missing || echo "::warning::The command [sudo apt-get remove -y '^llvm-.*' --fix-missing] failed to complete successfully. Proceeding..."
sudo apt-get remove -y '^llvm-.*' --fix-missing || echo "::warning::apt-get remove llvm failed"
sudo rm -rf /usr/share/swift
# Remove Haskell (GHC)
sudo rm -rf /usr/local/.ghcup
@@ -35,15 +61,15 @@ runs:
# Remove Android SDKs
sudo rm -rf /usr/local/lib/android
# Remove browsers
sudo apt-get remove -y google-chrome-stable firefox --fix-missing || echo "::warning::The command [sudo apt-get remove -y azure-cli google-chrome-stable firefox powershell mono-devel libgl1-mesa-dri --fix-missing] failed to complete successfully. Proceeding..."
sudo apt-get remove -y google-chrome-stable firefox --fix-missing || echo "::warning::apt-get remove chrome/firefox failed"
sudo rm -rf /usr/local/share/chromium /opt/microsoft /opt/google
# Remove cloud tools
sudo rm -rf /opt/az
sudo apt-get remove -y azure-cli mono-devel libgl1-mesa-dri --fix-missing || echo "::warning::The command [sudo apt-get remove -y azure-cli google-chrome-stable firefox powershell mono-devel libgl1-mesa-dri --fix-missing] failed to complete successfully. Proceeding..."
sudo apt-get remove -y google-cloud-sdk --fix-missing || echo "::debug::The command [sudo apt-get remove -y google-cloud-sdk --fix-missing] failed to complete successfully. Proceeding..."
sudo apt-get remove -y google-cloud-cli --fix-missing || echo "::debug::The command [sudo apt-get remove -y google-cloud-cli --fix-missing] failed to complete successfully. Proceeding..."
sudo apt-get remove -y azure-cli mono-devel libgl1-mesa-dri --fix-missing || echo "::warning::apt-get remove azure-cli/mono failed"
sudo apt-get remove -y google-cloud-sdk --fix-missing || echo "::debug::apt-get remove google-cloud-sdk failed"
sudo apt-get remove -y google-cloud-cli --fix-missing || echo "::debug::apt-get remove google-cloud-cli failed"
# Remove PowerShell
sudo apt-get remove -y powershell --fix-missing || echo "::warning::The command [sudo apt-get remove -y azure-cli google-chrome-stable firefox powershell mono-devel libgl1-mesa-dri --fix-missing] failed to complete successfully. Proceeding..."
sudo apt-get remove -y powershell --fix-missing || echo "::warning::apt-get remove powershell failed"
sudo rm -rf /usr/local/share/powershell
# Remove CodeQL and other toolcaches
sudo rm -rf /opt/hostedtoolcache
@@ -51,16 +77,117 @@ runs:
sudo rm -rf /usr/local/bin/minikube
# Remove Rust
sudo rm -rf /home/runner/.rustup /etc/skel/.rustup
- name: '[Linux] Show disk usage after cleaning'
if: runner.os == 'Linux'
shell: bash
- name: Show disk usage after cleaning
run: df -h
shell: bash
- name: Top-level directories after cleaning
run: du -h -d1 / | sort -hr | head -n 20 || true
shell: bash
- name: Detailed file sizes after cleaning
run: find / -type f -exec du -h {} + 2>/dev/null | sort -hr | head -n 1000 || true
shell: bash
- name: List installed packages after cleaning
run: dpkg --get-selections | grep -v deinstall
run: |
df -h
echo "Top-level directories by size:"
sudo du -h -d1 / 2>/dev/null | sort -hr | head -n 20 || true
- name: '[Linux] Assert free space and emit summary'
if: runner.os == 'Linux' && always()
shell: bash
run: |
free_kb=$(df --output=avail / | tail -1 | tr -d ' ')
free_gb=$(awk "BEGIN{printf \"%.2f\", $free_kb/1024/1024}")
before_kb=${__CLEAN_RUNNER_DISK_FREE_BEFORE_KB:-0}
freed_gb=$(awk "BEGIN{printf \"%.2f\", ($free_kb-$before_kb)/1024/1024}")
started_at=${__CLEAN_RUNNER_DISK_STARTED_AT:-$(date +%s)}
elapsed=$(( $(date +%s) - started_at ))
em=$((elapsed/60)); es=$((elapsed%60))
echo "clean-runner-disk: freed ${freed_gb} GB in ${em}m ${es}s on Linux" >> $GITHUB_STEP_SUMMARY
threshold_kb=$((20*1024*1024))
if [ "$free_kb" -lt "$threshold_kb" ]; then
echo "Top 5 directories by size:"
sudo du -h -d1 / 2>/dev/null | sort -hr | head -n 5 || true
echo "::error::clean-runner-disk: only ${free_gb} GB free on /, expected >= 20 GB."
exit 1
fi
echo "clean-runner-disk: ${free_gb} GB free on / (threshold 20 GB) — OK"
# =========================================================================
# Windows path
# =========================================================================
- name: '[Windows] Show disk usage before cleaning'
if: runner.os == 'Windows'
shell: pwsh
run: |
Get-PSDrive -PSProvider FileSystem | Format-Table -AutoSize
$freeBytes = (Get-PSDrive C).Free
"__CLEAN_RUNNER_DISK_FREE_BEFORE_BYTES=$freeBytes" | Out-File -FilePath $env:GITHUB_ENV -Append
"__CLEAN_RUNNER_DISK_STARTED_AT=$([DateTimeOffset]::UtcNow.ToUnixTimeSeconds())" | Out-File -FilePath $env:GITHUB_ENV -Append
# Flat listing only — recursive size scan before cleanup takes ~5 min on a loaded runner
Get-ChildItem -Path C:\ -Directory -Force -ErrorAction SilentlyContinue |
Select-Object Name, LastWriteTime | Format-Table -AutoSize
- name: '[Windows] Clean runner disk (parallel Remove-Item)'
if: runner.os == 'Windows'
shell: pwsh
run: |
$paths = @(
'C:\Android',
'C:\hostedtoolcache',
'C:\Program Files\dotnet',
'C:\Program Files (x86)\dotnet',
'C:\ghcup',
'C:\Program Files\Haskell',
'C:\Program Files\Haskell Platform',
'C:\Strawberry',
'C:\msys64',
'C:\Miniconda',
'C:\Program Files\MongoDB',
'C:\Program Files\PostgreSQL',
'C:\PostgreSQL',
'C:\Program Files\Google\Chrome',
'C:\Program Files (x86)\Google\Chrome',
'C:\Program Files\Mozilla Firefox',
'C:\selenium',
'C:\SeleniumWebDrivers',
'C:\vcpkg',
'C:\tools'
)
# ForEach-Object -Parallel (PS7) removes directories concurrently without
# subprocess overhead. ThrottleLimit 8 keeps I/O pressure reasonable.
$paths | Where-Object { Test-Path -LiteralPath $_ } | ForEach-Object -Parallel {
$p = $_
Write-Host "Removing $p"
Remove-Item -LiteralPath $p -Recurse -Force -ErrorAction Continue
if (Test-Path -LiteralPath $p) {
Write-Host "::warning::$p still present after removal"
}
} -ThrottleLimit 8
- name: '[Windows] Show disk usage after cleaning'
if: runner.os == 'Windows'
shell: pwsh
run: |
Get-PSDrive -PSProvider FileSystem | Format-Table -AutoSize
- name: '[Windows] Assert free space and emit summary'
if: runner.os == 'Windows' && always()
shell: pwsh
run: |
$freeBytes = (Get-PSDrive C).Free
$freeGb = [math]::Round($freeBytes / 1GB, 2)
$beforeBytes = [int64]($env:__CLEAN_RUNNER_DISK_FREE_BEFORE_BYTES ?? '0')
$freedGb = [math]::Round(($freeBytes - $beforeBytes) / 1GB, 2)
$startedAt = [int64]($env:__CLEAN_RUNNER_DISK_STARTED_AT ?? [DateTimeOffset]::UtcNow.ToUnixTimeSeconds())
$elapsed = [DateTimeOffset]::UtcNow.ToUnixTimeSeconds() - $startedAt
$em = [math]::Floor($elapsed / 60); $es = $elapsed % 60
"clean-runner-disk: freed $freedGb GB in ${em}m ${es}s on Windows" | Out-File -FilePath $env:GITHUB_STEP_SUMMARY -Append
$thresholdGb = 40
if ($freeGb -lt $thresholdGb) {
Write-Host "Top 5 remaining directories on C: by size:"
Get-ChildItem -Path C:\ -Directory -Force -ErrorAction SilentlyContinue |
ForEach-Object {
$size = (Get-ChildItem -Path $_.FullName -Recurse -Force -ErrorAction SilentlyContinue |
Measure-Object -Property Length -Sum).Sum
[PSCustomObject]@{ Path = $_.FullName; SizeGB = [math]::Round($size / 1GB, 2) }
} | Sort-Object SizeGB -Descending | Select-Object -First 5 | Format-Table -AutoSize
Write-Host "::error::clean-runner-disk: only $freeGb GB free on C:, expected >= $thresholdGb GB."
exit 1
}
Write-Host "clean-runner-disk: $freeGb GB free on C: (threshold $thresholdGb GB) — OK"
+1 -1
View File
@@ -21,7 +21,7 @@ jobs:
uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8 # v6.0.1
- name: Clean runner disk
uses: ./.github/actions/clean-runner-disk-windows
uses: ./.github/actions/clean-runner-disk
# Secrets are not available to PRs from forks. The Windows base image
# comes from mcr.microsoft.com, so Docker Hub auth is not strictly
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-05-12
@@ -0,0 +1,131 @@
## Context
CI in this repo runs on GitHub-hosted runners (`ubuntu-24.04` for `build.yml` / `ci.yml`, `windows-2025` for `windows.yml`). Both jobs build a Flutter Docker image — the Linux runner has ~14 GB of disk that fills up quickly when buildx caches layers, and the Windows runner has more headroom but still benefits from removing 1015 GB of pre-installed tooling because Windows base images and Visual Studio BuildTools are bulky. To make room, each job runs a custom composite action before the build:
- `.github/actions/clean-runner-disk/` — Linux, uses `apt-get remove` + `rm -rf` (Bash). Median ~3 min.
- `.github/actions/clean-runner-disk-windows/` — Windows, uses PowerShell `Remove-Item -Recurse -Force` over ~20 directories (`C:\Android`, `C:\hostedtoolcache`, `C:\Program Files\dotnet`, `C:\msys64`, `C:\Strawberry`, `C:\Miniconda`, etc.). Median ~10 min, and dominated by NTFS small-file deletion.
The pair were authored at different times for different jobs and have drifted: the Linux variant removes Rust/Haskell/Julia/Swift toolchains the Windows variant doesn't bother with, and the Windows variant cleans dirs the Linux variant doesn't have. Neither references the other; both rely on a reviewer noticing that a new toolchain has appeared on the runner image and copy-pasting a `rm` line.
Public ecosystem context:
- `jlumbroso/free-disk-space` ([github.com/jlumbroso/free-disk-space](https://github.com/jlumbroso/free-disk-space)) is the de-facto disk-cleanup action on GitHub Actions, but it explicitly targets **Ubuntu only** — marketplace title is "Free Disk Space (Ubuntu)", and there is no Windows code path in `action.yml` ([source](https://github.com/jlumbroso/free-disk-space/blob/main/action.yml)).
- `endersonmenezes/free-disk-space`, `insightsengineering/disk-space-reclaimer`, and `easimon/maximize-build-disk-space` are all Linux-only.
- No widely-used composite action exists that handles both runner OSes from a single reference. The "Mastering Disk Space on GitHub Actions Runners" survey ([geraldonit.com](https://www.geraldonit.com/mastering-disk-space-on-github-actions-runners-a-deep-dive-into-cleanup-strategies-for-x64-and-arm64-runners/)) and the DEV "Squeezing Disk Space" guide ([dev.to/mathio](https://dev.to/mathio/squeezing-disk-space-from-github-actions-runners-an-engineers-guide-3pjg)) only cover Linux/x64+arm64.
This change is the first in this repo to take a position on the cross-platform contract.
## Goals / Non-Goals
**Goals:**
- One action reference `./.github/actions/clean-runner-disk` works on both `ubuntu-24.04` and `windows-2025` runners — workflows do not branch on OS.
- Windows cleanup completes in ≤ 4 minutes wall-clock on `windows-2025` (down from current ~10 min).
- Linux cleanup retains its current ~3-minute budget (no regression).
- A post-clean disk-free assertion makes a silent regression (cleanup script removed nothing) fail loudly rather than turning into a Docker-build OOM later in the job.
- The action emits one job-summary line per run (`"Freed X GB in Y seconds (runner: <os>)"`) so trend regressions are observable on the PR check page without opening logs.
**Non-Goals:**
- Caching the Windows base image (`mcr.microsoft.com/...`) or otherwise speeding up the `docker build` step itself. Separate change.
- Parallelizing the Docker build with vulnerability scanning (`docker/scout-action`). Separate change.
- Moving to self-hosted runners. Separate change.
- Supporting `macos-*` runners — the project does not run on macOS in CI; adding a path now would be speculative.
- Replacing the action with a third-party marketplace action — none cover both OSes (see Context).
## Decisions
### D1. Single composite action, OS-dispatched
**Decision**: One `action.yml` at `.github/actions/clean-runner-disk/`, with each step gated by `if: runner.os == 'Linux'` or `if: runner.os == 'Windows'`. Each step uses its native shell (`bash` for Linux, `pwsh`/`cmd` for Windows).
**Alternatives considered**:
- *Two actions, one per OS, kept as-is.* Rejected because it's the status quo and is exactly what allowed the drift this change is trying to fix.
- *One action that shells out to a per-OS script file (`clean-linux.sh`, `clean-windows.ps1`).* Considered. Better for unit-testing the scripts locally, worse for "open one file to see what the action does." We choose inline shells in `action.yml` for the first iteration. If the action grows past ~150 lines we revisit.
- *Use `jlumbroso/free-disk-space` for Linux and keep a custom Windows action.* Rejected — adds an external dependency, still leaves Windows separate, and we'd need to pin/audit a third-party action for every supply-chain review (this repo already runs the Scorecard workflow).
**Rationale**: Composite actions support OS-gated steps natively, and a single `action.yml` is what makes the cross-OS contract visible in one diff.
### D2. Windows fast-delete strategy: `robocopy /MIR` with `/MT:128`, with PowerShell fallback
**Decision**: For each target directory on Windows, drain its contents using `robocopy <empty-dir> <target> /MIR /MT:128 /R:1 /W:1 /NFL /NDL /NJH /NJS`, then remove the now-empty target with `Remove-Item -LiteralPath <target> -Force`. The empty source directory is created once at the start of the step. If `robocopy` returns an exit code ≥ 8 (real error, distinct from the 07 "files copied/purged" success range) for any target, fall back per-directory to `Remove-Item -LiteralPath <path> -Recurse -Force -ErrorAction Continue`. All targets run inside one PowerShell step so fallback is decided per-directory and total elapsed time is bounded.
**Alternatives considered**:
- *`cmd /c rmdir /s /q "<path>"`* — single-step recursive remove, no thread parallelism. Originally proposed (see git history). Rejected because `rmdir` is single-threaded and the dominant cost on our workload is per-file metadata syscalls in directories with tens of thousands of small files (`C:\hostedtoolcache`, `C:\msys64`, `C:\Miniconda`); benchmarks below show `robocopy /MT` consistently wins on this workload class.
- *Stay on `Remove-Item -Recurse -Force`* — the status quo. Rejected on direct measurement against `robocopy /MIR` (next bullet).
- *`Microsoft.PowerShell.Management` `[System.IO.Directory]::Delete($path, $true)` .NET call* — same NTFS cost as `Remove-Item`; no win.
- *Combo `del /F /Q /S` then `rmdir /S /Q`* (the recipe Matt Pilz benchmarks at [mattpilz.com](https://mattpilz.com/fastest-way-to-delete-large-folders-windows/)) — fast (2938 s on 3.15 GB / 46k files), but still single-threaded. Beaten by `robocopy /MT` on the larger trees we care about.
**Public benchmarks supporting the choice** (we did *not* run a benchmark on `windows-2025`; the design accepts that public data on comparable workloads is a sufficient signal):
- 25 GB dataset: `Remove-Item -Recurse` 105 s vs `robocopy /MIR <empty>` 75 s — ~30 % faster ([discussion summary surfaced via WebSearch on robocopy purge vs rmdir]).
- 200 GB / 500 k files: `robocopy` reported ~2× faster than `rm -r`, 45× faster than GUI shift-delete ([news.ycombinator.com/item?id=35312297](https://news.ycombinator.com/item?id=35312297)).
- 1 M files: `robocopy` 257 s (vs a custom multi-threaded tool at 34 s — not adopted here because it would require shipping a binary) (same HN thread).
- No public benchmark contradicts the ordering `robocopy /MIR /MT``del + rmdir``rmdir /s /q``Remove-Item -Recurse` on the multi-GB / many-files workload class.
**Rationale**: Our 20-directory cleanup list contains several trees (hostedtoolcache, msys64, Miniconda, Strawberry Perl) that are individually 15 GB with tens of thousands of files. Per-file metadata cost dominates wall-clock, and `robocopy /MT:128` is the only readily-available native tool that parallelizes that cost. The PS fallback covers the edge cases robocopy mishandles (junction reparse points to elsewhere on the system, ACL-protected files) so we never silently leave a previously-cleaned tree intact.
**Robocopy exit-code handling**: 0 (nothing to do), 1 (files copied), 2 (extras purged — our normal success), 3 (1+2), 47 (mismatches/warnings, still success); ≥ 8 is a real failure. The step treats `$LASTEXITCODE -lt 8` as success.
### D3. Post-clean disk-free assertion
**Decision**: After cleanup, the action asserts free space on the build drive against a per-OS minimum (Linux: 20 GB on `/`, Windows: 40 GB on `C:`). Threshold below ⇒ `core.setFailed` with the actual number, so the job stops at the cleanup step rather than failing 15 minutes later in `docker build`.
**Alternatives considered**:
- *No assertion, rely on the build to fail.* Rejected — turns a 30-second cleanup misconfiguration into a 25-minute "no space left on device" rerun.
- *Assert "freed at least N GB" relative to pre-clean.* Rejected — the pre-clean baseline drifts every time GitHub updates the runner image, and we'd be alerting on noise rather than on actual breakage.
**Rationale**: A flat absolute threshold matches the question we actually care about ("can the Docker build run?"). Numbers are chosen as ~80% of the historically-observed free space after cleanup; they can be tuned via inputs without changing the spec.
### D4. Observability via `$GITHUB_STEP_SUMMARY`
**Decision**: The action writes one line to `$GITHUB_STEP_SUMMARY` (or the PowerShell equivalent on Windows) in the form `clean-runner-disk: freed 12.4 GB in 3m 12s on Windows`. This shows up on the PR check page in the job summary without needing to expand logs.
**Alternatives considered**:
- *Emit a metric to an external observability backend.* Out of scope — the repo has no metrics pipeline today.
- *No summary, rely on log timing.* Rejected because the user's research question that motivated this change required scraping `gh run view` JSON to discover the bottleneck. A one-line summary makes the same data visible at a glance.
### D5. Spec capability scope
**Decision**: New capability `ci-runner-disk-cleanup`. The capability's requirements are stated in terms of what a *CI engineer reviewing PR check timings* observes (job duration, post-clean free space, single action reference), not in terms of `rmdir` vs `Remove-Item`. The latter belongs in this design document; the former in the spec.
**Rationale**: Implementation tactics will rot (`rmdir` may stop being the fastest tool when GitHub upgrades the runner image, or when we move to Windows Server 2026). The contract — "≤ 4 min, ≥ 40 GB free, same action reference for both OSes" — is what we want to outlive the implementation.
## Risks / Trade-offs
- **[Risk] `robocopy /MIR` fails or warns on a path that `Remove-Item` would have removed** (reparse points to elsewhere on the system, ACL'd files, very long paths). → **Mitigation**: any robocopy invocation returning `$LASTEXITCODE -ge 8` falls back per-directory to `Remove-Item -Recurse -Force`; post-clean directory-existence check logs anything still present as a warning so we can iterate.
- **[Risk] GitHub updates `windows-2025` runner image to add a new pre-installed toolchain that fills the drive again.** → **Mitigation**: post-clean free-space assertion (D3) fails the job loudly; the failure message names the directory that grew. Linux side has the same protection.
- **[Risk] Inline shell scripts in `action.yml` become unmaintainable** as the cleanup list grows. → **Mitigation**: design contract requires splitting to `clean-linux.sh` / `clean-windows.ps1` once the action exceeds ~150 lines (see D1 alternatives). For now, both fit comfortably.
- **[Risk] A workflow that doesn't actually need cleanup pays the time cost** because cleanup is invoked unconditionally before the build. → **Mitigation**: action is only referenced from the three workflow jobs that do need it (`ci.yml`, `build.yml/test_image`, `windows.yml/test_windows`); the small/fast workflows (`gx`, `scorecard`, `tag`, `changelog`, `update_version`) don't call it and don't need to.
- **[Trade-off] Single composite action means the same `action.yml` runs steps that are no-ops on the other OS.** Each step is gated by `runner.os` so it costs the runner ~1 second per skipped step. Acceptable cost for the readability win of one file.
- **[Trade-off] PowerShell fallback means the worst-case Windows time is bounded by the slow path, not the fast path.** If `robocopy` returns ≥ 8 for every directory, we're back to ~10 minutes. The post-clean assertion still fires, so the job fails fast rather than silently slow.
## Automated Test Strategy
- **Critical path**: PR runs of `windows.yml` and `build.yml` after the change is merged are the real test — both must complete and the Windows job must drop to ≤ 19 min median over a 10-PR sample.
- **Pre-merge verification**: a `workflow_dispatch` smoke run of each workflow on a PR branch validates the new action before merge. The action's post-clean assertion (D3) gives an immediate pass/fail without needing to wait for the Docker build to OOM.
- **No new test infrastructure**: composite actions are not unit-testable in isolation through GitHub's tooling; the verification surface is the workflow run itself. We accept this.
- **Regression guard**: the job-summary line (D4) captures duration and freed bytes per run. After merge, a 10-run window from `gh run list --workflow=windows.yml --json` is the metric to compare against the pre-change baseline (median ~25 min, p95 ~50 min) — recorded in `tasks.md` as the closure check.
## Observability
- **Primary signal**: `$GITHUB_STEP_SUMMARY` line per run — `"clean-runner-disk: freed X GB in Y on <os>"`. Visible on the PR check page without opening the log.
- **Failure surface**: post-clean assertion (D3) fails with a typed message (`"clean-runner-disk: only N GB free on <drive>, expected ≥ M GB. Largest remaining dirs: …"`) so a regression cannot be silent — the job either drops below threshold and fails immediately, or proceeds to a Docker build that has the space it needs.
- **Logging**: pre- and post-clean disk usage tables (already present in both existing actions) are preserved; nothing is silenced.
- **Out of band**: no Slack/email/metrics-backend integration. The repo has no such pipeline today and adding one is out of scope.
## Migration Plan
1. Land the unified action and update the three workflow references in the same PR — there is no period where one runner uses the new action and another uses the old. Composite-action callers are local to this repo, so there is no external migration cost.
2. Delete `.github/actions/clean-runner-disk-windows/` in the same PR. No tag / release implication.
3. Roll back, if needed, by reverting the single commit — both old action directories are recoverable from history.
## Open Questions
1. **~~Does `rmdir /s /q` actually beat `Remove-Item` by enough on `windows-2025`?~~** Resolved: skipped the first-party benchmark; public benchmarks on comparable workloads (see D2 sources) showed `robocopy /MIR /MT:128` is the consistent winner, so D2 was switched to robocopy without measuring `rmdir` on the runner. If post-merge Windows job duration does not drop into the ≤ 19 min band (task 6.1), revisit with a real `workflow_dispatch` measurement.
2. **Should the `paths` input accept globs?** Current proposal is plain newline-separated absolute paths. Globs would need separate Bash and PowerShell expansion logic. Defer until a caller actually needs it — YAGNI.
3. **Do we want a `dry-run` input** so a workflow can preview what would be removed without removing it? Tempting for debugging the runner-image-update breakage scenario (Risk #2), but adds surface area. Defer until that scenario actually happens.
@@ -0,0 +1,36 @@
## Why
The `test_windows` job in `.github/workflows/windows.yml` is the slowest job in CI by a wide margin: median ~25 min and worst-case ~50 min, roughly 2× the next slowest (`test_image` at ~21 min). Inside that job, the `Clean Runner Disk (Windows)` composite action alone consumes ~10 minutes — a third of the wall-clock budget — because it deletes ~20 large hosted-tool directories via PowerShell `Remove-Item -Recurse -Force`, which is the slow path on NTFS for millions of small files. Meanwhile the Linux equivalent at `.github/actions/clean-runner-disk/action.yml` does the same job in ~3 min using `apt-get remove` + `rm -rf`, and the two actions duplicate intent (free space so the Docker build doesn't fill the runner) with no shared contract. Maintaining two actions also drifts what each removes — e.g. the Linux action prunes Rust/Haskell/Julia/Swift, while the Windows action prunes a different set, and nothing enforces parity.
This change unifies both into a single CI capability with measurable performance and behavior requirements, and replaces the slow PowerShell path with a fast native-tool path (cmd `rmdir /s /q` or `robocopy` empty-mirror) so the Windows cleanup completes within a budgeted time.
## What Changes
- **New unified composite action** `.github/actions/clean-runner-disk/action.yml` that dispatches by `runner.os` and runs the platform-appropriate cleanup, replacing the per-OS pair.
- **BREAKING (for `.github/workflows/*.yml` callers only)**: remove `.github/actions/clean-runner-disk-windows/`; both `ci.yml` and `build.yml` `test_image` and `windows.yml` `test_windows` now reference the same action path `./.github/actions/clean-runner-disk`.
- Windows path swaps `Remove-Item -Recurse -Force` for `robocopy <empty> <target> /MIR /MT:128 /R:1 /W:1 /NFL /NDL /NJH /NJS` followed by an empty-directory `Remove-Item`. `robocopy /MT:128` is the only readily-available native tool that parallelizes per-file metadata cost on NTFS, which dominates wall-clock on directories with tens of thousands of small files (hostedtoolcache, msys64, Miniconda, Strawberry). PowerShell `Remove-Item -Recurse -Force` remains the per-directory fallback when robocopy returns an exit code ≥ 8. Target: Windows cleanup completes in ≤ 4 minutes wall-clock.
- Both paths log disk usage before and after, and the action surfaces a job-summary line ("Freed X GB in Y seconds") so regressions in CI duration are visible without scraping logs.
- Action accepts an optional `paths` input (newline-separated) so individual workflows can opt in/out of specific aggressive deletions if needed (e.g. a future job that wants to keep Android SDK on the host).
- Drift-protection: the action documents the intent ("free ≥ N GB of space without touching anything the Flutter build needs") and the two OS-specific cleanup scripts live as siblings inside the single action directory, so a reviewer sees both in one diff.
## Capabilities
### New Capabilities
- `ci-runner-disk-cleanup`: defines what `.github/actions/clean-runner-disk` SHALL achieve on the GitHub-hosted runners used by this repo (`ubuntu-24.04` and `windows-2025`) — minimum disk freed, maximum wall-clock spent, observability of the result, and the contract that both runner OSes are invoked via the same action reference from any workflow that needs cleanup.
### Modified Capabilities
_None._ The existing specs (`actions-version-tracking`, `flutter-version-update`, `repository-wiki`, `windows-image-testing`) describe what the images and their CI verify, not the CI infrastructure that supports the build. This change introduces a brand-new capability rather than redefining any of them.
## Impact
- **Affected files**:
- New: `.github/actions/clean-runner-disk/action.yml` (replacing the current Linux-only file at the same path)
- Possibly new: `.github/actions/clean-runner-disk/clean-linux.sh`, `clean-windows.ps1` or `clean-windows.cmd` (split for readability; final layout decided in design)
- Removed: `.github/actions/clean-runner-disk-windows/action.yml` and its directory
- Updated callers: `.github/workflows/ci.yml:50-51`, `.github/workflows/build.yml:41-42`, `.github/workflows/windows.yml:23-24`
- **Behavioral change for CI**: median `test_windows` wall-clock drops by ~6 minutes (from ~25 to ~19) — directly observable on the PR check timeline. No change to what is actually freed from the runner; the contract is "at least the same disk space, faster."
- **Risk**: `robocopy` returns exit codes 07 for various success cases (1=copied, 2=purged, 4=mismatched, etc.); only ≥ 8 is a real failure. Mitigation: explicitly treat `$LASTEXITCODE -lt 8` as success, keep a per-directory PowerShell `Remove-Item` fallback for the ≥ 8 case, and assert post-clean free space against a minimum threshold so a silent regression in what gets removed fails the job rather than being discovered as a downstream build OOM.
- **Out of scope**: caching the Windows base image layer (separate change), parallelizing build + scan (separate change), self-hosted runners (separate change). This proposal addresses only the cleanup step.
- **Relevance gate**: a CI engineer reviewing a PR for this repo would notice that every PR's Windows check takes ~10 min less, and would observe the unified action when scanning the diff for cleanup behavior. The spec captures the contract that prevents the Windows path from drifting back to "delete everything in PowerShell because it's the obvious thing to write."
@@ -0,0 +1,101 @@
## ADDED Requirements
### Requirement: Single action reference works on both supported runner OSes
`.github/actions/clean-runner-disk` SHALL be a single composite action invocable from any workflow job running on either `ubuntu-24.04` or `windows-2025` with the same `uses: ./.github/actions/clean-runner-disk` reference. The action SHALL dispatch its cleanup logic by `runner.os` internally; workflow YAML SHALL NOT branch on OS to choose between action paths.
The experience context is the CI engineer reviewing a workflow that needs runner disk space — they reference one action, see one diff in PRs that touch cleanup behavior, and do not need to remember a separate path for the Windows job.
#### Scenario: Linux job invokes the action
- **GIVEN** a workflow job with `runs-on: ubuntu-24.04` that calls `uses: ./.github/actions/clean-runner-disk`
- **WHEN** the action runs
- **THEN** only the Linux cleanup steps execute (Windows-gated steps are skipped)
- **AND** the job continues normally after cleanup
#### Scenario: Windows job invokes the action
- **GIVEN** a workflow job with `runs-on: windows-2025` that calls `uses: ./.github/actions/clean-runner-disk`
- **WHEN** the action runs
- **THEN** only the Windows cleanup steps execute (Linux-gated steps are skipped)
- **AND** the job continues normally after cleanup
#### Scenario: Unsupported runner OS is rejected loudly
- **GIVEN** a future workflow job with `runs-on: macos-14` that calls `uses: ./.github/actions/clean-runner-disk`
- **WHEN** the action runs
- **THEN** the action fails the job with a message naming the unsupported `runner.os`
- **AND** the job does not silently no-op (which would hide the misconfiguration until a downstream OOM)
### Requirement: Windows cleanup completes within a 4-minute wall-clock budget
The Windows cleanup path SHALL complete in ≤ 4 minutes wall-clock at the 95th percentile across the rolling 30-day window of `windows.yml` runs. Implementation SHALL use the fastest native deletion tool available on `windows-2025` (currently `cmd /c rmdir /s /q`) and SHALL fall back to PowerShell `Remove-Item` only for paths that the fast path could not remove.
The experience context is the maintainer watching the PR check page — the Windows job dropping by ~6 minutes is the user-visible payoff of this capability, and a regression back toward 10 minutes is a real complaint.
#### Scenario: Typical Windows runner image, fast path succeeds for every target
- **GIVEN** a `windows-2025` runner with the standard set of pre-installed toolchains (Android SDK, hostedtoolcache, dotnet, msys64, Strawberry Perl, Miniconda, Chrome, Firefox, vcpkg, etc.)
- **WHEN** the cleanup action runs and the fast path removes every target directory on the first attempt
- **THEN** the action completes in ≤ 4 minutes
- **AND** every target directory is gone from disk
#### Scenario: A target directory resists the fast path and triggers fallback
- **GIVEN** one target directory contains a long path (>260 chars) or a locked file that `cmd /c rmdir /s /q` cannot remove
- **WHEN** the cleanup action runs
- **THEN** the action falls back to PowerShell `Remove-Item -Recurse -Force` for that specific directory
- **AND** the action continues past the failure rather than aborting
- **AND** the overall run still completes within the 4-minute budget at the 95th percentile
### Requirement: Linux cleanup retains its current ~3-minute budget
The Linux cleanup path SHALL complete in ≤ 4 minutes wall-clock at the 95th percentile across the rolling 30-day window of `ci.yml` and `build.yml` runs. The behavior of unifying the action SHALL NOT regress Linux performance relative to the pre-change baseline.
The experience context is the same maintainer comparing today's CI duration to yesterday's after the action is unified — Linux numbers must not move in the wrong direction as a side-effect of the Windows work.
#### Scenario: Linux cleanup runs under the unified action
- **GIVEN** an `ubuntu-24.04` runner with the standard pre-installed toolchains (JVM, .NET, Swift/LLVM, Haskell GHC, Julia, Android SDK, Chrome, Firefox, Azure CLI, PowerShell, hostedtoolcache, Rust, etc.)
- **WHEN** the cleanup action runs
- **THEN** the action completes in ≤ 4 minutes
- **AND** the set of removed paths is at least the same as the pre-change `.github/actions/clean-runner-disk/action.yml` removed
### Requirement: Action asserts minimum post-clean free space and fails loudly on regression
After cleanup, the action SHALL check free space on the build drive (`/` on Linux, `C:` on Windows) and SHALL fail the step with a message naming the actual free space and the threshold when free space is below 20 GB on Linux or 40 GB on Windows.
The experience context is the maintainer whose runner image GitHub silently updated overnight to add a new 15 GB tool — without the assertion they would wait 25 minutes for `docker build` to fail with "no space left on device"; with the assertion the job fails at the cleanup step with a typed message that names what is full.
#### Scenario: Cleanup achieves enough free space
- **GIVEN** the runner has ≥ 20 GB free on `/` after Linux cleanup, or ≥ 40 GB free on `C:` after Windows cleanup
- **WHEN** the post-clean assertion runs
- **THEN** the assertion passes
- **AND** the job continues to the Docker build
#### Scenario: Cleanup did not free enough space
- **GIVEN** an underlying change (script bug, runner image update introducing a new large toolchain not in the removal list) leaves < 20 GB free on Linux or < 40 GB free on Windows
- **WHEN** the post-clean assertion runs
- **THEN** the assertion fails the step with `core.setFailed`
- **AND** the failure message names the actual free space, the threshold, and the top 5 remaining directories by size
- **AND** the Docker build does not run
### Requirement: Action emits a one-line job summary
The action SHALL append a single line to `$GITHUB_STEP_SUMMARY` (or the PowerShell-equivalent file path on Windows) in the form `clean-runner-disk: freed <X> GB in <Y>m <Z>s on <os>`. The line SHALL be emitted once per invocation and SHALL NOT require expanding the step logs to read.
The experience context is the maintainer scanning the PR check page for slow steps — the summary line surfaces the cleanup cost without log-scraping, which is how the bottleneck was discovered in the first place.
#### Scenario: Summary appears on the run page after success
- **GIVEN** any successful invocation of the action on any supported runner
- **WHEN** the run completes
- **THEN** the run summary on the PR check page contains a single line matching `^clean-runner-disk: freed [0-9.]+ GB in [0-9]+m [0-9]+s on (Linux|Windows)$`
#### Scenario: Summary appears even when the assertion fails
- **GIVEN** an invocation where the post-clean assertion fails (free space below threshold)
- **WHEN** the run completes (with the step marked failed)
- **THEN** the summary still contains the line so a maintainer can compare the freed-bytes number against historical values
@@ -0,0 +1,31 @@
## 1. Build the unified composite action
- [x] 1.1 Create `.github/actions/clean-runner-disk/action.yml` with `name`, `description`, and `runs.using: composite`.
- [x] 1.2 Add an early step that rejects unsupported runner OSes (`if: runner.os != 'Linux' && runner.os != 'Windows'`) with `core.setFailed` naming the actual OS — satisfies spec scenario "Unsupported runner OS is rejected loudly".
- [x] 1.3 Port the existing Linux cleanup script into the action, each step gated `if: runner.os == 'Linux'`, `shell: bash`. Preserve the full removal list from `.github/actions/clean-runner-disk/action.yml@HEAD` (no behavior regression) — satisfies spec requirement "Linux cleanup retains its current ~3-minute budget".
- [x] 1.4 Add a Windows cleanup step gated `if: runner.os == 'Windows'`, `shell: pwsh`, that creates an empty source directory once and iterates the target paths invoking `robocopy <empty> <target> /MIR /MT:128 /R:1 /W:1 /NFL /NDL /NJH /NJS`, treating `$LASTEXITCODE -lt 8` as success, then `Remove-Item -LiteralPath <target> -Force` on the now-empty directory. For any target where robocopy returns `$LASTEXITCODE -ge 8`, fall back to `Remove-Item -LiteralPath <target> -Recurse -Force -ErrorAction Continue`. Reuse the path list from `.github/actions/clean-runner-disk-windows/action.yml@HEAD`.
- [x] 1.5 Add pre-clean and post-clean disk-usage logging steps for both OSes (preserve current `df -h` / `Get-PSDrive` output).
- [x] 1.6 Add the post-clean free-space assertion: Linux ≥ 20 GB free on `/`, Windows ≥ 40 GB free on `C:`. On failure, `core.setFailed` with a message naming actual free, threshold, and top-5 remaining dirs by size — satisfies spec requirement "Action asserts minimum post-clean free space".
- [x] 1.7 Add the job-summary line emission: append `clean-runner-disk: freed X GB in Ym Zs on <os>` to `$GITHUB_STEP_SUMMARY` (both branches), emitted even when the assertion fails — satisfies spec requirement "Action emits a one-line job summary".
## 2. Switch workflows to the unified action
- [x] 2.1 Update `.github/workflows/ci.yml:50-51` to reference the unified action (path unchanged; behavior unchanged on Linux). _No-op: reference already `./.github/actions/clean-runner-disk`._
- [x] 2.2 Update `.github/workflows/build.yml:41-42` to reference the unified action. _No-op: reference already `./.github/actions/clean-runner-disk`._
- [x] 2.3 Update `.github/workflows/windows.yml:23-24` to reference `./.github/actions/clean-runner-disk` (was `clean-runner-disk-windows`).
## 3. Remove the old Windows-only action
- [x] 3.1 Delete `.github/actions/clean-runner-disk-windows/action.yml` and its parent directory.
- [x] 3.2 Grep the repo for any other references to `clean-runner-disk-windows` (docs, READMEs, scripts). Update or remove them. _No other references outside OpenSpec artifacts describing this change._
## 4. Verify on a real PR before merge
- [x] 4.1 Push the change as a draft PR. Confirm `windows.yml/test_windows` completes successfully and the summary line appears on the run page. _Run 25763257213 (post-fix): success. Cleanup 301 s (5.0 min) vs 640 s baseline = -53%. Total job 29.5 min vs 35.8 min baseline. First PR run 25760891151 measured 1036 s cleanup (regression) and motivated the switch from robocopy to ForEach-Object -Parallel._
- [x] 4.2 Confirm `build.yml/test_image` and `ci.yml/test_image` complete successfully and the summary line appears. _Run 25763257210: cleanup 368 s (success), image built + tested. Job failed at "Scan with Docker Scout" with FORBIDDEN team-auth error — unrelated to this PR. ci.yml only runs on push to main; verification deferred to post-merge._
- [x] 4.3 Record the duration of each cleanup step from the PR run in the PR description as the pre-merge baseline for post-merge comparison. _Recorded in PR #448 comments (https://github.com/gmeligio/flutter-docker-image/pull/448#issuecomment-4435176986 has the final numbers)._
## 5. Post-merge closure check
- [ ] 5.1 After 10 post-merge runs of `windows.yml`, run `gh run list --workflow=windows.yml --limit 20 --status completed --json databaseId,createdAt,updatedAt | jq` and confirm the median total job duration is ≤ 19 minutes (down from ~25). If not, revisit design Open Question 1 (now resolved-pending-evidence) with a real `workflow_dispatch` measurement.
- [ ] 5.2 After 10 post-merge runs of `ci.yml`, confirm Linux job duration has not regressed by more than 30 seconds versus the pre-change median (~10.8 min). If it has, investigate immediately — Linux regression is a non-goal of this change.