feat(clean-runner-disk): drop apt-get remove on Linux, rely on rm -rf + freed-bytes assertion (#464)

- Replaces ~15 `apt-get remove` calls in the Linux side of
`.github/actions/clean-runner-disk/action.yml` with direct `rm -rf` of
the package install dirs. `apt-get autoremove -y` + `apt-get clean`
remain as a trailing pair.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Eligio Mariño
2026-05-24 14:17:18 +02:00
committed by GitHub
parent 7672004a29
commit 6a1d62ec22
7 changed files with 86 additions and 69 deletions
+27 -38
View File
@@ -37,46 +37,35 @@ runs:
if: runner.os == 'Linux' if: runner.os == 'Linux'
shell: bash shell: bash
run: | run: |
# Remove PHP # Toolchains (JDK, .NET + aspnetcore, Swift, LLVM, Haskell, Julia, Android, Rust)
sudo apt-get remove -y 'php.*' --fix-missing || echo "::warning::apt-get remove php.* failed" sudo rm -rf /usr/lib/jvm
# Remove databases sudo rm -rf /usr/share/dotnet
sudo apt-get remove -y '^mongodb-.*' --fix-missing || echo "::warning::apt-get remove mongodb failed" sudo rm -rf /usr/share/swift
sudo apt-get remove -y '^mysql-.*' --fix-missing || echo "::warning::apt-get remove mysql failed" sudo rm -rf /usr/lib/llvm-16 /usr/lib/llvm-17 /usr/lib/llvm-18
# Remove apt packages sudo rm -rf /usr/local/.ghcup
sudo rm -rf /usr/local/julia*
sudo rm -rf /usr/local/lib/android
sudo rm -rf /home/runner/.rustup /etc/skel/.rustup
# Browsers (Chrome under /opt/google, Firefox under /usr/lib/firefox)
sudo rm -rf /usr/local/share/chromium
sudo rm -rf /usr/lib/firefox /usr/bin/firefox*
sudo rm -rf /usr/bin/google-chrome*
# Cloud tools (Chrome and PowerShell live under these top-level dirs)
sudo rm -rf /opt/microsoft /opt/google
sudo rm -rf /opt/az /usr/share/az_*
sudo rm -rf /usr/lib/google-cloud-sdk /opt/google-cloud-sdk /usr/bin/gcloud*
# Databases (MySQL; MongoDB is not installed on ubuntu-24.04)
sudo rm -rf /var/lib/mysql /etc/mysql /usr/sbin/mysqld
# Other apt-managed packages now removed by rm -rf only
sudo rm -rf /usr/lib/php /etc/php
sudo rm -rf /usr/lib/mono /etc/mono
sudo rm -rf /usr/lib/x86_64-linux-gnu/dri
# Toolcaches and Kubernetes
sudo rm -rf /opt/hostedtoolcache
sudo rm -rf /usr/local/bin/minikube
# Trailing apt cleanup: handles dangling deps and clears /var/cache/apt
sudo apt-get autoremove -y || echo "::warning::apt-get autoremove failed" sudo apt-get autoremove -y || echo "::warning::apt-get autoremove failed"
sudo apt-get clean || echo "::warning::apt-get clean failed" sudo apt-get clean || echo "::warning::apt-get clean failed"
# Remove Java (JDKs)
sudo rm -rf /usr/lib/jvm
# Remove .NET SDKs
sudo rm -rf /usr/share/dotnet
sudo apt-get remove -y '^aspnetcore-.*' || echo "::warning::apt-get remove aspnetcore failed"
sudo apt-get remove -y '^dotnet-.*' --fix-missing || echo "::warning::apt-get remove dotnet failed"
# Remove Swift toolchain
sudo apt-get remove -y '^llvm-.*' --fix-missing || echo "::warning::apt-get remove llvm failed"
sudo rm -rf /usr/share/swift
# Remove Haskell (GHC)
sudo rm -rf /usr/local/.ghcup
# Remove Julia
sudo rm -rf /usr/local/julia*
# Remove Android SDKs
sudo rm -rf /usr/local/lib/android
# Remove browsers
sudo apt-get remove -y google-chrome-stable firefox --fix-missing || echo "::warning::apt-get remove chrome/firefox failed"
sudo rm -rf /usr/local/share/chromium /opt/microsoft /opt/google
# Remove cloud tools
sudo rm -rf /opt/az
sudo apt-get remove -y azure-cli mono-devel libgl1-mesa-dri --fix-missing || echo "::warning::apt-get remove azure-cli/mono failed"
sudo apt-get remove -y google-cloud-sdk --fix-missing || echo "::debug::apt-get remove google-cloud-sdk failed"
sudo apt-get remove -y google-cloud-cli --fix-missing || echo "::debug::apt-get remove google-cloud-cli failed"
# Remove PowerShell
sudo apt-get remove -y powershell --fix-missing || echo "::warning::apt-get remove powershell failed"
sudo rm -rf /usr/local/share/powershell
# Remove CodeQL and other toolcaches
sudo rm -rf /opt/hostedtoolcache
# Remove Kubernetes
sudo rm -rf /usr/local/bin/minikube
# Remove Rust
sudo rm -rf /home/runner/.rustup /etc/skel/.rustup
- name: '[Linux] Show disk usage after cleaning' - name: '[Linux] Show disk usage after cleaning'
if: runner.os == 'Linux' if: runner.os == 'Linux'
@@ -11,12 +11,18 @@ This change replaces the `apt-get remove` calls with direct `rm -rf` of the same
- In `.github/actions/clean-runner-disk/action.yml` Linux path: - In `.github/actions/clean-runner-disk/action.yml` Linux path:
- **Remove**: all `apt-get remove -y '...'` lines for browsers, .NET, Swift/LLVM, Azure CLI, Google Cloud, PowerShell, mono. - **Remove**: all `apt-get remove -y '...'` lines for browsers, .NET, Swift/LLVM, Azure CLI, Google Cloud, PowerShell, mono.
- **Keep**: `apt-get autoremove -y` + `apt-get clean` once at the end (cleans up dependency cruft from packages already removed and clears `/var/cache/apt`). - **Keep**: `apt-get autoremove -y` + `apt-get clean` once at the end (cleans up dependency cruft from packages already removed and clears `/var/cache/apt`).
- **Add**: explicit `rm -rf` of the package install dirs that `apt-get remove` previously handled: - **Add**: explicit `rm -rf` of the package install dirs that `apt-get remove` previously handled. Paths cross-referenced against the `actions/runner-images` `install-*.sh` scripts for ubuntu-24.04:
- `/usr/lib/google-cloud-sdk`, `/opt/google-cloud-sdk`, `/usr/bin/gcloud*` - Google Cloud SDK/CLI: `/usr/lib/google-cloud-sdk`, `/opt/google-cloud-sdk`, `/usr/bin/gcloud*`
- `/opt/microsoft/powershell`, `/usr/local/share/powershell` - PowerShell: `/opt/microsoft/powershell` (already covered by existing `/opt/microsoft` rm; do **not** add `/usr/local/share/powershell` — it does not exist on ubuntu-24.04)
- `/opt/microsoft`, `/opt/google` (already present) - `/opt/microsoft`, `/opt/google` (already present — covers PowerShell and Chrome respectively)
- `/usr/lib/google`, `/usr/bin/google-chrome*`, `/usr/bin/firefox*` - Chrome: `/usr/bin/google-chrome*` (symlink only; `/opt/google/chrome` is under existing `/opt/google`)
- `/opt/az`, `/usr/share/az_*` - Firefox: `/usr/lib/firefox` (Mozilla PPA install dir — proposal's `/usr/bin/firefox*` is only the symlink), `/usr/bin/firefox*`
- Azure CLI: `/opt/az`, `/usr/share/az_*`
- LLVM: `/usr/lib/llvm-16`, `/usr/lib/llvm-17`, `/usr/lib/llvm-18` (the apt removal of `^llvm-.*` handled these; `/usr/share/swift` stays in the existing rm block)
- Mono: `/usr/lib/mono`, `/etc/mono`
- PHP: `/usr/lib/php`, `/etc/php`
- MySQL: `/var/lib/mysql`, `/etc/mysql`, `/usr/sbin/mysqld`
- libgl1-mesa-dri (~150 MB): `/usr/lib/x86_64-linux-gnu/dri`
- **Keep**: the existing `rm -rf` block for `/usr/lib/jvm`, `/usr/share/dotnet`, `/usr/share/swift`, `/usr/local/.ghcup`, `/usr/local/julia*`, `/usr/local/lib/android`, `/usr/local/share/chromium`, `/opt/microsoft`, `/opt/google`, `/opt/hostedtoolcache`, `/usr/local/bin/minikube`, `/home/runner/.rustup`, `/etc/skel/.rustup`. - **Keep**: the existing `rm -rf` block for `/usr/lib/jvm`, `/usr/share/dotnet`, `/usr/share/swift`, `/usr/local/.ghcup`, `/usr/local/julia*`, `/usr/local/lib/android`, `/usr/local/share/chromium`, `/opt/microsoft`, `/opt/google`, `/opt/hostedtoolcache`, `/usr/local/bin/minikube`, `/home/runner/.rustup`, `/etc/skel/.rustup`.
- Keep the post-clean disk-free assertion (`≥ 20 GB on /`) — this is the actual safety contract; the spec already requires it. - Keep the post-clean disk-free assertion (`≥ 20 GB on /`) — this is the actual safety contract; the spec already requires it.
- Keep the job-summary line. - Keep the job-summary line.
@@ -37,7 +43,7 @@ _None._
- **Affected files**: `.github/actions/clean-runner-disk/action.yml` (Linux step only). No workflow YAML changes. - **Affected files**: `.github/actions/clean-runner-disk/action.yml` (Linux step only). No workflow YAML changes.
- **Behavioral change**: `clean-runner-disk` Linux wall-clock drops from ~3m06s to ≤ 2m. Saves ~1-2 min on every PR build (`build.yml/test_image` or, after p3 lands, `build.yml/build_image`) and every `ci.yml` push run. - **Behavioral change**: `clean-runner-disk` Linux wall-clock drops from ~3m06s to ≤ 2m. Saves ~1-2 min on every PR build (`build.yml/test_image` or, after p3 lands, `build.yml/build_image`) and every `ci.yml` push run.
- **Risk**: a package's files might live in a path not covered by the `rm -rf` list, in which case the disk gain regresses. Mitigation: the post-clean assertion (`≥ 20 GB free`) catches this immediately on the first run, before merge. - **Risk**: a package's files might live in a path not covered by the `rm -rf` list, in which case the disk gain regresses. Mitigation: the `rm -rf` set was cross-referenced against the `actions/runner-images` install scripts on ubuntu-24.04 (`install-firefox.sh`, `install-google-chrome.sh`, `install-powershell.sh`, `install-google-cloud-cli.sh`, etc.) and `Ubuntu2404-Readme.md`; the post-clean assertion (`≥ 20 GB free`) catches anything missed immediately on the first run, before merge.
- **Risk**: a future package install that depends on apt's view of dpkg state could be broken by the removal of `apt-get remove`. Mitigation: the runner image is freshly provisioned each job, and the only `apt-get install` users in the affected workflows are `clean-runner-disk` itself (no others). The `apt-get autoremove` we keep handles dangling deps. - **Risk**: a future package install that depends on apt's view of dpkg state could be broken by the removal of `apt-get remove`. Mitigation: the runner image is freshly provisioned each job, and the only `apt-get install` users in the affected workflows are `clean-runner-disk` itself (no others). The `apt-get autoremove` we keep handles dangling deps.
- **Depends on**: `p4-unify-runner-disk-cleanup` archived. Per the user's assumption, this has happened. - **Depends on**: `p4-unify-runner-disk-cleanup` archived. Per the user's assumption, this has happened.
- **Out of scope**: Windows cleanup (separate concern), self-hosted runners. - **Out of scope**: Windows cleanup (separate concern), self-hosted runners.
@@ -1,3 +1,8 @@
## RENAMED Requirements
- FROM: `### Requirement: Linux cleanup retains its current ~3-minute budget`
- TO: `### Requirement: Linux cleanup completes within a 2-minute wall-clock budget`
## MODIFIED Requirements ## MODIFIED Requirements
### Requirement: Linux cleanup completes within a 2-minute wall-clock budget ### Requirement: Linux cleanup completes within a 2-minute wall-clock budget
@@ -1,20 +1,20 @@
## 1. Trim the Linux cleanup script ## 1. Trim the Linux cleanup script
- [ ] 1.1 In `.github/actions/clean-runner-disk/action.yml`, in the `[Linux] Clean runner disk` step, remove every `apt-get remove ...` line for browsers, .NET, aspnetcore, Swift/LLVM, Azure CLI, mono, Google Cloud SDK / CLI, PowerShell. - [x] 1.1 In `.github/actions/clean-runner-disk/action.yml`, in the `[Linux] Clean runner disk` step, remove every `apt-get remove ...` line for browsers, .NET, aspnetcore, Swift/LLVM, Azure CLI, mono, Google Cloud SDK / CLI, PowerShell.
- [ ] 1.2 Keep `apt-get autoremove -y` and `apt-get clean` as a single trailing pair. - [x] 1.2 Keep `apt-get autoremove -y` and `apt-get clean` as a single trailing pair.
- [ ] 1.3 Add `rm -rf` lines covering the package install dirs that the removed `apt-get` calls used to handle (paths listed in proposal "What Changes"). - [x] 1.3 Add `rm -rf` lines covering the package install dirs that the removed `apt-get` calls used to handle (paths listed in proposal "What Changes").
- [ ] 1.4 Run the step manually inside a `workflow_dispatch` PR and time it. Confirm ≤ 2 minutes wall-clock — satisfies the modified spec requirement. - [x] 1.4 Run the step manually inside a `workflow_dispatch` PR and time it. Confirm ≤ 2 minutes wall-clock — satisfies the modified spec requirement. _PR #464 build_image (job 77594156664): composite step 11:38:51→11:40:47 = **1m56s (116 s)**. Under budget._
## 2. Update the capability spec ## 2. Update the capability spec
- [ ] 2.1 Verify `p4-unify-runner-disk-cleanup` is archived (its specs are now under `openspec/specs/ci-runner-disk-cleanup/spec.md`). If not, this change SHALL block until it is. - [x] 2.1 Verify `p4-unify-runner-disk-cleanup` is archived (its specs are now under `openspec/specs/ci-runner-disk-cleanup/spec.md`). _Confirmed: p4 archived at commit 7672004; `openspec/specs/ci-runner-disk-cleanup/spec.md` exists._
- [ ] 2.2 The MODIFIED Requirements in this change's spec delta replace the existing "Linux cleanup retains its current ~3-minute budget" requirement with a 2-minute budget and a freed-bytes-not-tactics contract. - [x] 2.2 The MODIFIED Requirements in this change's spec delta replace the existing "Linux cleanup retains its current ~3-minute budget" requirement with a 2-minute budget and a freed-bytes-not-tactics contract.
## 3. Verify on a real PR before merge ## 3. Verify on a real PR before merge
- [ ] 3.1 Push as a draft PR. Confirm the post-clean assertion (`≥ 20 GB free on /`) still passes — this is the regression alarm. - [x] 3.1 Push as a draft PR. Confirm the post-clean assertion (`≥ 20 GB free on /`) still passes — this is the regression alarm. _PR #464 build_image: `clean-runner-disk: 127.00 GB free on / (threshold 20 GB) — OK`._
- [ ] 3.2 Confirm the new wall-clock is ≤ 2 minutes at the median across 3 consecutive runs. - [ ] 3.2 Confirm the new wall-clock is ≤ 2 minutes at the median across 3 consecutive runs. _Run 1/3 done at 1m56s; needs two more re-pushes or defer to post-merge p95 check (4.1)._
- [ ] 3.3 Compare the freed-bytes number from the job-summary line against the pre-change baseline (~30-40 GB). Confirm it has not regressed by more than 2 GB. If it has, an `rm -rf` path is missing; add it. - [x] 3.3 Compare the freed-bytes number from the job-summary line against the pre-change baseline (~30-40 GB). Confirm it has not regressed by more than 2 GB. _PR #464: before 89.4 GB free → after 127 GB free = **~38 GB freed**, within the 30-40 GB baseline. No regression._
## 4. Post-merge closure check ## 4. Post-merge closure check
+31 -15
View File
@@ -3,9 +3,7 @@
## Purpose ## Purpose
Define what `.github/actions/clean-runner-disk` SHALL achieve on the GitHub-hosted runners used by this repo (`ubuntu-24.04` and `windows-2025`) — minimum disk freed, maximum wall-clock spent, observability of the result, and the contract that both runner OSes are invoked via the same action reference from any workflow that needs cleanup. Define what `.github/actions/clean-runner-disk` SHALL achieve on the GitHub-hosted runners used by this repo (`ubuntu-24.04` and `windows-2025`) — minimum disk freed, maximum wall-clock spent, observability of the result, and the contract that both runner OSes are invoked via the same action reference from any workflow that needs cleanup.
## Requirements ## Requirements
### Requirement: Single action reference works on both supported runner OSes ### Requirement: Single action reference works on both supported runner OSes
`.github/actions/clean-runner-disk` SHALL be a single composite action invocable from any workflow job running on either `ubuntu-24.04` or `windows-2025` with the same `uses: ./.github/actions/clean-runner-disk` reference. The action SHALL dispatch its cleanup logic by `runner.os` internally; workflow YAML SHALL NOT branch on OS to choose between action paths. `.github/actions/clean-runner-disk` SHALL be a single composite action invocable from any workflow job running on either `ubuntu-24.04` or `windows-2025` with the same `uses: ./.github/actions/clean-runner-disk` reference. The action SHALL dispatch its cleanup logic by `runner.os` internally; workflow YAML SHALL NOT branch on OS to choose between action paths.
@@ -54,19 +52,6 @@ The experience context is the maintainer watching the PR check page — the Wind
- **AND** the action emits `::warning::<path> still present after removal` so the surviving directory is named in the log - **AND** the action emits `::warning::<path> still present after removal` so the surviving directory is named in the log
- **AND** the post-clean free-space assertion (see "Action asserts minimum post-clean free space") catches any case where enough survived to threaten the downstream `docker build` - **AND** the post-clean free-space assertion (see "Action asserts minimum post-clean free space") catches any case where enough survived to threaten the downstream `docker build`
### Requirement: Linux cleanup retains its current ~3-minute budget
The Linux cleanup path SHALL complete in ≤ 4 minutes wall-clock at the 95th percentile across the rolling 30-day window of `ci.yml` and `build.yml` runs. The behavior of unifying the action SHALL NOT regress Linux performance relative to the pre-change baseline.
The experience context is the same maintainer comparing today's CI duration to yesterday's after the action is unified — Linux numbers must not move in the wrong direction as a side-effect of the Windows work.
#### Scenario: Linux cleanup runs under the unified action
- **GIVEN** an `ubuntu-24.04` runner with the standard pre-installed toolchains (JVM, .NET, Swift/LLVM, Haskell GHC, Julia, Android SDK, Chrome, Firefox, Azure CLI, PowerShell, hostedtoolcache, Rust, etc.)
- **WHEN** the cleanup action runs
- **THEN** the action completes in ≤ 4 minutes
- **AND** the set of removed paths is at least the same as the pre-change `.github/actions/clean-runner-disk/action.yml` removed
### Requirement: Action asserts minimum post-clean free space and fails loudly on regression ### Requirement: Action asserts minimum post-clean free space and fails loudly on regression
After cleanup, the action SHALL check free space on the build drive (`/` on Linux, `C:` on Windows) and SHALL fail the step with a message naming the actual free space and the threshold when free space is below 20 GB on Linux or 40 GB on Windows. After cleanup, the action SHALL check free space on the build drive (`/` on Linux, `C:` on Windows) and SHALL fail the step with a message naming the actual free space and the threshold when free space is below 20 GB on Linux or 40 GB on Windows.
@@ -105,3 +90,34 @@ The experience context is the maintainer scanning the PR check page for slow ste
- **GIVEN** an invocation where the post-clean assertion fails (free space below threshold) - **GIVEN** an invocation where the post-clean assertion fails (free space below threshold)
- **WHEN** the run completes (with the step marked failed) - **WHEN** the run completes (with the step marked failed)
- **THEN** the summary still contains the line so a maintainer can compare the freed-bytes number against historical values - **THEN** the summary still contains the line so a maintainer can compare the freed-bytes number against historical values
### Requirement: Linux cleanup completes within a 2-minute wall-clock budget
The Linux cleanup path SHALL complete in ≤ 2 minutes wall-clock at the 95th percentile across the rolling 30-day window of `ci.yml` and `build.yml` runs. The implementation SHALL favor direct `rm -rf` of large directories over `apt-get remove`, which is slow due to dpkg-lock contention and maintainer-script execution per package set. `apt-get autoremove` and `apt-get clean` MAY be retained as a trailing pair to handle dangling dependencies and clear `/var/cache/apt`.
The contract this requirement defends is "freed bytes" (measured by the existing post-clean assertion), NOT "set of removed paths". An implementation that frees ≥ 20 GB on `/` via any tactic SHALL satisfy this requirement, even if it removes fewer packages than a prior implementation.
The experience context is the maintainer measuring CI wall-clock — the previous 3-minute budget assumed `apt-get` was necessary; profiling showed it was the dominant cost without a corresponding safety benefit, since the post-clean assertion is the real safety net.
#### Scenario: Linux cleanup runs within budget on a standard runner
- **GIVEN** an `ubuntu-24.04` runner with the standard pre-installed toolchains
- **WHEN** the cleanup action runs
- **THEN** the action completes in ≤ 2 minutes at the median across 5 runs
- **AND** the post-clean assertion (`≥ 20 GB free on /`) passes
#### Scenario: Cleanup tactic may differ as long as freed-bytes contract holds
- **GIVEN** an implementation that uses only `rm -rf` (no `apt-get remove`)
- **WHEN** the cleanup action runs
- **THEN** the post-clean assertion passes (`≥ 20 GB free on /`)
- **AND** the requirement is satisfied even though apt's package metadata still references files that no longer exist on disk
- **AND** no downstream step in `build.yml` or `ci.yml` queries apt-database consistency for those packages
#### Scenario: Cleanup tactic regression is detected by the assertion, not by path inventory
- **GIVEN** a future edit removes an `rm -rf` line, leaving < 20 GB free
- **WHEN** the cleanup action runs
- **THEN** the post-clean assertion fails the step with the actual free-space number
- **AND** the regression is caught at the cleanup step rather than at a downstream `docker build` "no space left on device"