- Splits the monolithic `test_image` job (~20m serial) into `build_image` + parallel `test_image` (CST) and `scan_image` (Scout) consumers — wall-clock drops to ~15m (~5m saved per PR). --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: verified-commit[bot] <180343340+verified-commit[bot]@users.noreply.github.com>
5.9 KiB
Context
p2 establishes that the build job produces either a registry tag (image_ref) or an artifact (image_artifact). This change splits the existing serial test_image job into three jobs: build_image, test_image, scan_image. The latter two consume the handoff in parallel.
The non-trivial questions are: (a) how do consumer jobs get the bits onto their runner with minimum overhead, (b) what's the failure model when one consumer fails, and (c) how do we manage branch-protection migration without locking out merges during the transition.
Goals / Non-Goals
Goals:
test_imageandscan_imagerun as siblings, not in series.- The consumer jobs are thin: no
clean-runner-diskneeded unless the artifact path is in use (the pull does not write the full image to disk twice). - Branch-protection migration is documented and ordered so
mainis never unprotected and a stale required check name doesn't block merges indefinitely.
Non-Goals:
- Matrix-ing CST configs (the project has one
test/android.yml; if it ever grows, that's a separate change). - Caching the Scout vulnerability DB (no public knob for this per the action README).
- Self-hosted consumers.
Decisions
D1. Consumer jobs reach the image with the cheapest call each tool supports
Decision: The two consumers use different strategies because the two actions support different things:
docker/scout-action: passimage: registry://ghcr.io/<owner>/flutter-android:pr-N. Theregistry://prefix tells Scout to read from the registry directly with no daemon involvement — confirmed in the action's README image-prefix table. Nodocker pullstep on the registry path forscan_image.container-structure-test(viaplexsystems/container-structure-test-action): the action invokescontainer-structure-test test --image <input>with no--pullflag and no driver override. The CLI's defaultdockerdriver inspects the local Docker daemon only. Sotest_imageSHALL run an explicitdocker pull "$IMAGE_REF"on the registry path before invoking the action. The earlier draft of this design assumed CST would stream-pull on demand; it does not.
Alternatives considered:
- Replace the plexsystems action with raw
container-structure-test test --pull --image <ref>. One step instead of two, and drops a stale dependency (plexsystems' last release was Mar 2023; last commit Aug 2023). Rejected for now: requires a new install step (curl + chmod-and-pin), and the win is ~30 s. Worth revisiting if the action ever blocks an upgrade. - Swap to CST's
--driver tar(no daemon at all, e.g.crane export <ref> | container-structure-test --driver tar -). Smallest disk footprint, but the tar driver has historical limitations aroundcommandTeststhat rely on real process execution. Out of scope for p3. docker save-style artifact even for non-fork PRs. Rejected — wastes the registry-cache work from p1.
Rationale: Match the cheapest path to each action's actual behavior, verified against current upstream sources rather than assumed.
D2. Fork-PR consumers download-artifact + docker load, gated on image_artifact != ''
Decision: Both consumer jobs check needs.build_image.outputs.image_artifact. When non-empty, run a setup block: download-artifact → gunzip image.tar.gz → docker load. Then use the loaded image's tag (read from metadata-action output passed through build_image.outputs.image_local_tag) for the CST/Scout call.
Alternatives considered:
- Skip Scout entirely for fork PRs. Already the status quo (
build.yml:113). Keep it for the scan job; the test job has no such restriction. - Pull from a public mirror. No public mirror exists for a not-yet-merged fork PR's image.
Rationale: The ~2 min artifact handoff cost is paid only on fork PRs and is bounded.
D3. Branch protection migration
Decision: The new consumer job is named test_image — the same key as today's monolithic job. The check name test_image therefore continues to appear and continues to be required by branch-protection. The new build_image and scan_image checks are added but are NOT yet required. A follow-up admin action adds them to required status checks once the new layout has shown a few green runs.
Failure semantics during the transition:
- If
build_imagefails,test_imageis skipped. GitHub treats a skipped required check as not-reported → PR cannot merge. Equivalent safety to today. - If
scan_imagefails buttest_imagepasses, the PR can merge. This is a regression in safety relative to today (Scout currently blocks). The gap is bounded to the window between merging this change and the admin addingscan_imageto required checks. Acceptable transient.
Alternatives considered:
- Aggregator job that
needs: [build_image, scan_image]and exits 0 under a name likeimage-checks. Equivalent end-state, more YAML. Rejected because the name-preservation strategy already keepsmainprotected without an extra job. - Block this PR until protection is updated in the same merge window. Rejected — couples a workflow-change PR to admin availability.
Rationale: Preserving the existing check name by renaming the consumer is the smallest possible migration. The bounded scan-gap is the price; it is documented so the admin treats it as a follow-up rather than discovering it via an exploit.
Risks / Trade-offs
- R1: A consumer-job pull failure (transient GHCR error) fails one check while the other passes. Re-run-failed re-runs that single job, but the re-run consumes the same
pr-Ntag, which is safe. - R2: If p2 ever pushes an inconsistent
pr-N(build succeeded but push partially failed), both consumers will fail in the same way. p2's task list requires the push to be a hard failure of the build job, which prevents this. - R3: Fork-PR
docker loadmaterializes 5 GB on the consumer runner. Runclean-runner-diskfirst on the fork path.