Failures when concurrent lazy package installs race #2964
Unanswered
plobsing
asked this question in
Bug Report
Replies: 1 comment 4 replies
-
Thank you for your report.
Yes, this is a known issue of aqua. The workaround is to avoid parallel installation by lazy install.
To resolve this issue completely, we need to introduce lock mechanism or something but I didn't want to do it because it makes aqua more complicated and may cause trouble related to lock file. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
aqua info
Overview
Seems like concurrent archive extractions get in one another's way and get confused.
Sometimes Aqua sees a concrurrent run's output directory, so it skips the download, but then fails validation since its contemporary has only extracted part of the archive so far - some of the files we expect to find are still unextracted and are reported as missing.
Other times, it seems that laggards are trying to perform the download-and-extract themselves, but they fail when they attempt to
overwrite the executable file that was extracted concurrently and is already in use by some other instance.
How to reproduce
I've only seen this on CI, only sporadically, and in scripts that are a bit big too share; so far, I have been unable to isolate a reliable, minimal, local repro.
We've seen what seem to be concurrent-extraction related failures in a couple of circumstances. We see it when running
pre-commit
, where itgets tripped up extracting
shellcheck
orterraform
(from packagetenv
). We also see it in a custom push script when it does the equivalent ofxargs flux push artifact
to push many configs to ECR at once, where it gets tripped up extractingamazon-ecr-credential-helper
.aqua.yaml
Other related code such as local Registry
.pre-commit-config.yaml (for pre-commit
Executed command and output
(no direct analog for
flux
pusher, apologies)Debug output
Output of a failed
terraform_fmt
run is here: https://gist.github.com/plobsing/1aa39f24d539cb533962ad33b6f7930cAqua logs are a bit interleaved with the rest of the output, but relevant segments indicate a failure to validate the extracted archive, which is reported in a couple of places. For example:
But also in other cases, we see execution passing to binaries from the
tenv
package, indicating at least some of our concurrent invocations are successful (at least at the Aqua layer):Output of a failed run of my concurrent flux pusher is here: https://gist.github.com/plobsing/8ad5373e498104b7c593bebff855fbac
The relevant bit seems to be
which I interpret as the extraction step of the archive trying to open
docker-credential-ecr-login
for writing, but being rejected withETXTBSY
(because there is already a file there and it is being used as the executable for some other process).Expected behaviour
I would expect that concurrent invocations of Aqua-managed tools to not interfere with one another, even if lazy installation needs to be performed. It'd also be nice to avoid duplicate work, but that's much less important.
I think there's a few ways that could be achieved. Maybe a lockfile at either
aqua root-dir
or individual-package granularity. Or some other scheme involving advisory file locking. Or download/extraction to a temporary location followed by an atomic filesystem operation to move the files to the intended destination. Or maybe even just using retries in more places.Actual behaviour
Sometimes archive-extraction interferes between concurrent runs. See above for details.
Note
Seems this is not the first report for this kind of issue. #537 previously reported issues with
shellcheck
, although invoked throughactionlint
rather thanpre-commit
. That report lead to #545. However that change is already in the version I am using, v2.28.1, soit probably isn't a full fix.
Beta Was this translation helpful? Give feedback.
All reactions