Making ensure_stack_deployed more reliable #165
Labels
No Label
breaking-change
bug
CI/CD
design
documentation
duplicate
enhancement
help wanted
invalid
plugin
question
secrets
shell-completion
versioning
wontfix
No Milestone
No Assignees
1 Participants
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: coop-cloud/abra#165
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
See
4e84664310/abra (L766-L808)
. I've seen this not really work a bunch of times. I am thinking it might be worth dropping into Python for this, in a separately written utility which can be vendored into~/.abra/vendor
for use byabra
. Then we can cover all the nitty-gritty edge cases of tracking the deployment state and ensuring what happened with the deployment. This should be handled by swarmkit but it hasn't been implemented in years so it is unlikely to happen soon at current status. Not knowing if what you ran a deploy on actually succeeded or not is a huge issue.The main edge case I've seen is that the new container is deployed successfully but it immediately goes into a flip-flop of up/down/up/down status with obvious error messages in the logs.
Sometimes, you haven't configured a secret or some config issue and the container doesn't even start. There are no container logs and it is really hard to know what went wrong (in some cases, there are simply no logs at all!).
Seems like we could use the approach of https://stackoverflow.com/a/57403566. We could first list all container hashes and then feed them into a Python script which asynchronously watches all events for those containers and if it sees die/kill/fail events then it could report failure. It could also wait a few seconds once it comes up to make sure the container isn't flapping.
Actually, I take it all back. This can be done in Bash 🐊 We can improve the current implementation. Just need to handle the service container id shuffling and checking the state correctly. If you run
docker events
when running a deploy you can see the events listing. Those are also viewable in5a5494a9a7/api/types.pb.go (L421-L428)