Making ensure_stack_deployed more reliable #165

Closed
opened 2021-05-17 10:39:40 +00:00 by decentral1se · 3 comments
Owner

See 4e84664310/abra (L766-L808). I've seen this not really work a bunch of times. I am thinking it might be worth dropping into Python for this, in a separately written utility which can be vendored into ~/.abra/vendor for use by abra. Then we can cover all the nitty-gritty edge cases of tracking the deployment state and ensuring what happened with the deployment. This should be handled by swarmkit but it hasn't been implemented in years so it is unlikely to happen soon at current status. Not knowing if what you ran a deploy on actually succeeded or not is a huge issue.

See https://git.autonomic.zone/coop-cloud/abra/src/commit/4e84664310c7a8c495c544a2828de54e14825baf/abra#L766-L808. I've seen this not really work a bunch of times. I am thinking it might be worth dropping into Python for this, in a separately written utility which can be vendored into `~/.abra/vendor` for use by `abra`. Then we can cover all the nitty-gritty edge cases of tracking the deployment state and ensuring what happened with the deployment. This should be handled by swarmkit but it hasn't been implemented in years so it is unlikely to happen soon at current status. Not knowing if what you ran a deploy on actually succeeded or not is a huge issue.
decentral1se added this to the Beta release milestone 2021-05-17 10:39:40 +00:00
decentral1se added the
enhancement
label 2021-05-17 10:39:40 +00:00
Author
Owner

The main edge case I've seen is that the new container is deployed successfully but it immediately goes into a flip-flop of up/down/up/down status with obvious error messages in the logs.

Sometimes, you haven't configured a secret or some config issue and the container doesn't even start. There are no container logs and it is really hard to know what went wrong (in some cases, there are simply no logs at all!).

The main edge case I've seen is that the new container is deployed successfully but it immediately goes into a flip-flop of up/down/up/down status with obvious error messages in the logs. Sometimes, you haven't configured a secret or some config issue and the container doesn't even start. There are no container logs and it is really hard to know what went wrong (in some cases, there are simply no logs at all!).
decentral1se referenced this issue from a commit 2021-05-31 21:23:29 +00:00
Author
Owner

Seems like we could use the approach of https://stackoverflow.com/a/57403566. We could first list all container hashes and then feed them into a Python script which asynchronously watches all events for those containers and if it sees die/kill/fail events then it could report failure. It could also wait a few seconds once it comes up to make sure the container isn't flapping.

Seems like we could use the approach of https://stackoverflow.com/a/57403566. We could first list all container hashes and then feed them into a Python script which asynchronously watches all events for those containers and if it sees die/kill/fail events then it could report failure. It could also wait a few seconds once it comes up to make sure the container isn't flapping.
Author
Owner

Actually, I take it all back. This can be done in Bash 🐊 We can improve the current implementation. Just need to handle the service container id shuffling and checking the state correctly. If you run docker events when running a deploy you can see the events listing. Those are also viewable in 5a5494a9a7/api/types.pb.go (L421-L428)

Actually, I take it all back. This can be done in Bash 🐊 We can improve the current implementation. Just need to handle the service container id shuffling and checking the state correctly. If you run `docker events` when running a deploy you can see the events listing. Those are also viewable in https://github.com/docker/swarmkit/blob/5a5494a9a7b408b790533a5e4e1cb43ca1c32aad/api/types.pb.go#L421-L428
decentral1se self-assigned this 2021-06-08 12:34:16 +00:00
This repo is archived. You cannot comment on issues.
No Milestone
No Assignees
1 Participants
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: coop-cloud/abra#165
No description provided.