Make ensure_stack_deployed more reliable #177

Merged
decentral1se merged 1 commits from improved-stack-deploy-guarantees into main 2021-06-27 19:03:48 +00:00
Owner

Closes #165.

Cases to be solved from my testing:

  • Sucessful deployment
  • Failed update (we see the healthcheck fail on the new container)
  • Container refuses to start because MaxRetryAttempts has happened (when no existing container is up - this is typically when you're trying to deploy the app for the first time)
  • Handling missing healthchecks gracefully

Overall goals are: make it work and keep it fast.

All of these will rely on healthchecks on each container, so, we need to implement them in all our apps and make them standard from now on, unfortunately...but I am getting better at understanding how they work! I have hope.

Seems like a default new deploy: ... configuration is emerging:

deploy:
  update_config:
    failure_action: rollback
    order: start-first
  rollback_config:
    order: start-first
  restart_policy:
    max_attempts: 3

Where you ensure that the service won't be in a restart loop forever.

Closes https://git.autonomic.zone/coop-cloud/abra/issues/165. Cases to be solved from my testing: - [x] Sucessful deployment - [x] Failed update (we see the healthcheck fail on the new container) - [ ] Container refuses to start because MaxRetryAttempts has happened (when no existing container is up - this is typically when you're trying to deploy the app for the first time) - [x] Handling missing healthchecks gracefully Overall goals are: *make it work and keep it fast*. All of these will rely on healthchecks on each container, so, we need to implement them in all our apps and make them standard from now on, unfortunately...but I am getting better at understanding how they work! I have hope. Seems like a default new `deploy: ...` configuration is emerging: ``` deploy: update_config: failure_action: rollback order: start-first rollback_config: order: start-first restart_policy: max_attempts: 3 ``` Where you ensure that the service won't be in a restart loop forever.
decentral1se force-pushed improved-stack-deploy-guarantees from f151018f10 to c7f838178c 2021-06-10 22:44:09 +00:00 Compare
Author
Owner

It seems a core element of making this run fast is configuring the healthcheck:

The default settings are really default. A redis container is up in like 3 seconds but the first healthcheck won't run until 30 seconds. So by tailoring each of these values to some better default, we can get speedy deployments and abra doesn't have to manage it. It pushes the complexity to swarm itself which is ideal.

Also:

https://engineering.issuu.com/2018/09/12/confident-deployments

It seems a core element of making this run fast is configuring the healthcheck: - https://docs.docker.com/engine/reference/builder/#healthcheck - https://docs.docker.com/compose/compose-file/compose-file-v3/#healthcheck The default settings are really default. A redis container is up in like 3 seconds but the first healthcheck won't run until 30 seconds. So by tailoring each of these values to some better default, we can get speedy deployments and `abra` doesn't have to manage it. It pushes the complexity to swarm itself which is ideal. Also: > https://engineering.issuu.com/2018/09/12/confident-deployments
decentral1se force-pushed improved-stack-deploy-guarantees from c7f838178c to 4ed9f119d8 2021-06-15 21:53:16 +00:00 Compare
decentral1se force-pushed improved-stack-deploy-guarantees from 4ed9f119d8 to 7322105fd9 2021-06-17 08:32:42 +00:00 Compare
decentral1se force-pushed improved-stack-deploy-guarantees from 7322105fd9 to 4593451dab 2021-06-19 19:06:45 +00:00 Compare
decentral1se force-pushed improved-stack-deploy-guarantees from 4593451dab to 62999a1732 2021-06-27 19:01:30 +00:00 Compare
Author
Owner

OK, I'm gonna merge this as is now, one edge case is still not covered. Example output:

INFO: works_r5_should_also_work is healthy!
INFO: works_r4_disabled_health_check has no healthcheck configured, cannot guarantee this service comes up successfully...
INFO: Deploying: 2/5 (timeout: 1/60)
INFO: works_r3_no_health_check has no healthcheck configured, cannot guarantee this service comes up successfully...
INFO: Deploying: 3/5 (timeout: 2/60)
INFO: Deploying: 3/5 (timeout: 3/60)
INFO: Deploying: 3/5 (timeout: 4/60)
INFO: works_r1_should_work is healthy!
INFO: Deploying: 4/5 (timeout: 5/60)
WARNING: Healthcheck for new instance of works_r2_broken_health_check is failing (exit code: 127)
WARNING: /bin/sh: foobar: not found
ERROR: healthcheck for works_r2_broken_health_check is failing, this deployment did not succeed :(
OK, I'm gonna merge this as is now, one edge case is still not covered. Example output: ``` INFO: works_r5_should_also_work is healthy! INFO: works_r4_disabled_health_check has no healthcheck configured, cannot guarantee this service comes up successfully... INFO: Deploying: 2/5 (timeout: 1/60) INFO: works_r3_no_health_check has no healthcheck configured, cannot guarantee this service comes up successfully... INFO: Deploying: 3/5 (timeout: 2/60) INFO: Deploying: 3/5 (timeout: 3/60) INFO: Deploying: 3/5 (timeout: 4/60) INFO: works_r1_should_work is healthy! INFO: Deploying: 4/5 (timeout: 5/60) WARNING: Healthcheck for new instance of works_r2_broken_health_check is failing (exit code: 127) WARNING: /bin/sh: foobar: not found ERROR: healthcheck for works_r2_broken_health_check is failing, this deployment did not succeed :( ```
decentral1se force-pushed improved-stack-deploy-guarantees from 62999a1732 to dccfff0c87 2021-06-27 19:02:01 +00:00 Compare
decentral1se force-pushed improved-stack-deploy-guarantees from dccfff0c87 to 93714a593b 2021-06-27 19:03:29 +00:00 Compare
decentral1se changed title from WIP: make ensure_stack_deployed reliable to Make ensure_stack_deployed more reliable 2021-06-27 19:03:42 +00:00
decentral1se merged commit 0ab2b3a652 into main 2021-06-27 19:03:48 +00:00
decentral1se deleted branch improved-stack-deploy-guarantees 2021-06-27 19:10:12 +00:00
This repo is archived. You cannot comment on pull requests.
No description provided.