Evaluate Deployment Health Metrics
Evaluate Deployment Health Metrics
Overview
This document is not fully encompassing at the moment.
After our metrics become more reliable, hopefully this runbook can be removed.
Tracking issue: https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2255
Auto-deploy will automatically query the health of our services prior to package
deployment to production. The same can be seen when one runs the chatops
command /chatops run auto_deploy blockers
.
These perform a few actions:
- Looking for any high severity incidents or change requests that should block deployments
- Determining if a deployment is active
- Determining if a environment is locked
- Determining if Canary is drained for a given environment
- Performing a Prometheus query to determine the health of our services.
Evaluation
Sometimes our metrics are unreliable and will require us to perform the latter evaluation manually. To do so, one can view the following links for each of these components and observe the Aggregate Error and Apdex SLIs. These are the top two left panels on each of our Overview dashboards. If either of these are exceeding their thresholds, then it should be deemed unsafe to perform a deploy.
cny-api
- https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=cnycny-git
- https://dashboards.gitlab.net/d/git-main/git-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=cnycny-web
- https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=cnymain-api
- https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1main-git
- https://dashboards.gitlab.net/d/git-main/git-overview?orgId=1main-sidekiq
- https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1main-web
- https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1
Moving Forward
If it is determined safe to perform a deploy, proceed to do so per our standard procedure.
If it is not safe to proceed, proceed to engage in the current Engineer On-Call. This may be an incident inducing situation, an incident may already exist, or we simply need approval to proceed with a deploy. This is very situation specific and up to the EOC to provide an approval to allow a deploy to proceed forward.