Handling Deployment failures

Sometimes a deploy pipeline fails. We should always be aware of these failures through notifications coming in through Slack on #announcements. Note you will only see one failure notice per pipeline, so keep an eye on a pipeline after a retry.

If you cannot resolve the failure within 5 minutes, follow these steps for help:

Consider the impact of the failure, if GitLab.com users are potentially affected by the failure (for example any production failure), declare an incident and work to rapidly resolve
If GitLab.com users aren’t being affected, for example the failure is on the staging-canary stage, start by opening an issue in the release issue tracker. Include the date, package name, as well as details of the failure. Label with “~release-blocker”. Use this issue to track discussions with other teams to reach resolution.
For QA failures, we have some additional guidance in the runbook for resolving qa failures
If at any time you need help from Dev-escalation/developer or Reliability please report an incident. Make sure to apply the correct availability severity and "~delivery impact::*" labels
Once resolved please add the Deploys-block-gprd::* and Deploys-block-gstg::* labels to record the delay so we can improve things

Anyone can stop a deployment by following the steps in deployment blockers.

Failure Cases

Possible failures, and ways to respond for the individual deployment jobs:

Missing branches or packages – The package pipeline on the Dev instance has failed, or is running for longer than expected. You can try to debug yourself from https://dev.gitlab.org/gitlab/omnibus-gitlab/-/pipelines or reach out to #g_distribution to ask for support.
Prepare job - This job ensures that servers are in the appropriate state in our load balancers and that the version of GitLab is the same across the fleet. Action must be taken manually in order to remediate this type of failure. Check for guidance in the job log file.
Fleet Deploy - This can fail for a myriad of reasons. Each failure should be investigated to determine the appropriate course of action accompanied with an issue where appropriate. A retry is usually safe to attempt but use your judgement based on the error reported.
QA - QA jobs run following deployment completion. If any QA tasks fail we should assume something has regressed with the deployment. Use the runbook to resolve the failure.
Post Deploy Migrations - See post-deployment migration failures guideline.

Troubleshooting

Production deploy failed, but okay to leave in place

In situations where production deploys fail for any reason (such as post-deploy migration failures), but it is deemed safe to leave production as-is, we need to ensure that we don’t prevent future deploys from being blocked. Run the following from the deploy-tooling repository:

CURRENT_DEPLOY_ENVIRONMENT=<env> DEPLOY_VERSION=<version> ./bin/set-version
./bin/set-omnibus-updates gstg --enable

This will configure the <env>-omnibus-version chef role to the appropriate version and ensure installation is enabled of that version. This happens automatically during successful pipelines.

This is slated to become ChatOps commands: https://gitlab.com/gitlab-com/gl-infra/delivery/issues/524

Prepare job discovered nodes in `DRAIN`

The prepre job (<ENV>-prepare) will fail if nodes are not in state UP or MAINT. Any other state and the prepare job will hard fail noting which frontend server contains the state, and which backend server is in this state. Unless there’s known maintenance happening there should not exist a situation where a server isn’t in either of the preferred states.

If a server is in state DRAIN a prior deploy may not have fully completed. In this case, set the server into MAINT and retry.

This can be accomplished by following the documentation for setting haproxy server state. If you don’t have access to the nodes, ask the SRE on call for GitLab Production to help you with this. You can find the engineer on call through ChatOps:

/chatops run oncall production

Fleet Deploy

E: Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable)
- This is common if there’s a collision with chef running and the deploy trying to perform similar actions
- Though we’ve tried to eliminate these issues as much as possible, hitting retry is usually the best method to allow the deploy to continue
Timeout when setting haproxy state
- This is common for our pages fleet when pages is starting up
- Pages service takes a long time to start up, hitting retry for this job helps. The timeout is already very high on this particular task, if we hit the timeout again, we should open an issue to investigate further.
- If this happens on a server unrelated to the pages service starting, this must be deeply investigated on the node that exhibited the failure

Handling Deployment failures #

Failure Cases #

Troubleshooting #

Production deploy failed, but okay to leave in place #

Prepare job discovered nodes in DRAIN #