Handling Deployment failures
Sometimes a deploy pipeline fails. We should always be aware of these failures
through notifications coming in through Slack on #announcements
. Note you will
only see one failure notice per pipeline, so keep an eye on a pipeline after a
retry.
If you cannot resolve the failure within 5 minutes, follow these steps for help:
- Consider the impact of the failure, if GitLab.com users are potentially affected by the failure (for example any production failure), declare an incident and work to rapidly resolve
- If GitLab.com users aren’t being affected, for example the failure is on the staging-canary stage, start by opening an issue in the release issue tracker. Include the date, package name, as well as details of the failure. Label with “~release-blocker”. Use this issue to track discussions with other teams to reach resolution.
- For QA failures, we have some additional guidance in the runbook for resolving qa failures
- If at any time you need help from Dev-escalation/developer or Reliability please report an incident. Make sure to apply the correct availability severity and "~delivery impact::*" labels
- Once resolved please add the
Deploys-block-gprd::*
andDeploys-block-gstg::*
labels to record the delay so we can improve things
Anyone can stop a deployment by following the steps in deployment blockers.
Failure Cases
Possible failures, and ways to respond for the individual deployment jobs:
- Missing branches or packages – The package pipeline on the Dev instance has failed, or is running for longer than expected. You can try to debug yourself from https://dev.gitlab.org/gitlab/omnibus-gitlab/-/pipelines or reach out to
#g_distribution
to ask for support. - Prepare job - This job ensures that servers are in the appropriate state in our load balancers and that the version of GitLab is the same across the fleet. Action must be taken manually in order to remediate this type of failure. Check for guidance in the job log file.
- Fleet Deploy - This can fail for a myriad of reasons. Each failure should be investigated to determine the appropriate course of action accompanied with an issue where appropriate. A retry is usually safe to attempt but use your judgement based on the error reported.
- QA - QA jobs run following deployment completion. If any QA tasks fail we should assume something has regressed with the deployment. Use the runbook to resolve the failure.
- Post Deploy Migrations - See post-deployment migration failures guideline.
Troubleshooting
Production deploy failed, but okay to leave in place
In situations where production deploys fail for any reason (such as post-deploy
migration failures), but it is deemed safe to leave production as-is, we need to
ensure that we don’t prevent future deploys from being blocked. Run the
following from the deploy-tooling
repository:
CURRENT_DEPLOY_ENVIRONMENT=<env> DEPLOY_VERSION=<version> ./bin/set-version
./bin/set-omnibus-updates gstg --enable
This will configure the <env>-omnibus-version
chef role to the appropriate
version and ensure installation is enabled of that version. This happens
automatically during successful pipelines.
This is slated to become ChatOps commands: https://gitlab.com/gitlab-com/gl-infra/delivery/issues/524
Prepare job discovered nodes in DRAIN
The prepre job (<ENV>-prepare
) will fail if nodes are not in state UP
or MAINT
. Any other
state and the prepare job will hard fail noting which frontend server contains the
state, and which backend server is in this state. Unless there’s known
maintenance happening there should not exist a situation where a server isn’t in
either of the preferred states.
If a server is in state DRAIN
a prior deploy may not have fully completed.
In this case, set the server into MAINT
and retry.
This can be accomplished by following the documentation for setting haproxy server state. If you don’t have access to the nodes, ask the SRE on call for GitLab Production to help you with this. You can find the engineer on call through ChatOps:
/chatops run oncall production
Fleet Deploy
E: Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable)
- This is common if there’s a collision with chef running and the deploy trying to perform similar actions
- Though we’ve tried to eliminate these issues as much as possible, hitting retry is usually the best method to allow the deploy to continue
- Timeout when setting haproxy state
- This is common for our
pages
fleet when pages is starting up - Pages service takes a long time to start up, hitting retry for this job helps. The timeout is already very high on this particular task, if we hit the timeout again, we should open an issue to investigate further.
- If this happens on a server unrelated to the
pages
service starting, this must be deeply investigated on the node that exhibited the failure
- This is common for our