Severity::1 Incidents
Overview
This document serves as an action reference manual for release managers during the highest severity issue affecting availability of GitLab.com.
Current definition of availability S1 states that we are seeing an immediate impact on customer’s business goals and that the problem needs to be resolved with highest priority.
S1 incident
As an example of S1 outage, we’ll reference a scenario where post application deployment to GitLab.com, one of the more important workloads such as CI pipelines is no longer working for everyone on the platform.
At the start of the incident we would have:
- Initiated incident in #incident-management Slack channel.
- SRE on call, communications manager, current release manager, developer on call, and GitLab.com support member in the Incident Zoom room.
If any of the above are missing, ensure that you speak up and invite the missing people or create the missing resources.
As a release manager, your tasks would consist of providing sufficient background on tooling and changes that were introduced into the environment.
This means:
- Provide the commit diff that was deployed to production.
- Eg. use
/chatops auto_deploy status
to find the latest branch running in production and link that to the developer on call for further investigation, or check the grafana dashboard to find the latest deployed commit.
- Eg. use
- Advise on the possible remediation options.
- Define a timeline for specific remediation options chosen in the incident discussion.
- Eg. Think about what can be done immediately, what can be left for a couple of hours later, and what should be excluded from the conversation.
Remediation options
Consider the following options in combination with a preferred timelines:
- The source of the problem is unknown
- Consult feature flag log and consider disabling any recently enabled features.
- Deployment rollback. If the deployment contains background database migrations, this option should be excluded.
- The source of the problem is known
- Issue a post-deployment patch to patch the production fleet. Post-deployment patches can only be applied when the fix is in the Rails source code and will be applied to the Web, API, Git and Sidekiq VMs as well as Kubernetes clusters.
- Revert the change that caused the issue and pick the revert into the auto-deploy branch. It is ok to tag immediately after picking without waiting for pipelines to pass on the alternative repository (likely on dev.gitlab.org) using the security release process pipeline override. If using this option, ensure that you confirm that all specs have passed prior to deploying to production.
Timeline
It is important to note that in situations such as this one, focus should be exclusively on reverting to the previously known working state.
Ensure that you challenge any decision that would consider creating the fix for the problem (if the fix ends up being broken, time was lost and now two problems exist). There are edge cases where it is ok to consider this, but those should be very rare.
In the first hour of the incident, it is common to consider the following options:
- Disabling feature flags
- Environment hot patch
- Deployment rollback
Past the first hour, the options to consider are:
- Reverting the offending code
- Creating a new deployment