Release Manager incident guide

Release Managers will encounter incidents as they go through their deployment and release tasks. Involvement will come either at the request of the EOC or IM, or because Delivery have raised an incident to request help with deployments or releases.

This guide will explain what to expect and our responsibilities for the different types of incident.

Be sure to read through the full handbook page to understand the full Incident Management process.

MR needs to be deployed to fix incident

graph TB;
   id1{Is the MR for an S1 incident?}

   id2(The EOC can choose to perform a hot-patch <br/>if we need to deploy it extremely fast, in less than 8 hours)
   id1-- Yes -->id2

   id3(Note that a deployment will remove the hot patch,<br/> so we cannot deploy after the hot patch until the MR has been merged)
   id2-->id3

   id3_2{Is the MR associated with a security issue?}
   id3-->id3_2

   id3_3{Does AppSec want a critical security release?}
   id3_2-- Yes -->id3_3

   id3_4(Coordinate with AppSec to <br/>start a critical security release)
   id3_3-- Yes -->id3_4

   id4{Is the MR associated with a security issue?}
   id1-- No -->id4

   id5{Does it need a speedy deployment?}
   id4-- No -->id5

   id6(Follow the docs to speed up the auto-deploy process)
   id5-- Yes -->id6
   id3_2-- No -->id6

   id7(The MR will be automatically included <br/>in the next auto-deploy package once it is merged)
   id5-- No -->id7

   id10(Ask the MR author to follow the <br/>security developer workflow to include the MR in the next security release)
   id4-- Yes -->id10
   id3_3-- No -->id10

Notes:

  1. These are recommendations to help guide release managers. However, release managers may sometimes have to use their judgement in an unusual situation. Feel free to ask the team for help in making decisions. You do not have to make them alone.
  2. Do not merge a security MR into security master before the start of a security release. This will cause canonical and security repositories to diverge, which will cause problems when the security release starts.
  3. Hot patches should only be done for S1 issues, and even then they should be used only when absolutely necessary.

References

  1. Speed up auto-deploy process
  2. Security developer workflow

Release Manager support request from EOC or IM

During ongoing incidents the EOC or IM may request support from release managers using the @release-managers slack handle.

Release managers should treat this as a top priority request and join the Slack channel as well as the Zoom room if required.

Typically we will be asked to give details about

  1. Any ongoing or recently deployed changes, having a compare link between recent packagaes is helpful
  2. Suitability of the package to rollback. If suitable you may want to recommend this option for fast mitigation of software-change incidents

Remember that every incident issue has a comment on it to provide useful information and links to help answer these questions. Take a look at the Useful information/tools to debug the incident for more details.

You may need to take action to prevent further deploys or help to get a mitigation deployed, this would either be a rollback, revert, fix, or hotpatch. Which one will depend on the specific incident. Discuss with the EOC and ask for support from Delivery if you’re not sure how to proceed.

During these incidents please add comments to the incident issue and help the EOC complete the incident summary, timeline, labelling, and identification of corrective actions following the incident.

If the incident blocked deployments please add the appropriate “deploys-blocked…” labels.

Release Manager requesting support

During times when we cannot deploy we’re vulnerable because in the event of a high-severity problem we would be unable to quickly deploy a fix. Treat all blockers as high-risk and move fast to unblock.

We have two ways to record and track deployment-failures - lightweight, and incidents. Follow these steps to decide which is right for this failure.

If a deployment or release failure needs investigation or action from dev-escalation or the EOC, the Release Manager should declare an incident with the correct availability severity. For incidents that don’t have customer-impact, for example, Staging failures or patch release failures, include the “backstage” label to assist the EOC with prioritization. Including one of the Delivery Impact labels will help make the impact of the incident visible.

For example, a staging deployement failure caused by an environment problem would be an Severity 3 incident, it would have the “backstage” label, and a “Delivery Impact::1” label.

Follow the instructions to raise a new incident

Once the incident is created:

  1. You are the Owner for this incident. Be sure to keep people informed and be pro-active in working towards a resolution
  2. Fill in as much information as you have on the incident issue. Assign to yourself
  3. Join the incident Slack and consider joining the Zoom bridge
  4. Engage with the EOC, Dev escalation, or Quality On Call as needed to resolve the issue
  5. Keep the incident issue updated with the latest decisions and information uncovered

Once the incident has been mitigated or resolved

  1. Set the incident issue labels to “resolved” once the issue is solved
  2. Add the approapriate labels for rootcause and service
  3. Work with engineers to agree on corrective actions following this incident
  4. Make sure the summary and timeline sections of the description are fully completed
  5. Add tracking for the length of delay we experienced by adding appropriate “deploys-blocked…” labels.

Tracking deployment blockers

To help track time and patterns in deploymemt blockers any incident or Change Request(CR) that blocks a deployment shoud have appropriate “Deploys-blocked-gstg-X” and “Deploys-blocked-gprd-X” labels added. See this incident for an example. Deployment delays are recorded and analyzed on https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/659

Useful information/tools to debug the incident

  1. Check the diff of the latest deployment to the environment.

    The /chatops run auto_deploy status command is useful for this.

    The Gitlab Release Tools Bot puts a comment on the incident issue listing the current and previous package on production, along with the diff. This can be used for incidents affecting production.

    Also post the diff to the incident issue and incident Slack channel. Someone else might see something that you missed.

  2. Check if any feature flags have been turned on or turned off.

    The incident issue description has a link called Feature Flag Changes which links to logs showing feature flag changes in the last few hours.

    Every feature flag change creates an issue in https://gitlab.com/gitlab-com/gl-infra/feature-flag-log/-/issues?scope=all&sort=created_date&state=all, so that is also a good place to check. You can filter by labels to check for changes in a particular environment only.