How to deal with security fixes breaking changes?

Security fixes can have far-reaching impact or inadvertently introduce breaking changes, this could affect customers and/or GitLab.com operations and lead to a production incident.

This runbook intends to guide Release Managers about how to deal with these scenarios.

What to do if the security fix is involved in an incident? (Overview)

Ensure there are engineers with topical knowledge in the incident call. At the very minimum, the following people should be involved:
- An engineer from the team that developed the security fix, preferably, the author or their manager.
- The security release manager or a security manager from AppSec.
Along with these engineers, evaluate the impact of the security fix. Following points should be clarified:
- Security fix severity. Note this might differ from the incident’s severity.
- Root of the incident. Was the incident a consequence of an actual vulnerability being fixed, or was it a bug associated with the security fix?
- Context of security fix and vulnerability:
  - Does the security fix repairs a vulnerability or is a security improvement?
  - Is the security vulnerability already disclosed to the public, if not, what would be the impact of unveiling the vulnerability?
Ensure EOC, IMOC and Support are aware of this incident. If the security fix is affecting some customers, probabilities of impacting more of them are high.
Assist to mitigate the incident: Operations on GitLab.com need to be restored.
Then, organize the security release: After the incident has been mitigated, timeline of a security release should be re-evaluated to consider the re-introduction of the security fix.

At any point in time, escalate if necessary.

Mitigate the incident

Revert and rollback are options Release Managers can suggest to mitigate the incident. If the incident is categorized as an S1, hot-patching would be the preferred option, avoid hotpatching for lower severities since it’s an intrusive operation.

Revert

Depending on the severity, the safest and quickest way to mitigate the incident could be reverting the security fix:

The revert needs to be performed in the Security repository.
If necessary, emphasize to AppSec that reverting doesn’t leak the vulnerability to the public.
To gain more time, consider a speedy deployment.
Notify EOC, TAM and Support about deployment timings, e.g. a revert should land in production in 8h approx.

Rollback

Rollback is a fast way to mitigate the incident, however, there are some downsides in rolling back a security fix:

It could imply rolling back other security fixes that were deployed at the same time of the one that caused the incident, this makes GitLab.com vulnerable.
Depending on the incident and security severities, rolling back might not be the most suitable option, e.g. if the security vulnerability is an S4 and the incident is an S3 it’s more reasonable to revert the MR and start the auto-deploy process.

Consider rolling back if the security fix was initially deployed isolated.

Organize the security release.

Most likely, mitigating the incident required reverting the security fix, with that, AppSec along with Release Managers need to prepare the next steps to complete and publish the security release. There are two options

Remove the security fix from the security release. AppSec approval is required for this option and it only applies if the security vulnerability was already disclosed to the public, or if the security fix was a security improvement.
[Delay the security release]. If the security fix can’t be dropped from the security release, the security release due date needs to be postponed to account for the re-introduction of the security fix.

Additionally, Release Managers should consider “blocking” the security release by preventing more issues to be added. Processing and deploying security fixes added last minute could lead to other incidents and delay the security release even further. One way to block a security release is to actively keep an eye on the Tracking Security Issue and ensure no last minute issues are added.

Reintroduce the security fix to the Security Release.

To not disclose the security vulnerability associated with the incident, the security fix needs to be re-integrated to the Security Release. DRI’s for this task are the team that prepared the fix, AppSec and the TAM. There’s a runbook that guides the re-introduction of a security fix with breaking changes from the development perspective.

Release Managers should assist scheduling the deployments and confirming the new due date of the security release is viable.

Delay the security release

Delaying a security release is a decision that belongs to AppSec in coordination with Release Managers. From the Delivery perspetive, delaying it a couple of days is acceptable, however delaying it further (e.g. 10 days or more) can impact several Release activities, from the monthly release to the QA:

Monthly release is blocked - A monthly release can’t be prepared if there’s a pending security release.
Security vulnerabilies are blocked - Customers can’t benefit from other security fixes associated with the security release, e.g. there could be S1/S2 security fixes waiting to be published.
Risk of undetected bugs - With the nightly builds disabled, nightly tests aren’t running which could lead to releasing undetected issues.
No gitaly updates - If Gitaly security fixes are associated with the security release, the Gitaly updates are paused until the security release is out.
Limited on ability to respond with patch releases due to process complexity.
Restriction on future security fixes because there’s no new tracking issue to associated.

Escalate

Engineering Manager of Delivery should be aware of this incident and the impact it could have on Release processes.
Escalate to the Director of Infrastructure, Platform, when there’s considerable delay for the security release.

How to deal with security fixes breaking changes? #

What to do if the security fix is involved in an incident? (Overview) #

Mitigate the incident #

Revert #

Rollback #

Organize the security release. #

Reintroduce the security fix to the Security Release. #

Delay the security release #

Escalate #