Overview of the post-deploy migration pipeline

The post-deploy migration (PDM) pipeline will execute pending post-deploy migrations on the staging and production environments. Introduced as part of https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/585, it unties the post-deploy migration execution from the auto-deploy packages making them suitable for rollback if needed.

Broadly, the PDM pipeline:

  1. Executes post-deploy migrations on staging, then
  2. Triggers a QA pipeline against staging to make sure tests continue to pass after the execution of post-deploy migrations in staging, then
  3. Executes post-deploy migrations on production.

Post-deploy migration pipeline

As with any other deploy pipeline, it also:

  • Notifies the start, end, and status of the execution on Slack.
  • Keeps a record of the post-deploy migrations executed by adding a comment in the release monthly issue.
  • Performs production checks before executing the post-deploy migrations on any environment to check for active incidents and ongoing changes (CRs).
  • Tracks the auto-deploy package running on GitLab.com on the GitLab canonical and security projects at the moment of the PDM execution.

How to execute post-deploy migrations?

To execute the PDM pipeline:

  1. Review the release manager dashboard for any pending post migration available:

release manager example

  1. If there are pending post migrations, trigger the PDM pipeline:
# On Slack
/chatops run post_deploy_migrations execute

Note that PDM complies with the production checks, post-deploy migrations won’t be executed if there are active incidents or deployments, or if there are no pending post-deploy migrations.

Handling post-deploy migrations failures

A failing post-deploy migration should not be retried without investigation, because data could be in an unstable state. Instead:

  1. Follow these steps to request support.
  2. Then work with the SRE on call to determine the next steps:

Report an incident

When creating incident issues, don’t worry about not having all the details yet; a link to a failing pipeline with some basic information is enough to begin. Once the issue is created, make sure to add any additional information to the incident, such as links to the merge request that introduced the migration.

Note that as a release manager your primary task at this point is to coordinate the effort to resolve the incident, rather than trying to resolve it yourself. This helps balance the workload, instead of one person having to do all the work.

If the incident happens at the end of your shift and there is no immediate need to resolve it, make this clear in both the issue and the appropriate Slack channels, and inform the next release manager about the state of things. This ensures the next release manager can take over the work when they begin their shift.

Take a look at the Release manager requesting support guide for more details on getting EOC and dev on-call involved.

Finding information about the post migration

In order to provide the EOC, developers and DBREs the information the need to help debug the issue, we need to identify the post migration that failed and the merge request that introduced it.

1. Determine the post deploy migration that failed. To do this you will need to go to the pipeline that contains the failure. This can be done by clicking on the link in the announcements channel indicating the migrations for the environment failed, and then clicking on the $env-postdeploy-migrations job. Scroll to the bottom of the output, and take note of the error from the failure, and the migration itself. The migration should have a name like 20220525201022 AddTemporaryIndexForVulnerabilityReadsClusterAgentIdMigration.

2. Determine the MR that introduced the post deploy migration. Once you have the name of the mgiration that failed, search for the name (excluding the date part) in the gitlab-org/gitlab project, and you should get a result pointing to an .rb file in the db/post_migrate folder. Click on the file in the search results, and then click on the commit message at the top. Finally, you should see a link to the relevant merge request that last touched that file (just below the line with parent). This is the MR that introduced the migration, and should be noted on the incident issue

3. Determine the engineers associated with the merge request. Once you have located the MR of the failure, you can indentify the following engineers involved

  • The engineer who created the change (the requester of the MR)
  • The database maintainer who reviewed the MR (this will involve looking at the comments until you find an approval from a member of the Database maintainer group, the approval is done by approving the merge request and assigning the ~“database::approved” label

Next Steps

  1. Determine if the post deploy migration is safe to retry

Now that you have identified the key people involved in the MR, you should reach out to them on slack and let them know about the incident issue, the error from the failed migration, and ask if it’s safe to retry the migration. If it is, retry the migration job in the pipeline.

  1. Determine if the failure should block deployments

If the post deploy migration is not safe to retry, analyze it with the EOC and engineers involved in the MR to determine if the failure should block deployments, the impact of the failure will depend on the nature and the operations performed by the post-deploy migration. If it’s deemed to block deployments it should be treated as a deployment blocker.

  1. Determine a plan for a mitigation strategy

Work with the EOC and the engineers responsible for the migration on a plan that allows deploys to continue and/or unblock the post migration execution. Make sure this plan is documented in the incident when agreed upon.

The mitigation depends on the nature of the post migration, whether they perform DDL or DML operations.

DDL Operations

Typical DDL operations include adding or dropping tables, columns, foreign keys or indicies.

Often the best mitigation for failed DDL operations is to

  1. Have the engineers create an MR to gitlab-org/gitlab that makes the migration in question a no-op

  2. Make sure that a deployment with that MR makes it to all environments

  3. Re-run the post deployment pipeline through chatops as described here

DML Operations

Typical DML operations include inserting, updating, or deleting new data, as part of a background migration or data cleanup exercise.

Depending on the migration in question, a common process for mitigating these migrations is to

  1. Work with the engineers and EOC to do a change request to mark the migration as complete in the database example

  2. Work with the engineers and EOC to create a change request to perform any work needed to cleanup data from the failed migration example

  3. Confirm with the engineers to follow up on why the failure happened and determine if the migration is safe for self-managed customers (should be documented on the incident).

Another option is to follow the process used for [DDL Operations] instead.

Final step

Make sure once the incident is resolved, to ping the release managers on the incident to make them aware of it, and to make sure they follow up on any fixes that might get done to ensure they are included in the monthly release.

How to determine if a post-deploy migration has been executed on GitLab.com?

There are three ways to determine if a post migration has been executed the GitLab staging and production environments:

  1. Through the merge request widget: In the merge request, if the environment widget indicates db/gstg and db/gprd, the post migration has been executed in staging and in production.
  2. Through the merge request labels: In the merge request, if the ~workflow::post-deploy-db-staging and ~workflow::post-deploy-db-production labels have been added, the post migration has been executed in environment that matches the label name.
  3. Through ChatOps: Using /chatops run auto_deploy status <sha> outputs the envirnments the commit has been deployed to, if db/gstg and db/gprd are included, the post migration has been executed in staging and production.
Merge request widgetMerge request labelsChatOps
MR widgetMR labelChatOps

Utilities

The definition of the PDM pipeline can be found on: