Overview

GitLab.com deployments are initiated by GitLab Chatops, which triggers a new pipeline on the Deployer project.

To see all available options for deploy run /chatops run deploy --help

Creating a new deployment for upgrading GitLab

Deployments are initiated using GitLab Chatops. For example, to initiate a deployment of 11.8.0 to staging:

/chatops run deploy 11.8.0.ee.0 gstg

Override variables

Variables can be set on the deployer pipeline to change its behavior or the jobs that will be executed. See variables.md.

Bypassing Failures

Skipping Production Promotion Checks

In cases where there may be an incident or a change that would block a deploy, perform the necessary investigation to determine if it is safe to proceed with a deploy. If you decide it is safe to proceed, do the following:

  • Set a variable in the promote manual job, called OVERRIDE_PRODUCTION_CHECKS_REASON. The value of this variable should be the reasoning behind the decision to ignore the production checks. The reason is placed into the release issue for audit purposes.

For compliance reasons, make sure to have the EOC add a response to the “…started a deployment” comment on the release issue to record their approval of the override.

Deploy using ChatOps command

Another option is to utilize the ignore-production-checks parameter in the chatops deploy command with a reason indicating why these checks are skipped. The reason is placed into the release issue for audit purposes.

Example command:

/chatops run deploy 11.8.0.ee.0 gprd --ignore-production-checks 'Insert a reason for skipping production checks here'

Skipping Canary Promotion Checks

In case a manual deploy to canary is required, the following command can be used:

/chatops run deploy 13.6.202011122020-fe7dcb0a4ee.886b1d4c02a gprd-cny

Skipping Prepare Job Failures

The CI job <ENV>-prepare may fail if a node is down in haproxy. We can bypass the forced failure on the prepare job by using the option allow-precheck-failure. Example:

/chatops run deploy 11.8.0.ee.0 gstg --allow-precheck-failure

This essentially sends the variable PRECHECK_IGNORE_ERRORS to the deployment pipeline.

Creating a new deployment for rolling back GitLab to an earlier version

Rollbacks are covered in detail in the rollback runbook.

CI/CD pipeline for a deploy

Assets

Assets are either extracted from the assets docker image if it is available or pulled from the omnibus package. This is done with a shell script in the deploy pipeline at the same time when we run database migrations in a job called <ENV>-assets (see CI configuration).

After extraction they are upload to an object storage bucket which serves as the origin for our asset CDN. It is assumed that all assets have hashed filenames so a long cache lifetime is set (e.g., Cache-Control:public,max-age=31536000).

graph TB;

    w[web browser] --- a[Fastly CDN<br/>assets.gitlab-static.net];
    w --- c[CloudFlare CDN<br/>gitlab.com];
    a ---|/assets| b[Object Storage];

    c ---|/*| d[HAProxy];
    c ---|/assets| d[HAProxy];

    a ---|/*| c;
    d ---|/assets| b;
    subgraph fleet
      d --- v[VM Infrastructure];
      d --- k[K8s Infrastructure];
    end

When the browser requests an asset under /assets, it will either be for assets.gitlab-static.net/assets or gitlab.com/assets:

  • If the request is for assets.gitlab-static.net/assets it will arrive at the Fastly CDN, which is configured to use a GCS object storage bucket as an origin for all requests to /assets.
  • If the request is for gitlab.com/assets, the request will go to the CloudFlare CDN, then HAProxy, which proxies to object storage.

There is an outstanding issue to simplify this by removing the Fastly CDN https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11411

The <env>-assets job is only run on our canary stages. They are set to manual for other stages. This is done primarily for speed. Since assets are uploaded during the canary stage, there’s no need to upload them a second time during the main stage deployment. These jobs remain in the unlikely case that we have a deployment package expedited, canary could be drained, the deployment to canary may be skipped. Having this job available means that we can proceed with manually running the assets job during emergency situations.

Upgrade Pipeline

The following stages are run for a single environment upgrade:

graph LR
    a>prepare] --> b>track]
    b --> c>migrations and assets]
    c --> d>gitaly]

    subgraph fleet
      d --> e>praefect]
      e --> k8s
      subgraph k8s-workloads/gitlab-com
        k8s>kubernetes]
      end
    end

    k8s --> f>postdeploy migrations]
    f --> g>track]
    g --> h>cleanup]
    h --> i>gitlab-qa]

Change-lock periods for deployment

Before deploying to stage that receives production traffic there is an additional check for a change-lock periods which are configured in a configuration file in the change-lock project. This is not yet a product feature https://gitlab.com/gitlab-org/gitlab-ce/issues/51738 so the deployer pipeline uses script published in a docker image that is run before the production stages of deployment.

This yaml file specified one more windows where the production jobs will fail, it uses a cron or date syntax for determining the change-lock period.

For more information about specifying change-lock periods, see the project README.

Note: The change-lock period only runs for automated deployments, where the DEPLOY_USER=deployer. For a normal chatops based deployment the user will be set to the name of the individual who initiated the deploy.

CI/CD pipeline for a rollback

A rollback pipeline has the same stages, except that the gitaly and praefect jobs are optionally executed after all other stages, if required. This is necessary because it’s possible that gitlab-rails will have changes that are incompatible with earlier versions of Gitaly.

graph LR
    a>prepare] --> b>track]
    b --> k8s>kubernetes]

    subgraph k8s-workloads/gitlab-com
      k8s>kubernetes]
    end

    k8s --> c>track]
    c --> d>gitlab-qa]
    d --> e>cleanup]

    subgraph fleet
      f(gitaly-rollback)
      g(praefect-rollback)
    end

    e --> f
    e --> g

    classDef manual stroke-dasharray: 5 5
    class f,g manual

Rollback considerations for database migrations

  • Before initiating a rollback, background migrations should be evaluated. The DBRE oncall should asses the impact of rolling back. Note that clearing the background migration queue may not be the best course of action as these migrations should be backwards compatible with previous application versions. For information on how to clear queues see the sidekiq troubleshooting guide
  • If the current version introduced one or more post-deployment migrations, these migrations must be reverted before rolling back the code changes. This is a manual process and should be assessed by a DBRE before the rollback is initiated. https://gitlab.com/gitlab-org/release/framework/issues/234 discusses how to we can better deal with post-deploy migrations in the context of rollbacks.
  • Regular migrations can be reverted but they are not reverted in the rollback pipeline. Migrations are designed to be backwards compatible with the previous version of application code.
  • Rolling back more than one version without a very thorough review should never be attempted.

Stages in detail

  • prepare: The prepare stage is responsible for all non-destructive changes before a deployment. This is the general place we put checks and notifications before continuing with changes to servers. The following tasks are executed in the prepare stage:

    • notes the start time of the deploy
    • checks to see if the package is indexed in packagecloud
    • verifies haproxy status
    • verifies that there are no critical alerts
    • verifies the version is consistent across the fleet
  • track: We track deployments in this stage as “running”, and then after a deploy has completed, we track a deployment as either “success” or “failure”.

  • migrations: This stage runs online database migrations for staging and canary deployments. We do not run online migrations for production deployments because they are handled by the canary stage

  • gitaly and praefect deploy: The Gitaly deploy happens before the rest of the fleet in case there are Rails changes in the version being deployed that take advantage of new Gitaly features. If the Gitaly version is not changed, the Omnibus package update is skipped, and updated later when Chef is run after the pipeline completes. If the Gitaly version is updated, a Gitaly hup is issued which will cleanly reload the service avoiding downtime. Note that we MUST deploy to Gitaly prior to Praefect.

  • kubernetes: This stage includes the necessary job to trigger an upgrade to our Kubernetes infrastructure. Read more about this in section Kubernetes Upgrade.

  • postdeploy migrations: Post-deploy migrations are run last and may be a point-of-no-return for upgrades as it might make a change that is not backwards compatible with previous versions of the application.

    NOTE: Post-deploy migrations are executed on-demand outside of the deployer pipeline by release-tools. See this epic for more information.

  • cleanup: This stage handles any post-install tasks that need to run at the end of deployment. This includes Grafana annotations and starting Chef across the fleet.

  • gitlab-qa: The very last stage of the pipeline runs a set of QA tests against the environment on which the deploy is running.

    NOTE: QA pipelines are triggered outside of the deployer pipline by release-tools. See this epic for more information.

Omnibus Installation

The omnibus installation is handled by a task file that has the following steps:

graph LR;
    a>lock environment] ==> b>install gitlab-ee];
    b ==> c>run gitlab-ctl reconfigure];
    c ==> d>restart services if necessary];

Kubernetes Upgrade

Inside of deployer, we have a helper script cng-versions-from-deploy-version to help match the variety of ways that our deployment versions come into deployer and translate these into the correct format that is required by our GitLab Helm chart. These properly formatted versions are sent to downstream pipelines as new environment variables. Please view the comments in the script for details.

Also inside of deployer, is a job to check and validate that the desired deployment images have been successfully built in our CNG project. This is to ensure we do not accidentally perform a deploy while missing essential components to our infrastructure. This is play wait_cng_build

Deployer will then trigger a pipeline to the k8s-workloads/gitlab-com project. The job responsible for the trigger is called <ENV>-kubernetes in deployer. This trigger sends all the variables for the Deployment. The trigger can be seen in our deployer .gitlab-ci.yml

In k8s-workloads/gitlab-com, we have a special set of CI jobs specific for auto-deploy. These can be see in our .gitlab-ci.yml file. These jobs will only operate on the specific environment passed into the triggered pipeline. A dry-run is performed which will examine if there are any outstanding changes that would be applied to an environment outside the change we are expecting for deploying a new version. This job will fail if outstanding changes are detected, and this must be investigated and manually resolved (typically with a pipeline on master), before this job will succeed.

Repositories overview

graph TD
    subgraph patcher;
    a1[gitlab-ci.yml];
    a2[deploy-tooling submodule];
    end;
    subgraph deployer;
    b1[gitlab-ci.yml];
    b2[deploy-tooling submodule];
    b3[patcher submodule];
    b4[k8s-workloads/gitlab-com trigger];
    end;