Overview
GitLab.com deployments are initiated by GitLab Chatops, which triggers a new pipeline on the Deployer project.
To see all available options for deploy run /chatops run deploy --help
Creating a new deployment for upgrading GitLab
Deployments are initiated using GitLab Chatops. For example, to initiate a deployment of 11.8.0 to staging:
/chatops run deploy 11.8.0.ee.0 gstg
Override variables
Variables can be set on the deployer pipeline to change its behavior or the jobs that will be executed. See variables.md.
Bypassing Failures
Skipping Production Promotion Checks
In cases where there may be an incident or a change that would block a deploy, perform the necessary investigation to determine if it is safe to proceed with a deploy. If you decide it is safe to proceed, do the following:
- Set a variable in the
promote
manual job, calledOVERRIDE_PRODUCTION_CHECKS_REASON
. The value of this variable should be the reasoning behind the decision to ignore the production checks. The reason is placed into the release issue for audit purposes.
For compliance reasons, make sure to have the EOC add a response to the “…started a deployment” comment on the release issue to record their approval of the override.
Deploy using ChatOps command
Another option is to utilize the ignore-production-checks
parameter in
the chatops deploy
command with a reason
indicating why these checks are skipped. The reason is placed into the release
issue for audit purposes.
Example command:
/chatops run deploy 11.8.0.ee.0 gprd --ignore-production-checks 'Insert a reason for skipping production checks here'
Skipping Canary Promotion Checks
In case a manual deploy to canary is required, the following command can be used:
/chatops run deploy 13.6.202011122020-fe7dcb0a4ee.886b1d4c02a gprd-cny
Skipping Prepare Job Failures
The CI job <ENV>-prepare
may fail if a node is down in haproxy. We can bypass
the forced failure on the prepare job by using the option
allow-precheck-failure
. Example:
/chatops run deploy 11.8.0.ee.0 gstg --allow-precheck-failure
This essentially sends the variable PRECHECK_IGNORE_ERRORS
to the deployment
pipeline.
Creating a new deployment for rolling back GitLab to an earlier version
Rollbacks are covered in detail in the rollback runbook.
CI/CD pipeline for a deploy
Assets
Assets are either extracted from the assets docker image if it is available or pulled from the omnibus package.
This is done with a shell script in the deploy pipeline at the same time when we run database migrations in a job called <ENV>-assets
(see CI configuration).
After extraction they are upload to an object storage bucket which serves as the origin for our asset CDN.
It is assumed that all assets have hashed filenames so a long cache lifetime is set (e.g., Cache-Control:public,max-age=31536000
).
graph TB;
w[web browser] --- a[Fastly CDN<br/>assets.gitlab-static.net];
w --- c[CloudFlare CDN<br/>gitlab.com];
a ---|/assets| b[Object Storage];
c ---|/*| d[HAProxy];
c ---|/assets| d[HAProxy];
a ---|/*| c;
d ---|/assets| b;
subgraph fleet
d --- v[VM Infrastructure];
d --- k[K8s Infrastructure];
end
When the browser requests an asset under /assets
, it will either be for assets.gitlab-static.net/assets
or gitlab.com/assets
:
- If the request is for
assets.gitlab-static.net/assets
it will arrive at the Fastly CDN, which is configured to use a GCS object storage bucket as an origin for all requests to/assets
. - If the request is for
gitlab.com/assets
, the request will go to the CloudFlare CDN, then HAProxy, which proxies to object storage.
There is an outstanding issue to simplify this by removing the Fastly CDN https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11411
The <env>-assets
job is only run on our canary stages. They are set to
manual
for other stages. This is done primarily for speed. Since assets are
uploaded during the canary stage, there’s no need to upload them a second time
during the main stage deployment. These jobs remain in the unlikely case that
we have a deployment package expedited, canary could be drained, the deployment
to canary may be skipped. Having this job available means that we can proceed
with manually running the assets job during emergency situations.
Upgrade Pipeline
The following stages are run for a single environment upgrade:
graph LR
a>prepare] --> b>track]
b --> c>migrations and assets]
c --> d>gitaly]
subgraph fleet
d --> e>praefect]
e --> k8s
subgraph k8s-workloads/gitlab-com
k8s>kubernetes]
end
end
k8s --> f>postdeploy migrations]
f --> g>track]
g --> h>cleanup]
h --> i>gitlab-qa]
Change-lock periods for deployment
Before deploying to stage that receives production traffic there is an additional check for a change-lock periods which are configured in a configuration file in the change-lock project. This is not yet a product feature https://gitlab.com/gitlab-org/gitlab-ce/issues/51738 so the deployer pipeline uses script published in a docker image that is run before the production stages of deployment.
This yaml file specified one more windows where the production jobs will fail, it uses a cron or date syntax for determining the change-lock period.
For more information about specifying change-lock periods, see the project README.
Note: The change-lock period only runs for automated deployments, where
the DEPLOY_USER=deployer
. For a normal chatops based deployment the user
will be set to the name of the individual who initiated the deploy.
CI/CD pipeline for a rollback
A rollback pipeline has the same stages, except that the gitaly
and praefect
jobs are optionally executed after all other stages, if required. This is
necessary because it’s possible that gitlab-rails will have changes that are
incompatible with earlier versions of Gitaly.
graph LR
a>prepare] --> b>track]
b --> k8s>kubernetes]
subgraph k8s-workloads/gitlab-com
k8s>kubernetes]
end
k8s --> c>track]
c --> d>gitlab-qa]
d --> e>cleanup]
subgraph fleet
f(gitaly-rollback)
g(praefect-rollback)
end
e --> f
e --> g
classDef manual stroke-dasharray: 5 5
class f,g manual
Rollback considerations for database migrations
- Before initiating a rollback, background migrations should be evaluated. The DBRE oncall should asses the impact of rolling back. Note that clearing the background migration queue may not be the best course of action as these migrations should be backwards compatible with previous application versions. For information on how to clear queues see the sidekiq troubleshooting guide
- If the current version introduced one or more post-deployment migrations, these migrations must be reverted before rolling back the code changes. This is a manual process and should be assessed by a DBRE before the rollback is initiated. https://gitlab.com/gitlab-org/release/framework/issues/234 discusses how to we can better deal with post-deploy migrations in the context of rollbacks.
- Regular migrations can be reverted but they are not reverted in the rollback pipeline. Migrations are designed to be backwards compatible with the previous version of application code.
- Rolling back more than one version without a very thorough review should never be attempted.
Stages in detail
prepare: The prepare stage is responsible for all non-destructive changes before a deployment. This is the general place we put checks and notifications before continuing with changes to servers. The following tasks are executed in the prepare stage:
- notes the start time of the deploy
- checks to see if the package is indexed in packagecloud
- verifies haproxy status
- verifies that there are no critical alerts
- verifies the version is consistent across the fleet
track: We track deployments in this stage as “running”, and then after a deploy has completed, we track a deployment as either “success” or “failure”.
migrations: This stage runs online database migrations for staging and canary deployments. We do not run online migrations for production deployments because they are handled by the canary stage
gitaly and praefect deploy: The Gitaly deploy happens before the rest of the fleet in case there are Rails changes in the version being deployed that take advantage of new Gitaly features. If the Gitaly version is not changed, the Omnibus package update is skipped, and updated later when Chef is run after the pipeline completes. If the Gitaly version is updated, a Gitaly
hup
is issued which will cleanly reload the service avoiding downtime. Note that we MUST deploy to Gitaly prior to Praefect.kubernetes: This stage includes the necessary job to trigger an upgrade to our Kubernetes infrastructure. Read more about this in section Kubernetes Upgrade.
postdeploy migrations: Post-deploy migrations are run last and may be a point-of-no-return for upgrades as it might make a change that is not backwards compatible with previous versions of the application.
NOTE: Post-deploy migrations are executed on-demand outside of the deployer pipeline by release-tools. See this epic for more information.
cleanup: This stage handles any post-install tasks that need to run at the end of deployment. This includes Grafana annotations and starting Chef across the fleet.
gitlab-qa: The very last stage of the pipeline runs a set of QA tests against the environment on which the deploy is running.
NOTE: QA pipelines are triggered outside of the deployer pipline by release-tools. See this epic for more information.
Omnibus Installation
The omnibus installation is handled by a task file that has the following steps:
graph LR;
a>lock environment] ==> b>install gitlab-ee];
b ==> c>run gitlab-ctl reconfigure];
c ==> d>restart services if necessary];
Kubernetes Upgrade
Inside of deployer, we have a helper script cng-versions-from-deploy-version to help match the variety of ways that our deployment versions come into deployer and translate these into the correct format that is required by our GitLab Helm chart. These properly formatted versions are sent to downstream pipelines as new environment variables. Please view the comments in the script for details.
Also inside of deployer, is a job to check and validate that the desired deployment images have been successfully built in our CNG project. This is to ensure we do not accidentally perform a deploy while missing essential components to our infrastructure. This is play wait_cng_build
Deployer will then trigger a pipeline to the k8s-workloads/gitlab-com project.
The job responsible for the trigger is called <ENV>-kubernetes
in deployer.
This trigger sends all the variables for the Deployment. The trigger can be
seen in our deployer
.gitlab-ci.yml
In k8s-workloads/gitlab-com
, we have a special set of CI jobs specific for
auto-deploy. These can be see in our
.gitlab-ci.yml
file. These jobs will only operate on the specific environment passed into the
triggered pipeline. A dry-run is performed which will examine if there are any
outstanding changes that would be applied to an environment outside the change
we are expecting for deploying a new version. This job will fail if outstanding
changes are detected, and this must be investigated and manually resolved
(typically with a pipeline on master), before this job will succeed.
Repositories overview
graph TD
subgraph patcher;
a1[gitlab-ci.yml];
a2[deploy-tooling submodule];
end;
subgraph deployer;
b1[gitlab-ci.yml];
b2[deploy-tooling submodule];
b3[patcher submodule];
b4[k8s-workloads/gitlab-com trigger];
end;
- gl-infra/deployer is the repository where the CICD configuration is maintained for defining the pipeline. It is sourced on https://ops.gitlab.net with a public mirror on https://gitlab.com.
- gl-infra/patcher for engineering is the repository where post-deployment patches live and can be accessed by all of engineering.
- gl-infra/patcher push mirror is the repository where post-deployment patcher pipelines are run it is a private repo that can only be accessed by SREs.
- gl-infra/deploy-tooling is a common repository that is used as a submodule for all other repos that require Ansible code. This repository contains the plays, plugins and scripts for deployment. It is sourced on https://ops.gitlab.net with a public mirror on https://gitlab.com.
- k8s-workloads/gitlab-com is the repository that contains all the necessary components for GitLab.com that operate on Kubernetes Infrastructure.