How (and Why) We Moved to Spinnaker

Background

Just after the middle of last year, Target expanded beyond its on-prem infrastructure and began deploying portions of target.com to the cloud. The deployment platform was homegrown (codename Houston), and was backed wholly by our public cloud provider. While in some aspects that platform was on par with other prominent continuous deployment offerings, the actual method of deploying code was cumbersome and not adherent to cloud best practices. These shortcomings led to a brief internal evaluation of various CI/CD platforms, which in turn led us to Spinnaker.

We chose Spinnaker because it integrates with CI tools we already use at scale (Jenkins), supports deploying to all major public cloud providers, and compels software deployment best practices – all deployments are performed via immutable images, a snapshot of config + code.

Supporting a Platform

The primary goal of Target’s cloud platform is to enable product teams to deploy and manage their applications across multiple cloud providers. We provide CI/CD, monitoring, and service discovery as services, and any application deployed via our platform gets those capabilities via a base image that is pre-configured for connectivity to each service’s respective endpoint.

Since these components are essentially products we provide to internal customers, we had to ensure the new CD platform was operationally supportable and highly-available. So, as soon as we decided on Spinnaker, a handful of engineers from the Cloud Platform group set about making this happen.

Default Spinnaker scripts make it easy to standup a single self-contained server with the microservices and persistence layer all together, but that wasn’t conducive to doing blue-green deployments – allowing updates of Spinnaker without downtime to our internal customers.

We built jobs for building packages based off the master branch of each Spinnaker component’s upstream git repository, and wrote Terraform plans to manage the deployment of each stack. We set about making each component as resilient as possible. Front50, the Spinnaker component responsible for data persistence, uses Cassandra by default. Managing a Cassandra ring added too much overhead just to maintain configuration of Spinnaker, so we borrowed a play from Netflix and configured it to persist configuration in cloud storage instead. We’re also using the cloud provider’s cache option instead of having to manage our own highly-available Redis cluster.

Overcoming challenges

We had our fair share of challenges along the way. We were running into various rate-limiting and throttling issues when hitting the cloud provider APIs. Spinnaker makes a lot of API calls in order to have a responsive user experience, but the cloud provider will only allow a small number of API calls per minute. Fortunately, Netflix also wrote a tool, Edda, to cache cloud resources. We configured Spinnaker to use that instead of a direct connection to the cloud provider’s API, which seems to be scaling much better.

The other major challenge we faced was how to handle baking images for multiple regions. Originally we had configured Spinnaker to bake in our western region and then copy to east. That was SLOW. It would bake the image in about 5 minutes, but it would take 20-40 minutes to copy it. Fortunately, Spinnaker supports parallel multi-region baking – taking the same base OS image in each region and installing the same packages on them, which in theory should result in the same config+code. Unfortunately, the way Netflix implemented it would work if we had multiple regions in the same account, but not if we had separate accounts per region, which is how Target operates. One of our engineers found a work around, and ultimately worked with an engineer at Netflix to get a more elegant solution incorporated upstream. Now, because we can bake the images in parallel, baking images takes 5 minutes instead of 25-45 minutes. A big improvement.

OpenStack Driver

One of the reasons we chose Spinnaker was its pluggable architecture, and we knew early on that we would put that to good use. Target has a considerably sized internal OpenStack environment and we wanted to be able to leverage Spinnaker to deploy there, so we started the development of a native OpenStack clouddriver.

The process worked exactly how contributing to an open source project should work. We asked the community if anyone else was interested in collaborating, and Veritas Corporation, who also needed OpenStack support in Spinnaker, had we set to work with several of their engineers. We met with core Spinnaker engineers from Netflix and Google, and they asked us to submit pull requests directly against the master branch in the public github repository in order to get rapid feedback on the changes we were making.

A small group of engineers began to spec out the work and begin development in late May, and at the end of September the driver reached what we would call a stable state.

Autoscaling in OpenStack

During development of the OpenStack driver, we ran into an issue with the way autoscaling is implemented in Heat (an OpenStack orchestration engine to launch multiple composite cloud applications based on templates in the form of text files that can be treated like code).

First, the APIs for load balancers didn’t support automatically adding instances to a member pool until the Mitaka release of OpenStack. We were running an earlier version, so our private cloud engineers swarmed and quickly upgraded our environment to Mitaka.

Second, and more troublesome, Heat doesn’t track disparity between the desired instance count in a scaling group, and the actual count. As a result, autoscaling in OpenStack isn’t really auto; it requires some kind of intervention. To work around this, we updated the driver to mark the server group unhealthy in Heat when it detects a discrepancy between desired and actual.

What’s next?

We’re extremely proud to share the OpenStack driver in Spinnaker with the community, and we hope that any organization that is using OpenStack will leverage the new driver to enable immutable deployments and autoscaling in that environment.

We’re currently growing our use of container-based deployments via Kubernetes, both internally, as well as on public cloud providers. The presence of a driver for Kubernetes in Spinnaker will enable us to facilitate deployments to any or all of the k8s clusters we’re running. Look forward to learning more about our Kubernetes deployment and consumption in a future techblog post.

Tagged with cloud • automation • ci/cd • engineering