Target and Elasticsearch: Maintaining an ELK stack over Peak Season

One of the strongest benefits of launching an application into the cloud is the pure on-demand scalability that it provides. I’ve had the privilege of working with the ELK stack (Elasticsearch, Logstash, Kibana) for purposes of log aggregation for the past two years. When we started at that time, we were pleased with our performance on search and query times with 10’s of gigabytes of data in the cluster in production. When Peak time hit, we reveled as our production clusters successfully managed half a terabyte of data(!). During peak, Target hosted 14 Elasticsearch clusters in the cloud containing more...

Surviving (and thriving) Through Peak Season 2016 on the Digital Observability Team

It is 7:30 AM on a Monday morning in late October. I am waiting in line at Cafe Donuts to bring my team breakfast for our mandated ‘no work for one hour’. We just wrapped up a strenuous week of implementing a major upgrade to our Elastic logging cluster. Many digital teams are relying on this upgrade to position themselves to confidently monitor their application health during the most important day in retail - Black Friday. It was a successful, much anticipated upgrade that resulted in many hours of overtime, late night calls, and cross-team performance tests. Morale is high,...

Hadoop Rolling Upgrades

Hadoop upgrades over the last few years meant long outages where the Big Data platform team would shutdown the cluster, perform the upgrade, start services and then complete validation before notifying users it was ok to resume activity. This approach is a typical pattern for major upgrades even outside Target and reduces the complexity and risks associated with the upgrade. While this worked great for the platform team, it was not ideal for the hundreds of users and thousands of jobs that were dependent on the platform. That is why we decided to shake things up and go all in...

How (and Why) We Moved to Spinnaker

Background Just after the middle of last year, Target expanded beyond its on-prem infrastructure and began deploying portions of target.com to the cloud. The deployment platform was homegrown (codename Houston), and was backed wholly by our public cloud provider. While in some aspects that platform was on par with other prominent continuous deployment offerings, the actual method of deploying code was cumbersome and not adherent to cloud best practices. These shortcomings led to a brief internal evaluation of various CI/CD platforms, which in turn led us to Spinnaker. We chose Spinnaker because it integrates with CI tools we already use...

Distributed Troubleshooting

Target’s open source big data platform contains a vast array of clustered technologies or ecosystems working together. Troubleshooting an issue within a single ecosystem is a difficult task let alone an issue that spans several ecosystems. It is impractical for a single human to individually investigate ecosystems one at a time for potential problems. The house will burn to the ground long before an engineer can find the cause of an issue and resolve it without quick access to aggregated system metrics and logs. The Solution How to identify, troubleshoot and resolve a distributed issue? Fight fire with fire of...

Win the cloud with Winnaker!

Win the cloud with Winnaker! I am happy to announce that we, at Target, decided to open source a tool called Winnaker. This tool will allow the user to audit Spinnaker from an end user point of view. But first what is Spinnaker? The first time I heard the word Spinnaker, my reaction was, “wait, what does that even mean in English?” Shortly after, I found myself implementing a demo of Spinnaker as a potential replacement for our internal cloud deployment tool. Spinnaker is a cloud agnostic continuous delivery tool, which means we can push our code to any cloud...