REDstack

Hadoop Clusters as a Service

REDstack is Now Open Source!

We are officially open sourcing REDstack, our sandbox tool for Big Data development at Target.

What is REDstack?

REDstack is a tool for provisioning kerberized clusters on OpenStack. We created it with four goals in mind:

  • Provide a secured environment, with the ability to leverage preconfigured LDAP and Kerberos servers.
  • Out of the box usability, allowing you to log in with preconfigured user accounts.
  • Custom user management utilities to administer the cluster.
  • Provide a fully customizable experience, everything is a configuration option in your build files:
    • Cluster size, node sizes, types of nodes and node roles,
    • Hadoop configurations, heap sizes, and components,
    • All users, passwords, and secure assets.

Components

REDstack is made up of two major components:

  1. hdp-cloud - The cookbook
    • The cookbook is used by the application itself to install components and lay down cluster configuration.
    • The cookbook can be used independently of REDstack to manually provision a cluster.
  2. REDstack - The orchestration component
    • REDstack is a python application that performs all of the high-level complexities and timings associated with a full Hadoop installation:
      • Orchestrates the provisioning of resources over OpenStack APIs,
      • Controls and monitors parallel Chef deployment across the cluster,
      • Manages and monitors cluster component install over HTTPS requests.

    REDstack is bundled with a Docker image, where the configs are set up locally before an installation, and all of the dependencies are updated and configured.

How to Get Started

Head over to the repository at https://github.com/target/redstack and follow along. The repo has instructions on how to build and configure the clusters using the included Docker image.

History of the Project

Target’s Big Data Platform Team manages multiple Big Data environments, with hundreds of nodes and many PB’s of data. As mentioned in our prior blog posts, we depend heavily on Chef as a core part of our CI/CD pipeline. During our testing a couple of years ago, we identified a large opportunity to provide a way to do full integration testing with our Chef cookbooks. This opportunity opened the door for a new product.

Origins

Early on, we released a product internally called Pushbutton. This was a three-node cluster that ran on a standard issue laptop at Target. It was kerberized, and it worked well as a Sandbox environment for developers because it had the same security setup. PushButton, however, did not use the same cookbooks as the main cluster. We wanted something that could do what PushButton did, but with our real cookbooks in a larger environment. By creating a full cluster from scratch, we would be able to understand exactly how the cookbooks would function when applied to new nodes and make sure the cookbooks were in a constant working state. We also needed a little bit larger of a sandbox, otherwise we would have difficulty testing high availability (HA) components, or those that run across multiple nodes, like Apache Zookeeper.

Toward REDStack

Our early exploration started out by trying to adapt the existing PushButton work onto OpenStack. We used shell scripts to automate creation of instances, Knife to bootstrap the nodes with the Chef recipes, and cURL requests to automate and monitor the install process. We got it working, but we still had to face our biggest challenge yet, integrating the cookbooks meant for physical hardware onto virtualized nodes. They were expecting particular configurations such as physical drive formatting and partitioning, or where master services are already defined and running. Instead, we had to get them working with our minimum 1x50GB virtual volumes, and anything we changed would have to still be working on the physical nodes. After some difficult work, and with some clever tricks with attributes and Chef injection, we were able to preconfigure the nodes to be recognized by the recipes and were able to slowly commit our changes back to the ecosystem without impact on the main cluster’s health.

A Product is Born

By this point, we had written the entire process as a Python application and set it up on a nightly loop. Every night, it would build an entire Hadoop cluster, from scratch, smoke test it, and report the results to the team. And it worked! Word started to spread among the organization and we were suddenly getting requests from users of our production cluster. They wanted to use REDstack to spin up a sandbox for them to use for Hadoop. Not only would it be more powerful, and sharable by multiple people on a team, it would look exactly like our production cluster because it uses all of the same configurations and assets.

Opportunity!

As a data engineer, wouldn’t it be nice if I could have a kerberized sandbox environment that looked similar to a production cluster, was easy to work with and user friendly? This is what was possible with these environments, so we started to try and provide them to teams. It didn’t work very well initially, users cloned the repo and ran it on their laptops and ran into all sorts of issues with dependencies and versions with Chef versions, ruby versions, gem versions, and python versions. There were simply too many dependencies to manage on on different environments, even with existing dependency management tools. We needed a way to hide all of the complexity and eliminate the need for users run anything on their computer.

The Full-Stack Service

This led us to the development of our full-stack cluster delivery service, Stacker. Stacker is an API running on top of a database, orchestrating REDstack deploys in threads and listening for requests over a front-end web page. Users simply submit a request on the website and a cluster will be delivered to them in about an hour. At this time, there have been more than 500 unique cluster requests and at least 30 teams are actively using REDstack as a part of their development process.

Ongoing Development

Over time, REDstack has evolved to include multiple types of Big Data clusters. We now provide Elasticsearch clusters as well as Druid in addition to the original Hadoop clusters. Our service continues to evolve with new releases and versions of the software, and additional ease-of-user work on our build in functions for user management and cluster administration.

About the Author

Eric Krenz is a Senior Data Engineer on the Big Data Platform Team at Target.