Measuring the Performance of our OpenStack Cloud

Here at Target, we run our own private OpenStack cloud and have never been able to accurately measure the performance of our hardware. This lack of measurement prevents the evaluation of performance improvements of new hardware or alternative technologies running as drivers inside OpenStack. It also prevents us from providing a Service Level Agreement (SLA) to our customers. Recently we have been striving to improve our OpenStack service which led us to talk to our consumers directly.

One of the major feedback points provided by talking with our consumers was the performance of the OpenStack cloud was lower than expected. Because we have not measured the performance of our cloud in the past, we have been unable to know if new hardware or configuration changes improves consumer-facing performance. With our new OpenStack environment builds we focused on changing this. But first we needed a tool to do the job.

Searching for a Tool

The first tool we looked at was Rally. Rally does performance testing of an OpenStack cloud. However, Rally focuses on the OpenStack API only. It is mainly used to test functionality (via Tempest) and stability of the API under large amounts of load. Rally does contain a resource to boot an instance and run Linux CLI commands via user data. This was tested as a way to provide the staging of instances to run performance software. However, starting the software on each instance at the same time and collecting the results from said software was difficult and not viable. Because of this, we deemed Rally was not suitable for our needs.

The next tool we looked at was KloudBuster. KloudBuster is a tool that does performance testing inside an OpenStack instance. At the time of writing it provides two sets of tests: HTTP and storage. The HTTP test uses a traffic generator to measure requests per second and latency between instances. The storage test uses FIO to measure read/write IOPs and bandwidth. KloudBuster does what we were looking for, measuring the performance of instances inside of OpenStack. However, it does not support adding more tests past the two included, it has limited configuration options for the environment setup, and it had stability issues inside our OpenStack cloud. Because of this, we deemed KloudBuster was not suitable.

Creating Our Own

With current options not meeting all of our needs, we decided the best option was to create our own performance framework that can be flexible to cover a wide variety of tests and environment setups. These were our requirements:

support the following tests: network, storage, and cpu
easily support adding future tests
flexible enough to support any OpenStack environment (including old versions)
ability to use fixed or floating IP addresses for connectivity
ability to use ephemeral or cinder storage
return test results in a format usable by other software

Enter CloudPunch

CloudPunch is a tool developed by the OpenStack team here at Target. It is completely open source and follows the MIT license. CloudPunch has the following features:

Written 100% in Python - CloudPunch is written in the Python language including the sections that stage OpenStack and the tests that run. This was chosen to avoid reliance on other tools.
Create custom tests - Because tests are written in Python, custom written tests can be ran by simply dropping a file in a folder. These tests are not limited; a test can do anything Python can do.
Fully scalable - A test can include one instance or hundreds. A couple lines of configuration can drastically change the stress put on OpenStack hardware.
Test across OpenStack environments - Have multiple OpenStack environments or regions? Run tests across them to see performance metrics when they interact.
Run tests in an order or all at once - See single metric results such as network throughput or see how a high network throughput can affect network latency.
JSON and YAML support - Use a mix of JSON or YAML for both configuration and results

How Does CloudPunch Work?

CloudPunch separates the process of running a test into three major roles:

Local Machine - The machine starting the test(s) and receiving the results outside of OpenStack.
Master - The OpenStack instance that is the communication between external and internal OpenStack. The local machine will send configuration to the master so the slaves can get it. The slaves send test results back the master so the local machine can receive them.
Slave - The OpenStack instance that runs the test(s). It reports only to the master.

For more specific information on these roles, see here

To better explain the process from start to finish, I will go over an example of running a simple ping test between instances. To initially start CloudPunch, I give the CLI a configuration file, an environment file, and an OpenRC file.

CloudPunch on the local machine then begins to stage the OpenStack cloud (in this order) with a security group, a keypair, the master router, the master network, the master instance, the slave routers, the slave networks, then finally the slave instances. The local machine now waits for all of the instances to be ready by checking in with the master server via a Flask API. At the same time, all slave instances are working to register with the master instance to provide host information and say that they are ready.

Once the master server and all slaves are registered and ready, the local machine sends the configuration to the master server and signals that the test can now begin. The slave instances are checking in with the master server every ~1 second for the test status to be ready. Once it is ready, the slaves pull down the configuration from the master server. The slaves then start the ping test by calling the ping.py file inside the slave directory. This file runs the ping command via the shell and captures the latency results. The slaves collect these results and once the test is complete, send these results back to the master instance. All while this is happening, the local machine is checking in with the master server every ~5 seconds for all slaves to have posted results to the master.

Once all slaves have posted results, the local machine pulls down the results and saves the results depending the configuration. Now that the process of running the test is complete, the local machine now deletes all of the resources it created on the OpenStack cloud. I am now left with the results of the ping test without having any resources sitting on the OpenStack cloud.

How Do I Get Started With CloudPunch?

See the getting started guide here

About the Author

Jacob Lube is an Engineer and part of Target’s Enterprise Cloud Engineering team focusing on anything OpenStack. Outside of OpenStack, Jacob enjoys riding bicycle, playing video games, and messing around in the Python language.

Tagged with openstack • performance • cloudpunch