We’re pleased to welcome Elizabeth K. Joseph, OpenStack Infrastructure Team, as our newest guest blogger. Elizabeth works as an Automation and Tools Engineer at HP. She focuses on supporting the OpenStack project infrastructure, where
elastic-recheck – powered by the ELK Stack – is used in production for the OpenStack development teams. This infrastructure is also managed as its own open source project, and draws core contributors from HP, the OpenStack Foundation and Mirantis, as well as part-time contributors from various other companies in the OpenStack ecosystem.
Every day, the OpenStack project runs hundreds of patches through its continuous integration system to assure code consistency, functionality and smooth integration with other projects in the OpenStack ecosystem.
This process works exceptionally well for a majority of patches, but as with any system pushing so much data through development infrastructure, there are times when there are failures unrelated to the patches being tested. It may be that a VM goes down unexpectedly during a test, an external resource is unavailable (nameserver, package repository, etc), a service unrelated to the test locks up or one of many race conditions or other transient bugs pops up.
Historically, the team worked to track these issues in bug reports, which developers could submit and then search through at http://status.openstack.org/rechecks/ to see if their failed change was tied to one of these types of “transient” bugs. This process was largely manual and only gave limited data about the frequency and types of patches these bugs were occurring on. It was also difficult to determine whether a “new” bug was only impacting your change, whether it had been going on for some time without being reported or whether the bug had simply gone away. There was also no automated way of notifying a developer that their change had encountered a known bug; they had to check the rechecks page themselves.
When a developer ran into one of these bugs, they would then rerun the tests with a comment referencing the bug in our code review system.
Back at the OpenStack Summit in the fall of 2012, Clark Boylan and Sean Dague came up with some ideas around a more automated solution, using Elasticsearch to address many of the issues with the manual process so that developers could be automatically notified if their change hit a known bug.
In 2013, Clark began implementing an Elasticsearch + Logstash + Kibana (the ELK stack) solution for the OpenStack infrastructure. The stable setup he finally came up with is fully open source and documented at http://ci.openstack.org/logstash.html.
Over that summer, Sean created some sample code to talk to the web service. Joe Gordon and Matthew Treinish then turned the sample code into
elastic-recheck in September of 2013, when stress on the project infrastructure hit a high point and manual rechecks were common.
elastic-recheck now in place, contributors can:
- Identify a pattern in the failure logs and visualize it in Kibana at http://logstash.openstack.org/ to search through a few weeks of logs to determine frequency.
- Create a bug in our bug tracker for the error, add a comment to the bug with the exact query identified via Kibana, and a link to the logstash url for that query search.
- Submit a simple YAML-based change to the
elastic-recheckrepository’s queries/ directory, which contains the list of bugs to track: https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries.
Format for this change is
bugnumber.yaml, so for bug 1253896, a file called
1253896.yaml with data from the bug would be created containing:
query: > message:"SSHTimeout: Connection to the" AND message:"via SSH timed out." AND tags:"console.html" AND NOT build_name:"check-tempest-dsvm-neutron-heat-slow"
(You can see live files here: https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries/.)
Just like everything else in the OpenStack ecosystem, this patch is then reviewed by peers and then merged into the system to be used.
From there, the
elastic-recheck page is automatically generated with all bugs included as queries in
elastic-recheck. The page includes frequency graphs and links to resources for each bug: http://status.openstack.org/elastic-recheck/.
Automatic responses are then fed into our code review system from
elastic-recheck when a patch hits a known bug, e.g.:
Now, when known failures are encountered, subsequent developers running into the same issue are automatically notified and can take the appropriate action immediately without having to search through the bug tracker. This innovation has been a boon for our developers, who can now move along more efficiently with development instead of getting stuck on transient bugs.
By using the ELK stack and
elastic-recheck, we have given members of our community tools that allow them to:
- Search logs to determine all occurrences of a bug
- Identify bug trends such as: when it started, whether it was fixed and whether it's getting worse
- Get real-time reports to the code review system when a known match is found, so a patch author knows why the test failed
You can find out more about
- Code: http://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/
- Documentation: http://docs.openstack.org/infra/elastic-recheck/
Thanks to Clark Boylan, Joe Gordon and Sean Dague for reviewing this article, and to Matt Riedemann and Matthew Treinish who keep reviews for this project coming along as additional core reviewers.
Many thanks to Elizabeth for sharing the story of
elastic-recheck. We're thrilled that our bits power ease of development for OpenStack, and that we've made life better for the hundreds of developers working on the project!