22 December 2017 Culture

Solving the Small but Important Issues with Fix-It Fridays

By Daniel Cecil

Question: How many engineers does it take to change a light bulb?


Answer: The light bulb works fine on the system in my office...


OK. It isn’t a great joke. But it’s the perfect setup for discussing an important topic here at Elastic: How do busy engineers, often working on large and gnarly projects, handle the small issues — like changing a metaphorical light bulb — that inevitably pop up from time to time?


The answer: Fix-It Friday.


The Elasticsearch code is housed in a public repository on GitHub and accessible to anyone. When a user finds bugs, spots missing features, or wants to make a specific request, they can flag it using the issues tab by simply submitting a new issue. The process is open and transparent — just the way we like it.


Each day, someone on the Elasticsearch team is assigned to a role called support dev help. In this role, the engineer has the dual duty of aiding the Elastic support team while looking for fresh issues in the Elasticsearch repository. When a new issue arises, the engineer will add a label to help the team prioritize when to tackle it, and how much effort it might take to solve it.


However, not all issues have a simple diagnosis, nor an easy fix.


“If there’s enough information, but it’s not clear that the issue is something we really want to handle due to policy, or maybe the person handling the ticket doesn’t have enough knowledge in the issue area to make a decision on it, then we can mark the ticket ‘discuss’ and it goes into the queue for Fix-It Friday,” said Colin Goodheart-Smithe, Elasticsearch Software Engineer.  


Elasticsearch Team Lead Clint Gormley created the Fix-It Friday initiative a little over three years ago as a time when these small issues were given to engineers to solve. That ambitious concept didn’t last very long. The team quickly learned that small issues often turned out to be big ones in disguise. (Think: the filament in the light bulb looks dead, but in reality the electricity is out.) So, the scope of Fix-It Friday evolved into a get together for discussing user requests and finding solutions. Since the Elastic team is distributed, the meetup also became a weekly opportunity to get off Slack and email and get focused on a team video call.


“It’s a good time,” said Gormley, “getting a group with such a wide range of expertise in one virtual room — it’s amazing.”


About 10 issues are discussed during a typical one hour Fix-It Friday session. Issues are later fixed and implemented or de-escalated. When asked whether there was a particular issue from a Fix-It Friday meeting that jumped out at him, or that he thought was quirky or fun, Gormley laughed. “We’ve only been through 12,000 issues or something ….”


But one seemingly small bug hiding something larger did spring to mind. Users reported heavy queries submitted to Elasticsearch never timing out, and Gormley recalled queries which ran for hours.


“Usually, our queries run milliseconds, so if one runs for an hour, you know you have a problem,” he explained.


In these situations users, thinking nothing is happening, run the query again. So, instead of one wildcard query running for an hour, they actually have two — or more. This isn’t exactly an issue that could break anything, but it had the potential to slow results and reduce resources. The issue was marked for discussion at a Fix-It Friday session. After a lengthy debate, Elastic engineers considered adding a default timeout, meaning in one hour’s time, the query got canceled. It seemed like a good idea at first. But with several eyeballs on the issue, another perspective developed.


Data is stored in indexes mapped out to shards, which are situated on different machines. When you run a query, it reaches out to all the shards, gathering the results and providing those results to the user. But what happens if one of the shards is missing due to a dying node on the shard, or when it gets disconnected from the network, causing the heavy query to fail? Should Elasticsearch show an exception? Or show only the results from the available shards?


Users performing a simple search might be happy with getting results only from available shards. But users performing analytics would want to know that they’re receiving partial results. For the timeout option, Elastic engineers decided that a silent timeout (when you do not get a notification that the query stopped running) was out of the question. They also considered throwing an exception so that the user knew something was wrong with the query. But what of other circumstances, such as a missing shard, that can create partial results? Should that throw a hard exception too? In the end, they decided to add a global and per-request setting to toggle this behavior. The timeout discussion turned out to be too large a decision for one engineer to make on their own.


“From a user perspective it’s important that we actually look at these things,” said Gormley. “Our users are very involved. If they’ve taken the time to write a decent issue, we owe it to them to respond appropriately.”


This is where the value of Fix-It Friday really comes into play — it’s a broadening of the collective Elastic mind. For engineers, Fix-It Friday is a chance to break from the day-to-day and think about new issues in different ways, providing an opportunity to meditate on an problem that may not be their particular focus but is part of the larger product. In the end, Fix-It Friday isn’t about simply fixing bugs, or fielding requests — it’s about widening the scope of what Elastic can do.


“It's about making decisions,” said Elasticsearch Software Engineer Adrien Grand. “It’s about which direction we want to take.”


“You see people asking us to add features that work on small datasets but won’t scale,” said Gormley. “If we make something as a small-scale solution, inevitably someone will want to use it on the big scale and it will fail. That kind of stuff is important for new devs to know so that they can make these decisions later on. There’s an ethos to how we develop; guiding principles of what to add, and what not to add.”


However, Gormley added, nothing is set in stone.


“That willingness to change minds is an important part of the Elastic culture.”


“As usual in open source,” added Adrien Grand, “no is temporary, but yes is forever.”