03 November 2014 Engineering

The People Behind the Projects: for the Commons

By Andrew Selden

The People Behind The Projects is a new series of first-person blog posts from the individuals who contribute to the fantastic, elastic world of Elasticsearch. We're a close-knit family at Elasticsearch. The goal of this series is to erode the boundaries between the so-called personal and the professional. We want to share our stories with you. So in this series, Elasticsearchers talk about where they're from, how they first started engaging with Elasticsearch, and what motivates them.
I meandered some in the early years. I had no idea what I wanted to do in college, but my parents always told my brother and I that we needed to get a Liberal Arts education. They actively encouraged us from being practical. I dabbled in philosophy, history, and political science before settling on Russian Language and Literature as a major.

After college, I started working at the Bioinformatics Lab at UPenn. This was in the early days of what came to be called “big data". There was no Hadoop. No Elasticsearch. None of the great tools we have today to manage massive data sets. We had to cobble together large Linux clusters using Perl, NFS, and a smattering of other technologies. We'd get these large grants from the National Science Foundation to perform analytics on various genomes. Today you would just rent space in the cloud, but 12 years ago there was no cloud. It's so interesting to see how the state of the industry has evolved. A lot of the computational techniques we were struggling to build a decade ago are now mainstream in just about every tech startup.

I simultaneously got my Masters in Computer Science taking courses part-time alongside work. After all, who wants to get out of school with student debt?

After finishing graduate school, I moved to the Bay Area and ended up as a search engineer for a large media surveillance company. We crawled the web, pulling in content in over 30 different languages to analyze and create competitive intelligence for our clients. We were very successful and built a great product, but I always felt that the software tools were too hard to use. Then I discovered Elasticsearch.

The technology probably wasn't even a year old when we introduced Elasticsearch into our company. But even in its earliest days it was so clearly superior to everything else. Clustering just worked. Search just worked. It was a breath of fresh air to have a tool I didn't have to fight to get working. It's been amazing to watch the evolution in how customers use Elasticsearch, from a purely search-oriented technology into a full-fledged analytics engine.

Elasticsearch comes at the problem from the perspective of a search engine. Most big data products come at the problem from the perspective of putting files on disks, scanning them from start to end. That approach is very linear, whereas Elasticsearch wants to find that relevant needle in the haystack in milliseconds. Easy to get data in. Easy to get data out.

For instance, you give the data to Hadoop. It will take that data and store it on disk. Later, if you want to analyze the data, it will allow you to launch a number of parallel processes, each of which scan a part of the file. They take apart and scan the file from start to finish. It's linear within chunks. You give it data and it puts the bits on disks.

With Elasticsearch, the moment you give us data, we don't just take it and store it on disk. We do deep analysis on it. We construct data structures and put those on disk. The data structures are organized in such a way that you can do extremely fast lookups. If I want to search for the word “Dostoevsky" in a set of billions of documents, I don't have to sit idly while the system scans everything for hours. I just have to lookup “Dostoevsky." It's one hop. The preprocessing was completed at the time of ingestion. Search engines and information retrieval algorithms are a well-understood problem,whereas parallel processing data analytics is still new arena -- and the ELK stack is ahead of the curve here.

For example, say you're Walmart with thousands of retail outlets, and you're selling five types of electronics across thousands of stores. You want the min-maxes and averages per store for every electronic. You also want to find the most effective salespeople across your network. Those questions were traditionally answered over the course of weeks or months by using SQL on an Oracle analytics engine. With Elasticsearch's aggregations, we can assemble such business analytics within seconds because of the pre- and parallel processing.

I joined Elasticsearch last year, and one of the first customers I supported was my former media surveillance company. I spent a day onsite there. From that point on, we've had calls every two weeks. We had a live dubugging call recently between five guys at the company and two of our engineers, Robert Muir and Mike McCandless. Robert and Mike are the top contributors to Lucene. They are search experts. Everyone learns so much in these conversations. In our support conversations, there have been technical wins, but really it's been also about human relationships.The conversation reminded me of this book that I've loved called For Common Things. It's a manifesto by Jedediah Purdy for civic responsibility and against this prevailing culture of cynical irony that seems to be ever-present. Civic responsibility at Elasticsearch is apparent when the team comes together to make a customer successful.