When it comes to big data, what’s old is new again (and again)

April 11, 2023

This blog post summarizes a trio of "big data" topics — metadata, distributed search, and dynamic categorization — that have actually been part of the information retrieval landscape for quite a while, but seem to periodically lose favor and then reappear over time.

I'm starting to believe that these recurring patterns are actually generational — meaning that each successive generation of technologists runs up against the same problems and tries to solve them with the technology of the day, but using the same old techniques.

What they don't realize is that the workarounds have become institutionalized — the solutions of the past became a limiting factor in the technology landscape that continues to hide the underlying problem: that the relative value of data is subjective and constantly changing. The corollary being that you can't rely on summarized or labeled data from a point in time or for a specific context to decide what's relevant for the current (or future) context. Unfortunately, that tenet has been forgotten as we've all tried to do the best we can with what we have.

Uncovering the underlying big data problem

Data is the lifeblood of our digital world. The whole point of IT systems is to collect data about our operations and then use that information to detect problems or opportunities, to form and investigate hypotheses, and to make the best decisions possible. The first and most important step in any business operation is to find all potentially relevant information as quickly as possible. How this first step is accomplished is often the key to the success or failure of the operation.

There are legacy technological shortcomings that previously limited our ability to connect and analyze widely distributed or large-scale data stores in a unified way. In an attempt to overcome the age-old problems of "precision vs. recall vs. scalability vs. cost," workarounds were devised to make data manageable and cost-effective at large scales, but those shortcuts eventually create more downstream problems.

The importance of unifying all data

Perhaps if you've read this far, you're saying to yourself "yeah, but what you're describing about ‘precision vs. recall' applies mostly to human-derived unstructured data; structured data like event logs and metrics can easily be queried with filters on metadata, columns, and aggregations." That's true enough, but a large part of the problem we have in using our collected data (which, again, is the point!) is that it's nearly impossible to get a comprehensive picture from disconnected silos — that proverbial "single pane of glass."

Rows and columns, numbers and ranges…in many ways those are the simplest parts of a search application — not easy, mind you, but straightforward with regard to filtering, matching, and aggregations; it's reliable old math. It's in the unstructured and dimensional data (full-text, geospatial, vector-based, etc.) that a search engine provides the value that a database can't.

Archived data stored in monolithic data lake repositories aren't readily available in a fully usable state like they are in a search engine; they're only able to be accessed via whatever summarized/metadata-based lookups are offered by the chosen storage mechanism. The gulf between the operational data we have access to (for short periods of time, due to historical cost and practicality factors) and the long-term data kept in "big data lakes" is huge, and it turns the task of finding relevant archived data in a data lake and then correlating it with our real-time data results into a siloed data problem.

Memory lane: 3 limited approaches to handling big data

This lack of a unified storage, access, analysis, and cross-correlation across silos and time is where these problems crop up again and again, and it's where we continue to try using the same old unsuccessful approaches (again).

Each of the topics below is a distinct approach to handling "big data," but they are all hampered by that same hidden underlying problem, just appearing as a different workaround/approach.

1. Why metadata isn't enough

I can hear you saying, "This is a boring old topic…why are we discussing metadata again?"

Because metadata has what should be obvious limitations that are being overlooked: metadata is being used as the basis for finding relevant data, and it's not up to the task. The meaning of any given piece of data as it relates to answering a question (a query, a search) is subjective and cannot be predetermined because each bit of new information adds to experience and understanding, subtly changing the calculation of value as new information is gained.

But the problem isn't that we create or use metadata, it's in the way it's used that causes trouble. Using metadata to tag content with a label that determines its findability forevermore is actually counterproductive, even dangerous.

Read the "Metadata isn't enough" white paper.

2. Distributed (not federated!) search

To me, "federated" is one of those terms that makes the hairs on the back of my neck stand up when I hear it. I suppose that reaction comes from multiple experiences trying to make federated search systems work for different organizations.

They began as well-meaning attempts to leverage existing IT assets and bring together the data needed to improve operations. It always ended the same way, though: wasted cycles of futile troubleshooting bad syntax or tracing connections between applications. When that didn't work, falling back to coding workarounds (we jokingly called it "duct tape and baling wire"), desperately trying anything and everything to keep our federated search system limping along.

Each and every time, everyone involved inevitably came to the conclusion that not only was the federated search system too difficult and expensive to maintain, but it didn't even serve the mission it was created for in the first place.

A "federated" search application is middleware that tries to act as a "universal query translator" — it's a single endpoint used to submit queries to multiple "federated" data systems and return the combined results back to the requestor. Seems like it should be relatively simple, but the reality is there are so many pitfalls and inadvertent, interwoven side-effects to be aware of with federated search that it's hard to tease apart and summarize them all, but let's try anyways.

Read the "Distributed (not Federated!) search" white paper.

3. Dynamic categorization

Anyone here remember Yahoo Categories? Back when there was still a competition for internet search engine supremacy, Yahoo's Categories were probably the most actively updated controlled vocabulary used in a public-facing application. At the time, the idea seemed like it was a good, human-friendly solution for dealing with the V's of big data (by some accounts there are somewhere between 3-10 different V's!) that led to "information overload."

Know why categories never became the state of the art? Because it's extremely difficult (nay, impossible) to maintain a controlled vocabulary of any kind that can suit every possible search context. Language is constantly morphing and growing, and every user has a unique reason for interrogating the data, and a different idea of what's relevant to their search.

There are of course several different approaches to building and using categories, including some that have a definite (though limited) usefulness applied to searching. But the reason this is on our list of problem approaches is that by necessity, categories use a pre-built definition to find data that matches. As we found in the "Why Metadata isn't enough" paper, there are limits to how far metadata/labeled data can take you.

But — there are other, more dynamic ways to do categorization, if you have the power of a distributed search platform in your arsenal!

Elastic: Search to the rescue!

As the promise of AI/ML begins to come to fruition (and we start to codify and automate big data operations), I want to chime in with a reminder about where we've been so that we can avoid the mistakes of the past. Underlying each of these old-school approaches is an attempted workaround to our inability to sift through the huge amounts of wildly disparate and widely distributed data. I believe we can finally move past that technology gap.

The common solution to all of these problems is a wholly new data access paradigm: Distributed Search — the ability to perform fast, comprehensive search both locally and remotely, at scale, using a common data platform.

It's the obvious solution and feels like something that should have already been tried before, but in truth it's really only been a technical possibility for a few years now (with the advent of Cross Cluster Search). Because of that, most people aren't aware of the possibilities that Distributed Search opens up, and many of the ingrained assumptions and data design decisions that have been made over the years were based on the limitations that were built into our systems for handling and accessing big data.

The white papers linked here outline how these recurring patterns can be overcome with distributed search, but there are many other implications to having this core problem solved.

For one, we can begin the work of refactoring our data operations to take advantage of a speed layer that can dynamically access all data to suit the context. This allows us to pivot the endless cycles that were previously dedicated to disparate data synchronization, normalization, and rationalization, and instead focus those efforts on improving (and applying) the analytics that rely on a common data picture — which is the real reason we're collecting all that data in the first place.

Dive deeper into big data

Explore these topics further in the related white paper series: