Customers

Investigative analysis of disjointed data in Elasticsearch with the Siren Platform

At Siren, we build a platform used for “investigative intelligence” in Law Enforcement, Intelligence, and Financial Fraud. Investigative intelligence is a specialisation of data analytics that serves the needs of those that are typically hunting for bad actors. Such investigations are the primary focus of law enforcement and intelligence, but are also critical to uncovering financial crime activities and for threat hunting in cybersecurity.

At the heart of Siren is the Elastic Stack. With its real-time responses and ability to deal with ever-increasing amounts of structured and unstructured data, Elasticsearch provides the ideal fabric for investigative analysis. Furthermore, thanks to the Siren and Elastic partnership, all the advanced features of an Elastic Platinum subscription can be used in Siren as part of US Federal Siren deployments, giving investigators unprecedented flexibility, capabilities, and operational resilience.

At Siren, we set out to use Elasticsearch to tackle two of the biggest problems in investigative intelligence: disjointed data and disjointed tools. With that in mind, in this blog post, we would like to share part of our approach with the broader community of technologists and architects who are trying to get the most out of their data.

Two challenges for modern investigative analytics

At the data level, the main challenge in investigative intelligence is disjointed data: investigators need to be able to follow non-obvious relationships across a myriad of diverse data sets and data sources. 

A more subtle challenge, however, is the diversity of the analysis needs, which traditionally forced organizations and analysts to either use many disconnected tools and backends (APIs) or embark on building expensive and rigid ad-hoc integrations. 

For example, investigators certainly need link analysis — to see and explore the connections between records — but they also need fast drilldowns, business intelligence (BI)-style visuals, and text search and analysis for unstructured data. 

Our challenge from the onset has been: How can we deliver a unified investigative analytics platform that is also architecturally modern and easy to deploy?

Stepping up to the challenge

With its real-time responses, its array of real-time analytics functions, and its powerful search engine — capable of dealing with fuzzy searches and noisy data — Elasticsearch immediately stood out as the backend providing the ideal starting point for investigative analysis.

That is why we built the Siren platform — a unified tool that enables big, disconnected data analytics — on top of Elasticsearch and Kibana.

The Siren platform

Analysing disjointed data in Elasticsearch: Data model, joins, and link analysis

The investigative world is made up of disjointed data that needs to be connected. People (for example) are connected to vehicles they own, which are connected to locations where they’ve been, which may be connected to events, and so on.

In Elasticsearch these are typically recorded in separate indices, possibly coming from all sorts of sources. Siren leverages the real-time speed of Elasticsearch to tie this data together for investigators, regardless of source or index.

Tying data together with an associative data model

In Siren Investigate — Siren’s frontend built on Kibana — administrators or advanced analysts define an Associative Data Model on top of their existing data, and this data model then drives all the analytic operations. 

The Data Model editor is visual, allowing you to define how tables are interconnected, typically by shared keys, which are then used to join the records. For example, an Associative Data Model in law enforcement can be defined to connect tables which contain persons with vehicles, cases, automatic camera licence plate readings, and more.

One uses the visual editor to specify the primary and foreign keys to be used as associations. For example, here are the connections for the Crimes index:

And the overall model can also be seen as a single picture, such as this graph visualization of the connections between persons and other entities. 

A view of the connections between persons and other entities

In another example — a cybersecurity scenario — it is common to use concepts such as IPs, MD5 hash values, emails or user IDs to tie together security logs. The following screenshot shows the relationships between different IPs.

A view of the relations between different IPs

All the examples above share specific identifiers, but, as we’ll see later, “fuzzy” relations can be similarly accounted for (e.g., relations coming from Natural Language Processing (NLP) or Entity Resolution).

Data model-powered associative drilldowns (and link analysis)

The data model enables a special kind of investigative capabilities called associative drilldowns and link analysis investigations.

Let’s see this in action with a financial investigation example, where we have companies that have received investments by investors, as well as articles that mention companies (and often their investments). This is represented by the following data model:

Here, the articles-to-companies relation comes from the NLP engine.

Thanks to the data model in Siren, we can drill down based on what’s connected to a set of records. 

For example, in the screenshot below, the relational navigator, the user sees in real time how many records are connected (351,243 articles and 41,298 investments) to the current set and is navigable with the click of a button. 

Drilling into investments in the relational navigator button brings us to a tailored dashboard, where we can drill down further.

The process can be then repeated — for example, to move from here to the 2,535 investors who made these investments in 2012.

Under the hood, these real-time interactive associative capabilities and the relational button make use of the Siren Federate Plugin, which extends the Elasticsearch query DSL to include cluster scalable join/correlation capabilities. 

Siren Federate also enables working across different backends: it has a series of drivers that can see data in remote backends as if they were in Elasticsearch (virtual indices).

Pivoting to graph mode: Siren Link analysis

Being able to do associative drilldowns is great, but there are questions that no dashboard can answer. 

For example:

  • Which investors invested in which company?
  • Are they investing together or in groups?
  • Are there groups that appear to be investing in competing companies?

For these questions, Siren's ability to move from dashboards to link analysis is key.

I simply dragged and dropped the filtered Investments dashboards and I can quickly see how they connect

Elasticsearch aggregations for big (graph) link analysis 

Elasticsearch aggregations can be used on demand to summarize the graph. The sidebar of the link analysis visualization allows you to choose the aggregation criteria to display edges which summarize (e.g., count/rollup) all the nodes between two entities. 

In this example we’re counting the number of articles which co-mention the two companies but also the significance of the co-mention — as outputted by the incredibly useful significant terms aggregation in Elasticsearch.

blog-siren-agg-link-analysis.jpg

Link analysis: Articles displayed as aggregate links between companies

Efficient shortest path queries in Elasticsearch

Finding the shortest and most significant path across connected records (phone calls, messages, social links) is a typical example of a widely used investigative graph algorithm.

Efficient shortest path in Elasticsearch is another operation made possible by the Siren Federate plugin technology. Here it is in action finding connections between two users, six phone hops away.

Another very important investigative graph algorithm is the ability to find a “common communicator” among nodes. In the following screenshots we find that a common communicator exists between these three companies. 

Identifying Microsoft as the common communicator as mentioned in articles with other companies

Web services and advanced geo/time/spatial analysis

Sometimes it’s not possible to get an answer with a simple operation, such as with a shared key join.  Siren supports calling remote web services and fitting their results back in the data model. 

This capability can be used in many ways — for example, to pull in data on demand (e.g., to access remote knowledge or reference data) and ask for advanced computations on demand. 

Let’s take for example a COVID-19 simulation scenario: have two phones been in physical proximity for more than 15 minutes? In the next screenshot, Siren is configured to use a web service (which implements the complex logic required to deal with noisy and spotty data) and makes the results available for analysis.

As expected, geo time analysis is critical in investigative intelligence, and Siren builds on the extraordinary geo capabilities of Elasticsearch to provide this in an analyst-interactive way. The following screenshots illustrate some of the capabilities, which include graph over time evolution and analyst-activated Elasticsearch stored layers.

Using the timeline mode to view spatial and temporal data

Conclusion 

Elasticsearch is the ideal centerpiece backend for large-scale, interactive structured and unstructured data analysis. It was a natural foundational choice for Siren in its mission to provide a unified intelligence analytics experience — and connect disjointed data.

Interested? Try Siren now with our freely available Siren Community Edition and our nice getting started tutorial

Learn more

Elastic and Siren: Protecting people, assets, and networks (video)

About Siren

Siren provides investigative intelligence based on Elasticsearch to some of the world’s largest and most complex organizations. 

About Dr. Giovanni Tummarello

Giovanni Tummarello, Ph.D is a Computer Scientist and entrepreneur, co-founder and Chief Product Officer at Siren.io. He led the team at the National University of Ireland Galway researching on Knowledge Graphs, Search Engines and related UI/UX which then spun off into Siren. Previously, while at the FBK Institute in Trento Italy he led a Semantic Web team and co-founded business information company Spaziodati.eu.