July 19, 2018

An Introduction to Elasticsearch SQL with Practical Examples - Part 1

Want to learn more about the differences between the Amazon Elasticsearch Service and our official Elasticsearch Service? Visit our AWS Elasticsearch comparison page.

The version 6.3 release of the Elastic Stack represented one of the most feature rich releases in recent memory. If the opening of X-Pack and addition of rollups weren’t enough to get you excited, we announced the addition of SQL support as a experimental feature. This represents one of the most sought after and requested features in Elasticsearch’s relatively short history.

In this blog series, we’ll explore some of the features and capabilities currently supported by Elasticsearch SQL. We will also openly discuss a few limitations and touch on plans for the future.

Target Audience

Historically at Elastic, there was a hesitance to add SQL to the product for a number of a reasons. Common questions included:

What would SQL support even constitute?

What features would we support? More commonly, this was articulated as “Do we support JOINs?

Expressions/Functions? Grouping?”

Do we need to support JDBC/ODBC connectivity?

Do we even need to support SQL or are we simply failing to provide sufficient material for new users to familiarise themselves with our own Domain Specific Language (DSL)?

After several initiatives we successfully narrowed down the required features to something we felt would be useful to our users. Based on discussions with our users, we believe SQL will be most valuable to two main audiences:

New users to the stack who might find the Elasticsearch DSL a little overwhelming as a starting point, or simply don’t need to learn the full syntax. For example, a user might be converting an existing SQL-based application for performance and scalability reasons, and simply wants the equivalent query without having to learn the full syntax. We also appreciate a common learning strategy to learning new “languages”, is to find equivalence based on what you already know.
Data consumers who really have no desire or need to learn the full Elasticsearch DSL. These users might be data scientists who simply want to extract the data for external processing. Alternatively they may be with less technical BI users who are broadly familiar with SQL and use this on a day-to-day basis.

Aside from the above audiences, there is also no refuting that, as a declarative language, SQL is an extremely attractive paradigm to all users, as this blog series will often illustrate. The prevalence of SQL is no doubt based on this ability to express the logic of the computation, and what you are trying to accomplish, without needing to define its control flow. Additionally, as we will illustrate, certain SQL queries elegantly define a problem that can be quite verbose using the Elasticsearch DSL equivalent. Where as the Elasticsearch DSL elegantly describes full text search problems, SQL can be more effective at describing structured analytics based queries.

What Elasticsearch SQL is, What Elasticsearch SQL isn’t…

Elasticsearch SQL — in this initial release — provides a read-only interface that complies with a subset of the ANSI SQL specification and allows Elasticsearch to be exposed as a tabular source. We also provide additional operators beyond this specification, exposing capabilities unique to Elasticsearch in comparison to RDBMS-based implementations. We’ve aimed to provide a lightweight and fast implementation that minimises external dependencies and moving parts. In no way does this first offering transform Elasticsearch into a fully relational database (with its associated properties) or replace the need for data modeling. Whilst some data manipulation functions and expressions are implemented by the SQL plugin, a principle of push down is adhered to whenever the result count and order is impacted or grouping is requested. This limits the current processing of data in the Elasticsearch SQL plugin to result manipulation only (e.g. functions on fields) and the client (JDBC driver/CLI or browser) to rendering. This approach exploits the scalability and performance of Elasticsearch, allowing it to do the heavy lifting.

Mapping Concepts: Indexes and Documents vs. Tables and Rows

In the early days of Elasticsearch, indexes and types were often referred to as analogous to RDBMS databases and tables, mainly to help users understand and work with potentially new concepts. As explained Elasticsearch 6.3 documentation, this was both an incorrect and potentially dangerous analogy. Although we are removing types, we still need a correct and usable logical equivalence between Elasticsearch’s schema-less document oriented model and SQL’s strongly typed concepts.

Fortunately, Elasticsearch indexes, like RDBMS tables, are physically isolated and should be used in broadly the same manner (i.e. to store related data). Rows and documents are also a natural analogy in that they provide a mechanism to group fields/columns — although a row tends to be strict (and have more enforcements) while a document tends to be a bit more flexible or loose (while still having a structure). Fields in Elasticsearch represent a name entry and support various data types, potentially containing multiple values (i.e. a list). With the exception of these multi-value fields, this concept maps directly to SQL columns. Note: If you attempt a SELECT on a multi-valued field, you will receive an error at query time.

Other concepts do not map as directly — a SQL database and Elasticsearch cluster have only loose equivalence. These, however, will not generally impact a user of Elasticsearch SQL. Further details on this topic can be found in our Mapping concepts across SQL and Elasticsearch documentation.

In summary, use the index name in your WHERE clause to target a specific index/table. You can expect unique documents to be returned as rows with fields mapped as columns. Due to this largely transparent mapping, we use these terms interchangeably moving forward.

Implementation Internals

The Elasticsearch SQL implementation consists of 4 execution phases:

The Parsing phase is responsible for converting the SQL query into an Abstract Syntax Tree (AST). Any syntax validation is performed at this stage, before the analyzer validates the AST and resolves any tables, columns, functions, aliases and namespaces to produce a logical plan. This plan is optimised including the removal of any redundant expressions, before being converted to a physical plan (i.e. Elasticsearch DSL) for execution. The Query Executor then runs the actual query and streams the results to the client, performing any type and tabular conversion as required e.g. aggregation tree to table.

Methods of Connectivity

In providing any SQL solution, connectivity support becomes a paramount consideration. Whilst a REST only interface might be acceptable for some users, most expect to be able to connect through standard interfaces — typically JDBC and ODBC. Whilst ODBC support is planned, and being actively developed, JDBC is available with this first release and available for download.

Importantly, all communication with these drivers will remain over HTTP through our REST interface. This has several compelling advantages:

Granting SQL access to your users is no different than opening and exposing an Elasticsearch port with it integrating natively with security. We are therefore also able to support SQL immediately in our hosted Elasticsearch Service on Elastic Cloud, with existing users able to combine with access control permissions OOTB.
This allow us to utilise SQL through the REST interface directly, and release an additional CLI client for convenience. We expect the latter to be particularly popular with administrators, who will be familiar with the command line means of interaction common in RDBMS.

The JDBC driver utilises the newly created XContent library responsible for query and response parsing (historically this code was tightly coupled with Elasticsearch). This avoids the need for the driver to depend on all of the Elasticsearch libraries, thus ensuring it remains lightweight and portable. This decoupling will continue to be improved in future releases, resulting in a smaller and faster driver.

Some Simple Examples

Let’s look at some examples that utilise a mixture of CLI and REST API. For our examples, we will use a sample dataset soon to be distributed with Kibana. If you can’t wait for the release, this flight dataset is also available at demo.elastic.co, where you can execute the following examples through the Kibana console. We provide links to demo.elastic.co, which should auto populate with the relevant query, throughout the blog. Alternatively, we provide the full list of queries to execute in the demo Kibana console. In some cases results may vary if there is no explicit ordering or restriction to the example query, due to the natural ordering of results in Elasticsearch when there is no relevancy or sort order applied.

Retrieving Elasticsearch Schema Information: DSL vs. SQL

Let’s first identify the schema of the table/index, and the fields available for us to play with. We’ll do this via the REST interface:

Request

POST _xpack/sql
{
  "query":"DESCRIBE flights"
}

Try on demo.elastic.co

Response

{
  "columns": [
    {
      "name": "column",
      "type": "keyword"
    },
    {
      "name": "type",
      "type": "keyword"
    }
  ],
  "rows": [
    [ "AvgTicketPrice", "REAL" ],
    [ "Cancelled", "BOOLEAN" ],
    [ "Carrier", "VARCHAR" ],
    [ "Dest", "VARCHAR" ],
    [ "DestAirportID", "VARCHAR" ],
    [ "DestCityName", "VARCHAR" ],
    [ "DestCountry", "VARCHAR" ],
    [ "DestLocation", "OTHER" ],
    [ "DestRegion", "VARCHAR" ],
    [ "DestWeather", "VARCHAR" ],
    [ "DistanceKilometers", "REAL" ],
    [ "DistanceMiles", "REAL" ],
    [ "FlightDelay", "BOOLEAN" ],
    [ "FlightDelayMin", "INTEGER" ],
    [ "FlightDelayType", "VARCHAR" ],
    [ "FlightNum", "VARCHAR" ],
    [ "FlightTimeHour", "VARCHAR" ],
    [ "FlightTimeMin", "REAL" ],
    [ "Origin", "VARCHAR" ],
    [ "OriginAirportID", "VARCHAR" ],
    [ "OriginCityName", "VARCHAR" ],
    [ "OriginCountry", "VARCHAR" ],
    [ "OriginLocation", "OTHER" ],
    [ "OriginRegion", "VARCHAR" ],
    [ "OriginWeather", "VARCHAR" ],
    [ "dayOfWeek", "INTEGER" ],
    [ "timestamp", "TIMESTAMP" ]
  ]
}

The above response can also be formatted in a tabular form through the url parameter ?format=txt. For example:

POST _xpack/sql?format=txt 
{
  "query":"DESCRIBE flights"
}

Try on demo.elastic.co

          
          column          |     type      
--------------------------+---------------
AvgTicketPrice            |REAL           
Cancelled                 |BOOLEAN        
Carrier                   |VARCHAR        
Carrier.keyword           |VARCHAR        
Dest                      |VARCHAR        
Dest.keyword              |VARCHAR        
DestAirportID             |VARCHAR        
DestAirportID.keyword     |VARCHAR        
DestCityName              |VARCHAR        
DestCityName.keyword      |VARCHAR        
DestCountry               |VARCHAR        
DestCountry.keyword       |VARCHAR        
DestLocation              |STRUCT         
DestLocation.lat          |VARCHAR        
DestLocation.lat.keyword  |VARCHAR        
DestLocation.lon          |VARCHAR        
DestLocation.lon.keyword  |VARCHAR        
DestRegion                |VARCHAR        
DestRegion.keyword        |VARCHAR        
DestWeather               |VARCHAR        
DestWeather.keyword       |VARCHAR        
DistanceKilometers        |REAL           
DistanceMiles             |REAL           
FlightDelay               |BOOLEAN        
FlightDelayMin            |BIGINT         
FlightDelayType           |VARCHAR        
FlightDelayType.keyword   |VARCHAR        
FlightNum                 |VARCHAR        
FlightNum.keyword         |VARCHAR        
FlightTimeHour            |REAL           
FlightTimeMin             |REAL           
Origin                    |VARCHAR        
Origin.keyword            |VARCHAR        
OriginAirportID           |VARCHAR        
OriginAirportID.keyword   |VARCHAR        
OriginCityName            |VARCHAR        
OriginCityName.keyword    |VARCHAR        
OriginCountry             |VARCHAR        
OriginCountry.keyword     |VARCHAR        
OriginLocation            |STRUCT         
OriginLocation.lat        |VARCHAR        
OriginLocation.lat.keyword|VARCHAR        
OriginLocation.lon        |VARCHAR        
OriginLocation.lon.keyword|VARCHAR        
OriginRegion              |VARCHAR        
OriginRegion.keyword      |VARCHAR        
OriginWeather             |VARCHAR        
OriginWeather.keyword     |VARCHAR        
dayOfWeek                 |BIGINT         
timestamp                 |TIMESTAMP

Moving forward we use the tabular response structure shown above, whenever providing an example response from the REST api. Achieving the same query via the console requires us to logon using:

./elasticsearch-sql-cli http://elastic@localhost:9200

After responding to the password prompt...

sql> DESCRIBE flights;
column      |     type
------------------+---------------
AvgTicketPrice    |REAL
Cancelled         |BOOLEAN
Carrier           |VARCHAR
Dest              |VARCHAR
DestAirportID     |VARCHAR
DestCityName      |VARCHAR
DestCountry       |VARCHAR
DestLocation      |OTHER
DestRegion        |VARCHAR
DestWeather       |VARCHAR
DistanceKilometers|REAL
DistanceMiles     |REAL
FlightDelay       |BOOLEAN
FlightDelayMin    |INTEGER
FlightDelayType   |VARCHAR
FlightNum         |VARCHAR
FlightTimeHour    |VARCHAR
FlightTimeMin     |REAL
Origin            |VARCHAR
OriginAirportID   |VARCHAR
OriginCityName    |VARCHAR
OriginCountry     |VARCHAR
OriginLocation    |OTHER
OriginRegion      |VARCHAR
OriginWeather     |VARCHAR
dayOfWeek         |INTEGER
timestamp         |TIMESTAMP
sql>

The above schema is also returned with any query for the fields being displayed in the SELECT clause, thus providing any potential driver with necessary type information required to format or operate on the results. For example, consider a simple SELECT with a LIMIT clause to keep the response short. By default we return 1000 rows.

Simple SELECT

POST _xpack/sql?format=txt
{
  "query":"SELECT FlightNum FROM flights LIMIT 1"
}

Try on demo.elastic.co (results may vary)

   FlightNum   
---------------
1Y0TZOE

This same REST request/response is consumed by the JDBC driver and console but concealed to the user.

sql> SELECT OriginCountry, OriginCityName FROM flights LIMIT 1;
OriginCountry |OriginCityName
---------------+---------------
US      |San Diego

Try on demo.elastic.co (results may vary)

Note if at any time you request a field that does not exist (case sensitive), the semantics of a tabular and strongly typed store mean an error will be returned - this differs from Elasticsearch behaviour where the field will simply not be returned. For example, modifying the above to use the field “OrigincityName” instead of “OriginCityName” results in a helpful error message:

{
  "error": {
    "root_cause": [
      {
        "type": "verification_exception",
        "reason": "Found 1 problem(s)\nline 1:8: Unknown column [OrigincityName], did you mean any of [OriginCityName, DestCityName]?"
      }
    ],
    "type": "verification_exception",
    "reason": "Found 1 problem(s)\nline 1:8: Unknown column [OrigincityName], did you mean any of [OriginCityName, DestCityName]?"
  },
  "status": 400
}

Try on demo.elastic.co

Likewise if we attempted to use a function or expression on a field that isn’t compatible, we are presented with an appropriate error. In general the Analyser will fail early when validating the AST. In order to achieve this, Elasticsearch must be aware of the index mapping and capabilities of each field. For this reason, any client accessing the SQL interface with security will need the appropriate permissions.

If we continue to provide every request and accompanying response, we’ll end up with an extremely long blog post! For the sake of brevity, here are a few queries of increasing complexity with notes of interest.

SELECT with WHERE and ORDER BY

“Find the 10 longest US based flights where the flight time is greater than 5 hrs.”

POST _xpack/sql?format=txt
{
  "query":"SELECT OriginCityName, DestCityName FROM flights WHERE FlightTimeHour > 5 AND OriginCountry='US' ORDER BY FlightTimeHour DESC LIMIT 10"
}

Try on demo.elastic.co

OriginCityName | DestCityName  
---------------+---------------
Atlanta        |Durban         
Louisville     |Melbourne      
Peoria         |Melbourne      
Albuquerque    |Durban         
Birmingham     |Durban         
Bangor         |Brisbane       
Seattle        |Durban         
Huntsville     |Sydney         
Savannah       |Shanghai       
Philadelphia   |Xi'an

The operator for restricting row counts varies per SQL implementation. For Elasticsearch SQL, we are consistent with Postgresql/Mysql in implementing the LIMIT operator.

Math

Just some random math...

sql> SELECT ((1 + 3) * 1.5 / (7 - 6)) * 2 AS random;
    random
---------------
12.0

Try on demo.elastic.co

This represents an example of where the server side performs some post processing for functions. This has no equivalent Elasticsearch DSL query.

Functions & Expressions

“Find all flights, after the month of June, where the flight time is greater than 5 hrs and order by the longest.”

POST _xpack/sql?format=txt
{
  "query":"SELECT MONTH_OF_YEAR(timestamp), OriginCityName, DestCityName FROM flights WHERE FlightTimeHour > 5 AND MONTH_OF_YEAR(timestamp) > 6 ORDER BY FlightTimeHour DESC LIMIT 10"
}

Try on demo.elastic.co

MONTH_OF_YEAR(timestamp [UTC])|OriginCityName |   DestCityName    
------------------------------+---------------+-------------------
7                             |Buenos Aires   |Shanghai           
7                             |Stockholm      |Sydney             
7                             |Chengdu        |Bogota             
7                             |Adelaide       |Cagliari           
7                             |Osaka          |Buenos Aires       
7                             |Buenos Aires   |Chitose / Tomakomai
7                             |Buenos Aires   |Shanghai           
7                             |Adelaide       |Washington         
7                             |Osaka          |Quito              
7                             |Buenos Aires   |Xi'an

These functions would typically require writing Painless for the equivalent result in Elasticsearch, whilst SQL’s functional declaratives avoid any scripting whatsoever. Notice also how we can use the function in both the WHERE and SELECT clause. The WHERE clause component is pushed down to Elasticsearch as it impacts the result count. The SELECT function, however, is handled by the server side plugin on presentation.

Note that a list of available functions can be retrieved with “SHOW FUNCTIONS”

Try on demo.elastic.co

Combining this with our earlier math ability, we can start to formulate queries which would be complex to articulate for most users in DSL.

“Find the distance and average speed of the fastest 2 flights (velocity) which leave on monday, tuesday or wednesday, between 9am and 11am, and are also greater than 500 km. Round the distance and speed to the nearest integer. If speed is equal, show the longest first.”

Try on demo.elastic.co

sql> SELECT timestamp, FlightNum, OriginCityName, DestCityName, ROUND(DistanceMiles) AS distance, ROUND(DistanceMiles/FlightTimeHour) AS speed, DAY_OF_WEEK(timestamp) AS day_of_week FROM flights WHERE DAY_OF_WEEK(timestamp) >= 0 AND DAY_OF_WEEK(timestamp) <= 2 AND HOUR_OF_DAY(timestamp) >=9 AND HOUR_OF_DAY(timestamp) <= 10 ORDER BY speed DESC, distance DESC LIMIT 2;
       timestamp        |   FlightNum   |OriginCityName | DestCityName  |   distance    |     speed     |  day_of_week
------------------------+---------------+---------------+---------------+---------------+---------------+---------------
2018-07-03T10:03:11.000Z|REPKGRT        |Melbourne      |Norfolk        |10199          |783            |2
2018-06-05T09:18:29.000Z|J72Y2HS        |Dubai          |Lima           |9219           |783            |2

A rather convoluted and odd question but hopefully you see the point. Notice also how we create field aliases and refer to these in the ORDER BY clause.

Also note there is no requirement to specify all fields in the SELECT clause that are used in the WHERE and ORDER BY. This likely differs from the SQL implementations you’ve used in the past. For example, the following is perfectly valid:

POST _xpack/sql
{
  "query":"SELECT timestamp, FlightNum FROM flights WHERE AvgTicketPrice > 500 ORDER BY AvgTicketPrice"
}

Try on demo.elastic.co

Translate SQL Queries to DSL

We’ve all had a SQL query we’ve struggled to express in Elasticsearch DSL or wondered if it is as optimal as it could be. One of the compelling features of the new SQL interface is its ability to assist new adopters of Elasticsearch with such problems. Using the REST interface we simply append /translate to the “sql” endpoint, to obtain the Elasticsearch query the driver would issue.

Let’s consider a few of the previous queries:

POST _xpack/sql/translate
{
  "query":"SELECT OriginCityName, DestCityName FROM flights WHERE FlightTimeHour > 5 AND OriginCountry='US' ORDER BY FlightTimeHour DESC LIMIT 10"
}

Try on demo.elastic.co

The equivalent DSL should be fairly obvious to any experienced user of Elasticsearch:

{
  "size": 10,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "FlightTimeHour": {
              "from": 5,
              "to": null,
              "include_lower": false,
              "include_upper": false,
              "boost": 1
            }
          }
        },
        {
          "term": {
            "OriginCountry.keyword": {
              "value": "US",
              "boost": 1
            }
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "_source": {
    "includes": [
      "OriginCityName",
      "DestCityName"
    ],
    "excludes": []
  },
  "sort": [
    {
      "FlightTimeHour": {
        "order": "desc"
      }
    }
  ]
}

The WHERE clause is converted into range and term queries as you would expect. Notice how the OriginCountry.keyword variant of the sub field is used for the exact term match vs the parent OriginCountry (which is of type text). The user is not required to know the differences in behaviour of the underlying mapping - the correct field type was selected automatically. Interestingly, the interface attempts to optimise retrieval performance by using docvalue_fields over the _source when available i.e. for exact types (numerics, dates, keywords) with doc values enabled. We can rely on Elasticsearch SQL to generate the most optimal DSL for the query specified.

Now consider the most complex query we used last time:

POST _xpack/sql/translate
{
  "query":"SELECT timestamp, FlightNum, OriginCityName, DestCityName, ROUND(DistanceMiles) AS distance, ROUND(DistanceMiles/FlightTimeHour) AS speed, DAY_OF_WEEK(timestamp) AS day_of_week FROM flights WHERE DAY_OF_WEEK(timestamp) >= 0 AND DAY_OF_WEEK(timestamp) <= 2 AND HOUR_OF_DAY(timestamp) >=9 AND HOUR_OF_DAY(timestamp) <= 10 ORDER BY speed DESC, distance DESC LIMIT 2"
}

Try on demo.elastic.co

And the response…

{
  "size": 2,
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "filter": [
              {
                "script": {
                  "script": {
                    "source": "(params.v0 <= doc[params.v1].value.getDayOfWeek()) && (doc[params.v2].value.getDayOfWeek() <= params.v3)",
                    "lang": "painless",
                    "params": {
                      "v0": 0,
                      "v1": "timestamp",
                      "v2": "timestamp",
                      "v3": 2
                    }
                  },
                  "boost": 1
                }
              },
              {
                "script": {
                  "script": {
                    "source": "doc[params.v0].value.getHourOfDay() >= params.v1",
                    "lang": "painless",
                    "params": {
                      "v0": "timestamp",
                      "v1": 9
                    }
                  },
                  "boost": 1
                }
              }
            ],
            "adjust_pure_negative": true,
            "boost": 1
          }
        },
        {
          "script": {
            "script": {
              "source": "doc[params.v0].value.getHourOfDay() <= params.v1",
              "lang": "painless",
              "params": {
                "v0": "timestamp",
                "v1": 10
              }
            },
            "boost": 1
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "_source": false,
  "stored_fields": "_none_",
  "docvalue_fields": [
    "timestamp",
    "FlightNum",
    "OriginCityName",
    "DestCityName",
    "DistanceMiles",
    "FlightTimeHour"
  ],
  "sort": [
    {
      "_script": {
        "script": {
          "source": "Math.round((doc[params.v0].value) / (doc[params.v1].value))",
          "lang": "painless",
          "params": {
            "v0": "DistanceMiles",
            "v1": "FlightTimeHour"
          }
        },
        "type": "number",
        "order": "desc"
      }
    },
    {
      "_script": {
        "script": {
          "source": "Math.round(doc[params.v0].value)",
          "lang": "painless",
          "params": {
            "v0": "DistanceMiles"
          }
        },
        "type": "number",
        "order": "desc"
      }
    }
  ]
}

Our clauses for WHERE and ORDER BY have been converted into Painless scripts, and utilised in the sort and script query provided by Elasticsearch. These scripts are even parameterised to avoid compilations and exploit script caching.

As a side note, whilst the above represents the most optimal translation for the SQL statement it does not represent the best solution to the wider problem. In reality we would want to encode the day of the week, hour of the day and speed on the document at index time — thus allowing us to just use simple range queries. This is likely to be more performant than using Painless scripts to solve this particular problem. Some of these fields are actually even present on the document already for this reason. This is a common theme users should be cautious of: Whilst we can rely on the Elasticsearch SQL implementation to provide us with the optimal translation, it can only utilise the fields specified in the query and thus can’t necessarily provide an optimal solution to the larger problem. Considering the strengths of the underlying platform is required in order to achieve the optimal approach, and the _translate API can be the first step in this process.

Next Time

In A Practical Introduction to Elasticsearch Part II, we will continue to use the _translate API as we demonstrate some more complex Elasticsearch SQL features. We will also look at some of the limitations a user is likely to encounter in this first release. After that, I’ll provide a teaser of what is on roadmap!