Since Elasticsearch 8.15 or with Elasticsearch Serverless, ES|QL responses support the Apache Arrow streaming format. This blog post will show you how to take advantage of it in Python. In an earlier blog post, I demonstrated how to convert ES|QL queries to Pandas dataframes using CSV as an intermediate representation. Unfortunately, CSV requires explicit type declarations, is slow (especially for larger datasets) and does not handle nested arrays and objects. Apache Arrow lifts all these limitations.
ES|QL to Pandas dataframes in Python
Importing test data
First, let's import some test data. As before, we will be using the employees
sample data and mappings. The easiest way to load this dataset is to run these two Elasticsearch API requests in the Kibana Console.
Converting dataset to a Pandas DataFrame object
OK, with that out of the way, let's convert the full employees
dataset to a Pandas DataFrame object using the ES|QL Arrow export:
from elasticsearch import Elasticsearch
import pandas as pd
client = Elasticsearch(
"https://[host].elastic-cloud.com",
api_key="...",
)
response = client.esql.query(
query="""
FROM employees
| DROP is_rehired,job_positions,salary_change*
| LIMIT 500
""",
format="arrow",
)
df = response.to_pandas(types_mapper=pd.ArrowDtype)
print(df)
Even though this dataset only contains 100 records, we use a LIMIT
command to avoid ES|QL warning us about potentially missing records. This prints the following dataframe:
avg_worked_seconds birth_date ... salary still_hired
0 268728049 1953-09-02 00:00:00 ... 57305 True
1 328922887 1964-06-02 00:00:00 ... 56371 True
2 200296405 1959-12-03 00:00:00 ... 61805 False
3 311267831 1954-05-01 00:00:00 ... 36174 True
4 244294991 1955-01-21 00:00:00 ... 63528 True
.. ... ... ... ... ...
95 204381503 1954-09-16 00:00:00 ... 43889 False
96 206258084 1952-02-27 00:00:00 ... 71165 False
97 272392146 1961-09-23 00:00:00 ... 44817 False
98 377713748 1956-05-25 00:00:00 ... 73578 True
99 223910853 1953-04-21 00:00:00 ... 68431 True
[100 rows x 17 columns]
OK, so what actually happened here?
- Given
format="arrow"
, Elasticsearch returns binary Arrow streaming data - The Elasticsearch Python client looks at the Content-Type header and creates a PyArrow object
- Finally, PyArrow's Pandas integration converts the PyArrow object to a Pandas dataframe.
Note that the types_mapper=pd.ArrowDtype
parameter asks Pandas to use a PyArrow backend instead of a NumPy backend, since the source data is PyArrow. While this backend is not enabled by default for compatibility reasons, it has many advantages: it handles missing values, is faster, more interopable and supports more types. (This is not a zero copy conversion, however.)
For this example to work, the Pandas and PyArrow optional dependencies need to be installed. If you want to use another dataframe library such as Polars instead, you don't need Pandas and can directly use polars.from_arrow
to create a Polars DataFrame from the PyArrow table returned by the Elasticsearch client.
One limitation is that Elasticsearch does not currently handle multi-valued fields, which is why we had to drop the is_rehired
, job_positions
and salary_change
columns. This limitation will be lifted in a future version of Elasticsearch.
Anyway, you now have a Pandas dataframe that you can use to analyze your data further. But you can also continue massaging the data using ES|QL, which is particularly useful when queries return more than 10,000 rows, the current maximum number of rows that ES|QL queries can return.
More complex queries
In the next example, we're counting how many employees are speaking a given language by using STATS ... BY
(not unlike GROUP BY
in SQL). And then we sort the result with the languages
column using SORT
:
response = client.esql.query(
query="""
FROM employees
| DROP is_rehired,job_positions,salary_change*
| STATS count = COUNT(emp_no) BY languages
| SORT languages
| LIMIT 500
""",
format="arrow",
)
df = response.to_pandas(types_mapper=pd.ArrowDtype)
print(df)
Unlike with CSV, we did not have to specify any types, as Arrow data already includes types. Here's the result:
count languages
0 15 1
1 19 2
2 17 3
3 18 4
4 21 5
5 10 <NA>
21 employees speak 5 languages, wow! And 10 employees did not declare any spoken language. The missing value is denoted by <NA>
, which is consistently used for missing data with the PyArrow backend. If we had used the NumPy backend instead, this column would have been converted to floats and the missing value would have been a confusing NaN
, as NumPy integers don't have any sentinel value for missing data.
Queries with parameters
Finally, suppose that you want to expand the query from the previous section to only consider employees that speak N or more languages, with N being a variable parameter. For this we can use ES|QL's built-in support for parameters, which eliminates the risk of an injection attack associated with manually assembling queries with variable parts:
response = client.esql.query(
query="""
FROM employees
| DROP is_rehired,job_positions,salary_change*
| STATS count = COUNT(emp_no) BY languages
| WHERE languages >= (?)
| SORT languages
| LIMIT 500
""",
format="arrow",
params=[3],
)
df = response.to_pandas(types_mapper=pd.ArrowDtype)
print(df)
which prints the following:
count languages
0 17 3
1 18 4
2 21 5
Conclusion
As we saw, ES|QL's native Arrow support makes working with Pandas and other DataFrame libraries even nicer than using CSV and it will continue to improve over time, with the multi-value support coming in a future version of Elasticsearch.
Additional resources
If you want to learn more about ES|QL, the ES|QL documentation is the best place to start. You can also check out this other Python example using Boston Celtics data. To know more about the Python Elasticsearch client itself, you can refer to the documentation, ask a question on Discuss with the language-clients tag or open a new issue if you found a bug or have a feature request. Thank you!