This Week in Elasticsearch and Apache Lucene - 2020-03-06
Cloud Identity Provider
We are continuing to work on the Identity Provider (IdP) to enable Single-Sign-On between our Cloud UI (console) and deployed applications, e.g. Kibana. We’re getting close to have something that we can merge into master and 7.x, we have recently added:
- Storing Service Providers as documents in an ES index in the security cluster (with automatic index creation & template setup)
- An API to add new Service Providers to that index
- Looking up a Service Provider from that index during SSO
- Parsing and validating SP initiated authentication requests
- Creating IdP metadata
- Processing secondary authentication (end user) credentials during SSO
We continue to work on searchable snapshots, discussing use cases, the technical building blocks, the features to be built out of this, and how it fits into the broader Elastic ecosystem. The initial deliverables are focused on providing a new cheaper warm tier on our hosted Cloud service, while future phases of the project are geared towards a new cold tier with near unlimited scaling capabilities.
On the development front, We have added timing information to the stats we collect regarding retrieving data from the blob store when searching a snapshot, and has also added support for frozen searchable snapshots. By limiting the number of searchable snapshot shards that can be searched concurrently we expect to deal more gracefully with datasets whose total size exceeds the size of the cache.
We are adding a dedicated API to mount an index from a snapshot into a cluster. This API, which takes repository, snapshot, and index name as input allows to effortlessly make an index in a snapshot available for search, and will serve as a basis to implement ILM actions that transition regular indices to snapshot-powered indices.
We have been benchmarking our current searchable snapshots implementation using different datasets (pmc, nyc_taxis, geonames) over different challenges and with various cache size settings. With this, we were able to confirm that the execution of search queries on snapshot-backed indices has similar performance to regular indices, as long as the cache is large enough to fully cache the shard data. We also investigated how the feature performs when only a part of the data can fit into the cache. Here, the results vary more based on a number of factors, and we're looking at ways to improve our current implementation. We also measured raw S3 download performance by running full-restore benchmarks, and identified potential optimizations to make searchable snapshot restores faster (more parallelization; smaller downloads on checksum verifications; async fetch of larger files) as well as fixed a number of bugs in the process.
Create API Key on behalf of another user
We have an open PR to that would allow one user (typically a system user) to create an API key on behalf of another user (typically an end-user).
The API requires credentials for the end-user (a username+password, or an access token), but means that the authorization check is performed against the system-user rather than the end-user.
We intentionally do not allow users to create API keys by default. For example, we want to support situations where a cluster administrator requires that users must authenticate via SAML using a second factor device. If those users were allowed to create API keys, that would give them an alternative means of authentication to the cluster that does not use the SAML identity provider and by-passes the intended security controls.
However, API Keys have uses within Elastic products for storing a credential that allows a background process to act on behalf of a user. The primary example for this is the Kibana Alerting project. An alert that interacts with Elasticsearch needs to be able to authenticate to Elasticsearch as the user that created the alert, but should not (and may not be able to) store the user’s password. Because alerting uses that API internally and never exposes it to the user, the risk profile is different; it does not give the user a way to by-pass the main authentication process.
So it’s OK to let Kibana create API keys for users that are not permitted to create API keys of their own. But, we do not want Kibana to be able to create API keys for any user it chooses - if it could do that, then it would be able to create an API key for the elastic user, and gain superuser privileges. So, this new API allows Kibana to create API keys for users who would not necessarily be able to do that on their own, but requires that they first authenticate to Kibana (so Kibana is effectively restricted to only create API keys for users who are logged in at the time).
We fixed a regression in Console’s proxy that would always overwrite the “host” header forwarded from Kibana. This regression was introduced in 7.5.0 and the fix ships with 7.6.2.
We have recently been revisiting his geo_line aggregation prototype (WIP commit here), looking at how to best utilize our existing allocation-efficient data structures. GeoLine is an agg that can stitch together "consecutive" points to create a single line shape. Think of a cargo vessel with a beacon that reports GPS coordinates once per hour. If you stitch those individual datapoints together, you get the spatio-temporal path the ship has taken.
We migrated a prototype to use primitive arrays within Object BigArrays that grow with the help of Lucene's array utils. Sorting of these arrays is done with the help of Lucene's IntroSorter. Need to sync with the Lucene folks to understand if that is the best sorter for the job, it is being chosen for now because it is the simplest.
Ingest: Performance and Retries
We continue work on the POC to batch shard operations together. Performance testing exhibited how poorly this interacts with the existing request-count-limited queues, for which we are investigating solutions. Benchmarking also shows the effectiveness of batching fsyncs together when there are very high levels of concurrency.
We worked with a community contributor on changing the response code when indexing is blocked due to disk space usage exceeding the flood stage, now indicating that this condition is retryable, as the index block will automatically be released once enough disk space is available again. This fixes an issue where our ingest clients were not retrying requests in this situation based on the erroneous response status code.
Apache Lucene Highlights
We have created Lucene 8.5 branch in preparation for the next release. We have updated Elasticsearch repository with a new snapshot from this branch but unfortunately we had to revert this change as there has introduced some concurrency issue in the IndexWriter. We are working with the community to address the issue.
Merge on commit
When using concurrent indexing, IndexWriter will accumulate many small segments which when writing to disk will add search-time cost as searching must visit all those tiny segments. The idea of this change is to collect those small changes and merge them during commit, creating a more efficient index layout.
The LatLonPoint implementation used by Elasticsearch projects geo data into the plane. What it means is that when you define a polygon, the edges of the polygon follows cartesian geometry.
Geo3DPoint is another Lucene's implementation that projects geo data into an ellipsoid. In this case, edges of a polygon are defined using spherical geometry, and those edges are in fact geodesics of the ellipsoid. Until now, this implementation could only be used with the WGS84 ellipsoid. We have changed it so it can take any ellipsoid.
Breaking Changes in 8.0:
- Percentiles aggregation: disallow specifying same percentile values twice #52257
- Remove fixed_auto_queue_size threadpool type #52280
- Remove translog retention settings #51697
Breaking Changes in 7.7:
Breaking Changes in Elasticsearch Hadoop Plugin
Breaking Changes in 8.0:
- Boost default scroll size to 1000 #1429
Breaking Changes in Rally
Breaking Changes in 1.5.0:
- Require at least Python 3.6 #905
Breaking Changes in 1.4.1:
- Use zeros instead of whitespaces as padding bytes #899