Elasticsearch Search API: A new way to locate App Search documents

thumb-sea-of-documents.png

Elastic 8.2 introduces a new search API for App Search. The Elasticsearch Search API, now in beta, brings more of the flexibility and power of Elasticsearch to App Search. Elastic 8.2 also introduces a Search Explain API for App Search, which exposes the Elasticsearch queries generated by App Search. Use these Elasticsearch queries as the basis for your own.

In this post, we'll look at the new APIs and explore the following use cases:

The Elasticsearch Search API for App Search

In App Search v8.2, we’ve added a new beta API called Elasticsearch Search API. Using this API, you can query the App Search document indices using free-form Elasticsearch queries.

Perhaps you’ve been using App Search for a while, and although it’s powerful out of the box, you’d like to customize your search queries. App Search makes it very easy to get up and running with search. At the same time, it hides details and makes assumptions. The Elasticsearch Search API can fill this gap by providing direct access to query the underlying indices with Elasticsearch.

The API is available as:

GET /api/as/v0/engines/<engine-name>/elasticsearch/_search
POST /api/as/v0/engines/<engine-name>/elasticsearch/_search

The API accepts the following parameters:

request: JSON object with the following properties:
request.body: JSON. This query will be sent as-is to Elasticsearch.
request.query_params: List of parameters. A parameter is an object with a key and a value.
analytics: JSON object with the following properties:
analytics.query: String. Query associated with this request.
analytics.tags: List of tags to attach to this request.

Configuration

This API is only available via a private key. In addition, a feature flag feature_flag.elasticsearch_search_api should be set to true in the Enterprise Search configuration file.

Keep in mind that with this API, results are formatted differently than Search API results. Documents are returned from Elasticsearch as-is, without applying any additional formatting. This means you can’t use this API as a drop-in replacement for the Search API.

Use cases

What kind of problems can you solve with this API? We’ll look at a few, using the National Parks sample engine that comes with App Search.

I want to count how many documents would match my query, without the overhead of the search results payload.

Provide a body to the API, and set the “size” parameter to 0. Example:

{
    "request": {
      "body": {"query": {"match_all": {}}},
      "query_params": [
        {"key": "size", "value": "0"}
      ]
    }
}

I want to count how many documents match my query, grouped by a certain field or fields.


Provide a body with aggregations (“aggs”) to the API, and set the “size” to 0. Multiple aggs can be specified. In this example, for brevity, I’m not specifying any query, so aggregations will be applied to all documents in the documents index. In reality, you will want to do some kind of searching and filtering:

{
  "request": {
    "body": {
      "aggs": {
        "top_states": {
          "terms": {
            "field": "states.enum",
            "size": 100
          }
        },
        "world_heritage_site": {
          "terms": {
            "field": "world_heritage_site.enum",
            "size": 10
          }
        }
      }
    },
    "query_params": [
      {
        "key": "size",
        "value": "0"
      }
    ]
  }
}

I want to search for documents that are like a specific document in the same index.

Use the Elasticsearch’s more_like_this (MLT) query. Example

{
  "request": {
    "body": {
      "query": {
        "more_like_this": {
          "fields": [
            "title",
            "description"
          ],
          "like": [
            {
              "_id": "park_sequoia"
            }
          ],
          "min_term_freq": 1,
          "max_query_terms": 12
        }
      }
    },
    "query_params": [
      {
        "key": "size",
        "value": "100"
      }
    ]
  }
}

I want to use a custom function to calculate document scores.


Why not? With a custom function, you can calculate document scores as a function of park square footage and number of visitors:

{
  "request": {
    "body": {
      "query": {
        "function_score": {
          "script_score": {
            "script": {
              "source": "Math.log(doc['acres.float'].value * doc['acres.float'].value)"
            }
          }
        }
      }
    },
    "query_params": [
      {
        "key": "size",
        "value": "100"
      }
    ]
  }
}

I want to retrieve a subset of documents without applying any scoring or grouping. These features are not useful to me and make the query slower.


This is what filter context in Elasticsearch is for — you can filter the documents using a combination of criteria, but they won’t be scored. The following query selects all national parks in California within 300 miles of San Francisco International airport:

{
  "request": {
    "body": {
      "query": {
        "bool": {
          "filter": [
            {
              "geo_distance": {
                "distance": "300mi",
                "location.location": {
                  "lat": 37.62126189231072,
                  "lon": -122.3790626898805
                }
              }
            },
            {
              "term": {
                "states.enum": "California"
              }
            }
          ]
        }
      }
    },
    "query_params": [
      {
        "key": "size",
        "value": "100"
      }
    ]
  }
}

I want to search for an exact match of a word or phrase, not a fuzzy match.

In its current version, App Search doesn’t make this easy. This is because after text fields are tokenized, the search is not being done on exact terms anymore, but on the resulting tokens. For example, the word “needle-like” will be turned into two tokens: “needle” and “like”. So if you try to use a match query:

{
  "query": {
    "match": {
      "description": "needle-like"
    }
  }
}

You will find documents that match “needle” and / or “like”. In our sample National Parks index, this will return three documents. Here is a workaround using a runtime field:

{
  "request": {
    "body": {
      "query": {
        "bool": {
          "filter": {
            "term": {
              "has_exact_word": true
            }
          }
        }
      },
      "runtime_mappings": {
        "has_exact_word": {
          "type": "boolean",
          "script": {
            "source": "emit(doc['description.enum'].value.contains('needle-like'))"
          }
        }
      }
    },
    "query_params": [
      {
        "key": "size",
        "value": "100"
      }
    ]
  }
}

The query above will return only one document, that in fact contains the exact word “needle-like”.

Here is another workaround using a script query:

{
  "request": {
    "body": {
      "query": {
        "bool": {
          "filter": {
            "script": {
              "script": "doc['description.enum'].value.contains('needle-like')"
            }
          }
        }
      }
    },
    "query_params": [
      {
        "key": "size",
        "value": "100"
      }
    ]
  }
}

These workarounds might temporarily solve a legitimate business problem, but performance would be severely degraded. The script query would have to scan every document in the index and, for an index of any significance, this quickly becomes unsustainable. The best way to solve this problem would be to apply a custom analyzer to your documents index. This ensures that text is tokenized in a way that makes sense for your set of documents.

I want to add a runtime field to my documents and return it in my search.


Let’s add distance to SFO (in miles) to all documents in the National Parks index. The following query adds a runtime field, and includes it in “fields” to ensure it’s being returned in the response:

{
  "request": {
    "body": {
      "runtime_mappings": {
        "miles_to_sfo": {
          "type": "double",
          "script": {
            "source": "emit(0.00062137 * doc['location.location'].planeDistance(37.62126189231072, -122.3790626898805))"
          }
        }
      },
      "fields": [
        "miles_to_sfo"
      ]
    },
    "query_params": [
      {
        "key": "size",
        "value": "100"
      }
    ]
  }
}

Keep in mind that, because runtime fields are evaluated at query time, they will naturally be less performant than indexed fields. One thing you can do to improve query performance is ensure you’re only retrieving a subset of documents you actually need, by applying filters on other indexed fields. This means the runtime field doesn’t have to be evaluated for the whole dataset. If this is a query you will be making regularly, and especially if the index contains a lot of documents, you should consider promoting this field to an indexed field.

The Search Explain API for App Search


The new Search Explain API is another useful tool that will help you write your Elasticsearch queries.

The Search Explain API accepts the same parameters as the App Search Search API. However, instead of running a search and returning results, it builds and returns an Elasticsearch query that App Search would run.

The API is available as:

GET /api/as/v0/engines/<engine-name>/search_explain
POST /api/as/v0/engines/<engine-name>/search_explain
You can see what happens when you search for “everglade” in App Search:
curl -XPOST 'http://localhost:3002/api/as/v0/engines/national-parks-demo/search_explain' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer private-abcdef' \
--data-raw '{
    "query": "everglade"
}'
Response:
{
  "meta": {
    "alerts": [],
    "warnings": [],
    "precision": 2,
    "engine": {
      "name": "national-parks-demo",
      "type": "default"
    },
    "request_id": "d3346586-46b0-419f-91a2-e051253ab455"
  },
  "query_string": "GET enterprise-search-engine-national-parks-demo/_search",
  "query_body": {
    "query": {
      "bool": {
        "must": {
          "function_score": {
            "boost_mode": "sum",
            "score_mode": "sum",
            "query": {
              "bool": {
                "must": [
                  {
                    "bool": {
                      "should": [
                        {
                          "multi_match": {
                            "query": "everglade",
                            "minimum_should_match": "1<-1 3<49%",
                            "type": "cross_fields",
                            "fields": [
                              "world_heritage_site^1.0",
                              "world_heritage_site.stem^0.95",
                              "world_heritage_site.prefix^0.1",
                              "world_heritage_site.joined^0.75",
                              "world_heritage_site.delimiter^0.4",
                              "description^2.4",
                              "description.stem^2.28",
                              "description.prefix^0.24",
                              "description.joined^1.8",
                              "description.delimiter^0.96",
                              "title^5.0",
                              "title.stem^4.75",
                              "title.prefix^0.5",
                              "title.joined^3.75",
                              "title.delimiter^2.0",
                              "nps_link^0.7",
                              "nps_link.stem^0.665",
                              "nps_link.prefix^0.07",
                              "nps_link.joined^0.525",
                              "nps_link.delimiter^0.28",
                              "states^2.8",
                              "states.stem^2.66",
                              "states.prefix^0.28",
                              "states.joined^2.1",
                              "states.delimiter^1.12",
                              "id^1.0"
                            ]
                          }
                        },
                        {
                          "multi_match": {
                            "query": "everglade",
                            "minimum_should_match": "1<-1 3<49%",
                            "type": "best_fields",
                            "fuzziness": "AUTO",
                            "prefix_length": 2,
                            "fields": [
                              "world_heritage_site.stem^0.1",
                              "description.stem^0.24",
                              "title.stem^0.5",
                              "nps_link.stem^0.07",
                              "states.stem^0.28"
                            ]
                          }
                        }
                      ]
                    }
                  }
                ]
              }
            },
            "functions": [
              {
                "script_score": {
                  "script": {
                    "source": "Math.max(_score + ((1.5 * (doc.containsKey(\"visitors.float\") && !doc[\"visitors.float\"].empty ? doc[\"visitors.float\"].value : 0))) - _score, 0)"
                  }
                }
              }
            ]
          }
        }
      }
    },
    "sort": [
      {
        "_score": "desc"
      },
      {
        "_doc": "desc"
      }
    ],
    "highlight": {
      "fragment_size": 300,
      "type": "plain",
      "number_of_fragments": 1,
      "order": "score",
      "encoder": "html",
      "require_field_match": false,
      "fields": {}
    },
    "size": 10,
    "from": 0,
    "timeout": "30000ms",
    "_source": [
      "visitors",
      "square_km",
      "world_heritage_site",
      "date_established",
      "description",
      "location",
      "id",
      "acres",
      "title",
      "nps_link",
      "states"
    ]
  }
}

Whoa. A lot seems to be happening there. App Search is:

  • combining 2 different multi-match queries, one type of best_fields and another of type cross_fields
  • calculating a script score, multiplying the document score returned by Elasticsearch by an additional factor of visitors
  • applying field weights and boosts
  • adding highlighting
  • summing up the resulting document scores

This query could be used as a starting point, and modified as needed to achieve your search objectives.

Summary

In this blog post, we gave you some tips for using the new Elasticsearch Search API in App Search. We provided several use cases, based on App Search feature requests we have received over time. We also let you take a peek into the inner workings of App Search, with the new Search Explain API.

We hope that this new API will empower you to build that perfect search experience you’ve always been looking for. Try it out with a free trial on Elastic Cloud. We’d love to hear what you build with it, and if you have any feedback, don’t hesitate to let us know.