Using the annotated-text fieldedit

The annotated-text tokenizes text content as per the more common text field (see "limitations" below) but also injects any marked-up annotation tokens directly into the search index:

PUT my_index
{
  "mappings": {
    "properties": {
      "my_field": {
        "type": "annotated_text"
      }
    }
  }
}

Such a mapping would allow marked-up text eg wikipedia articles to be indexed as both text and structured tokens. The annotations use a markdown-like syntax using URL encoding of one or more values separated by the & symbol.

We can use the "_analyze" api to test how an example annotation would be stored as tokens in the search index:

GET my_index/_analyze
{
  "field": "my_field",
  "text":"Investors in [Apple](Apple+Inc.) rejoiced."
}

Response:

{
  "tokens": [
    {
      "token": "investors",
      "start_offset": 0,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "in",
      "start_offset": 10,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "Apple Inc.", 
      "start_offset": 13,
      "end_offset": 18,
      "type": "annotation",
      "position": 2
    },
    {
      "token": "apple",
      "start_offset": 13,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "rejoiced",
      "start_offset": 19,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

Note the whole annotation token Apple Inc. is placed, unchanged as a single token in the token stream and at the same position (position 2) as the text token (apple) it annotates.

We can now perform searches for annotations using regular term queries that don’t tokenize the provided search values. Annotations are a more precise way of matching as can be seen in this example where a search for Beck will not match Jeff Beck :

# Example documents
PUT my_index/_doc/1
{
  "my_field": "[Beck](Beck) announced a new tour"
}

PUT my_index/_doc/2
{
  "my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat"
}

# Example search
GET my_index/_search
{
  "query": {
    "term": {
        "my_field": "Beck" 
    }
  }
}

As well as tokenising the plain text into single words e.g. beck, here we inject the single token value Beck at the same position as beck in the token stream.

Note annotations can inject multiple tokens at the same position - here we inject both the very specific value Jeff Beck and the broader term Guitarist. This enables broader positional queries e.g. finding mentions of a Guitarist near to strat.

A benefit of searching with these carefully defined annotation tokens is that a query for Beck will not match document 2 that contains the tokens jeff, beck and Jeff Beck

Any use of = signs in annotation values eg [Prince](person=Prince) will cause the document to be rejected with a parse failure. In future we hope to have a use for the equals signs so wil actively reject documents that contain this today.