IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Extract information Search and compare text »

› › ›

Classify text

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Classify text

edit

These NLP tasks enable you to identify the language of text and classify or label unstructured input text:

Language identification

edit

Language identification enables you to determine the language of text.

A language identification model is provided in your cluster, which you can use in an inference processor of an ingest pipeline by using its model ID (lang_ident_model_1). For an example, refer to Add NLP inference to ingest pipelines.

The longer the text passed into the language identification model, the more accurately the model can identify the language. It is fairly accurate on short samples (for example, 50 character-long streams) in certain languages, but languages that are similar to each other are harder to identify based on a short character stream. If there is no valid text from which the identity can be inferred, the model returns the special language code zxx. If you prefer to use a different default value, you can adjust your ingest pipeline to replace zxx predictions with your preferred value.

Language identification takes into account Unicode boundaries when the feature set is built. If the text has diacritical marks, then the model uses that information for identifying the language of the text. In certain cases, the model can detect the source language even if it is not written in the script that the language traditionally uses. These languages are marked in the supported languages table (see below) with the Latn subtag. Language identification supports Unicode input.

Supported languages

edit

The table below contains the ISO codes and the English names of the languages that language identification supports. If a language has a 2-letter ISO 639-1 code, the table contains that identifier. Otherwise, the 3-letter ISO 639-2 code is used. The Latn subtag indicates that the language is transliterated into Latin script.

Code	Language	Code	Language	Code	Language
af	Afrikaans	hr	Croatian	pa	Punjabi
am	Amharic	ht	Haitian	pl	Polish
ar	Arabic	hu	Hungarian	ps	Pashto
az	Azerbaijani	hy	Armenian	pt	Portuguese
be	Belarusian	id	Indonesian	ro	Romanian
bg	Bulgarian	ig	Igbo	ru	Russian
bg-Latn	Bulgarian	is	Icelandic	ru-Latn	Russian
bn	Bengali	it	Italian	sd	Sindhi
bs	Bosnian	iw	Hebrew	si	Sinhala
ca	Catalan	ja	Japanese	sk	Slovak
ceb	Cebuano	ja-Latn	Japanese	sl	Slovenian
co	Corsican	jv	Javanese	sm	Samoan
cs	Czech	ka	Georgian	sn	Shona
cy	Welsh	kk	Kazakh	so	Somali
da	Danish	km	Central Khmer	sq	Albanian
de	German	kn	Kannada	sr	Serbian
el	Greek, modern	ko	Korean	st	Southern Sotho
el-Latn	Greek, modern	ku	Kurdish	su	Sundanese
en	English	ky	Kirghiz	sv	Swedish
eo	Esperanto	la	Latin	sw	Swahili
es	Spanish, Castilian	lb	Luxembourgish	ta	Tamil
et	Estonian	lo	Lao	te	Telugu
eu	Basque	lt	Lithuanian	tg	Tajik
fa	Persian	lv	Latvian	th	Thai
fi	Finnish	mg	Malagasy	tr	Turkish
fil	Filipino	mi	Maori	uk	Ukrainian
fr	French	mk	Macedonian	ur	Urdu
fy	Western Frisian	ml	Malayalam	uz	Uzbek
ga	Irish	mn	Mongolian	vi	Vietnamese
gd	Gaelic	mr	Marathi	xh	Xhosa
gl	Galician	ms	Malay	yi	Yiddish
gu	Gujarati	mt	Maltese	yo	Yoruba
ha	Hausa	my	Burmese	zh	Chinese
haw	Hawaiian	ne	Nepali	zh-Latn	Chinese
hi	Hindi	nl	Dutch, Flemish	zu	Zulu
hi-Latn	Hindi	no	Norwegian
hmn	Hmong	ny	Chichewa

Text classification

edit

Text classification assigns the input text to one of multiple classes that best describe the text. The classes used depend on the model and the data set that was used to train it. Based on the number of classes, two main types of classification exist: binary classification, where the number of classes is exactly two, and multi-class classification, where the number of classes is more than two.

This task can help you analyze text for markers of positive or negative sentiment or classify text into various topics. For example, you might use a trained model to perform sentiment analysis and determine whether the following text is "POSITIVE" or "NEGATIVE":

{
    docs: [{"text_field": "This was the best movie I’ve seen in the last decade!"}]
}
...

Likewise, you might use a trained model to perform multi-class classification and determine whether the following text is a news topic related to "SPORTS", "BUSINESS", "LOCAL", or "ENTERTAINMENT":

{
    docs: [{"text_field": "The Blue Jays played their final game in Toronto last night and came out with a win over the Yankees, highlighting just how far the team has come this season."}]
}
...

Zero-shot text classification

edit

The zero-shot classification task offers the ability to classify text without training a model on a specific set of classes. Instead, you provide the classes when you deploy the model or at inference time. It uses a model trained on a large data set that has gained a general language understanding and asks the model how well the labels you provided fit with your text.

This task enables you to analyze and classify your input text even when you don’t have sufficient training data to train a text classification model.

For example, you might want to perform multi-class classification and determine whether a news topic is related to "SPORTS", "BUSINESS", "LOCAL", or "ENTERTAINMENT". However, in this case the model is not trained specifically for news classification; instead, the possible labels are provided together with the input text at inference time:

{
    docs: [{"text_field": "The S&P 500 gained a meager 12 points in the day’s trading. Trade volumes remain consistent with those of the past week while investors await word from the Fed about possible rate increases."}],
    "inference_config": {
        "zero_shot_classification": {
            "labels": ["SPORTS", "BUSINESS", "LOCAL", "ENTERTAINMENT"]
        }
    }
}

The task returns the following result:

...
{
    "predicted_value": "BUSINESS"
    ...
}
...

You can use the same model to perform inference with different classes, such as:

{
    docs: [{"text_field": "Hello support team. I’m writing to inquire about the possibility of sending my broadband router in for repairs. The internet is really slow and the router keeps rebooting! It’s a big problem because I’m in the middle of binge-watching The Mandalorian!"}]
    "inference_config": {
        "zero_shot_classification": {
            "labels": ["urgent", "internet", "phone", "cable", "mobile", "tv"]
        }
    }
}

The task returns the following result:

...
{
    "predicted_value": ["urgent", "internet", "tv"]
    ...
}
...

Since you can adjust the labels while you perform inference, this type of task is exceptionally flexible. If you are consistently using the same labels, however, it might be better to use a fine-tuned text classification model.

« Extract information Search and compare text »

On this page

Language identification
Supported languages
Further reading
Text classification
Zero-shot text classification

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

Classify text

Classify text

Language identification

Supported languages

Further reading

Text classification

Zero-shot text classification

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards