Classify text
editClassify text
editThese NLP tasks enable you to identify the language of text and classify or label unstructured input text:
Language identification
editLanguage identification enables you to determine the language of text.
A language identification model is provided in your cluster, which you can use in an
inference processor of an ingest pipeline by using its model ID
(lang_ident_model_1
). For an example, refer to Add NLP inference to ingest pipelines.
The longer the text passed into the language identification model, the more accurately the
model can identify the language. It is fairly accurate on short samples (for
example, 50 character-long streams) in certain languages, but languages that are
similar to each other are harder to identify based on a short character stream.
If there is no valid text from which the identity can be inferred, the model
returns the special language code zxx
. If you prefer to use a different
default value, you can adjust your ingest pipeline to replace zxx
predictions
with your preferred value.
Language identification takes into account Unicode boundaries when the feature set is
built. If the text has diacritical marks, then the model uses that information
for identifying the language of the text. In certain cases, the model can
detect the source language even if it is not written in the script that the
language traditionally uses. These languages are marked in the supported
languages table (see below) with the Latn
subtag. Language identification supports
Unicode input.
Supported languages
editThe table below contains the ISO codes and the English names of the languages
that language identification supports. If a language has a 2-letter ISO 639-1
code, the
table contains that identifier. Otherwise, the 3-letter ISO 639-2
code is
used. The Latn
subtag indicates that the language is transliterated into Latin
script.
Code |
Language |
Code |
Language |
Code |
Language |
af |
Afrikaans |
hr |
Croatian |
pa |
Punjabi |
am |
Amharic |
ht |
Haitian |
pl |
Polish |
ar |
Arabic |
hu |
Hungarian |
ps |
Pashto |
az |
Azerbaijani |
hy |
Armenian |
pt |
Portuguese |
be |
Belarusian |
id |
Indonesian |
ro |
Romanian |
bg |
Bulgarian |
ig |
Igbo |
ru |
Russian |
bg-Latn |
Bulgarian |
is |
Icelandic |
ru-Latn |
Russian |
bn |
Bengali |
it |
Italian |
sd |
Sindhi |
bs |
Bosnian |
iw |
Hebrew |
si |
Sinhala |
ca |
Catalan |
ja |
Japanese |
sk |
Slovak |
ceb |
Cebuano |
ja-Latn |
Japanese |
sl |
Slovenian |
co |
Corsican |
jv |
Javanese |
sm |
Samoan |
cs |
Czech |
ka |
Georgian |
sn |
Shona |
cy |
Welsh |
kk |
Kazakh |
so |
Somali |
da |
Danish |
km |
Central Khmer |
sq |
Albanian |
de |
German |
kn |
Kannada |
sr |
Serbian |
el |
Greek, modern |
ko |
Korean |
st |
Southern Sotho |
el-Latn |
Greek, modern |
ku |
Kurdish |
su |
Sundanese |
en |
English |
ky |
Kirghiz |
sv |
Swedish |
eo |
Esperanto |
la |
Latin |
sw |
Swahili |
es |
Spanish, Castilian |
lb |
Luxembourgish |
ta |
Tamil |
et |
Estonian |
lo |
Lao |
te |
Telugu |
eu |
Basque |
lt |
Lithuanian |
tg |
Tajik |
fa |
Persian |
lv |
Latvian |
th |
Thai |
fi |
Finnish |
mg |
Malagasy |
tr |
Turkish |
fil |
Filipino |
mi |
Maori |
uk |
Ukrainian |
fr |
French |
mk |
Macedonian |
ur |
Urdu |
fy |
Western Frisian |
ml |
Malayalam |
uz |
Uzbek |
ga |
Irish |
mn |
Mongolian |
vi |
Vietnamese |
gd |
Gaelic |
mr |
Marathi |
xh |
Xhosa |
gl |
Galician |
ms |
Malay |
yi |
Yiddish |
gu |
Gujarati |
mt |
Maltese |
yo |
Yoruba |
ha |
Hausa |
my |
Burmese |
zh |
Chinese |
haw |
Hawaiian |
ne |
Nepali |
zh-Latn |
Chinese |
hi |
Hindi |
nl |
Dutch, Flemish |
zu |
Zulu |
hi-Latn |
Hindi |
no |
Norwegian |
||
hmn |
Hmong |
ny |
Chichewa |
Further reading
editText classification
editText classification assigns the input text to one of multiple classes that best describe the text. The classes used depend on the model and the data set that was used to train it. Based on the number of classes, two main types of classification exist: binary classification, where the number of classes is exactly two, and multi-class classification, where the number of classes is more than two.
This task can help you analyze text for markers of positive or negative sentiment or classify text into various topics. For example, you might use a trained model to perform sentiment analysis and determine whether the following text is "POSITIVE" or "NEGATIVE":
{ docs: [{"text_field": "This was the best movie I’ve seen in the last decade!"}] } ...
Likewise, you might use a trained model to perform multi-class classification and determine whether the following text is a news topic related to "SPORTS", "BUSINESS", "LOCAL", or "ENTERTAINMENT":
{ docs: [{"text_field": "The Blue Jays played their final game in Toronto last night and came out with a win over the Yankees, highlighting just how far the team has come this season."}] } ...
Zero-shot text classification
editThe zero-shot classification task offers the ability to classify text without training a model on a specific set of classes. Instead, you provide the classes when you deploy the model or at inference time. It uses a model trained on a large data set that has gained a general language understanding and asks the model how well the labels you provided fit with your text.
This task enables you to analyze and classify your input text even when you don’t have sufficient training data to train a text classification model.
For example, you might want to perform multi-class classification and determine whether a news topic is related to "SPORTS", "BUSINESS", "LOCAL", or "ENTERTAINMENT". However, in this case the model is not trained specifically for news classification; instead, the possible labels are provided together with the input text at inference time:
{ docs: [{"text_field": "The S&P 500 gained a meager 12 points in the day’s trading. Trade volumes remain consistent with those of the past week while investors await word from the Fed about possible rate increases."}], "inference_config": { "zero_shot_classification": { "labels": ["SPORTS", "BUSINESS", "LOCAL", "ENTERTAINMENT"] } } }
The task returns the following result:
... { "predicted_value": "BUSINESS" ... } ...
You can use the same model to perform inference with different classes, such as:
{ docs: [{"text_field": "Hello support team. I’m writing to inquire about the possibility of sending my broadband router in for repairs. The internet is really slow and the router keeps rebooting! It’s a big problem because I’m in the middle of binge-watching The Mandalorian!"}] "inference_config": { "zero_shot_classification": { "labels": ["urgent", "internet", "phone", "cable", "mobile", "tv"] } } }
The task returns the following result:
... { "predicted_value": ["urgent", "internet", "tv"] ... } ...
Since you can adjust the labels while you perform inference, this type of task is exceptionally flexible. If you are consistently using the same labels, however, it might be better to use a fine-tuned text classification model.