Natural language processing (NLP) is a form of artificial intelligence (AI) that focuses on the ways computers and people can interact using human language. NLP techniques help computers analyze, understand, and respond to us using our natural modes of communication: speech and written text.
Natural language processing is a subspecialty of computational linguistics. Computational linguistics is an interdisciplinary field that combines computer science, linguistics, and artificial intelligence to study the computational aspects of human language.
The history of natural language processing goes back to the 1950s when computer scientists first began exploring ways to teach machines to understand and produce human language. In 1950, mathematician Alan Turing proposed his famous Turing Test, which pits human speech against machine-generated speech to see which sounds more lifelike. This is also when researchers began exploring the possibility of using computers to translate languages.
In its first decade of research, NLP relied on rule-based processing. By the 1960s, scientists had developed new ways to analyze human language using semantic analysis, parts-of-speech tagging, and parsing. They also developed the first corpora, which are large machine-readable documents annotated with linguistic information used to train NLP algorithms.
In the 1970s, scientists began using statistical NLP, which analyzes and generates natural language text using statistical models, as an alternative to rule-based approaches.
The 1980s saw a focus on developing more efficient algorithms for training models and improving their accuracy. This led to the rise of machine learning algorithms in NLP. Machine learning is the process of using large amounts of data to identify patterns, which are often used to make predictions.
Deep learning, neural networks, and transformer models have fundamentally changed NLP research. The emergence of deep neural networks combined with the invention of transformer models and the "attention mechanism" have created technologies like BERT and ChatGPT. The attention mechanism goes a step beyond finding similar keywords to your queries, for example. It weighs each connected term based on its relevance. This is the technology behind some of the most exciting NLP technology in use right now.
Natural language processing works in several different ways. AI-based NLP involves using machine learning algorithms and techniques to process, understand, and generate human language. Rule-based NLP involves creating a set of rules or patterns that can be used to analyze and generate language data. Statistical NLP involves using statistical models derived from large datasets to analyze and make predictions on language. Hybrid NLP combines these three approaches.
The AI-based approach to NLP is most popular today. Like with any other data-driven learning approach, developing an NLP model requires preprocessing of the text data and careful selection of the learning algorithm.
Step 1: Data preprocessing
This is the process of cleaning and preparing text so that an NLP algorithm can analyze it. Some common data preprocessing techniques include text mining, which takes large amounts of text and breaks it into data, or tokenization, which splits up the text into individual units. These units can be punctuation, words, or phrases. Stop word removal is a tool that eliminates common words and articles of speech that aren’t very helpful in analysis. Stemming and lemmatization break words down to their basic root form, making it easier to identify their meaning. Part-of-speech tagging identifies nouns, verbs, adjectives, and other parts of speech in a sentence. Parsing analyzes the structure of a sentence and how the different words relate to each other.
Step 2: Algorithm development
This is the process of applying NLP algorithms to the preprocessed data. It extracts useful information from the text. These are some of the most common natural language processing tasks:
- Sentiment analysis determines the emotional tone or sentiment of a piece of text. Sentiment analysis labels words, phrases, and expressions as positive, negative, or neutral.
- Named entity recognition identifies and categorizes named entities such as people, locations, dates, and organizations.
- Topic modeling groups similar words and phrases together to identify the main topics or themes in a collection of documents or text.
- Machine translation uses machine learning to automatically translate text from one language to another. Language modeling predicts the likelihood of a sequence of words in a certain context.
- Language modeling is used for autocomplete, autocorrect applications, and speech-to-text systems.
Two branches of NLP to note are natural language understanding (NLU) and natural language generation (NLG). NLU focuses on enabling computers to understand human language using similar tools that humans use. It aims to enable computers to understand the nuances of human language, including context, intent, sentiment, and ambiguity. NLG focuses on creating human-like language from a database or a set of rules. The goal of NLG is to produce text that can be easily understood by humans.
Some of the benefits of natural language processing include:
- Elevation of communication: NLP allows for more natural communication with search apps. NLP can adapt to different styles and sentiments, creating more convenient customer experiences.
- Efficiency: NLP can automate lots of tasks that normally require people to complete. A few examples include text summarization, social media and email monitoring, spam detection, and language translation.
- Content curation: NLP can identify the most relevant information for individual users based on their preferences. Understanding context and keywords leads to higher customer satisfaction. Making data more searchable can improve the efficacy of search tools.
NLP still faces many challenges. Human speech is irregular and often ambiguous, with multiple meanings depending on context. Yet, programmers have to teach applications these intricacies from the start.
Homonyms and syntax can confuse data sets. And even the best sentiment analysis cannot always identify sarcasm and irony. It takes humans years to learn these nuances — and even then, it's hard to read tone over a text message or email, for example.
Text is published in various languages, while NLP models are trained on specific languages. Prior to feeding into NLP, you have to apply language identification to sort the data by language.
Unspecific and overly general data will limit NLP's ability to accurately understand and convey the meaning of text. For specific domains, more data would be required to make substantive claims than most NLP systems have available. Especially for industries that rely on up to date, highly specific information. New research, like the ELSER – Elastic Learned Sparse Encoder — is working to address this issue to produce more relevant results.
Processing people's personal data also raises some privacy concerns. In industries like healthcare, NLP could extract information from patient files to fill out forms and identify health issues. These types of privacy concerns, data security issues, and potential bias make NLP difficult to implement in sensitive fields.
NLP has a wide variety of business applications:
- Chatbots and virtual assistants: Users can have conversations with your system. These are common customer service tools. They can also guide users through complicated workflows or help them navigate a site or solution.
- Semantic search: Often used in ecommerce to generate product recommendations. It decodes the context of keywords by analyzing search engines and using knowledge-based search. It interprets user intent to provide more relevant recommendations.
- NER: Identify information in text to fill out forms or make it more searchable. Educational institutions can use it to analyze student writing and automate grading. Plus, text-to-speech and speech-to-text capabilities make information more accessible and communication easier for people living with disabilities.
- Text summarization: Researchers across industries can quickly summarize large documents into concise, digestible text. The finance industry leverages this to analyze the news and social media to help predict market trends. The government and the legal industry use it to extract key information from documents.
ChatGPT and generative AI carry the promise to transform. With technologies such as ChatGPT entering the market, new applications of NLP could be close on the horizon. We will likely see integrations with other technologies such as speech recognition, computer vision, and robotics that will result in more advanced and sophisticated systems.
NLP will become more personalized, too, allowing machines to better understand individual users and adapt their responses and recommendations. NLP systems that can understand and generate multiple languages are a major growth area for international business. Most importantly, NLP systems are constantly getting better at generating natural-sounding language: They are sounding more and more human every day.
The release of the Elastic Stack 8.0 introduced the ability to upload PyTorch models into Elasticsearch to provide modern NLP in the Elastic Stack, including features such as named entity recognition and sentiment analysis.
The Elastic Stack currently supports transformer models that conform to the standard BERT model interface and use the WordPiece tokenization algorithm.
Here is the architecture currently compatible with Elastic:
- DPR bi-encoders
- SentenceTransformers bi-encoders with the above transformer architectures
Elastic lets you leverage NLP to extract information, classify text, and provide better search relevance for your business. Get started with NLP with Elastic.