There is a vast collection of textual data on the internet and in various organizational databases today, the overwhelming majority of which is not structured in an easily accessible manner. Natural language processing (NLP) can be used to make sense of unstructured data collections in a way that allows the automatization of important decision-making processes that would otherwise require a significant investment of time and effort to achieve manually.
In this article, we discuss one of the ways in which NLP can be used to structure textual data. Domain classification, also known as topic labeling or topic identification, is a text classification method which is used to assign document domain or category labels to documents of various types and lengths. A “domain” or “category” can be understood as either a conversational domain, a particular segment of the industry, or even a specific genre of text, depending on the application. For instance, a textual database may contain documents that pertain to the Legal domain, Healthcare, the Hospitality industry, and many others. Organizations structure their data in this manner in order to make individual documents more readily accessible and the retrieval of relevant information more efficient.
This article is accompanied by a hands-on tutorial that navigates the reader through the entire process of building a domain classification pipeline from data preprocessing to the training and evaluation of an artificial neural network. With just basic knowledge of the Python programming language, anyone can use this tutorial to achieve around 84 percent accuracy in a sentence-level domain labeling task using the BBC news data set and learn about sentence embeddings, an increasingly popular numerical text representation method.
Automatic text classification methods, such as domain classification, make it possible for data owners to structure their data in a scalable and reproducible manner, which not only means that large numbers of documents can be sorted automatically in a very short amount of time, but also that classification criteria can remain consistent over longer periods. Moreover, text classification methods allow businesses to acquire valuable, real-time knowledge about the performance of their tools or services, which enables better and more accurate decision-making.
For example, a domain classification algorithm can be used to organize documents in an unstructured database by topic. This opens the way to a large variety of further, more specific applications, which can include anything from analyzing current trends in online discussion forums to selecting the right kind of training data for a machine translation (MT) system.
Other text classification tasks include natural language processing problems such as spam filtering, sentiment analysis, and language identification, all of which share the fundamental challenge of having to make sense of structural, linguistic, and semantic cues in written documents to successfully assign a correct category label to them. Domain identification, along with related tasks like the ones listed above, provides the foundation for a wide range of NLP solutions.
Domain classification can be performed either manually or automatically. However, since manual text classification requires intensive labor conducted by experts and is therefore no longer applicable in the age of big data, we focus on automatic methods for the remainder of this article.
Domain identification is a subfield of NLP, a discipline that lies at the intersection of linguistics and information technology. Its aim is to gain an understanding of human language, using the tools of computer science, primarily machine learning (ML) and artificial intelligence (AI).
Text classification algorithms are generally divided into three distinct categories: rule-based systems, machine learning models, and hybrid approaches. Rule-based systems, which were particularly popular in the early days of NLP, make use of carefully designed linguistic rules and heuristics that are able to identify patterns in the language of a specific document based on which domain labels can be applied. They often rely on meticulously crafted word lists that help determine the topic of a given text; consider, for example, the Aviation category, which is likely to contain words such as “aircraft”, “altitude”, and “radar”. However, although such algorithms can perform rather well at determining the domain of a piece of text and their inner workings are easy to understand, they require considerable domain expertise and considerable effort to create and maintain.
Nevertheless, instead of applications based primarily on manually prepared linguistic rules, the field is currently dominated by machine learning systems. These can be further separated into three main categories, which are supervised, unsupervised, and semi-supervised systems.
Supervised machine learning algorithms learn from associations between a collection of data points and their corresponding labels. For example, one can train an ML system to correctly identify the topic of news articles by showing the model thousands or even millions of examples from a variety of categories and employing one of many available learning mechanisms. Whether based on deliberately selected features, such as bag-of-words representations or tf-idf vectors, or characteristics of the data that the models discover themselves, they are capable of applying learned “knowledge” when predicting labels for previously unseen texts. Supervised learning mechanisms include the Naive Bayes algorithm, support vector machines, neural networks, and deep learning.
In contrast, unsupervised systems can be used when the training set does not contain any labels and so the model has to essentially learn based solely on the internal characteristics of the data it encounters. Clustering algorithms, which are popular in text classification, fall under this category. On the other hand, semi-supervised ML algorithms learn associations from a combination of labeled and unlabeled training data.
Many domain classification approaches employ a combination of handcrafted rules and machine learning techniques, which makes them reliable and flexible tools for data enhancement. These methods are often referred to as hybrid systems.
All machine learning algorithms require plentiful and high-quality training data in order to perform well. Fortunately, there are a number of open-source datasets on the internet that anyone can download free of charge and use to train or improve domain classification models.
One of the most popular benchmarks in text classification research is the BBC news data set, which contains 2225 articles from five different domains (Business, Entertainment, Politics, Sport, and Tech). This data set is also used in the accompanying tutorial.
Another collection of news articles, namely the Reuters-21578 data set, contains an even larger variety of domains that cover a broad range of economic subjects. Some examples are Coconut, Gold, Inventories, and Money-supply. As the name suggests, the collection contains 21,578 news articles of variable length.
Similarly, the 20 Newsgroups data set contains 20,000 messages taken from twenty newsgroups, which are organized into hierarchical categories. Some of the top-level categories are Computers, Recreation, Science, while lower-level domains include Windows, Hardware, Autos, Motorcycles, Religion, Politics.
Furthermore, many organizations possess a large number of internal datasets in the form of structured or unstructured collections of text that they have been collecting over the years. It is possible to convert such collections into useful data to train ML models, including domain classifiers and other advanced NLP applications, by first applying a set of data cleaning and enhancement techniques.
Automatic domain classification relies on NLP which involves a variety of techniques from text preprocessing tools to machine learning algorithms. Therefore, in order to build a domain identification system, one must be familiar with a programming language (Python is the most widely used one in the NLP community), various libraries and toolkits, and the basics of machine learning and statistical analysis.
The term “text preprocessing” refers to the steps in an NLP pipeline that are related to the preparation of data for ML systems. Basic preprocessing steps include, but are not limited to, tokenization (the splitting of sentences into roughly word-level chunks), lemmatization (the conversion of words into their dictionary form), and part-of-speech tagging (labeling each word according to its grammatical properties). Preprocessing can also refer to the process of converting human language into a numerical representation so that computers can make sense of the textual information. An example of this is vectorization, in which words or sentences are converted into vectors, which encodes their linguistic and semantic characteristics. Some Python libraries that offer text preprocessing capabilities include Stanza, spaCy, AllenNLP, TextBlob, and the NLTK platform.
When it comes to the ML component, there exists a variety of well-maintained and easy-to-use libraries to choose from. Scikit-learn is a popular option for beginners, as it has excellent documentation and offers multiple different classifier models that can be implemented with little effort. For deep learning, both the TensorFlow and PyTorch software libraries are popular choices. Using either of these platforms, anyone with basic programming skills can build efficient neural networks that are capable of performing at nearly state-of-the-art level at various NLP tasks.
In our tutorial, we build a domain classification system based on the BBC News data set. The goal is to create a domain classifier that is capable of assigning category labels to the sentences of the collection. This is somewhat different from the traditional use case of this task that requires documents, such as articles and reviews, to be sorted by topic rather than individual sentences. One might even argue that the task is considerably more difficult at this level, because sentences contain much less domain-specific information than longer texts. For example, consider the following sentences from the data set:
- But it was so dramatic!
- I found him very powerful.
- It’s not good, but it is very understandable.
Taken out of context, it is practically impossible to determine what domain these sentences belong to, regardless of whether a human or an ML model attempts to do so. In reality, domain classification might not be appropriate for such sentences at all.
In the tutorial, we cover the following steps of the domain classification pipeline:
- Python environment and packages
- Instructions on how to access the data
- Data preprocessing
- Splitting articles into sentences
- Creating sentence-level labels
- Separating the data set into training, development, and test sets
- Sentence vectorization
- Building a simple feedforward neural network
- Training and classification
Head over to this GitHub repository for the full tutorial.
9 minute read