Document classification is a method of automatically organizing unstructured text-based files such as .docx or .pdf into categories. By classifying files based on their content, text document classification can be used for consistent categorization even if file names are inconsistent or unrepresentative of the content, or if they are in different formats, such as scans or images.
Automatic document classification has three main use cases:
- Categorization – Automatically sort documents into categories such that they can be dealt with in batches
- Identification – Extract document characteristics such as language, genre or topic
- Analytics – To identify trends, patterns or correlations across multiple documents, such as scientific literature meta analysis or across an organization’s technical support tickets.
Document categorization is important for organizations who need to deal with a variety of paperwork such as public sector, healthcare, finance, legal, and the like. For example, a lawsuit may need a large number of documents to be submitted as evidence such as contracts, email exchanges, invoices, bank statements, transcripts, and so on. These may vary widely in format, structure, and content.
Classification can be used for document indexing, storage and retrieval. This can either be for storing files in a cloud-based service, or automatically categorizing content on a website. Text classification can enable easier and more efficient search queries.
To categorize a document, a machine learning algorithm can also take advantage of the document’s structure. These can either be:
- Structured – These include any documents which are completed using templates such as forms, questionnaires, tests and the like.
- Semi-Structured – These include documents which have a format, but may differ from one document to another. For example, invoices typically contain the same type of information, but the positioning in the document may vary from one document to another.
- Unstructured – These are documents containing free-form text, such as letters and emails, contracts, articles, and other.
Types of Classification Models
Classification can be divided into three categories: supervised classification, unsupervised classification or rule-based.
In supervised classification, the model is trained using labeled documents grouped in predetermined classes. This entails a training set that is manually labeled by humans. We will mainly be focusing on supervised classification for the rest of this article.
In unsupervised classification, unlabeled documents are classified into clusters based on similar characteristics. There are no predetermined categories where the algorithm can assign documents.
Rule-based classification leverages sentiment analysis, morphology, lexis, syntax, and semantics to automatically tag the text and complete a classification task. For example, wording such as ‘Customer shall’ and ‘We shall’ is generally used within contracts and agreements.
Training Data Preparation and Preprocessing
To train a deep learning document classification model, the algorithm needs to be fed high-quality labeled data. To generate a high-quality training dataset, first consider following:
- Define the classes or categories – Determine in which categories a document classification model can categorize documents. These may vary by use case, but some examples include categorizing news articles by topic (sports, politics, business), categorizing financial documents (invoices, statements, purchase orders) and categorizing human resources documents (passport, driving license, proof of address). The number of datapoints for each class need to be balanced, as any imbalances would need to model reconfiguration or artificially balancing of the dataset by undersampling or oversampling each class.
- Obtaining the dataset – This entails the collection of relevant data points for your use case. Thankfully, there are plenty of free and reputable datasets available on the internet. We have compiled a list of the major ones here.
- Formatting – This step ensures that all the documents are in a consistent text-based format. Especially important to note here are documents which are either images or scans. To include those in the training or test sets, we need to use an optical character recognition (OCR) tool to extract text and meta-data from images.
- Data cleaning and transformation – for a model to efficiently read text-based data, apply the following transformation processes:
- Case correction: convert all text either upper or lowercase.
- Regex for non-alphanumeric characters: remove all characters which are not alphanumeric, such as punctuation.
- Word Tokenization: a one Page Text string becomes list of words
- Stopwords Removal: stopwords are common words in a language such as “the”, “is”, and “a”. These are not helpful in classifying the individual documents. These words can also be domain-specific that are frequently found across multiple documents, such as the word ‘price’ for finance documents. These words can also be removed.
- Case correction: convert all text either upper or lowercase.
- Splitting data between training and testing – Once the dataset has been obtained and processed, split the data for training and testing. The ratio should be 80% to be used for training and 20% to be used for testing. The data should also be randomly shuffled in a stratified fashion for each class.
Text Vectorization
For natural language processing (NLP) algorithms to process text and text-based files, they need to be confirmed into a format that the machine can understand, namely numerical values. To do that, we will be transforming text information into feature vectors. Vectors are a list of numerical values representing some of the text’s characteristics.
One common approach for extracting features from text is to use the bag of words model: a model where for each document, an article in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored.
There are multiple ways of converting text into vectors. These are packaged into machine learning libraries, such as the Sklearn (also known as scikit-learn), an open source machine learning library that is particularly accessible to beginners.
Below is a list of models used to convert text-based files into vectors.
- Vector Space Models (VSMs) are types of models which embed words into a continuous vector space where semantically similar words are mapped to nearby points. VSMs include methods such as the ones below.
- Count-Based computes the frequency of how often a word co-occurs with its neighbors in a large text corpus, and then maps these count-statistics to a small, dense vector for each word such as TFIDF.
- Predictive Methods can provide a prediction of a word based on its neighbors in terms of learned small, dense embedding vectors (e.g. Skip-Gram, CBOW). Word2Vec and Doc2Vec belong to this category of models
- Word2Vec Model can detect synonymous words or suggest additional words for a partial sentence by training a neural network model to learn word associations from a large corpus of text. It uses either continuous bag-of-words (CBOW) or continuous skip-gram model architectures to produce a distributed representation of words.
- In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). CBOW is faster than skip-gram. It does a better job for infrequent words.
- In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. Skip-gram does a better job for infrequent words than CBOW.
- Doc2Vec Model – A generalization of Word2Vec, Doc2Vec is an NLP tool for representing documents as a vector. It’s an unsupervised algorithm that learns fixed-length feature vector representation from variable-length pieces of texts. Then these vectors can be used in any machine learning classifier to predict the class’s label. Compared to Word2Vec, it uses all words in each text file to create a unique column in a matrix (called it Paragraph Matrix). Then a single layer neural network will be trained where the input data are all surrounding words of the current word along with the current paragraph column to predict the current word.
Using these models we can calculate measures such as the term frequency, inverse document frequency (TF-IDF) vector for each document. Term frequency, inverse document frequency, is a statistic which represents words’ importance in each document. This is calculated using a word’s frequency as a proxy for its importance. For example, if the word ‘loan’ is mentioned 20 times in a document, it is highly likely that the word is more important than if it was only mentioned once.
The document frequency represents the number of documents containing a given word to determine how common the word is. This minimizes the effect of domain-specific stop-words that do not add much information. The rationale behind calculating the inverse is that words which appear multiple times in multiple documents may not provide much information. However, if a word is repeated frequently in only one document, but not in the rest, the word represents a piece of information specific to that document.
Classifier Model
Once the text is converted to a vector format, it is ready for a machine learning classifier to learn the patterns present in the vectors of different document types and identify the correct distinctions for a classification problem.
Common classifier models for document classification include logistic regression, random forest, naive bayes classifier, and k-nearest neighbor algorithm.
Logistic Regression is a classification algorithm, used when the value of the target text can be classified using a binary output. Using logistic regression, a document can either belong to a category or not.
Random Forest is a model which consists of a large number of individual decision trees that operate as an ensemble. Using a ‘wisdom-of-the-crowd’ approach, each individual tree in the random forest produces a class prediction. The class with the most votes becomes the trained model’s prediction. The key for a successful random forest prediction is low correlation between the decision trees within the model. Uncorrelated models can produce ensemble predictions that have higher accuracy compared to any of the individual predictions as the trees protect each other from their individual errors.
Naive Bayes classifier – Naive Bayes classification makes use of Bayes theorem to determine the probability of an item belonging to a category. Naive Bayes sorts items into categories based on whichever probability is highest.or example, what would be the probability of a document that contains the words “price” or “rate” or “VAT” to be categorized as an inovice rather than a purchase order. The model is dubbed ‘naive’ as it treats the appearance of each word in a document independently, with no correlation with other words in the text. This is highly unlikely in natural language, as words have associated semantic fields, where, for example, the probability of the word ‘politics’ is related to the probability of the word ‘government’. Despite this, Niave Bayes works surprisingly well with the Bag of Words model, and has notably been used for spam detection.
K-nearest neighbors algorithm – a supervised training algorithm which assigns new documents to categories by comparing new inputs to the ones used to train the algorithm. KNN algorithm is generally used for datasets with fewer than one hundred thousand labeled non-textual data samples. K in KNN is a parameter that refers to the number of nearest neighbors to a particular data point that are to be included in the decision making process. This is the core decision factor as the classifier output depends on the class to which the majority of these neighboring points belong. Consider if the value of K is 5, then the algorithm will take into account the five nearest neighboring data points for determining the class of the object. Choosing the right value of K is termed as Parameter Tuning. As the value of K increases the prediction curve becomes smoother.
Conclusion
Document classification is a rapidly evolving space with many automation use cases across different verticals. It’s been a particularly helpful technique in improving services such as spam email filtering, with more use cases emerging in areas such as content moderation and document storage.
When working with supervised machine learning algorithms for document classification, the data labeling process is one of the most important factors that determines the end output quality of the algorithm. Especially when dealing with specialized industries, such as finance, legal, government, healthcare, data should be annotated by trained annotators that understand the nuances between different types of documents.
iMerit natural language understanding experts can help you optimize your document classification model by providing a consistent and high-quality flow of labeled training datasets. To learn more about our document and text labeling services, contact us today.