Navigating data preparation for natural language processing (NLP) applications comes with its fair share of complexity. With such a wide variety of tasks that fall underneath NLP, developing a criteria for selection can help accelerate and improve the outcome of the project. There are two key considerations when choosing your data labeling path:
- Security: As sensitive data is often involved in NLP applications, document security must be guaranteed during the data labeling process.
- Expertise: How much subject matter expertise is demanded of the labelers to ensure an accurate dataset is generated?
- Volume: How much data needs to be labeled?
In this piece, we’ll outline specific examples in relation to question answering engines, NLP for legal texts, and medical NLP combined with computer vision to understand key criteria and considerations when selecting your data labeling solution.
Question Answering Engine
Question answering (QA) is an exciting subfield of NLP focused on building models designed to facilitate information retrieval around a specific topic or set of topics. QA systems can be used as the backend for a virtual assistant tool on a travel website, allowing users to ask a variety of questions about available flights. The system can then respond with relevant information that helps users coordinate their booking.
The best QA models leverage some form of deep learning, and therefore require access to very large datasets to be trained adequately. A fantastic dataset for question answering is the Stanford Question Answering Dataset (SQuAD), which contains a set of 100,000 questions based on Wikipedia articles with answers embedded into the texts based on the questions asked.
For QA engines, a crowdsourced data labeling solution can be perfectly adequate, with SQuAD serving as a perfect example of that. These questions were both posed and answered by crowdsourced workers, and could be done so adequately as the questions only required English reading comprehension abilities to answer correctly.
If the guidelines needed to generate a valid dataset are complicated, professional data annotation is typically the only answer, which can be done in-house (budget permitted) or by an annotation service. For example, if a QA system is being designed to answer a researcher’s highly-specialized questions, crowdsourcing will fail to meet this demand as annotators lack the expertise required for accurate annotation.
NLP for Law
NLP has seen increasing adoption within the legal industry as firms leverage it to:
- Perform legal research
- Conduct electronic discovery
- Automate contract review
- Draft & analyze legal documents
- Predict rulings
Security is the biggest challenge for annotating data for use in a legal AI model. Legal documentation demands the highest security due to the proprietary nature of the document’s content. To adequately protect the confidentiality of the documentation, the work must be annotated by a professional service that can guarantee the security of any documents shared when creating and annotating the datasets.
There’s also the issue of legal expertise. Legal documents are very structured and dense, it’s best to have a legal expert annotate and label them. Only individuals with a law degree or who have worked in the law in some capacity (clerk, secretary, paralegal, etc) will have the ability to sift through complicated legal terminology and understand it.
Medical NLP and Combined Computer Vision
Medicine is another key area where NLP is increasingly finding new applications, particularly in the exam room. Doctors using speech-to-text models packaged in dictation software can easily record clinical notes verbally, allowing for them to focus more on their patients while also speeding up the exam time.
Training an NLP system around the medical lexicon naturally demands a certain level of subject matter expertise. Correctly identifying and transcribing terms like “appendicitis” or “cephalexin” when annotating a dataset requires strong English fluency with some medical background and knowledge. This is why it’s best to source data labelers from a professional service that guarantees a high level of subject matter expertise.
The medical field is also seeing a number of scenarios in which NLP and computer vision (CV) models are combined. Because NLP and CV models typically require lots of data to train, the crowdsourcing route typically wins out as it can result in huge cost savings. This is the best choice for simpler use cases.
However, imagine a scenario in which a medical team is using diagnostic radiology images such as x-rays, MRIs, and CT Scans to identify potential sources of cancer in patients. Suppose also that the model is twofold: there is a CV model which identifies the suspicious tumor or lesion in the radiology image and then an NLP model which takes that selected subset and labels it with text specifying the tumor location and type. For example, it might take in an image of the brain and label it as “Glioblastoma located in the prefrontal cortex”.
In this scenario, medical expertise is critical on behalf of the data labelers as differentiating between tumor subtypes can be challenging, even for highly trained doctors. Annotators will also require familiarity with the medical terms being used for the NLP portion. In such a circumstance, a specialized data labeling service is your best bet.
NLP encompasses a wide range of subfields, each with very different dataset structures and needs. For basic tasks such as named entity recognition (NER), sentiment analysis, or classification,traditional crowdsourcing may prove adequate. If your business works with NLP in a highly-specialized domain demanding best-in-class data security and annotation accuracy is paramount, then contact iMerit today to speak with an annotation expert.