Post

Medical AI: Is the field suffering from a lack of data?

June 18, 2020

Medical AI does not suffer from a lack of data. Millions upon millions of images — of internal organs, tumors, fetuses, skeletal components, dental features, retinal scans, etc — reside in data banks in every city, state, and country. In addition, the increased use of video-driven technology for robotic, procedural, and non-invasive surgery adds to the “digitization” of clinical imaging. What secrets do these images hold? What similarities and differences might prompt researchers to devise new treatments, or clinicians to diagnose serious illnesses before they caused irreparable harm or death to the patient? Only time will tell. 

Interestingly, data by itself is neither information, nor knowledge. This view can be counter-intuitive in a clinical setting. An oncologist might insist that a CT scan, for example, contains information which is absolutely vital. But this is true only because the radiologist brings a vast amount of experience to the viewing of the image. Without that experience the image is not useful. “Data is a set of discrete, objective facts about events,” writes data scientist Thomas Davenport. For data to be useful, he writes, it requires context, categorization, and other attributes only human input can provide. This is the role of the data annotator, and it is critical. 

The job is made more challenging by the fact that even the “objectivity” of a CT scan or other imaging data can be called into question. Just as different computer monitors vary in color and display differences from subtle to extreme, different scanning systems may assign different gray-scale values to similar structures, leading to edge cases calling for expert evaluation. Does this gray area represent healthy tissue or diseased tissue? These questions must be answered before an image can be rendered mathematically to help train a machine learning model. As in any endeavor, perfection is impossible to achieve, but greater accuracy in the annotative process results in far more efficiency in the iterative machine learning procedure.