iMerit Solutions Architect Mallory Dodd is featured on the MapScaping Podcast, “Labels Matter,” hosted by Daniel O’Donohue. She discusses how to create and maintain training datasets for AI algorithms, the various types of annotation and labeling for machine and deep learning, and importance of quality training data for AI and ML.
Here are 5 key takeaways from the session:
- Training datasets is the process of human annotators going through and labeling raw/unstructured datasets based on the criteria required for an AI/ML algorithm to accurately identify and interpret the inputs. Building ‘training data’ can take a long time. The more complex a model’s goal is, the more training data is required to train the model. Additionally, if one needs to pull multiple data points and add attributes from each training data source, this will also add time and complexity.
- Annotation is the human part of the process in generating training data for a model. Annotators physically mark out the features they want the model to learn and, if necessary, add additional tags to help the model describe the image. There are various types of annotation and labeling, some of which are better suited to specific use cases than others.
- The human-in-the-loop process is ongoing because the training datasets may need to be revisited if a new source of data becomes available or if the project’s requirements change.
- The key element to getting expected results with the final model is to use training data that is as close to the data the final model will be run with as possible. This is known as ground truthing. Differences in resolution can cause issues in the model as it is now seeing the target in a different context than how it was trained too, resulting in loss of quality in the results, if it works at all. Some other differences to consider are the types of sensor used, the angle the target is viewed from, and lighting and weather conditions.
- Human intelligence is necessary in order to make sure the end product is meaningful to humans. Keeping humans in the loop encourages transparency throughout development of the model, and ultimately results in a better product.