Why Solving Data Edge Cases is Key to Accelerating AI

February 02, 2023

It’s no secret that AI systems are being used in more and more high-stakes applications – self-driving cars, robotic surgery, and more. As AI advances, it’s becoming critical to ensure that AI systems navigate real-world anomalies successfully.

In this session, Shweta Shrivastava, Senior Director of Product Management for Behavior at Waymo, Vinesh Sukumar, Senior Director of Product Management at Qualcomm, and Dr. Itai Orr, VP of Technology at Autobrains Technologies, discuss what it takes to identify and solve data edge cases to maximize AI performance and achieve widespread adoption.

Defining Edge Cases

Edge cases are a situation or conditions that sit outside the normal. When defining edge cases, it’s important to differentiate between rare and hard. Rare is not the same as hard, where rare is about data scarcity, and hard is about ambiguity.

In the autonomous vehicle industry, for example, a rare edge case might be a van leaving a trail of soap bubbles – very rare to encounter, yet minimally disruptive. A hard edge case, but common, can be construction sites which close off roads and add diversion signs.

The panellists consider that edge cases will never be fully eliminated. Their frequency may decrease considerably as more data is captured, but the long tail will never be exhausted. ‘Long tail is job security’ is one of the running jokes in the ML industry.

“Long tail will always exist. It can get into newer types of rare cases, but it will never go away.
– Shwetha Shrivastava, Sr. Director of Product Management for Behavior at Waymo

Even outside of the autonomous vehicle industry, most edge use cases have been quite generic in nature, such as it’s easy to define them comprehensively, and it’s likely that multiple users will encounter the edge case without variations. However, ML practitioners must also consider edge cases that are specific to users driven by their behavior. This starts blurring the lines between edge cases and personalization independent of the vertical.

Getting the Right Type of Data

To get ML models to behave as they are intended in real-world scenarios, the models must be trained on suitable data. Once the model has a large enough database of common scenarios, you need to systematically approach edge cases. Using monitoring and real-world feedback that indicates how the model behaves, ML developers can work backwards from the behavior to understand what other data they need to get the system trained.

In the case of the AV industry, getting more data of driving on a highway or through a city makes no difference to the model performance. Now, to only extract the edge cases from new data collection driving sessions, companies in the space use filters and analyzers to get targeted data for specific situations.

To tackle edge cases, we need to be collecting interesting data, mine data and filter efficiently. This can only be scalable when removing as much of the human factor as possible. Once you find an out-of-distribution case, the automated process must perform root cause analysis, and understand where the problem lies.

Common points of failure in the dataflow include the neural network, the post-processing phase, sensor input or calibration, or the neural path planning. A mature self-corrective pipeline can identify a solution and have it pushed to the system.

Complementing Real-World Data with Synthetic Data

Revisiting the first principles of MLOps and DataOps, we remember that training, test, and validation data must be separated. Building upon this, ML projects must have a robust set of quantitive KPIs for everything metric that impacts performance and reliability. Relying only on visual inspection only is not enough.

“Separate your training data and validating data. Otherwise, you are grading your own homework. You have the risk of false confidence that your system can handle new situations.”
– Shwetha Shrivastava, Sr. Director of Product Management for Behavior at Waymo

In the case of AV, generalized, flexible models which are trained on large data sets are more suitable compared to multiple and smaller systems in AV as the end goal is building a generalized driving system.

Leveraging synthetic data when the training data is scarce is a valuable technique. Conducting due diligence in the area requires the following questions:

  • Do I have the right data?
  • If not, can I complement it with synthetic data?
  • If I can complement using synthetic data, what would be the variables in the synthetic definition for better success?

Synthetic data can also be used to generate test data that can help test model performance. Especially when wanting to test against edge cases for which real-world data is limited. Synthetic data today is visual-heavy. This space can further progress by adding data generation capabilities for other types of inputs, such as Lidar and radar. Stable diffusion models can also create photorealistic images based on which can be annotated and used to train models for hard-to-find data.

“If you’re able to use the advances in generative AI to automatically generate the scenario for me, that’d be fantastic.”
– Vinesh Sukumar, Senior Director of Product Management at Qualcomm

However, it’s worth noting there’s a challenge in traversing the chasm between simulated and real-world data. In any simulated environment, there is always some physics aspect or other nuances that are not modelled in the simulation, which makes the model test against an imperfect version of reality.

To learn more about iMerit’s data annotation services, contact us today to talk to an expert.