Is it necessary for autonomous vehicles to ‘see’ like humans?

June 18, 2020

Before an autonomous vehicle can reason its way out of trouble on the road it needs to understand what it’s seeing. AV car makers use some combination of three forms of sensors – cameras, laser-based 3-D Light Detection and Ranging (LiDAR) and traditional 2-D radar – that work in tandem to sense the road, identify objects, and discern the difference between passing scenery and a threatening object. LiDAR is becoming particularly attractive as a way to help the AV computer judge distance and speed. 

To safely navigate a city or any other streetscape, an autonomous vehicle needs to see the road as a human being does. It needs to be able to understand – and react to: One, objects on the road and the difference between static and moving objects. Two, the nature of those objects. Three, the speed and direction of those objects. And four, any potential threat caused by those objects. The algorithms powering autonomous vehicles need to, in milliseconds, absorb and process data from multiple sensor arrays and sensor types, evaluate shifting environmental conditions in real time, and issue instructions to the car’s steering, braking, and acceleration systems as needed. All of the “If A then B” decision making relies on the algorithm’s ability to infer real-time scenarios from the training data used to help it recognize and interpret the significance of sensor input. 

There are a slew of sensors an autonomous vehicle relies on for various views. They include sensors for adaptive cruise control, traffic sign recognition, cross traffic alerts and a host of others. Each of these sensors enables an autonomous vehicle to navigate the road, monitor traffic and other objects, and react to changes in its immediate environment, everything from a “simple” curve in the road to an exponentially more complex chain reaction series of vehicle collisions. Processing all of that data is built on a complex training program that takes sample data from each type of input, annotates it to break each data image down to its component objects, and then correlates it across all of the many scenarios the car could encounter on the way from Point A to Point B. 

Training that system of onboard sensors and AV computers encompasses thousands upon thousands of hours of video (training data) broken down into its component parts of different objects, analyzed, categorized, and then fed into the algorithm to test whether or not the lesson has been learned. It takes many years to train a child to recognize and react to potentially dangerous objects before a parent will let go of its hands; the task is no less daunting for a machine.