Tong Guo In robotics, 3D bounding boxes for objects are annotated without LiDAR by using stereo cameras or monocular cameras combined with depth estimation techniques. These methods rely on visual data captured by cameras to infer depth and position.
For example, imagine taking two photos of the same object from slightly different angles, like how our eyes see the world. By comparing these images, we can estimate the distance (depth) of the object from the camera, similar to how we perceive depth with two eyes. This approach is called stereo vision. It calculates the disparity between the two images to determine depth, which is then used to annotate 3D positions in the camera’s XYZ coordinate system.
Alternatively, for monocular cameras (a single lens), depth estimation relies on machine learning models trained to predict the distance based on object size, perspective, and texture in the image. For instance, if you take a photo of a car, the model identifies its position based on known shapes and dimensions, estimating how far it is.
To create 3D bounding boxes, we mark the corners of the object in the image, calculate the dimensions (length, width, height) in meters, and align them with the 3D coordinate system of the camera. This process often requires camera calibration to map pixel coordinates to real-world measurements accurately.
In summary, stereo vision mimics human eyes, while monocular depth estimation relies on visual clues to annotate 3D boxes. Both methods make it possible to measure object dimensions and positions without LiDAR.