I am implementing monocular visual odometry algorithm for my research project and i need to use the scale factor in my processing. Does anyone has an idea how can I calculate it from the video frames?
The classic way of doing this is by using an additional sensor, e.g. an IMU: Article Visual-Inertial Monocular SLAM with Map Reuse
.
If you cannot add an IMU, you could insert an object with known size into the scene that you are filming. Then, from the 3D point cloud you can compute the scale.
See the second part of my previous answer: you can insert an object with known size into the scene that you are filming with the 2D camera. Then, from the 3D point cloud you can compute the scale. In the 3D point cloud, you have to identify the object, and then you can e.g. identify two corner points in the object. And between these corner points you know the distance in the real world, so you can now compute the scale.
Do you know the size of that target object? And do you know how closely the ROI/bounding box aligns to the object?
For example, if you have a cupboard with a height of 2 meters and you can rely on your DNN to exactly identify the ROI/bounding box of the cupboard, such that the bounding box also has a height of 2 meters, then you can extract the exact scale factor.
If your DNN is less precise and gives you a bounding box with a height of 2.20 meters, then your scale factor based on the height of the bounding box will contain a 10% error.
Christoph Bachhuber thank you very much for the detailed answer. The ROI's fit is very close to the actual height and width of the target objects. I will try the latter approach and check the results.