Since you work with semantics of visual objects, can you use text embedding to create a vector representation of objects instead of voxels?
If you use that concept, you will be able of handling 3D video streaming too in the future. This approach scales better.
Depending on the application, you may have to handle multiple imaging modalities. Fusing then into a 3D image is hard but the concept of meaning stays the same.
Since you work with semantics of visual objects, can you use text embedding to create a vector representation of objects instead of voxels?
If you use that concept, you will be able of handling 3D video streaming too in the future. This approach scales better.
Depending on the application, you may have to handle multiple imaging modalities. Fusing then into a 3D image is hard but the concept of meaning stays the same.