How do we typically choose between Convolutional Networks and Visual Language Models when it comes to Supervised Learning tasks for Images and Videos ?
What are the design consideration we need to make ?
When selecting a model for supervised learning, it’s important to consider factors such as data type and the desired outcomes. For image and video tasks focused on straightforward labeling and prediction, CNNs and their many variants are typically effective due to their efficiency in extracting spatial features. However, for tasks requiring detailed outputs or multimodal understanding; where both visual and textual information are essential then visual language models (VLMs), such as CLIP and Swin Transformer and so on.., are often more suitable.
In some cases, particularly for experimental purposes or network surgery, we may design complex models with extensive parameters even for simpler tasks. However, it is often observed that existing, well-optimized models can outperform newly created, complex models in these scenarios.
For handling image and video data, I personally believe CNNs are the best choice for straightforward labels and simpler data. On the other hand, when working with a 'bag of words + images/ videos' or needing more detailed predictions and textual descriptions, VLMs provide a more effective approach.
CNNs can be trained and deployed more efficiently than VLMs, especially for large-scale datasets. This is due to their simpler architecture and the ability to leverage parallel processing hardware for faster computations.