GPT-2 is a language model that's great at classification, and CLIP is a neural network trained on lots of text and images. T5 is also a language model used for natural language processing tasks, and BERT is another language model that's pretty good at zero-shot classification tasks.
Lastly, there's ViT, mainly used for image classification but has also been shown to work for speech classification. So, if you want to use these models for zero-shot speech classification, fine-tune them on some labeled data, and you're good to go!
Ryan Torres : not sure how speech can be fed directly to NLP models like T5 or BERT unless we use Speech to Text (STT). If we use STT, there might be noise in speech to text translation itself.
Zero-shot speech classification is the task of classifying audio data into categories without requiring training data for each category. Deep learning models have shown promising results for zero-shot speech classification. Here are some of the latest deep learning models for zero-shot speech classification:
Cross-modal deep clustering (XDC): XDC is a deep clustering model that jointly learns the representations of speech and text. The model projects speech and text inputs into a common embedding space and uses clustering to group them into categories. XDC has shown to achieve state-of-the-art results on zero-shot speech classification tasks.
Zero-shot classification via generative models: This approach combines deep generative models such as Variational Autoencoder (VAE) and Generative Adversarial Network (GAN) with few-shot learning to enable zero-shot classification. The model is trained on a few labeled examples and uses the generative model to generate additional examples for each class. The model then uses a classifier to perform zero-shot classification on new classes.
Meta-learning for few-shot and zero-shot classification: Meta-learning, also known as learning to learn, is a technique that enables models to learn how to learn from few examples. Meta-learning has shown to be effective for zero-shot classification by learning a model that can quickly adapt to new classes without additional training data.
Hybrid models: Hybrid models combine deep learning with other techniques such as probabilistic modeling, knowledge graphs, and expert knowledge to perform zero-shot classification. These models often require additional resources and expertise to develop but have shown to be effective for specific zero-shot classification tasks.
It's important to note that the field of zero-shot speech classification is rapidly evolving, and new models are constantly being developed. Therefore, it's essential to stay up-to-date with the latest research and evaluate the performance of different models on specific tasks.