Given a huge dataset of audio signals say of 10s or 60s ( which contains songs, recitation, political speeches, etc. ), are there traditional ML or deep learning techniques to cluster those audio files automatically ?
Deep-learning based audio classification typically involves converting files into spectrograms, input them into a CNN plus Linear Classifier model, and produce predictions about the class to which the sound belongs.
Here you can find some articles about deep-learning audio classification, with code:
Sound Classification using Deep Learning, by Mike Smales: https://mikesmales.medium.com/sound-classification-using-deep-learning-8bc2aa1990b7
Audio Deep Learning Made Simple: Sound Classification, Step-by-Step, by Ketan Doshi: https://towardsdatascience.com/audio-deep-learning-made-simple-sound-classification-step-by-step-cebc936bbe5
Audio Data Analysis Using Deep Learning with Python, by Nagesh Singh Chauhan: https://www.kdnuggets.com/2020/02/audio-data-analysis-deep-learning-python-part-1.html
I have to second Tim Ziemer. I am using a slightly modified version of the COMSAR/Apollon code to cluster popular music atm, and the results are very satisfying.
Thus I would also recommend Kohonen-Self-Organizing-Maps trained with a plausible set of MIR-features.
This technique only reveals underlying structures that require interpretation though, it will not provide classification.