I've tried working with traditional methods using the parameters ZC, STE and AC, but I'm getting unexpected results and think it may be because of not setting the threshold optimally.
traditional methods for speech activity detection work reasonably well only in unrealistic conditions in which not too much noise affect the wanted signals. In a more realistic case, the speech is often combined with both stationary and non-stationary noises making traditional solution not adequate. For this reason an optimal solution which is reasonable in all the possible situations does not exist. Personally, I think that modern Deep Neural Networks fed by Mel-Cepstral Coefficients (MFCCs) and pitch features could work pretty good in general. I suggest you to increase as much as possible both the context window (i.e., the number of frames to feed the neural network) and the material to train the network. In [1] I have used a deep neural network based solution to perform classification of sounds with interesting and promising results.
Mirco Ravanelli
https://sites.google.com/site/mircoravanelli/
[1] M. Ravanelli, B. Elizalde, K. Ni, G. Friedland, "Audio Concept Classification with Hierarchical Deep Neural Networks", in Proceeding of the European Signal Processing Conference, EUSIPCO 2014, Lisbon, Portugal.