I did not mean univariate distributions (a scatterplot per feature), but multivariate distributions. In your case, you could make several 2D or 3D plots (where each axis is a feature), or try to reduce the dimensionality of all your features to 2D or 3D by using PCA, multidimensional scaling or other methods.
You could perhaps be interested in this toolbox: http://www.37steps.com/prtools/ , it lets you do feature selection/extraction, training classifiers, and visualizing the classifier on top of your data (only in 2D).
Your question is so general, indeed it depends on the problem and your data set and also the presentation, you should explain more so others will be able to help you in the right way, good luck
The cross validation comes after the classifier is selected. I meant to ask about some initial tests that could be performed on data itself to guide me in selecting a particular type of classifier.
Dear Negar,
I wanted to know it very generally whether there are certain set of rules to help me decide the classifier from any given dataset.
As far as I know there is no a well defined rule for such task. In general, it depends on the kind of data and amount of samples x features. For instance, I would recommend to use naive Bayes or linear SVM for text classification/categorization. For datasets with numerical attributes: I would suggest linear SVM, neural networks or logistic regression if the amount of features is much greater than the number of samples. On the other hand, I would recommend neural networks or SVM with RBF or polynomial kernel if the amount of samples is not too large and greater than the number of features. Otherwise, if the number of samples is huge I would suggest to use neural networks or linear SVM, and so on. Obviously, there are other options for each scenario than those I have mentioned.
I have read about the evaluation techniques that you have mentioned. I would like to know details about meta learning, or some other methods where the data itself should pick the classifier with least effort on the user side. It would be kind if you suggest some papers for the same.
Dear Tiago,
Thanks for such an explanatory answer with examples. It would be appreciated if you could suggest some papers that explain the selection of classifier based on data-sets (some sort of review paper).
I would say the choice should at least depend on the number of samples / features that you have. If you only have a few samples, don't use a complex classifier that will overfit on your data.
You could also try looking at low-dimensional representations of your data to see how the data is distributed, whether there are any clusters, outliers etc and decide accordingly. This won't give you the "best" classifier but at least you could try to motivate your choices for using certain techniques.
Thanks for pointing out the dimensionality of data that often restricts the visualization process. Is it a good practice to visualize higher dimension data by dividing them into lower dimensions.
I have attached a sample plot for two class problem depicting the distribution of four different features. What inferences can be drawn about the choice of classifier from the scatter plots?
I did not mean univariate distributions (a scatterplot per feature), but multivariate distributions. In your case, you could make several 2D or 3D plots (where each axis is a feature), or try to reduce the dimensionality of all your features to 2D or 3D by using PCA, multidimensional scaling or other methods.
You could perhaps be interested in this toolbox: http://www.37steps.com/prtools/ , it lets you do feature selection/extraction, training classifiers, and visualizing the classifier on top of your data (only in 2D).
There are possible two stategies to perform: a) knowing a bit of the underlying production process of your data, if your data also have a dynamical component, that means the characteristics of the data presented is also dependend of time (e.g. speech I can say something "veeery" slow or "very" fast - it results in the same characteristics but my classifier should be aware of these dynamic time warping. Than I should use a dynamic classifier as Hidden Markov Models. If this is not the case than I can try to use static classifiers as Neural Networks, or SVMs.
But as far as i know, there is no statistical test around, to decide this question. One method, that is addressed often in this discussion is, to compare different classifiers with the same features and than use that one with best performance. Bit this requires the knowledge of optimal parameter settings for all investigated classifiers, which one wold normally do after a suitable classifier is selected.
For statical classification tasks, you can also use the tool WEKA it is a datamining tool, but also includes tools for data pre-processing, classification, regression, clustering, association rules, and visualization (http://www.cs.waikato.ac.nz/ml/weka/)
The issue of model selection for classification problems is well studied in several fields and you should be able to find good references on this topic.
The most frequently followed path is the following:
1) Define what are your metrics of success. In effect, you mention "..the best classifier..." but you have not told us what is best for you. Usually, this is related to some measure of classification accuracy, but that is not necessarily always the case. You could be interested in model interpretability, computation speed, etc. Even if predictive accuracy is your goal, there are several possible metrics, some more adequate than others, depending on the goals of your task. So, basically decide on the evaluation metrics before.
2) Decide on an experimental methodology for obtaining reliable estimates of the selected metrics. The "best" methodology may be dependent on the size of your data sample. Generally accepted heuristics are: i) use many repetitions of bootstrap if your data set is small; ii) use k-fold cross validation for average sized data sets; iii) use holdout for very large data sets. Obviously, there is subjectivity on the issue of the size of your data set, which will be strongly related with the computation power you have at your hands.
3) Decide on a reasonable set of models and model variants.
4) Run the experiment and analyse the results using some statistical significance test.
To help you on step 3 either you either resort to your experience or you may eventually use some sort of meta-learning approach that may provide you with a good candidate set.
Most data mining / data analysis tools have ways of automating several of these steps and make your life easier.
Now I have a better picture of the problem at hand and also the solution seems clearer. The metric here is classification accuracy (as the problem is concerned with diagnosis of a pathological condition). The data size is fairly moderate consisting of a feature vector of 48 elements for each subject and the complete study covers 40 such subjects. SVM and random forest provide impressive values of accuracy (>90%). The only thing I was looking for was a method to further validate these results via some other channel that would add more weight to my work.
It will be helpful if you provide some insight regarding the statistical significance tests that should be used in this case.
I would strongly recommend you to read the following paper that is probably the currently most well regarded work on this topic in the area of machine learning / data mining:
J. Demsar (2006): Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7, pp. 1--30.
You may get a copy of the article at the following URL found at google:
For your particular case with such small numbers and a single data set, I would probably advise you to use ~200 repetitions of bootstrap and eventually use the Wilcoxon Signed Rank test to perform the statistical significance tests. Still, read the above paper, it is a good source of guidance and knowledge on this subject.
In depth studies shows, and also inherited that, a couple of classical statistical evaluation methods for model authenticity and robustness of the model, fails in depicting the intrinsic evaluating parameter of the model. as for classification model, AUC, ROC, F Measure, Geometric Mean, SSE, RSS, Sen, Sp could of use.