We used standard techniques, bit the results are not good. We also tried SVM and GA, with little improvement. There are some papers suggesting similar methods, that we didn't try yet. What would be your suggestions?
If you have a small data set I would definitely try k-fold cross validation or similar. If you train on the data only once you are likely to miss some of it since the set is small and your classification will not represent the actual accuracy. You will also not want to have a large feature vector with small data sets, so chose features that provide the most discriminative information.
Learning on such small datasets is usually rather hard and it will anyway be difficult to distinguish between good learning results and overfitting.
You may nevertheless try if you have a strong prior model (maybe you need only fit one or two parameters a little better), o if you are happy with only rejecting a few very unlikely options.
In general I would guess that somethingvsimple like naive bayes could be tried (remember to use your prior or a laplace estimate). If you specify your application it may bevpossible to give more appropriate hints.
With extremely small data sets I’ve got good results using ensemble methods, such as stacked generalization and using SVM as a meta-learner and stacking very dissimilar base models. See also: Polikar R., “Ensemble based systems in decision making,” IEEE Circuits and Systems Mag., vol. 6, no. 3, pp. 21-45 , 2006.
In my opinion, you should perform cross validation. to maximize your learning set size (N). In general, the bigger N, the better generalization you get.
Bilal, I have 11 features whose patterns are recognized jointly e.g. by stacked generalization i.e. using multiple base models for ALL features and then using a single meta-model for evaluating the value of base models' predictions.
One field where people are somewhat forced to try to do something with very small number of samples (and very large number of features) is in gene expression profiling using technologies like microarrays. I have some papers on the subject.
What we proposed (and saw working in real world data) is to use linear SVM (you don't have data for kernel based methods) and cross-validation like suggested by Cherif. See:
I am also, at this moment trying to develop a modified version of the SVM that can force it to use all data (and not only the initial support vectors), that is supposed to work better in the case that you suggest. Unfortunately it is not written (or even properly coded) yet.
Article A feature selection approach for identification of signature...
Article A feature selection approach for identification of signature...
It is usually tricky to do a good job with machine learning when the training data set is small. However, I have personal experience with something that worked well for the industrial project published in the following link with a small training set. A regression was performed to interpret cast iron pipe thicknesses using only 60 training vectors (not less than 20 though, 60 is still small however) associated with 20 targets with three vectors per target. It worked pretty well both with neural networks and Gaussian Process regression based machine learning. We stick to the Gaussian Process which follows a Bayesian framework to avoid the Black Box scenario in Neural Networks. The bottom line is Gaussian Process could be powerful when having small training sets and it might be worth trying if you haven't done already. It can be used for both regression and classification.
However, if nothing works well, it might be possible train a model with the available small training set and test it with a few data with a known ground truth. This will give an indication how well the model performs. If the model works well, the small testing data set (with known ground truth) can thereafter be added to the training data set to make the training set larger for subsequent trials. Thus, by testing with small testing sets with known ground truth, it may be possible to generate a larger training set by including testing data which work well. That may be a naive but probably an effective way of generating a larger training set incrementally.
As a general thought: if your data is to sparse to capture the existing class borders and the distribution of classes within feature space, there is no way to train a classifier.
It is only a vague suggestion, but there is an over-sampling algorithmn called SMOTE that creates copies of existing data and positions them close to the original data point (not at the same location) while keeping the class of the original.
The idea behind the heuristic is: data points that are close in feature space often have the same label/class.
This is normally used to balance skewed class distributions and therefor increase classifier performance. Depending on your data and usecase, you may achieve better predictions with smoted training data.
Here is a link: http://www.jair.org/media/953/live-953-2037-jair.pdf
The best bet is to simply use the nearest neighbour or the k-NN algorithm. When the data size is very small, I am definitely sure that NN or k-NN will perform better than the training data intensive classifiers such as artificial neural networks or SVM. That is the greatness of nearest neighbour classifier. You can actually start with a single training sample per class.
One first observation is that with with that small number of samples you most likely cannot afford to have 11 features, due the curse of dimensionality and specifically the aspect of it known as Hughes effect:
I would propose to use a dimensionality reduction method like PCA and keep only the two first components. If you can also verify that most of the variance is contained there then you are in good shape.
For similar reasons it is not very likely that you can afford at least most of the ensemble methods either, unless we are talking about something that explicitly tries to reduce the variance component of error (as opposed to the bias) such as Bagging - this might be a good idea but it depends really on your dataset how well it will work.
As previous answers have indicated, a linear SVM might be a good first choice (again it is unlikely that you can afford the additional degrees of freedom offered by the RBF or other non linear kernels).
Regarding oversampling, if you have a class imbalance problem then by all means use SMOTE, it works really well for this problem. Also if you know something regarding the ratio of classes in the general population you can also use it to set this ratio. If you are in neither of these cases I would not use it since these are the things that is mostly designed to do (and it does them well).
If you have any information regarding the problem then this, as pointed by other answers, needs to be captured. It is especially important if your training sample is very small as it is here.
In any case you need to check the stability of your model. Bootstraps of the training set to see how much the results varying is the reasonable first step. If you see a lot of variation then I would propose to visualize the class separation boundary in the feature space of the two first PCA components. This can be done by asking your classifier (after training is done) to classify a greed of data points with small intervals (for example use meshgrid in Matlab) and paint the two classes with different colors. In general if your separation boundaries are not reasonably smooth it is very likely you are over-fitting. If you know what are the classes and the properties of the problem you can check if these are satisfied (for example check that you are not creating artificially complex boundaries in order to accommodate for some outliers). Good luck!
when you have n=10, better perform some statistical tests on the data before applying any machine learning algorithms.
Check for the cronbach alpha value.
Even if you create a prediction model based on n=20, then it wont be reliable. Say your accuracy is 98 percent, but the confidence with which the model makes decision will be very very low.
Therefore, on any unseen data, it may give wrong decision only.
The probability of ''decision by chance'' will be high.
Assuming that the n+ < 20 is referring to +ve training data examples, and that you can obtain n- >>>>> 20, that is a massive number of negative samples, then you may wish to look into hard-negative mining techniques that have had some success in computer vision. The general idea is fairly simple, and might be adaptable to other domains.
P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan
Object Detection with Discriminatively Trained Part Based Models
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No. 9, Sep. 2010
Section 4.4 Data-mining hard examples, SVM version
I have used this technique with a single +ve sample and 1 billion -ve samples, and it worked quite well.
edit:
You can think of it as basically learning what is "not" the thing of interest in this extreme case!
I should also reference this for that case:
http://www.cs.cmu.edu/~tmalisie/projects/iccv11/
"This paper proposes a conceptually simple but surprisingly powerful method which combines the effectiveness of a discriminative object detector with the explicit correspondence offered by a nearest-neighbor approach. The method is based on training a separate linear SVM classifier for every exemplar in the training set. Each of these Exemplar-SVMs is thus defined by a single positive instance and millions of negatives."
If you have a small data set I would definitely try k-fold cross validation or similar. If you train on the data only once you are likely to miss some of it since the set is small and your classification will not represent the actual accuracy. You will also not want to have a large feature vector with small data sets, so chose features that provide the most discriminative information.
Peter, see Group Method of Data Handling (GMDH). It is set of methods designed to handle small data samples. 20 years ago I was experimenting with MIA GMDH on Nuclear dataset (n=10) and it worked very well. Frank Lemke implemented many GMDH based methods in his Knowledge miner software. It is especially efficient for multivariate data, short and noisy data samples.
I still encounter problems when exporting Scopus files. After some research I found that records where papers have no page numbers sometimes cause problems, and we have to get rid of them. That is a bit tricky, because if you edit the file in Excel it become unusable in Vosviewer. So you have to get rid of such records directly in the search string (fortunately its quite rare and normally no more than one record per 2000 records). In such a case I use Advanced search and extend the string with AND NOT TITLE(“title of the paper”). To get the paper title you have to use the line number reported by Vosviewer in its error report You have to subtract one (1) from that number a locate the record with that number in the papers harvested by your original search string. Than you have to copy and paste the paper title in the extension string described above and add that string to your original search string. Its sound a bit complicated, but its quite easy after you tried this procedure once and it solved all my problems.