It would be very helpful if you could give more information about the problem you are facing. I assume there is a concrete task behind your question.
I only assume your task is classification. How many classes do you have? How much is the dataset unbalanced? Is the distribution of classes in your training data the same as in the target application (test data)? How long are the feature vectors? Do you want to use a specific learning method? Does your data set cause you any specific problems?
For example with 2-class SVMs and neural networks, you rarely have to do anything special for unbalanced training sets. Usually, it is sufficient to calibrate the final classifier properly (set proper decision threshold).
Regards.
PS: I can't help it, I have to respond to the answer of Indrajit Mandal. Indrajit, do you think your answer is helpful in any way? I would strongly question the value of what you wrote.
It would be very helpful if you could give more information about the problem you are facing. I assume there is a concrete task behind your question.
I only assume your task is classification. How many classes do you have? How much is the dataset unbalanced? Is the distribution of classes in your training data the same as in the target application (test data)? How long are the feature vectors? Do you want to use a specific learning method? Does your data set cause you any specific problems?
For example with 2-class SVMs and neural networks, you rarely have to do anything special for unbalanced training sets. Usually, it is sufficient to calibrate the final classifier properly (set proper decision threshold).
Regards.
PS: I can't help it, I have to respond to the answer of Indrajit Mandal. Indrajit, do you think your answer is helpful in any way? I would strongly question the value of what you wrote.
How large is the dimension?. For very large dimension PCA or Random Projection can be used to reduce the dimension. RP is data independent transformation and it shows good ability to dense the data space. For imbalance data set SMOTE is well known to solve this issue.
There are number of methods to use to reduce dimensional like SVD( Singular Value Decomposition), PCA (Principal Component Analysis) , ICA( Independent Component Analysis) etc. Select the method based on your application. For text data normally use Singular Value Decomposition like that. You select methods based on your application
You may be interested in the paper "Learning from Imbalanced Data" from He and Garcia. In particular, one of the issues over which they focus is "the combination of imbalanced data and the small sample size problem" (small sample size being the major problem of high dimensional input). For example, they mention a few techniques for dealing with both problems in Section 3.4.
@Michal Hradis: it seems to me that Mandal is making a large number of such comments, together with downvoting a lot of answers which contradicts him. Maybe we should signal this to the RG staff? (Sorry for going off-topic.)
Thanks to all. Well, we are facing both the problems of high dimensionality and imbalance of examples in our Churn-Prediction problem (Michal; our small dataset is of 50,000 samples, 200 features, and with minority class of 3.7%; Binary-Class problem).
As mentioned by Natalia, we had used SMOTE earlier (data processing). But, we have not tried any special algorithms customized towards imbalanced examples such as some SVMs are reported for this purpose. Similarly, we did use PSO, mRMR as well for dimensionality reduction. But the problem is that once you decide about both dimentionality-reduction and balancing, which one has to be performed first. They are going to effect each other. In my opinion, First data balancing should be performed. What you suggest?