Suppose I have a data set where vectors are described with 10 features. If I do PCA and consider all principal components (i.e no dimensionality reduction), is the classification accuracy expected to be equivalent to that when I do not use PCA?
PCA is basically a dimensionality reduction technique. You choose the top k principal components. PCA is not used for classification purpose. If you have classification, then choose Linear Discriminant Analysis (LDA) or support vector macine (SVM) if the data is linearly separable otherwise you can choose non-linear SVM for non-linear data
I suppose you ask if I do a PCA and use all the PC in one classification algorithm will I get the equivalent results. It depends of your data specially their distributions, I suppose you scale already, if it is the case then you should preprocess your data before to come to normal and then do the scaling, centering.
If the difference of units have a meaning in your classification and you have no outliers in your data, which have one noticeable influence of the axes of the PC then you classification should be similar.
Thanks for your reactions. I understand that PCA is a dimensionality reduction technique, however, nothing keeps us from using all the principal components. In fact, I was experimenting on a data set with 18 features. I normalise the feature vectors using L2-norm. I do not do z-transform. When I transform the data using all principal components I obtain again a data set with 18 features. The classification results that I obtain with these transformed features (using Naive Bayes) is significantly higher than that when I do not transform the data. Am I missing something? I was expecting to achieve the same results.
George, applying PCA will attempt to un-correlate the components of your data. In cases where before PCA your data is highly correlated this might improve the independence of your components. Naïve Bayes often is applied per-component (although you do not provide such details right now) and does assume independence of its inputs (that is the naïve part right?). So after PCA it will be more independent and thus you might get better results. I guess in effect you perform a whitening operation as described in https://en.wikipedia.org/wiki/Whitening_transformation. Best Regards, Klamer
Each PC found by PCA is a normalized linear combination of the n (=10 in your case) initial predictors, where normalized means that the l-2 norm of the linear coefficients is equal to one. So, assuming that the number of observations > 10 and that the rank of the predictor matrix is equals to 10, using {PC1,..., PC10} instead of {X1,...,X10} as predictors matrix vs. the same output variable will get you the same hypothesis function (= prediction on the test set) if the solution of the optimization problem behind your model is invariant to this kind of linear transformations in the predictor space. For example, if you were using a logistic regression model you would get the same prediction and, as a consequence, the same accuracy. On the other hand, if you were using a non linear model like SVM, you would get a different prediction and, in general, a different accuracy. The mathematical proof of that is a bit long and complex, so you can find here an R simulation where assuming the same {Xtrain,Ytrain} and {Xtest,Ytest} you can find the same prediction on Xtest (and hence the same accuracy = 0.45) both using the initial predictors {Xtrain,Xtest} and their PCs {PC.train,PC.test}. On the contrary, the same doesn't hold for SVM (accuracy from initial predictors equals to 0.5 while using the PCs as predictors equals to 0.45). You can change easily {Xtrain,Ytrain} and {Xtest,Ytest} just changing the initial seed (here set up to 333) and you can find out that the same pattern holds.
Example: a "linear" model (logistic regression) vs. a non linear model (SVM)
It is now clear to me why Naive Bayes classifier has more possibility to work better after rotating the axes i.e decorrelating) with PCA. Moreover, classifiers that are based on comparing the pairwise distances of samples with a linear function should not be effected when we rotate the axes because the pairwise distances will remain exactly the same. I confirmed this when I used KNN. However, when I used an SVM with a linear kernel the classification rate improves after I rotated the axes with PCA. My understanding is that the first step of SVM with linear kernel is to compute the pairwise similarities using dot product, which is a linear operation. What is causing such an improvement then?
It only behaves in the same way when the classifier used is based on some distance measures; such as KNN with Euclidean distance. In case of Naive Bayes (NB), for instance, by rotating the axes (the role of PCA), the features become uncorrelated. This satisfies the basic independence assumption of NB and as a result NB performs much better.
I applied PCA before training SVM+RBF classification model. I used all PCs and it substantially increased the performance (accuracy, sensitivity and specificity ..) I also asked the same question: "Does using all components make sense?" My experience says "yes".