200 x 1 is a weird dimension, what type of data is this?
PCA is a data dimensionality reduction technique, meaning you can represent your dataset with significantly fewer samples. In most cases, data is M x N dimensional, and M and N are the number of samples and features respectively. Usually, the data can be expressed with less features and still be unchanged. Alternatively, in our use case, we actually look to reduce the number of samples.
Here's my favorite example:
Imagine you had pictures of 100 people's faces. That might take up 1gb of data. But you can apply PCA and maybe find the first 6 components look like this:
Then, you can take combinations of these six images, add them together, and reproduce almost perfectly any face in the original 100! Like, maybe someone's face is 50% of the first PCA image, 20% of the second PCA image, etc...
Thus, you just need to store these 6 faces and the coefficients (ie 0.5, 0.2 etc...) to sum the 6 images to reproduce anyone's face.
Anyway, I guess I got off track, but PCA talks about about variation in a dataset. Along a vector, it doesn't have this interpretation anymore, and probably would tell you between your 200 points, where the most variation is (ie point 155 vs. 156?) I really don't know.
200 x 1 is a weird dimension, what type of data is this?
PCA is a data dimensionality reduction technique, meaning you can represent your dataset with significantly fewer samples. In most cases, data is M x N dimensional, and M and N are the number of samples and features respectively. Usually, the data can be expressed with less features and still be unchanged. Alternatively, in our use case, we actually look to reduce the number of samples.
Here's my favorite example:
Imagine you had pictures of 100 people's faces. That might take up 1gb of data. But you can apply PCA and maybe find the first 6 components look like this:
Then, you can take combinations of these six images, add them together, and reproduce almost perfectly any face in the original 100! Like, maybe someone's face is 50% of the first PCA image, 20% of the second PCA image, etc...
Thus, you just need to store these 6 faces and the coefficients (ie 0.5, 0.2 etc...) to sum the 6 images to reproduce anyone's face.
Anyway, I guess I got off track, but PCA talks about about variation in a dataset. Along a vector, it doesn't have this interpretation anymore, and probably would tell you between your 200 points, where the most variation is (ie point 155 vs. 156?) I really don't know.
In fluid mechanics, and specifically in turbulence, the PCA is called Proper Orthogonal Decomposition (POD). Indeed 200 x 1 vector is a strange one, so if we think about the flow fields, and we image the flow using one of the velocimetry methods, then we get let's say 50 realizations of 10 x 20 (rows x cols) of vectors, so we will end up with the set of uncorrelated samples of 200 x 50 realizations. 200 rows is the vector length times the number of uncorrelated samples. The POD is then used to identify the principal components or POD modes which are most energetic flow patterns in these realizations. One can get up to 50 modes, typically ordered such that the first one is the strongest (highest energy content). You can see our open source particle image velocimetry project, www.openpiv.net that has also the POD toolbox which will give you a quick entry point to the fluid dynamics use of PCA or POD.
The maximum number of principal components of an MxN matrix is min(M,N).
For your 200x1 dataset, you can extract at most one principal compontent, not three, and that principal component will be the dataset itself.
Something is wrong about your assumptions. How can hyperspectral images have a dimension of 200x1? I would expect them to be e.g. 128x128x16 if you have 16 images of 128 by 128 pixels each.
If your data is MxN, then you can run PCA. If your data is MxNxP you can try other types of factor analysis such as PARAFAC, or concatenate MxN into one (M*N) vector and still run PCA.
Either way, all M, N and P must be equal to or larger than 3 to extract 3 components.
What are the dimensions of your data? Do I understand you correctly that you want to go from MxNxP (where P is the number of images in your stack) to MxNx3, i.e. turn one hyperspectral image into one RGB image?
As I mentioned, this is not a straightforward exercise because a PCA does not expect three-dimensional data. A PCA does not transform MxNxP to MxNx3, but e.g. MxN to Mx3.
Have a look at the article I linked below. It states: "In order to apply conventional PCA to a hypercube, it is necessary to ‘unfold’ the hypercube into a two-dimensional matrix in which each row represents the spectrum of 1pixel. PCA
can be applied to decompose the unfolded hypercube into eigenvectors (or scores) and eigenvalues (Figure 3(a)). A scores matrix maybe obtained by transforming the original data into the directions defined by the eigenvectors (Figure 3(b)). The scores matrix can then be re-folded into a scores cube, such that each plane of the cube represents a principal component, known as a principal component scores image (Figure 3(c)). After unfolding of the masked hypercube, PCA was applied to hyperspectral data for each sample using the princomp function in Matlab..."
Haris, as you mentioned that you want to get first three eigen values from MxNxP data set. This can be done with the following steps:
1. create vector images:
reshape data into P column vectors of (M*N)x1
2. Normalize vector images:
going to remove any common features so that each plane (MxN) be left with only unique features. For doing so we have to subtract average of P vectors of (M*N)x1 from each (M*N)x1 vector.
3. Training the recognizer
reduce the dimensionality of training set
co-variance matrix C = ATA instead of AAT
4. Select k best eigen vectors [in your case k = 3] where k < P
I hope following links will help you to achieve your target.