Kmeans clustering: PCA + KMeans useful or not ?

More Titas De's questions See All

RNA later for the preservation of RNA in fecal samples at room temperature for one day (37°C)?

I am planning to collect human fecal samples for metatranscriptomic analysis using MGI. These samples are from indigenous people living in a region with high temperatures. I will have access to a...

06 August 2024 1,367 3 View

How to develop an academic literacy program for engineering at the higher education level?

Information literacy in higher education integration with curricula engineering

04 August 2024 5,368 3 View

How can i generate a CRISPR knockin mutation zebrafish model with a reporter?

Hey! I aim to generate a transgenic knockin zebrafish line that mimetizes a genetic condtition that leads to a certain disease on human. To do so, I need to insert a codon for mutagenic aminoacid...

14 July 2024 6,240 0 View

What should be the best Lumens range for T8 (120cm) full spectrum LED lamp tubes?

Please (for Arabidopsis), what could be a good Lumens and color range (Kelvin) range for full spectrum LED lamp tubes size T8 (120cm) for each shelve measuring 130x50 cm (length x width) and 60 cm...

11 July 2024 6,078 1 View

Cross Attention in Transformers: Standard applications of the same ?

What are the standard applications of Cross Attention in Transformer Architectures ?

09 July 2024 9,310 2 View

Time Series Analysis: Has Recurrent Neural Networks (RNN) ever been used on Time Series Analysis ?

Are there standard RNN architectures been applied for Time Series Analysis, forecasting and anomaly detection problems ?

30 June 2024 3,169 8 View

LSTM on Time Series: Has LSTM architectures ever been applied to Time-Series Forecasting ?

Have we ever used LSTM architectures on Time-Series Forecasting and Analysis, and gotten a decent result ?

30 June 2024 6,924 3 View

What could be causing these smears in my PCR electrophoresis gel?

I am new to running PCR gels. I loaded this gel and I thought it was fine, meaning I saw/felt no apparent punctures or spillovers to neighboring wells (see picture 1). When the gel started to run,...

30 June 2024 4,107 4 View

What are the typical applications of Large Vision Models (LVMs) ?

Where are large vision models typically used ?

25 June 2024 4,113 0 View

Are there standard libraries/frameworks for doing RLHF for training LLMs ?

When it comes to Re-inforcement Learning with Human Feedback, are there standard libraries/frameworks for training LLMs ?

25 June 2024 1,121 0 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View

Why does my protein refolded to beta sheet during thermal denaturation analysis?

Hi! So i attempted to understand a novel protein behavior towards heat application by analyzing its secondary structure change. I subjected the protein to a thermal denaturation analysis using...

06 August 2024 1,989 3 View

David Morse

Hello Titas,

I think the answer depends on what you mean by "better" results (and your specific research goal/question). If you first use PCA, then you will have created a set of (up to) k linearly independent composite scores for each case (where k is the number of original variables). However, unless you rotate the PCA solution, the first component will have the most variance associated with it, and likely will be most influential in how clusters are formed, especially if your similarity/proximity matrix is based on distance. Standardizing the component scores will solve the immediate problem of distortion due to unequal SDs, but the fact remains that, upon extraction, the first component (in PCA or first factor in common factor analysis) always accounts for the most variance from the data set.

You'll have to judge whether that makes sense in light of what you're trying to accomplish.

Good luck with your work.

Cristian Ramos-Vera

According to David

https://stats.stackexchange.com/questions/183236/what-is-the-relation-between-k-means-clustering-and-pca

http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf

https://www.researchgate.net/post/Which_would_you_use_first_K-Means_Clustering_or_Principal_Component_Analysis

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.162&rep=rep1&type=pdf

Conference Paper Principal Component Analysis and Effective K-Means Clustering.

Tarek Abd El-Hafeez

The first question that you should ask is whether or not you need to apply a dimensionality reduction technique. If you have very few features compared to the number of samples, you probably do not require to reduce the number of features. On the other hand, if the number of features is larger than the number of samples, then you will be dealing with the “curse of dimensionality”, and your k-means algorithm will not produce good results. In this case, you do want to reduce the number of features that you have. There are several techniques you could use for dimensionality reduction. For example, you could use feature selection, where you select the features that you think are the most relevant for the challenge at hand. Another approach is to use Principal Component Analysis (PCA), where you transform your data into a new dimensional space, where all the components are orthogonal to each other. Also, the components are sorted from the ones that describe the highest to lowest variance in the data. You would select a subset of the principal components as the features in your model, and capture a majority of your variance. Note that the k-mean clustering algorithm is typically slow and depends in the number of data points and features in your data set. In summary, it wouldn’t hurt to apply PCA before you apply a k-means algorithm.

ref: https://www.quora.com/Should-I-use-PCA-as-a-preprocessing-step-to-k-means-clustering

Titas De

Thanks David Morse , could you please explain what you meant by standardizing the component scores ?

Thank you Cristian Ramos-Vera and Tarek Abd El-Hafeez