How clustering struggles with high-dimensional datasets?

More Ashish Thakur's questions See All

Why Do TDS and EC Increase with Larger Wastewater Volumes, While BOD and COD Decrease?

I have carried out MFC experiments on three different volumes, 50, 500 and 1000 mL of wastewater. Results after MFC treatment shows that TDS and EC are more in larger volumes of water i.e. TDS and...

09 August 2024 9,621 0 View

How to enrich pig excreta for increasing nutrient quality organically ?

Pig slurry is rich in major and minor nutrients. Is there any way to improve / Enrich its manure quality to be used in agriculture organically ? please share your knowledge.

09 August 2024 5,605 2 View

Is it possible to plot the atom-projected band structure using GPAW?

Hi, I'm currently working on a project where I need to plot the atom-projected band structure using GPAW. I've been able to calculate the band structure for my material, but I'm having trouble...

07 August 2024 269 3 View

Unusual intensity drop in some sections of chromatograms in DDA?

Hi, we have measured tryptic peptides using both DDA and DIA method on QExactive. In DDA replicates i saw unusual intensity drops occurring at the same sections of chromatograms in DDA replicates...

07 August 2024 3,218 4 View

Leaf area of tomato ?

Hi How can this equation Ln(LA) = 1.038 + 0.89 ln(X) be applied to calculate the leaf area of a tomato? Can you explain with an example and what is the substitution of Ln and ln?

06 August 2024 2,508 2 View

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

I aim to be as skeptical as possible regarding whether a pair of orthologous genes results in the same phenotype in their different but related bacterial organisms under similar environmental...

05 August 2024 6,787 4 View

How to preform densitometry on SDS-page bands?

I ran a SDS-page of a bacterial lysate and I want to quantify protein concentration in a specific band. I was thinking of using a standards ladder or make some standards are different...

05 August 2024 9,805 3 View

Why might the impedance values for DI water and 0.1X PBS buffer solution exhibit a decreasing and increasing trend, respectively over time (HP 4194A)?

Hello everyone, I'm encountering an issue with my electrochemical impedance spectroscopy (EIS) measurements and would appreciate some insights. Experimental Setup: Electrodes: Gold interdigitated...

05 August 2024 3,783 2 View

XRD Analysis is showing only Calcium carbonate. It is not showing other compounds. Can anyone help me get the other compounds?

XRD Analysis is showing only Calcium carbonate. It is not showing other compounds. Can anyone help me get the other compounds

04 August 2024 3,019 3 View

Which solvent is better to dissolve with secondary metabolites extracted from fungi?

I work on MCF7 cell cell for anticaner purpose and I wa to do drug preperation the drug ( secondary metabolites extracted from Aspergillus) My question which solvent is better with these secodary...

03 August 2024 4,725 2 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View

Why does my protein refolded to beta sheet during thermal denaturation analysis?

Hi! So i attempted to understand a novel protein behavior towards heat application by analyzing its secondary structure change. I subjected the protein to a thermal denaturation analysis using...

06 August 2024 1,989 3 View

Olawale Abimbola

Hi Ashish Thakur

It is actually possible for clustering algorithms to really struggle with high-dimensional datasets. However, there are ways I think this could be solved.

Can you try doing some feature selection/feature extraction/dimensionality reduction techniques? I believe Principal component analysis (PCA) or t-distributed Stochastic Neighbor Embedding (t-SNE) can be a good start.

Let me know if this helps.

Cheers

Vrushali Lanjewar

High-dimensional data is complex in nature, so it is difficult to create clusters. Dimensionality reduction techniques such as PCA can be used to convert data into low dimension space and removing unwanted features in High dimensional data. Correlation can be used to find out two features are positively correlated or not.

Article Using Projection-Based Clustering to Find Distance- and Dens...

Article Systematic Review of Clustering High-Dimensional and Large Datasets

Article The Challenges of Clustering High Dimensional Data

Chapter Clustering High-Dimensional Data: A Reduction-Level Fusion o...

https://www.cse.msu.edu/~tangjili/publication/feature_selection_for_classification.pdf

Muhamed Hassen Seid

How clustering struggles with high-dimensional datasets

Clustering algorithms struggle with high-dimensional datasets due to the "curse of dimensionality." As the number of dimensions increases, the data becomes increasingly sparse, and the distance between points becomes less meaningful.

In high-dimensional spaces, the volume of the space increases exponentially with the number of dimensions, making it difficult to identify meaningful patterns or clusters. This means that the clustering algorithm may not be able to accurately identify clusters, or it may identify spurious clusters due to the noise or random fluctuations in the data.

In addition, high-dimensional data often suffer from the problem of feature redundancy, where some of the features may be highly correlated or redundant, and provide little or no additional information. This can lead to bias in the clustering results and a loss of interpretability.

To overcome these challenges, various techniques have been developed, including feature selection and dimensionality reduction techniques like principal component analysis (PCA), t-SNE, or autoencoders. These techniques can help to reduce the dimensionality of the data and remove irrelevant or redundant features, thereby improving the accuracy and interpretability of clustering algorithms

Fatemeh Moodi

You should use methods like PCA to reduce the dimensionality and then use improved clustering methods on the dataset. For example, see this article:

Betelhem Zewdu Wubineh

Clustering algorithms may face challenges in high-dimensional data sets because of noise, distance metrics, and visualizing and interpreting the results. To overcome this problem, dimensionality reduction techniques, such as feature selection or feature extraction, can be applied to reduce the number of features and improve the performance of clustering algorithms. Additionally, you tried to carefully select the appropriate clustering algorithms that are robust to high-dimensional datasets to mitigate the struggles associated with clustering high-dimensional datasets.

Ashish Thakur

Dear Vrushali,

Your observation and understanding of clustering is fantastic. Many factors responsible to hamper clustering.

Ashish

Dear Fatemeh, thanks.

Dear Fatemeh, Thanks. Big question is how clustering struggles and yes PCA can reduce those problems. Why clustering struggles???? I think redundancy and random fluctuations is major cause to struggle clustering. PCA overcome to those dimensionality sparse.

Dear Muhammad,

Comprehending answer are available in taxt. Thanks.

Dear Zewdu,

Thanks....Good answer.

Dear Olsawale,

Thanks.

Often clustering struggles with many possible factors and what science can deal how it rise that is investigation part. More than remedy or overcome to the struggle , it would have been better if we raise on how struggle starts moment we start clustering???

In actuality, the clustering problem is often found on a big level and technologists struggle a lot to find a innovative solution.