What is the efficient way to do clustering of more than 10,000 rows of data?

More Srinivas M Srikanth's questions See All

How do soil microflora interact with plant roots and influence plant nutrition, health, and productivity?

06 August 2024 9,618 3 View

Different ways to store biogas ?

Different Storage methods of biogas can be listed here: In India people are storing in Biogas Storage Bags made of different materials? Let us list the details here to find the best quality...

01 August 2024 635 2 View

How does the Retention time effect the Biogas Production?

List down the various factors responsible and how does this effect the overall biogas production when it is related to Hydraulic retention time of the digester based on the feedstock?

27 July 2024 4,800 4 View

What additional processing or treatment options are available to enhance biogas plant digestate quality or value?

One has to understand about the enhancement procedures and treatment systems I can list around 12 procedures which are tried and tested and are proven so let us see how many in total the experts...

20 July 2024 3,392 2 View

How does the digestate compare to other organic amendments regarding quality and performance in different countries?

Experts are welcome to share their experience country-wise so that we can make a nice document listing all the country-wise experiences. ? How does the digestate compare to other organic...

18 July 2024 5,225 3 View

From different feedstocks based Biogas digesters what would be the standard NPK - Nutrient contents of the digestate, and how consistent is it?

One has to list down the details and understand that different feedstocks and the biogas digesters working on different feedstocks show different nutrient content value: But when it comes to...

18 July 2024 9,907 12 View

How can the economic value of digestate management, utilization, and regulation, can spark interesting discussions and explorations?

How can the economic value of digestate management, utilization, and regulation, can spark interesting discussions and explorations? so one can list down the details and we can discuss in length...

17 July 2024 3,255 4 View

Environmental Impact of the Organic Manure Production Unit?

Does anyone have any case studies on the Environmental Impact of Organic Manure Production Unit? Environmental Impact of a Organic Manure Production Unit Facility is what I studied but...

16 July 2024 5,400 6 View

Isolation of Hydrogen Producing Bacteria?

any good research which can show selective microbial strains for producing hydrogen? Yes the normal procedure is known but the isolation is the key for such selective microbial strains which can...

13 July 2024 2,255 3 View

What are the opportunities and limitations for using digestate ion different crops and soil typs and how can its application be optimized?

What are the opportunities and limitations for using digestate ion different crops and soil typs and how can its application be optimized - I have few of them listed and can be updated in this...

13 July 2024 2,431 2 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View

Deborah J Hilton

what variables can you cluster by?

Toni Beverin

If you have some experience in high level programming languages you could use Mathematica, MatLAB or Octave.

And here is k-means clustering algorithm.

http://cs229.stanford.edu/notes/cs229-notes7a.pdf

Srinivas M Srikanth

Hi Deborah,

I have 30 samples as columns with expression values for each probe as rows (80,000).

Angela Serra

You can use clara algorithm, it is implemented in R (package cluster). It works by clustering a random sample from the dataset and then assign all elements in the dataset to these clusters. By the way I've never used it for matrix with 80.000 rows, but it works quite well for 50.000.

Amir H. Ashouri

I agree with Toni, use K-mean clustering algorithm defined by this paper : (https://www.cs.umd.edu/~mount/Projects/KMeans/pami02.pdf), I have done a clustering for 30k random micro-architectural points of my compiler design space and use low power, high performance and high intensity as my factors to cluster them to k=4 classes. (in which they were the pareto's efficient).

I suggest you add at least another constraint to be able to make it multi-objective.

If you need to take a look at my constraints, feel free to check the paper at : https://www.researchgate.net/publication/259820777_A_Framework_for_Compiler_Level_Statistical_Analysis_over_Customized_VLIW_Architecture

cheers.

Conference Paper A Framework for Compiler Level Statistical Analysis over Cus...

Thanks for the answers...I have tried K-means but there I need to specify the clusters a priori, what I am interested is to cluster the data unsupervised and see how it bins into different sub-clusters...

@Srinivas: but if your clustering constraint (objective) has only one objective (only using less RAM), there won't be needed to pass through all these pains cuz you just sort them by use of RAM and that would be the results.

As I suggested, make it multi-objective and cluster them. Here you can improvise your clustering to say 4, which stands as for instance less RAM, less power (LL), high RAM high power (HH) and so on.

P.S: You could then filter the pareto points as well, having the mutil-objective optimal points in each and every cluster

Peter Olukanmi

k-means-lite can solve 100,000 points (and much larger datasets), in fractions of a second, on today's typical laptop. More efficient than several known 'efficient' algorithms: running time is constant in the number of data points N. Also very easy to understand and implement.

Briefly and simply described in:

https://www.researchgate.net/publication/328497599_k-means-lite_real_time_clustering_for_large_datasets

You still have to supply the number of clusters k. However, since it is very fast, you can adopt common approaches like the elbow or silhoutte methods which involve multiple runs of your clustering algorithm.