Test for consistent clustering results on different datasets

More Billy Kidwell's questions See All

What are the research gaps for telematic technologies for road pavements optimization?

what can be the undiscovered benefits of applying telematic technologies for road pavement maintenance, incident reduction and environmental protection?

29 June 2024 7,507 1 View

Material is visually liquid, but rheometer shows G'>G". Why may this be?

I made a material which loos like liquid, but when I do the rheology test, the G'>G" . In my opinion, I think it means it was a solid or a gel. The test parameters are like this and the result...

19 April 2024 6,589 2 View

How Power line communication is affected by breaker?

Hello, My question is how PLC (power line communication) signal is affected by breakers? Actually, I'd like to find some equipment to interrupt the signal of power line communication. In my...

01 November 2023 8,927 1 View

How can I do post-hoc analysis in two-way repeated measures ANOVA?

I have two factors: time and intervention. Time has 4 levels (week 0, 4, 8, and 12), while the intervention consists of intervention group vs control group. The dependent variable is a test score....

29 August 2023 8,005 5 View

Why my low carbon steel AISI 1010 microstructure has more pearlite than ferrite?

I conduct a microstructure experiment on AISI 1010 low carbon steel, and the result show that pearlite is more abundant than ferrite, whereas it should be the opposite.

25 October 2022 185 3 View

Why is SolidWorks giving high thermal stress in an unconstrained part?

I'm running a thermal simulation with a temperature change from 293K to 88K. It's all 6061 aluminum. The temperature load is applied to all exposed faces. This sim is of a single piece with a...

11 October 2020 3,227 3 View

Does FuDR dissolve/dilute into NGM plates when administered to plates?

I am currently trying to use FuDR to run a lifespan experiment using C. elegans. I have a stock solution of FuDR that has been dissolved in water and am trying the determine the concentration of...

27 August 2020 1,795 1 View

Funnel plot indicates heterogeneity between studies - exclude those that cause asymmetry?

Dear research community! I am currently conducting a meta-analysis for my MSc thesis. My funnel plot is asymmetrical and indicates heterogeneity between studies rather than publication bias....

08 August 2020 1,220 3 View

What statistical test should I use?

Hello, I hope you're all well in these uncertain times :) I want to see the relationship between average temperature recorded, type of pre-treatment for the sample (there are six in total with...

17 April 2020 3,094 0 View

What is a good intro to flow through porous media?

I would like to find a way to build a model of the system I'm starting to research. Or really find any small amount of progress. Here's a snapshot of what's happening: Air is flowing (from atm.)...

16 October 2018 3,388 4 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

After COVID-19 it has seen that EFL learners technological affiliation has raised. In addition, in the post-COVID period learners started to engage AI technologies like ChatGPT while learning...

08 August 2024 8,964 4 View

What are examples of AI for good projects a teacher can assign to students?

So I am organizing an AI seminar. What are possible AI projects in the AI for good spirit? something the students can do and have an impact?

08 August 2024 9,437 4 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

Tanvi Banerjee

Not sure if I understood correctly but if you are playing with clustering techniques, the standard method for validating your results is by implementing different validity measures such as the Dunn's Index, Xi-Beni index, Davies-Bouldin index, partition coefficient, etc. The key is to try different indices and come up with the one which identifies the clusters you expect to see or the one which makes most sense for your problem. Does that help answer your question at all?

Billy Kidwell

Let me add a little bit more information.

I am clustering software bugs. I perform the clustering on all of the known bugs from version 1.0 of the software. I get a set of clusters and their descriptive features.

Then I perform clustering on version 2.0. If the clusters are capturing meaningful clusters, I would expect a similar set of clusters, with a similar distribution of bugs.

That is why I ask about consistency. Is the clustering of the data consistent across two different datasets from a similar source. I have not seen any papers that tackle this problem, but I suspect they exist, perhaps under different circumstances.

Well, the ideal clustering technique would yield similar results for similar data sets (i.e. from similar source) which means your clustering technique may not be the right one. Which ones did you implement? Agglomerative? K-means? Fuzzy K means? Also, what is the dimensionality of your feature set?

My results look fairly good, the clusters seem to match. My challenge is in how to compare these and perform a hypothesis test that demonstrates that they are similar.

I am using the repeated bisection method in CLUTO with the I1 and I2 criterion functions. The use of I2 is similar to k-means. The use of I1 works in a similar way, but it maximizes the pairwise similartiy of each point in the cluster, rather than maximizing the similarity with the centroid.

I have ~110 features (it varies by 2 or 3 between the versions of the software). The first dataset is 2800 instances. Subsequent versions are around 5000-7000 instances.

Here was my best guess at a way to show that the clusters were consistent.

I tried running the data for three versions of the software. Then I combined all of the instances into one large dataset and performed clustering again. I calculated expected results by adding up the instances in each of the three versions for each cluster based on the descriptive features in that cluster. There was one cluster in each of the versions that did not show up in one of the others, so I increased k by one for the combined data set. Then I did a chi-square comparison of the combined results with the expected results by adding up the others.

The chi-square result was good, but I am unsure whether that proves anything.