Cross validation of unsupervised classification, how to do it?

More Antonio Irpino's questions See All

Given the marginals of a contingency table, what is the maximum observable value for the $\chi^2$ Pearson statistic?

I know that given a R×C contingency table observed on N subjects, the maximum value of $\chi^2$ statistics is N⋅[min(R,C)−1]. But, this value is independent of given margins (namely, row totals...

31 December 2018 5,104 0 View

Can someone advise on evaluating the sensitivity to initialization in weighted k-means?

I am working on a k-means algorithm that embodies a step of variable weighting. In each step is assigned a different weight to each variable according to a contribution to the cluster measure. The...

09 October 2013 6,870 4 View

Cause-effect dilemma?

Last week I came accross this Kaggle challenge: http://www.causality.inf.ethz.ch/cause-effect.php. So, my question is, there existS a "statistical way" to suggest that X is the cause of Y (or...

07 August 2013 5,714 11 View

Density estimation or probability estimation?

I am working on histograms because histogram is a very parsimonious way of storing a distribution of observed values. In order to overcome the problem of the choice of the width of bins, I devised...

03 April 2012 7,420 1 View

Training for new staff?

I am looking for some training for new staff that will be starting in a self contained classroom with students with ASD. Most new staff have little to no experience working with students with ASD....

03 August 2024 6,717 3 View

How to prediction 2D to 3D aptamer?

Hello, I have a little problem. Well, right now I only have the DNA aptamer sequence. I went to see the 2D model form with DNA folding. After that, I used RNA Composer to simulate in 3D and used...

30 July 2024 6,794 0 View

How do we pick data for determination of Validation Acceptance Criteria?

Hello, colleagues! There is commenting open for new upcoming edition of USP 1033. Validation target acceptance criteria is now different from what it used to be and it doesn't include Cpm....

23 July 2024 7,292 3 View

What are the frameworks or methodologies to examine written academic ELF?

What should be the frameworks or methodologies to examine written academic ELF? I want to explore the linguistic features of written ELF in research articles.

23 July 2024 4,800 1 View

Reversed flow at outlet due to the release of DFBI?

Hi everyone, I am working on a simulation involving restricted canal with ship using DFBI. I am facing reversed flow in my outlet boundaries as the DFBI is released (In 1.25s). Is there any...

17 July 2024 7,032 1 View

Which research tool for expert validation for our study?

I am a 3rd year Computer Science student currently writing our Bachelor's thesis about finding diverse k-shortest paths in pedestrian networks. We have chosen 3 local areas as our proposed...

15 July 2024 4,289 0 View

Will the leadership style used in the U.S. be successful in Australia, or will the Australians respond better to another?

Will the leadership style used in the U.S. be successful in Australia, or will the Australians respond better to another? Which leadership training methodology would be most successful with your...

14 July 2024 173 4 View

What is an ideal threshold value for log2(fold change)?

Hello everyone, I am performing differential gene expression analysis and I am not sure what is an ideal threshold for log2(Fold change).Thanks in advance!

09 July 2024 1,437 1 View

Unexpected Increase in R² for the Third Component in sPLS model?

Hi everyone, I'm performing a sparse Partial Least Squares (sPLS) model to understand if the analysed contaminants (55 individual contaminants) explain my response variables (42 response...

07 July 2024 2,333 0 View

Is there any research paper on impact of knowledge sharing, training and development on employees retention??

I want to make thesis on this topic is it right??

06 July 2024 7,101 5 View

Sebastian Zaunseder

Hi Antonio,

to me your approach, i.e. deriving quality parameters from folds, sounds good. What you'll get is an idea on the validity of your validity index in a statistical sense.

However, a problem might arise from too few instances or highly imbalanced cluster sizes: I could imagine that the procedure becomes troublesome if not all clusters are contained in each of your folds (i.e. if the probability is high that not all clusters are contained).

Greetings, Sebastian

Antonio Irpino

Thanks, Sebastian, surely I agree with the choice of a validity index.

Let's say that I chose the k-means and the Silhouette score and let consider highly imbalanced cluster sizes (in general, we don't know the size of clusters and more frequently also the number of clusters too).

If not all clusters are contained in each fold this could not have an effect on the Silhouette score. Indeed, I take my fold and I compute the Silhouette score for each "test" unit with respect the solution on the "train" one.

What do you think?

Venkatesh Gauri Shankar

Cross-validations are techniques to measure generalization capacity of any regression against over-fitting or other limitations by comparing several statistical models, which can further be used for better regression by ensemble average.

Conventional Validation works with single partition of sample data with one training set and one testing set, where training set is used to train model and testing set is used to measure the generalization capacity of the trained model. Contrary, cross validation works with multiple such partitioning of the sample data to get more insight about the generalization capacity of any dataset.

Leave-k-out cross-validation left out k observation at each step and k-fold cross-validation use one out of k subsamples partitioned randomly from original sample at each step.
Differences between Monte Carlo cross-validation (repeated random sub-sampling validation) and k-fold cross validation is discussed here.
In linear regression, closed form expression for cross validation is available.
Cross-validation with data, which are not independent needs dependency related cross-validation like cross validation in time series partitions data in different segments of time series.