Which are the methods to validate an unsupervised machine learning algorithm?

Clearly. the answer depends on which class of unsupervised algorithms you are referring to.

For example, dimensionality reduction techniques are generally evaluated by computing the reconstruction error. You can do this using similar techniques with respect to supervised algorithms, e.g. by using an holdout test set, or by applying a k-fold cross validation procedure.

Clustering algorithms are more difficult to evaluate. Internal metrics [1] use only information on the computed clusters to evaluate if clusters are compact and well-separated (this is what is also mentioned on the answer of A.G. Ramakrishnan). Also, you can have external metrics that perform a statistical testing on the structure of your data [1].

Density estimation is also rather difficult to evaluate, but there are a wide range of techniques which are mostly used for model tuning [2], e.g. cross-validation procedures.

In addition, unsupervised strategies are sometimes used in the context of a more complex workflow, in which an extrinsic performance function can be defined. For example, if clustering is used to create meaningful classes (e.g. clustering documents), it is possible to create an external dataset by hand-labelling and test the accuracy (the so-called gold standard). Similarly, if dimensionality reduction is used as a pre-processing step in a supervised learning procedure, the accuracy of the latter can be used as a proxy performance measure for the dimensionality reduction technique.

[1] Halkidi, Maria, Yannis Batistakis, and Michalis Vazirgiannis. "On clustering validation techniques." Journal of Intelligent Information Systems 17.2-3 (2001): 107-145.

[2] Hall, Peter, Jeff Racine, and Qi Li. "Cross-validation and the estimation of conditional probability densities." Journal of the American Statistical Association 99.468 (2004).

A.G. Ramakrishnan

Basically, you are clustering the data in the feature space. So, you can look at the intra-cluster variance and the inter-cluster variances. For example, you can separate the characters with different colours from a document image (even a camera captured one) using unsupervised methods. You know how to evaluate the results.

Simone Scardapane

Clearly. the answer depends on which class of unsupervised algorithms you are referring to.

Density estimation is also rather difficult to evaluate, but there are a wide range of techniques which are mostly used for model tuning [2], e.g. cross-validation procedures.

[1] Halkidi, Maria, Yannis Batistakis, and Michalis Vazirgiannis. "On clustering validation techniques." Journal of Intelligent Information Systems 17.2-3 (2001): 107-145.

[2] Hall, Peter, Jeff Racine, and Qi Li. "Cross-validation and the estimation of conditional probability densities." Journal of the American Statistical Association 99.468 (2004).

Saeed Khazaee

First, you must know that is the data set consist of supervised data or not? In many researches supervised learning employed on unsupervised data and vice versa.

Mohammad H. Alomari

Here is a thesis that suggests that the cross-validation is a valuable tool for unsupervised learning...

I found it here http://udini.proquest.com/view/cross-validation-for-unsupervised-pqid:1904931481/ and the full text is available here http://arxiv.org/pdf/0909.3052.pdf

Mustafa Baydogan

You can simply evaluate its accuracy on classification problems.

Let the number of distinct classes be k in the classification problem. Apply your clustering algorithm to find k clusters and find the label for each cluster (based on the frequency of classes at each cluster). Then you can use this label as predicted class and evaluate performance.

Dario Russo

I have to validate formally an association rule algorithm. The association rule produces rules "A, B, C -> D, E" and so on. Are there any tool or instrument that I can use? Or which is the best approach to do that?

Thanks.

Smriti Singh

Yes I wanted the answer to that too. I'm trying to build a validation model for a RETAIL DATASET (it only has the items bought by multiple customers in the individual transactions and the entire dataset has about 90,000 transactions) but I am very confused where to start from.

All I know is this:

1)Divide the dataset into training, validation and test.

2)Apply the model on training, and later test it on test dataset. WHAT IS VALIDATION DATASET USED FOR?

3)HOW DO I CREATE A MODEL? I was told I have to create Apriori and Eclat testing models BUT HOW?

Pls help.

Feedback defines the constitution of an organism?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

Measuring the Intelligence of a Species?

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

The Curse of Evolution and Complexity?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

Need help with my research project on open source SIEM and machine learning?

Swimming/space travel depends on the proprioceptive muscle spindles?

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?