For one of my studies, I designed an unsupervised predictive clustering model, and now searching for some modification steps and post processing to use that clustering model for classification in a reliable way.
Unsupervised models are used when the outcome (or class label) of each sample is not available in your data. If you want to use your method to perform a classification task, you should have those labels in order to assess how good the method is. If this is the case, i.e class labels are available, I recomment you to test and compare your method with other well-known supervised machine learning models.
Daniel Urda , thank you so much Daniel, Yes class labels are available, the dataset has different brain EEG recordings from patients with different brain disorders. for another clustering task I made that model and it worked really well, so I wanted to try the model again (by some modification or post processing in order to convert it to a classification model) for classification of brain disorders
You can use your clustering method on data with labels removed and then check its efficiency by counting how many samples labeled with a similar class went to the same clusters. The trick here is that you cannot use precision, recall etc. metrics that you usually use to check the efficiency of classification. The most common metrics for clustering evaluation are Rand Jaccard, B-cubed. Here in paragraph "5.3 Evaluating clusters" I suggest to use F-measure: Preprint A Linguistic Model of Classifying Community Pages in a Socia...
You can see in the formula how different it is from the F-measure used for analysis of classification. It is important that you will not be able to compare efficiency of your clustering method to classification ones.
But if by classification you don't mean a Machine Learning method, but just that you want to use your clusters as a basis for terminological research - for example, that these clusters relate to some expert-defined classes of disorders, then you need to compare your own clustering method to other clustering (sic!) methods, not ML classification ones.
Hello. Unsupervised clustering methods create groups with instances that have similarities. If you do not have the classes associated with data set, you can use clustering methods for finding out related instances. An especialist can verify and define labels (classes) for groups.
After that, you can use supervised methods to learn from your new labeled data set. Good luck!
It just came to me that if your clustering method requires to specify the number of clusters in the output, then you can (sic!) compare its result to a classifier. A classifier assigns every item to a class in a given set of classes. If your clustering method always returns the same number of clusters as there are classes in the classifier, then you can check which cluster has the largest number of similar items (items belonging to the same class). It will be the so-called "best cluster". Then you can compare your best clusters to classes returned by the classifier.
Also, maybe, if your clustering method does not require to specify the number of clusters yet, it will be a good idea to introduce this feature, and it will move your method closer to classification. IMHO.
In clustering, main goal is to group the data points in data set into disjoint sets. The first clustering algorithm implement is k-means, which is the most widely used clustering algorithm. To scale up k-means,one will learn about the general MapReduce framework for parallelizing and distributing computations, and then how the iterates of k-means can utilize this framework.
For supervised learning we need to have a labeled data set. If not, it is good to run unsupervised learning algorithms for automatically labeling unlabeled data. Once the data is labelled using clustering algorithms, then it is possible to use supervised learning algorithms. For linking the two tasks a simple script can be written that connect the output of clustering as an input for the classification task.
It's very simple. If you have efficient reliable clustering algorithm apply it to wholew data set, split into clusters, each cluster would represent separate class and after that train your classifier using special training algorithm
An efficient approach to classify the clusters is to use "multinomial Naive Bayes classifier".
In this framework, feature vectors for training the classifier are the frequencies with which certain events (clusters) have been generated by a multinomial (p_1 , ... , p_n), where p_i is the probability that event i occurs. A feature vector x=(x_1,...,x_n) is a histogram with x_i counting the number of times event i was observed.
So, this way, you train the classifier with some observations and let the trained classifier to make decision about the rest of the unlabeled data.
Take a look at this blog post, I think it covers an interesting perspective on the question asked. - https://medium.com/datadriveninvestor/can-all-classification-problems-be-solved-by-unsupervised-clustering-3a9f3e1f72c0
Let me know if you are in agreement with the author.