Can i cluster documents to label them as a first step. Then in the second step, can I use the labelled documents to apply a classification method such as svm, knn, etc.?
Classification and clustering are two methods of pattern identification used in machine learning. Although both techniques have certain similarities, the difference lies in the fact that classification uses predefined classes in which objects are assigned, while clustering identifies similarities between objects, which it groups according to those characteristics in common and which differentiate them from other groups of objects. These groups are known as "clusters".
Because the classification will be based on the labels produced by the clustering step and as we all know that clsutering depends on the data distribution.
I mean, is there any difference between human labelling and clustering for labelling?.
Usually, after grouping the instances/records using a clustering algorithm, a manual labeling process needs to be accomplished. In this sense, instances of a specific cluster will have a unique label (class).
Now, once you have fully labeled data, you can perform the classification technique via e.g., SVM algorithm. Then, evaluate its prediction quality using the stratified 10-fold cross-validation technique, for instance. As a result, you can assess SVM's performance with different evaluation metrics like accuracy, Kappa statistic, ROC curve, etc.
Yes, you are right. The manul labeling needs to be done at the very early stage. Then, machine learning has to learn the human labeling strategy.
LDA is unsupervised learning algorithm that can be used for a topic modeling task. You can extract collection of topics from texts using LDA algorithm. Then, label these topics manually via domain-based knowledge.
The automatic labeling depends on the manual labeling that must be done first. Therefore, we train machine learning algorithms to learning our labeling to automatically perform future labeling.
Manual labeling depends on how you use your knowledge to label your data. For instance, you can label your data to either 'cat' or 'dog' because you know that these categories are in your data.
Yes, topic modeling techniques (e.g., the classical Latent Dirichlet Allocation (LDA) algorithm) allow you to extract text-related topics. Once you have these topics, you can understand them and assign appropriate labels manually to each topic.
Automatic labeling depends on manual labeling (i.e., the training set) that needs to be learned by machine learning algorithms (try many machine learning algorithms to know the one that has the best prediction capability). Thus, make sure your manual labeling is accurate as much as possible.