I have a dataset which consists of one text column (it is related to nlp), four categorical columns as features and one target column which has almost 90.000 categories. Also, the dataset has millions of rows. I have to use PySpark library. I researched multi-label classification and got familiar with the subjects like Binary Relevance, Classifier Chains and Label Powerset. My company wants me to do multi-label classification in order to predict 90.000 target categories.

However, i saw multi-label classification examples that have four or five target categories. To me, predicting 90.000 target categories using methods like Binary Relevance, Classifier Chains and Label Powerset is impractical or almost impossible when we think of maintenance of 90.000 ml models or one model which tries to predict thousands of target categories.

What is the suitable number of categories in order to do multi-label classification?

How should i approach to this problem?

More Hikmet Yavuz's questions See All
Similar questions and discussions