I am adopting a dimensionality reduction method for a task of cross-lingual information retrieval and classification. I am more interested in data set with many classes written in big languages.
Look at data sets used in Cross Language Evaluation Forum (CLEF) and the Text Retrieval Conference (TREC). NTCIR had some data for Asian languages if you are also interested.
Thank you very much for your answers. I have decided to use Reuters Multilingual corpus with machine translation, I believe this will be suitable for the research that I plan to conduct.