Nowadays there are plenty of core technologies for TC (Text Classification). Among all the ML learning approaches, which one would you suggest for training models for a new language and a vertical domain (alike Sports, Politics or Economy)?
The cloud-based Computer Vision API provides developers with access to advanced algorithms for processing images and returning information. By uploading an image or specifying an image URL, Microsoft Computer Vision algorithms can analyze visual content in different ways based on inputs and user choices. With the Computer Vision API users can analyze images to:
Tag images based on content.
Categorize images.
Identify the type and quality of images.
Detect human faces and return their coordinates.
Recognize domain-specific content.
Generate descriptions of the content.
Use optical character recognition to identify printed text found in images.
Recognize handwritten text.
Distinguish color schemes.
Flag adult content.
Crop photos to be used as thumbnails.
Requirements
Supported input methods: Raw image binary in the form of an application/octet stream or image URL.
I apologize for this not being an expert reply, but I am asking the same questions as you!
I'm currently exploring the use of a RNN and first getting it to create a language model from a large representative data set, then having that inbuilt model in my RNN, doing classification from there. I am following the new fast.ai course for this inspiration. http://forums.fast.ai/t/welcome-to-part-1-v2/5787 - see lesson 4 video for an idea of what's involved.
The github is here: https://github.com/fastai/fastai
See LanguageModelData in nlp.py - this is what is being used in lesson 4, and lm_rnn.py, where the key class RNN_Encoder is expalined like this:
"A custom RNN encoder network that uses - an embedding matrix to encode input, - a stack of LSTM layers to drive the network, and - variational dropouts in the embedding and LSTM layers The architecture for this network was inspired by the work done in "Regularizing and Optimizing LSTM Language Models". (https://arxiv.org/pdf/1708.02182.pdf)
Also see this paper explaining the idea: https://arxiv.org/abs/1801.06146
If you get a great answer from somewhere else please post it here - I would love more information to help me in my endeavours as well :)
The state-of-the-art for most text classification applications relies on embedding your text in real-valued vectors: Article Distributed Representations of Words and Phrases and their C...
The gensim package is popular for training word vectors on data: https://radimrehurek.com/gensim/models/word2vec.html
This method relies on having rich, diverse collections of words and contexts, which your data may not have on its own. Thus it's popular to initialize your embedding matrix using pre-trained word vectors like word2vec or fasttext; in some cases, these will work out of the box, in some you'll want to continue training the vectors on your dataset, in others it's better to just train on your data alone.
The great thing about embedding methods is they don't care about language; you can create an embedding for any language or really any sequential data that endows discrete data with a sort of 'meaning'.
Once you have richer features from your embedding matrix, you can use these as inputs to a classifier, which can be as simple as softmax regression, which assigns probabilities to discrete classes, or as complex as an RNN/LSTM, which ultimately can do the same but typically for sequential data.
The choices you make here depend more heavily on what specific problem you're trying to solve, but here are a few examples: