I want to represent characters or morphemes from words into vectors (like td idf or similar) so that when I cluster them they are together and some measure of distance is able to detect similarity.
If you want to represent morphemes or words into vectors, I can recommend you to use Word2Vec
Word2Vec
Word2vec is a set of methods/algorithms that are usually used in word embeddings.
How does this work? You can easily find out here:
- https://www.quora.com/How-does-word2vec-work , or here
- https://en.wikipedia.org/wiki/Word2vec
There is a good paper about that, which is done by Mikolov in NIPS 2013 (https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
Also, there are many really nice library for this
- Using DeepLearning4J (https://deeplearning4j.org/word2vec)
- Using TensorFlow (https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html)
- Original Google version (https://code.google.com/archive/p/word2vec/)
Roughly speaking, it consists of two main methods: Skip Gram and Bag of Words (CBOW)
In addition to IDF and TF, sveral ways can be applied, such as word occurrence in the document, i.e., if the word appears in the documet, then its value is 1, but 0 if not. Another method relies on word counts in the document.