Hello, I am trying to find a dataset of canonical suffix forms for Parts of speech tags. For example, ed->VBD (past tense), ing-> VBG. Is there any available dataset like this?
Hello, you can manually create a corpus using "Parts of Speech Tagging Guidelines for the Penn Treebank Project(3rd revision) by Beatrice Santorini". Since I didn't find any such existing datasets.
I don't know of anything out there, but I would suggest creating a corpus, or using something like the Brown corpus if you don't have the time. Keep the untagged corpus and create a tagged one. Run some code on the untagged corpus to get your suffixes (I have some dirty R code here that works at least! https://github.com/marcjoneslang/corpus_affix). Check your suffix findings with the tagged corpus, and this should hopefully do what you want it to do.