I want to do a Natural Language Processing project, for any language other than English. Any ideas is highly appreciated. Further I am struggling with finding the input data (corpus), please suggest.
As a sample, record a foreign language lesson, transcribe the interactions, classify the data into linguistic and non-linguistic learning input. Each type of input has many small subcategories. The nonlinguistic type may be broken into eye contacts, gestures, direct pointing, positive face, negative face, nonlinguistic sounds, e.g. claps, laughter. The linguistic type may include sentence, phrases, broken sentences, words, interjections, cultural rituals, linguistic cues, intonations, stress, pitch.
Once you find out what you actually like to study, tape ten lessons and you have enough data for a short written project. Corpus may be nonlinguistic input as well as linguistic ones.
Thank you for your response. It would be helpful if you could provide more details on input and output. Where I can get such inputs, any links or sites that are available ?
You can tape your language interaction. You can scan your text for words. There are readily available corpus sets some free. Perhaps your supervisor may be able to help. Otherwise google search. Start with the word that links to your interest or skill set that you have.
Where to look for existing corpora depends a lot on what type of NLP project you want to conduct. There are many corpora out there for a variety of languages. VoxForge provides community built/maintained audio corpora (http://www.voxforge.org/). Some of the NLP competitions provide corpora (e.g., http://www.festvox.org/blizzard/). Data from Google books could be used as a corpus for a selected language (https://developers.google.com/books/). Twitter has a way for you to get tweets (https://dev.twitter.com/rest/reference/get/search/tweets). It should be possible to get ones from languages other than English.
So long as your are not interested in English,you have a lot common languages.Try to apply your research on a language that you know well .You can record a conversation between two people for example and classify the speech in this conversation according to this language discourse analysis.I can help you in this side if you wanted topics because I studied the discourse analysis in the English Language and most language share the same rules extremely.
you can check HAAD: Human Annotated Arabic Data set of Book Reviews for Aspect Based Sentiment Analysis. It was annotated based on SemEval annotation guideline, the baseline results already computed for the data set.
available to download from : https://github.com/msmadi/HAAD