How to get corpus for NLP Projects?

The Bigger You Are, the Harder You Fall (some lessons from Dinosaurs)?

Evolutionary fitness is based on an organism’s ability to adapt rapidly to changing environmental circumstances. Large-bodied mammals have been equipped with large brains (and hence a high...

06 August 2024 4,849 2 View

Are air moisture harvesting technologies effective in combating desertification?

Air moisture harvesting Air water collection devices

06 August 2024 5,473 2 View

State of art in natural disasters?

Are increasing the costs of disasters in the affected countries.

01 August 2024 1,794 2 View

Broca’s area must be intact for the learning of new movement sequences?

When the eyes of a person are damaged this causes complete blindness. Likewise, when Wernicke’s and Broca’s areas of neocortex are damaged this causes complete aphasia, losing the ability to...

01 August 2024 6,744 2 View

How can I get my Granzyme B flow cytometry stain to be consistent?

I have used PE and PE-Dazzle 594 fluorochromes and have managed to get NK cells to properly show GranzymeB expression after 4 hr PMA/ionomycin stimulaton, but for some reason my CD8 cells in the...

01 August 2024 7,677 2 View

The Origin of Human Language?

I attended a lecture at the Baylor College of Medicine (~ 2019) where one of the questions was “Does birdsong have anything to do with human language?” Noam Chomsky would say, “Absolutely not!”...

31 July 2024 1,706 4 View

Posthoc test lettering in JAMOVI?

Does anyone know of a module for the JAMOVI software that is capable of generating mean separations using the classic letters based on post hoc results (e.g., Tukey test)? If, as I believe, such...

31 July 2024 3,333 4 View

Creating an Automaton/Using Language as the Model?

As animals learn a task, they become more reliant on their long-term memories as compared to the real-time sensory information to guide behavioral performance (Ahilan et al. 2018). This process...

31 July 2024 9,859 0 View

What are the roles of innovation in achieving the Sustainable Development Goals (SDG)?

31 July 2024 3,533 2 View

What exactly is RAG-LLM doing? Isn’t it data engineering?

What exactly is Retrieval Augmented Generation for Large Language Model doing? Isn’t it data engineering?

30 July 2024 7,376 3 View

Jyh Wee Sew

As a sample, record a foreign language lesson, transcribe the interactions, classify the data into linguistic and non-linguistic learning input. Each type of input has many small subcategories. The nonlinguistic type may be broken into eye contacts, gestures, direct pointing, positive face, negative face, nonlinguistic sounds, e.g. claps, laughter. The linguistic type may include sentence, phrases, broken sentences, words, interjections, cultural rituals, linguistic cues, intonations, stress, pitch.

Once you find out what you actually like to study, tape ten lessons and you have enough data for a short written project. Corpus may be nonlinguistic input as well as linguistic ones.

Sruti Sahani

Thank you for your response. It would be helpful if you could provide more details on input and output. Where I can get such inputs, any links or sites that are available ?

You can tape your language interaction. You can scan your text for words. There are readily available corpus sets some free. Perhaps your supervisor may be able to help. Otherwise google search. Start with the word that links to your interest or skill set that you have.

Thank you so much.

Carrie Demmans Epp

Where to look for existing corpora depends a lot on what type of NLP project you want to conduct. There are many corpora out there for a variety of languages. VoxForge provides community built/maintained audio corpora (http://www.voxforge.org/). Some of the NLP competitions provide corpora (e.g., http://www.festvox.org/blizzard/). Data from Google books could be used as a corpus for a selected language (https://developers.google.com/books/). Twitter has a way for you to get tweets (https://dev.twitter.com/rest/reference/get/search/tweets). It should be possible to get ones from languages other than English.

Laurent Romary

There are more and more corpus available for French on https://www.ortolang.fr.

Khalid Qenawy

So long as your are not interested in English,you have a lot common languages.Try to apply your research on a language that you know well .You can record a conversation between two people for example and classify the speech in this conversation according to this language discourse analysis.I can help you in this side if you wanted topics because I studied the discourse analysis in the English Language and most language share the same rules extremely.

Miral Patel

You can check morpho challenge, FIRE, CLEF website . these site provides good repository of collection of data set and corpus for number of languages.

Karima Meftouh

you will find at this url what you are looking for. several corpora with different sizes that you can use for different NLP applications

http://opus.lingfil.uu.se/index.php

Omar Qawasmeh

you can check HAAD: Human Annotated Arabic Data set of Book Reviews for Aspect Based Sentiment Analysis. It was annotated based on SemEval annotation guideline, the baseline results already computed for the data set.

available to download from : https://github.com/msmadi/HAAD

Abraha Girmay Hagos

There are many African languages are not studied in this area such as Tigrigna language in Ethiopia, Tigray region in different aspects.