How to create a corpus for Natural Language Processing Task?

More Sayyed Usman Ahmed's questions See All

Do you think can be any Uranium bearing rocks in Eastern part of Iran and western part of Afghanistan?

I want to know more about Uranium ore deposits in world.

11 August 2024 6,720 0 View

Do you think can be any diamond bearing rocks in Eastern part of Iran and western part of Afghanistan?

I want to know more about diamond ore deposits in world.

11 August 2024 2,167 1 View

What is the difference between mathematical R^4 space and physical 4D unit space?

We assume that the difference is huge and that it is not possible to compare the two spaces. The R^4 mathematical space considers time as an external controller and the space itself is immobile in...

10 August 2024 6,678 14 View

If Banks do not provide credit facility, what are the options available for FPOs and impact on producer’s income?

10 August 2024 8,198 5 View

Controlling for pupil light reflex when analyzing pupil size time course?

I used eye tracking to examine how participants from two different populations (A and B) react to an image. Participants in population A exhibit larger pupil sizes over time, but they also have...

10 August 2024 3,229 0 View

What are a “Farmers Producer Organization” (FPO) and its essential features?

10 August 2024 477 5 View

Strugglling with m6A dot blot any suugesstion ?

I have been doing the m6A dot blot for a while with no improvement, I am extracting the RNA, and I can see the dots although the three biological replicas give a different reading on the memberan...

10 August 2024 8,539 5 View

I have get this error when i calculated the geometrical optimization for prco3, it takes 12 hours until gives this message in the outputfile?

Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(1628)......: MPI_Allreduce(sbuf=000002459254A180, rbuf=000002459F86A140, count=4851, MPI_DOUBLE_COMPLEX, MPI_SUM,...

09 August 2024 7,615 1 View

Why Do TDS and EC Increase with Larger Wastewater Volumes, While BOD and COD Decrease?

I have carried out MFC experiments on three different volumes, 50, 500 and 1000 mL of wastewater. Results after MFC treatment shows that TDS and EC are more in larger volumes of water i.e. TDS and...

09 August 2024 9,621 0 View

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How do interactions between the biosphere, the carbon cycle, and the water cycle impact global warming and interaction between the atmosphere and the hydrosphere?

09 August 2024 3,291 2 View

The Bigger You Are, the Harder You Fall (some lessons from Dinosaurs)?

Evolutionary fitness is based on an organism’s ability to adapt rapidly to changing environmental circumstances. Large-bodied mammals have been equipped with large brains (and hence a high...

06 August 2024 4,849 2 View

Are air moisture harvesting technologies effective in combating desertification?

Air moisture harvesting Air water collection devices

06 August 2024 5,473 2 View

State of art in natural disasters?

Are increasing the costs of disasters in the affected countries.

01 August 2024 1,794 2 View

Broca’s area must be intact for the learning of new movement sequences?

When the eyes of a person are damaged this causes complete blindness. Likewise, when Wernicke’s and Broca’s areas of neocortex are damaged this causes complete aphasia, losing the ability to...

01 August 2024 6,744 2 View

How can I get my Granzyme B flow cytometry stain to be consistent?

I have used PE and PE-Dazzle 594 fluorochromes and have managed to get NK cells to properly show GranzymeB expression after 4 hr PMA/ionomycin stimulaton, but for some reason my CD8 cells in the...

01 August 2024 7,677 2 View

The Origin of Human Language?

I attended a lecture at the Baylor College of Medicine (~ 2019) where one of the questions was “Does birdsong have anything to do with human language?” Noam Chomsky would say, “Absolutely not!”...

31 July 2024 1,706 4 View

Posthoc test lettering in JAMOVI?

Does anyone know of a module for the JAMOVI software that is capable of generating mean separations using the classic letters based on post hoc results (e.g., Tukey test)? If, as I believe, such...

31 July 2024 3,333 4 View

Creating an Automaton/Using Language as the Model?

As animals learn a task, they become more reliant on their long-term memories as compared to the real-time sensory information to guide behavioral performance (Ahilan et al. 2018). This process...

31 July 2024 9,859 0 View

What are the roles of innovation in achieving the Sustainable Development Goals (SDG)?

31 July 2024 3,533 2 View

What exactly is RAG-LLM doing? Isn’t it data engineering?

What exactly is Retrieval Augmented Generation for Large Language Model doing? Isn’t it data engineering?

30 July 2024 7,376 3 View

Hazem Tarash

https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html

Sayyed Usman Ahmed

Thanks, I will surely check and let you know.

Robert Kapitan

Hi Sayyed,

There are online collections of texts that might be annotated (e.g.: Enron).

If you want to create a new set you can crawl documents using sets of keywords, for example.

Robert

Thanks for the reply.

Po Sang Yu

For Rookies, you can try messing with your data with voyant-tool. Just upload your document and see what happen.

https://voyant-tools.org/

For me, NLP depends on your target. For me, I am mostly working on semantic network and frequency analysis, so tagging is not needed. But if you want to try sentiment analysis or similar things, you may need to learn R or Python.

Paul

Great help Paul. Do you have any document pertaining to this tool. It will have complete information about this wonderful tool.

https://voyant-tools.org/docs/#!/guide/start

Po Sang Yu greatful to you.

Tullio Rizzini

Uman language is orogestual. Read my works.

Simona M Ignat

I am doing this from scrap and a human-based linguistic corpus should be tailored on the task(s). It has few stages of processing the data. These could be elimination of real-world recognition marks for assuring the privacy of subjects (according to GDPR regulations), codification, annotation etc. A corpus (literal meaning was singular in Latin), in general, has a qualitative model of processing, a corpora (plural) could have a quantitative or mixed methods.

Ge Lan

Interesting question: would you please elaborate on your NLP task so I can provide more suggestions?

In general, you need to think about (1) corpus size should be large enough (2) representativenss of the corpus design (3) use machine-readable format.

Musliadi Kh

I'm also doing research on this, and I'm trying to implement a Python library model, such as Stemming English