From the NLP point of view, what are the non-obvious main functionalities of a text extraction mechanism?

More Marcio Ferreira Moreno's questions See All

Do you think can be any Uranium bearing rocks in Eastern part of Iran and western part of Afghanistan?

I want to know more about Uranium ore deposits in world.

11 August 2024 6,720 0 View

Do you think can be any diamond bearing rocks in Eastern part of Iran and western part of Afghanistan?

I want to know more about diamond ore deposits in world.

11 August 2024 2,167 1 View

What is the difference between mathematical R^4 space and physical 4D unit space?

We assume that the difference is huge and that it is not possible to compare the two spaces. The R^4 mathematical space considers time as an external controller and the space itself is immobile in...

10 August 2024 6,678 14 View

If Banks do not provide credit facility, what are the options available for FPOs and impact on producer’s income?

10 August 2024 8,198 5 View

Controlling for pupil light reflex when analyzing pupil size time course?

I used eye tracking to examine how participants from two different populations (A and B) react to an image. Participants in population A exhibit larger pupil sizes over time, but they also have...

10 August 2024 3,229 0 View

What are a “Farmers Producer Organization” (FPO) and its essential features?

10 August 2024 477 5 View

Strugglling with m6A dot blot any suugesstion ?

I have been doing the m6A dot blot for a while with no improvement, I am extracting the RNA, and I can see the dots although the three biological replicas give a different reading on the memberan...

10 August 2024 8,539 5 View

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How do interactions between the biosphere, the carbon cycle, and the water cycle impact global warming and interaction between the atmosphere and the hydrosphere?

09 August 2024 3,291 2 View

How to get moment output in Abaqus Standart?

I have input a moment load in module load Abaqus, i put my moment load on the node surface (using reference point). I have define moment in history output and make a set for moment too. But the...

08 August 2024 4,831 4 View

How is energy cycled through the Earth's climate system and how do matter cycle and energy flow through the rock cycle?

08 August 2024 8,162 0 View

Is there a problem with my RNA pellet?

Hello, I am currently having problems with RNA extraction. I am using mouse liver (C57BL6J), and I have extracted RNA from mouse liver before. Before this experiment, my final RNA pellets were...

11 August 2024 7,082 3 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

RNA Extraction Using Hot Borate Method No Longer Working?

I've been performing RNA extraction on cotton petiole tissue for a few months now using the method described in the following paper, a derivative of the typical hot borate method...

08 August 2024 9,882 2 View

The Bigger You Are, the Harder You Fall (some lessons from Dinosaurs)?

Evolutionary fitness is based on an organism’s ability to adapt rapidly to changing environmental circumstances. Large-bodied mammals have been equipped with large brains (and hence a high...

06 August 2024 4,849 2 View

Are air moisture harvesting technologies effective in combating desertification?

Air moisture harvesting Air water collection devices

06 August 2024 5,473 2 View

"A Markov-like Model for Patient Progression"?

A Markov-like Model for Patient Progression" Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC) is a powerful computational technique used to draw samples from a probability...

05 August 2024 10,079 0 View

Low-yield gel extraction problem?

I am having an issue with my gel image where my PCR product is not appearing very bright on the gel. When I perform gel extraction, the A260/280 purity value is very low. I used the Qiagen gel...

05 August 2024 9,798 3 View

Do you have good tips for seaweed tissue preservation in the field for post RNA extraction?

I will be with my students collecting seaweed samples in a marine farm and later we will process this tissue for RNA isolation and further sequencing. Does anyone have tips on how to collect the...

04 August 2024 501 2 View

How to develop an academic literacy program for engineering at the higher education level?

Information literacy in higher education integration with curricula engineering

04 August 2024 5,368 3 View

Dirty and clean?

Hi everyone I need a file with a dirty and clean potato image

04 August 2024 7,199 4 View

Arturo Geigel

Marcio,

Other problems which I have faced to supplement the ones that you have already given are:

Quotes within text
e.g. and i.e. how to deal with them
Footnotes

Some of these are relevant in some cases that need to be taken into account for context and in others they are worthless. Since I am dealing with a particular type of document I have several rules in place on whether to include them in the parsing step or exclude them altogether.

Other challenges that I have faced when parsing engineering texts is whether to include measurements or not in a parse. Some of them give tolerances that will put out of context the text in other they are just example measures that can be obtained.

Also depending on the field knowing when to isolate segments can be really tricky. Scientific text is less challenging in this respect than novels in which the structure itself may even shift with the text progression.

Dariush Saberi

Hi,

In relation to the text cleaning part of your question, we can make a list, containing your example and Arturo's:

1. Line feeds ("\n") could be invisible and they break the parsing of the sentence

2. Different types of brackets, you name them ( ), [ ], { }, of course it depends

3. Encoding, sometimes there are characters inside the text which should be set before, even one single comma (,) can stop a whole system

4. Some languages (e.g. Java, PHP) are confused with single quote sometimes. It is better to use regex like style for them

5. A mixture of above points is also possible, for example a foreign name inside an English text with an encoding and single quote inside it

6. My mentor does not agree that a really too long sentence could confuse the parser but I have seen it :) The question is though how long a really long sentence is which I cannot answer

So I suggest in order to start the text extraction, following could be beneficial:

1. Visual inspection of a sample of text to see if there are unwanted items like comments in brackets and the like

2. Setting encoding correctly or neutralizing the text, if possible, e.g. by converting them to plain text in Notepad, Gedit, Nano, etc. (We are not talking about big data)

3. Step #2 does not remove line feeds. If system needs to process a single sentence at a time, there may remained line feeds.

3. Replacing single quotes, etc with their regex

Marcio Ferreira Moreno ,

Another problem is assuming well constructed text. An example is doing a check for quotes where the text (either by OCR or user omission) does not include the closing or opening quotes. This turns extremely interesting in CSV's!

Another example is abbreviations at the end of a sentence where the user did not capitalize the word at the beginning of the next sentence.

While this are errors in the text, the point that I learned is not to assume well behaved text. While the problem is obvious, its solution is not.