How to use machine learning techniqueto extract the tables from scanned document images?

More Sabari Nathan's questions See All

How to classify the scanned document images?

Hi friends, i want to classify the scanned document images and so many methods are there. But it depends high with texts in the document. Please suggest any best algorithm that can classify the...

05 June 2015 4,502 6 View

How can I convert the MSR 3D depth video into Mat file in matlab?

Hi Friends, I am working on human action classification. I have tried to convert depth.bin into Matlab mat file but i am not able to do . please share some code for reading depth file and...

03 April 2015 7,600 3 View

Which is the best feature for natural image character recongnition?

Hi friends, Now I am working on NCR. I have used hog,phog,co-occurence hog for natural character recognition. Even i have tried deep learning with RBM. But I got 50% accuracy only so please...

01 February 2015 5,895 8 View

Where can I find a handwritten character dataset ?

hi everyone, I am working on handwritten character recognition. I need some sample images for training. I have searched a lot but I got only few samples. So please share with me dataset links.

11 December 2014 6,310 13 View

How can I segment joined printed characters?

I am working on degraded document image enhancement. Most of them are using complex techniques If anyone knows a simple method for character segmentation, tell me the paper name or code link....

11 December 2014 3,236 2 View

Which author book is best for Hidden markov model learning?

Currently i am working on handwritten charater recognition so i want to learn markov model. Please suggest me good book.

10 November 2014 9,628 4 View

How to convert emxArray_real_T * image into Opencv Mat Image?

I have tried to convert Matlab code into opencv code. I have used few inbuilt Matlab c++ code for my opencv implementation. I don't know emxArray_real_T to Mat conversion. Does anyone know the answer?

07 August 2014 4,510 3 View

How to convert matlab mat file into opencv mat file?

Without using yml file.My mat size 12000X400(feature vector) so not able to create the YML file. please give me any suggestion and code.

06 July 2014 7,625 0 View

How to extract the confidence score value of the SVM classifier?

I am currently working on handwritten digit recognition. I need a confidence score for each recognized digit. Can any one help me solve this problem?

05 June 2014 3,846 5 View

How can I extract the without border tables in the scanned document?

I am working on without border table detection and extraction. Please suggest an efficient algorithm.

02 March 2014 1,470 2 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

A. N K Zaman

http://www.mathworks.com/help/vision/ref/ocr.html

Gerro J. Prinsloo

Dear Sabari

This is a very good idea as one often find research articles with interesting tables in .pdf format or as scanned documents that you have to retype in latex or in excel.

One can convert scanned pdf to excel using this type of OCR software

https://www.cogniview.com/pdf-to-excel/pdf2xl-ocr

but I assume you want to use ML to create your own conversion software. This is an excellent idea and I will be the first to use your solution.

In terms of ANN OCR in table format, this is a potential solution

http://arxiv.org/pdf/1302.1700.pdf

regards

Gerro

Gerson Flavio Mendes de Lima

Optical Character Recognition systems enable several applications, e.g. automatic character recognition in printed texts. For the success of such systems, reliable segmentation is an essential stage. This chapter presents two approaches to segmentation: the SLPTEO for segmentation of text lines and words, and SCORC for character segmentation. The first is applied to printed texts, but can be also applied to handwritten texts. The second handles printed overlapping and touching characters, working directly on grayscale images. Experimental results show great robustness of the methods presented.

SLPTEO e SCORC: Abordagens para Segmentação de Linhas, Palavras e Caracteres em Textos Impressos. Available from: https://www.researchgate.net/publication/256088532_SLPTEO_e_SCORC_Abordagens_para_Segmentao_de_Linhas_Palavras_e_Caracteres_em_Textos_Impressos [accessed Jun 16, 2015].

Chapter SLPTEO e SCORC: Abordagens para Segmentação de Linhas, Palav...

Salman Ghanvi

Interesting.

I think the shape of table can be recovered using cluster analysis before using ocr on individual clusters.

Mohammad Sabokrou

If you insist to use of Ml, I think you can, (1)recognise the cells of table using line detection algorithms (2) ROC the text in each cells ,using a method such as NN,

Ralf Vandenhouten

Dear Sabari Nathan,

the most significant properties of a line being a part of a table are

a) that it consists of non-continuous text flow, i.e. having sequences of characters alternating with intervals of empty space, and

b) that the intensity distribution (distribution of black and white pixels) of the line has a strong correlation with the distribution of the line above and the line below.

Both properties can be measured easily and can be used as a feature vector for classification in order to decide if the line is part of a table or not, e.g. using thresholds or any kind of machine learning technique.

If you are also looking for tables that do NOT cover the whole width of the sheet you have to apply the above method to smaller sections of the lines.

Good luck!

Pratap Muthukrishnan

I am new to ML yet i am also in search of Optical Table Recognizing options through ML way. I found these interesting articles. I'm yet to start working on it. Hope this helps.

https://blog.goodaudience.com/table-detection-using-deep-learning-7182918d778
http://code.activestate.com/recipes/580635-how-to-parse-a-table-in-a-pdf-document/
https://github.com/WZBSocialScienceCenter/pdftabextract
https://github.com/ulikoehler/OTR
http://stanford.edu/~mgorkove/cgi-bin/rpython_tutorials/Using_Python_to_Extract_Tables_From_PDFs.php