We are trying to build a dataset from publicly accessible government data.

Unfortunately all the data has been published in PDF format. To make things slightly more complicated, the documents are all in Arabic.

So far, we have used the python PDFMiner package to extract a text file but the results are a little inconsistent and very difficult to parse.

Any suggestions would be greatly appreciated.

Similar questions and discussions