Most of the digital version of published papers use Postscript or PDF as their document format, which helps to ensure that the document looks as intended on various platforms and in print. One of its drawbacks is that text is just a large array of glyphs that are to be placed on a page, semantics such as text being a heading, caption, text in a figure, or a reference, is lost.

This causes problems when you want to make automated use of a paper, e.g., to extract the references inside, enable fulltext search, or generate bibtex entries. For example, for inclusion into the ACM Digital Library you have to provide the LaTeX source and bibtex file to help them get the proper metadata in the first place. Google Scholar or RG seem to invest lots of effort to find out which papers are yours, who is cited by whom, etc.

There was much research on topics like the semantic web, are there already solutions for this problem that could be used for research documents at scale? To improve the situation, why not have a new data format (or just include this metadata in PDF) and make this information mandatory when publishing it?

Similar questions and discussions