How to parse medical research papers' main text body? #1075

bounaux · 2023-12-31T22:31:45Z

bounaux
Dec 31, 2023

Hi,

Any suggestions to parse medical research papers? For example, if I donwload this research paper, https://www.nature.com/articles/s41421-022-00392-4#Sec1. I'd like to strip out the header/footer, author names, and other non-essential text, and only keep the text body. Any suggestions for how to do this pls? I'd like to find a generalized way to parse PDFs like this, minimize the amoutn of hard code anything if possible. Thank you!

petermr · 2024-01-02T21:46:35Z

petermr
Jan 2, 2024

This is a hard problem in general. I have been working on it for years and am close to having a first pass at a general solution which I hope to post here.
However the link that you give is to HTML text, not a PDF. (There IS a link to the PDF but it needs downloading).
In general it is much easier to parse and mine the HTML than the PDF.

If you want to mine the PDF then consider:

PDF has no headers, no authors, no tables. It effectively only has chars, graphics paths, hyperlinks/annots and bitmap images. No words, no tables, no graphs. All those constructs are created by your brain. We have to reassemble these by ML and heuristics.
HTML may have some of these constructs added by NIH in JATS XML. There's no real standard.
There are hundreds of publishers and everyone uses a different layout.

So there is no general solution. Most pdf2html or pdf2txt tools discard the fonts and styles and often the indentation. This destroys a lot of the context.

I use pdfplumber as the primary parser and then build up the document structure and its semantics through heuristics.

You might want to look at GROBID (a Java based ML tool based on scholarly literature) which may solve some of you problems.

0 replies

bounaux · 2024-01-02T23:41:33Z

bounaux
Jan 2, 2024
Author

Thx Peter for the note. Look forward to your new ideas. Yeah the list I psoted was the link to the document, I ended up just use web scraping instead of downloading the pdf itself. I found it to be easier than parsing the PDF version. But, there are a few docs I have here that don't have web version, so I'd love to eventually process them in a cleaner way.

0 replies

petermr · 2024-01-04T07:40:02Z

petermr
Jan 4, 2024

I'm happy to discuss what you want and to offer our very alpha code pyami at https://github.com/petermr/pyamihtml which uses pdfplumber. I am concentrating on the running text - and the code does quite well for simple PDFs. If, for example you are looking for phrases in the text (to classify the documents) then it would be useful.If you what to extract bibliography (references) then there are already other tools that do this. So do you know what you want? And are there others in the PDFPlumber community who are doing text extraction?

…

On Tue, Jan 2, 2024 at 11:41 PM David Mai ***@***.***> wrote: Thx Peter for the note. Look forward to your new ideas. Yeah the list I psoted was the link to the document, I ended up just use web scraping instead of downloading the pdf itself. I found it to be easier than parsing the PDF version. But, there are a few docs I have here that don't have web version, so I'd love to eventually process them in a cleaner way. — Reply to this email directly, view it on GitHub <#1075 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTCSZYSTMZRUT7T263KGLYMSLLPAVCNFSM6AAAAABBIQ3YISVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TSOJXHE4TA> . You are receiving this because you commented.Message ID: ***@***.***>

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

2 replies

junoriosity Dec 16, 2024

@bounaux @petermr What is the state of your discussion.

Do we have any good solution to extract text etc. from a research paper? 🙂

petermr Dec 18, 2024

"good" - no
"effortless" - no
"adequate" - depends on the source and the amount of effort we want to put in. If we restrict ourselves to text (i.e. not math or chemistry) we can probably get a per-publisher answer on a crowd-sourced basis. This means (say) creating a template for:

headers and footers and margins. It would be fairly easy to create an interactive tool to help to do this. But it needs an enthusiast
giving roles for different font-sizes and indentations. This is publisher-specific (look at the splash page of an Elsevier journal). But once it's done it's probably good for several years.
lists. what are the bullet sizes and indentations?
double columns.
floats/boxes
Would make a good Master's project for someone who likes puzzles

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to parse medical research papers' main text body? #1075

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to parse medical research papers' main text body? #1075

bounaux Dec 31, 2023

Replies: 3 comments · 2 replies

petermr Jan 2, 2024

bounaux Jan 2, 2024 Author

petermr Jan 4, 2024

junoriosity Dec 16, 2024

petermr Dec 18, 2024

bounaux
Dec 31, 2023

Replies: 3 comments 2 replies

petermr
Jan 2, 2024

bounaux
Jan 2, 2024
Author

petermr
Jan 4, 2024