How to parse medical research papers' main text body? #1075
Replies: 3 comments 2 replies
-
This is a hard problem in general. I have been working on it for years and am close to having a first pass at a general solution which I hope to post here. If you want to mine the PDF then consider:
So there is no general solution. Most pdf2html or pdf2txt tools discard the fonts and styles and often the indentation. This destroys a lot of the context. I use You might want to look at GROBID (a Java based ML tool based on scholarly literature) which may solve some of you problems. |
Beta Was this translation helpful? Give feedback.
-
Thx Peter for the note. Look forward to your new ideas. Yeah the list I psoted was the link to the document, I ended up just use web scraping instead of downloading the pdf itself. I found it to be easier than parsing the PDF version. But, there are a few docs I have here that don't have web version, so I'd love to eventually process them in a cleaner way. |
Beta Was this translation helpful? Give feedback.
-
I'm happy to discuss what you want and to offer our very alpha code pyami
at https://github.com/petermr/pyamihtml which uses pdfplumber. I am
concentrating on the running text - and the code does quite well for simple
PDFs. If, for example you are looking for phrases in the text (to classify
the documents) then it would be useful.If you what to extract bibliography
(references) then there are already other tools that do this.
So do you know what you want?
And are there others in the PDFPlumber community who are doing text
extraction?
…On Tue, Jan 2, 2024 at 11:41 PM David Mai ***@***.***> wrote:
Thx Peter for the note. Look forward to your new ideas. Yeah the list I
psoted was the link to the document, I ended up just use web scraping
instead of downloading the pdf itself. I found it to be easier than parsing
the PDF version. But, there are a few docs I have here that don't have web
version, so I'd love to eventually process them in a cleaner way.
—
Reply to this email directly, view it on GitHub
<#1075 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCSZYSTMZRUT7T263KGLYMSLLPAVCNFSM6AAAAABBIQ3YISVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TSOJXHE4TA>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Beta Was this translation helpful? Give feedback.
-
Hi,
Any suggestions to parse medical research papers? For example, if I donwload this research paper, https://www.nature.com/articles/s41421-022-00392-4#Sec1. I'd like to strip out the header/footer, author names, and other non-essential text, and only keep the text body. Any suggestions for how to do this pls? I'd like to find a generalized way to parse PDFs like this, minimize the amoutn of hard code anything if possible. Thank you!
Beta Was this translation helpful? Give feedback.
All reactions