List of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources. Available online at https://shubhanshu.com/awesome-scholarly-data-analysis/
- Awesome Scholarly Data Analysis
- Table of Contents
- Datasets
- Tools
- Publication Venues
- Summer Schools
- Associations & Community
- Contributions
Table of contents generated with markdown-toc
- Arnet Miner
- Microsoft Academic Graph
- Open Academic Graph - MAG + AMiner
- Semantic Scholar Corpus
- CiteSeer
- PubMed
- CORA datasets for citation string parsing
- Humanities and multilingual citation string parsing Flux-CiM and ICONIP see Neural ParsCit paper for details
- Citation string parsing data for social sciences for English and German citations - comparison with Grobid and Cermine
- CrossRef DOI URLs
- DBLP Citation dataset
- NBER Patent Citations
- Scopus Citation Database
- Papers, patents, and grants from Indiana University
- Small Network Data - Mark Newman's Lab
- The Koblenz Network Collection
- Google Scholar citation relations
- Open citations project
- Wikicite Project
- Ecnonomic Papers
- ArXiv data dump
- Complete ACL anthology as bibtex file
- ACL Anthology Reference Corpus
- Astrophysics data system (ADS) - All physics papers
- CORE 37M full text open access papers
- Inspire database for high energy physics articles
- Scholarly Data of workshops and conferences in RDF triplets
- The Collection of Computer Science Bibliographies
- OpenCitations corpus
- COCI Doi-Doi citation data
- DOAJ API (Directory of Open Access Journals)
- ROAD (Directory of Open Access Scholarly Resources)
- Sherpa/Romeo (Publisher copyright policies & self-archiving)
- OpenAPC (fees paid for open access journal articles)
- OSF API (Open Science Framework)
- Digital tools for researchers
- Fatcat - versioned, publicly-editable catalog of research publications
- Microsoft Academic Knowledge Graph - RDF dump
- arXiv CS citation in context
- arXiv fulltext + citations dataset
- Self-citation analysis data based on PubMed Central subset (2002-2005)
- Unpaywalled Corpus - PDF to 23M DOIs Data Schema
- A dataset of publication records for Nobel laureates - paper
- 126+ Million literature-dataset and dataset-dataset links between 12+ Million objects - About the data
- Manually annotated citation data from the ACL Anthology into uses, motivation, future, extends, compare or contrast, and background
- iCite - NIH Open Citation Collection
- MEDLINE/PubMed Baseline Repository (MBR) - All Medline abstracts and paper paper meta-data in XML
- Mathematics Genealogy Project
- Academic Tree - Cross discipline academic genealogies
- MPACT project - Library Sciences
- PhDTree
- Chemistry Genealogy - curated at UIUC
- Notre Dame Genealogy Project
- UIUC Chemistry, Chemical Engineering, and Biochemistry
- Software Engineering Academic Genealogy
- Other lists of genealogy projects
- Wikipedia - Computer Science Genealogy
- Wikipedia - Theorecical Physicits Genealogy
- Wikipedia - Chemists Genealogy
- SCIENTIFIC GENEALOGY MASTER LIST - Scientists Associated with Concepts in Chemistry & Physics
- Economic Geneology Text Format
- Temporal profiles of PubMed authors
- ORCID data dump
- National Library of Medicine Profiles
- UIUC Professors database - Publications, Affiliations
- Author Profiles of scholarly authors in Wikipedia
- Career Transitions of CS students
- Author name gender and ethnicity dataset based on PubMed
- MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide
- Conceptual novelty scores for PubMed articles
- 100,000 top-scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator
- Canadian PhD career survey - Science report
- INSPIRE dataset
- Lee Giles dataset
- Cleaner version of Lee Giles dataset
- DBLP Korean Authors
- Arnet Miner
- Arnet Miner - Manual Name Disambiguation data 210 authors
- DBLP Name disambiguation dataset
- rexa-coref-data
- Dedped author names on IEEE Vis papers 1990-2018
- Author-ity dataset for PubMed 2009
- ACL Anthology dataset
- Open Access Theses and Dissertations
- The Networked Digital Library of Theses and Dissertations (NDLTD)
- PhD Dissertations in the Area of Software Engineering
- ProQuest Dissertations & Theses Global
- Citation Parsing
- Citation Parsing in humanities
- Sentences tagged for Drug Disease pairs
- Document Summarization and citation span identification
- ACL Anthology human summaries for 1000 papers
- Keyphrase Extraction
- Related Work Summarization
- Biomedical NLP annotated datasets
- Chemical compound and drug name recognition task
- Semantic Scholar Dataset
- ScienceIE
- ACL RD TEC 2.0 also at @CLARIN
- SEPID Corpus - Segmended ACL ARC 1.0
- PubMed Central Open Access - BioC
- PubMed Fulltext - protein-protein and genetic interactions
- BioNLP - Argo
- Biomedical NLP - Stav
- GENIA - BioNLP 2011
- Genia Treebank used for SciSpacy training - SciSpacy link
- Full GENIA corpus
- Anatomical Entity Mention (AnEM) corpus
- CellFinder - Entity detection
- Multi-Level Event Extraction (MLEE)
- Biomedical sentence simplification
- PubMed - Colorado Richly Annotated Full-Text
- Biomedical NER datasets related publication
- BioVerbNet
- Lunar and Planetary Science abstracts for NER and Relations
- ACM data affiliations
- ACM - DBLP database entry matching
- Colorado Richly Annotated Full-Text - PubMed abstract annotated with entities mapped to 10 biomedical ontology terms.
- CLEF datasets for multilingual Biomedical NLP+IE
- MedMentions - UMLS entities in PubMed
- Colright Initiatve - Rich text competition
- SciERC - scientific entities, their relations, and coreference clusters for 500 AI conf abstracts
- PubMed200k_RCT - Label abstract sentences into Objective, Background, Method, Results, Conclusions
- NER, Parsing, Classification datasets from SciBert
- ACA Wiki - Paper summaries of more than 1600 papers
- SemEval-2018 task 7 Semantic Relation Extraction and Classification in Scientific Papers
- A Compendium of Free, Public Biomedical Text Mining Tools Available on the Web
- Medical Information Extraction from PubMed abstracts
- Corpus of 40 scientific papers manually annotated by multiple scientific discourse facets
- PharmaCoNER: Pharmacological Substances, Compounds and proteins and Named Entity Recognition track - Train - Dev - Test - Background Test set
- Bacteria Biotope (BB) Task - NER, NEL, Relation, KB Extraction
- Entity/relation recognition and GOF/LOF mutated gene text identification task based on the Active Gene Annotation Corpus
- The Regulatory Network of Plant Seed Development (SeeDev) Task - NER, Relation
- TalkSumm - Summary of papers via alignment to talks
- SeminalSurveyDBLP - Classification of seminal or survey papers
- A Dataset of Peer Reviews (PeerRead)
- CiteTracked: A Longitudinal Dataset of Peer Reviews and Citations
- Supp.ai - PubMed supplement-drug interactions and supplement-supplement interactions
- GENETAG - More recent versions Publication and Download 2005
- MedTag: A Collection of Biomedical Annotations - Download
- Open Biomedical corpora
- Biomedical Abstract Meaning Representation corpus based on PubMed Fulltext - Also see other NLM curated biomedical resources
- SciGraph Springer Nature
- Medical Subject Headings maintained by the National Library of Medicine of the United States
- Computer Science Ontology maintained by Scholarly Knowledge: Modeling, Mining and Sense Making
- Physics Subject Headings maintained by American Physical Society (APS)
- Open Biological and Biomedical Ontology (OBO) maintained by the OBO Foundry
- ACM Computing Classification System maintained by the Association for Computing Machinery
- Physics and Astronomy Classification Scheme (PACS) maintained by American Institute of Physics (AIP) discontinued in 2010 and replaced by Physics Subject Headings
- Mathematics Subject Classification (MSC) mantained by Mathematical Reviews and zbMATH
- Journal of Economic Literature (JEL) maintained by the American Economic Association
- STW Thesaurus for Economics maintained by ZBW - Leibniz Information Centre for Economics
- Australian and New Zealand Standard Research Classification (ANZSRC) maintained by Australian Bureau of Statistics, it consists of 3 sub-classification schemes:
- Fields of Research (FoR) classification
- Research Fields, Courses and Disciplines (RFCD) classification
- Socio-Economic Objective (SEO) classification
- Library of Congress Classification (LCC) maintained by Library of Congress
- Fields of Study (FoS) maintained by Microsoft Academic
- Altmetrics API
- Dimensions.ai API - documentation, example
- Core Conference Rankings
- China Computer Federation Conference Rankings
- Google Scholar
- Semantic Scholar
- Microsoft Academic Graph
- AceMap
- GitXiv
- ACL Anthology
- NIPS papers
- Abel tools for PubMed data
- infolis: linking research data and publications
- Metrics toolkit
- Rcrossref (R library)
- Rscopus (R library)
- Scholar (R library)
- Bibliometrix (R library)
- CITAN (R library)
- BibeR (BibeR: A Web-based tool for bibliometric analysis in scientific literature)
- scihub.py (Python library)
- SoPaper (Python library)
- CiteSeer tools
- Novelty quantification in PubMed articles
- TidyPMC - R based PMC XML parser
- ContentMine - getpapers
- rcoreoa - CORE API R client
- Biomedical - BioSentVec Embeddings
- Biomedical embeddings - CambridgeLTL
- NIH scientific paper pre-processing
- SciSpacy - Spacy models for Biomedical NLP from AllenAI
- Multitask Biomedical NER
- SciBERT - Bert LM for Biomedical and CS papers
- CERMINE
- Grobid
- EXCITE (Extraction of Citations from PDF Documents)
- Science-Parse
- unarXiv (Citation in context from arXiv)
- Frontiers in Research Metrics and Analytics
- Scientometrics
- Journal of Informetrics
- Quantitative Science Studies (Open Access)
- Science, technology and human values
- Social Studies of Science
- Science and Public Policy
- Joint Conference on Digital Libraries (JCDL)
- International Conference on Theory and Practice of Digital Libraries (TPDL)
- European Semantic Web Conference (ESWC), Research of Research Track
- STI Conference series (Science and Technology indicators, e.g., 2018)
- ISSI Conference series (INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS, e.g., 2019)
- SIGMET - Metrics workshop
- International Workshop on Mining Scientific Publications
- Semantics, Analytics, Visualisation: Enhancing Scholarly Dissemination (SAVE-SD)
- Workshop on Reframing Research (RefResh)
- Enabling Open Semantic Science (SemSci)
- International Society for Informetrics and Scientometrics (ISSI)
- European Network of Indicator Designers (ENID)
- 4S (Society for Social Studies of Science)
- SIG/MET - Special Interest Group for the measurement of information production and use
The following people have contributed to the items on this list.
- Shubhanshu Mishra - Maintainer of the list.
- Angelo Antonio Salatino
- Philipp Zumstein
- Ali (Aliakbar Akbaritabar)