Notebooks and code snippets aimed at learning about / exploring the OpenAlex datasets.
First exploration at the moment focuses on exploring Topics & Keywords.
The topics classification in OpenAlex consists of various thousand categories organised into a 4-level hierarchy.
The gist of it is:
Works in OpenAlex are tagged with Topics using an automated system that takes into account the available information about the work, including title, abstract, source (journal) name, and citations. There are around 4,500 Topics. Topics are grouped into subfields, which are grouped into fields, which are grouped into top-level domains. This is shown in the diagram below, along with the counts for each.
Our team put together a new implementation of keywords based on our Topics. There are currently over 26,000 keywords and we expect to add more as time goes on. [...] With our new topics system that was developed in coordination with CWTS, we came out with a list of 10 keywords for each topic. In order to assign keywords to works, we took the topics assigned to that work (at most 3 topics), pulled the keywords associated with those topics (at most 30 keywords, for now) and then determined the similarity of the keyword to the title/abstract using embeddings (and the BGE M3-Embedding model).
For more details, see
- the official data documentation
- the white paper OpenAlex: End-to-End Process for Topic Classification
The notebook 2024-09-topics-explore.ipynb pulls the topics dataset and turns it into a FoamTree visualization.
SKOS provides a standard way to represent knowledge organization systems using the Resource Description Framework (RDF). Encoding this information in RDF allows it to process it using various tools developed for Knowledge Graph applications.
This notebook 2024-09-skos.ipynb loads the topics dataset and generates a SKOS ontology: openalex-topics-rdf.ttl.
Two sample visualizations of the ontology have been generated (using Ontospy):
Command is ontospy gendocs src/data/openalex-topics-rdf.ttl --preflabel label --theme united
.
- OpenAlex https://docs.openalex.org/
- Python API https://github.com/J535D165/pyalex
- Foam Tree https://get.carrotsearch.com/foamtree/latest/demos/
- Ontospy https://lambdamusic.github.io/Ontospy/
src
folder includes jupyter notebooks with all data extraction logicbuild
folder is used to generate outputs- put results into
docs
for publishing via github pages ie on https://lambdamusic.github.io/openalex-hacks/
This project is mainly a hack and can contain errors.
I am not affiliated to the OpenAlex project.