This is the benchmark, code, and configuration accompanying the EMNLP-Findings 2023 paper A Benchmark for Semi-Inductive Link Prediction in Knowledge Graphs. The main branch holds code/information about the benchmark itself. The following branches hold code and configuration for the separate models evaluated in the study.
- KGT5 & KGT5-context
- ComplEx + Bias + FoldIn
- DisMult ERAvg
- DisMult ERAvg + Mention/Description
- HittER
mkdir data
cd data
curl -O https://madata.bib.uni-mannheim.de/424/2/wikidata5m-si.tar.gz
tar -zxvf wikidata5m-si.tar.gz
All files are tab separated.
- entity_ids.del
- maps ids used in all files to Wikidata IDs
- first column entity id, second column Wikidata entity id
- entity_mentions.del
- maps entity ids to entity mentions
- entity_desc.del
- maps entity ids to entity descriptions
- relation_ids.del
- maps relation ids Wikidata relation ids
- first column relation id, second column Wikidata relation id
- relation_mentions.del
- maps relation ids to relation mentions
- train.del
- contains training triples in the form of subject, relation, object
- valid.del
- contains transductive validation triples in the form of subject, relation, object
- test.del
- contains transductive validation triples in the form of subject, relation, object
- all_entity_ids.del
- contains ids from entity_ids.del and additionally all ids of unseen entities
- all_entity_mentions.del
- contains mentions from entity_mentions.del and additionally all mentions of unseen entities
- all_entity_desc.del
- contains descriptions from entity_desc.del and additionally all descriptions of unseen entities
- valid_pool.del
- contains all triples used for semi-inductive validation
- columns
- 1: unseen entity id
- 2: slot of unseen entity (0: unseen entity is in subject slot, 1: unseen entity in object slot)
- 3-5: validation triple
- 3: subject
- 4: relation
- 5: object
- use
prepare_few_shot.py
to create all semi-inductive tasks from this file
- test_pool.del
- contains all triples used for semi-inductive testing
- columns
- 1: unseen entity id
- 2: slot of unseen entity (0: unseen entity is in subject slot, 1: unseen entity in object slot)
- 3-5: test triple
- 3: subject
- 4: relation
- 5: object
- tab separated
- use
prepare_few_shot.py
to create all semi-inductive tasks from this file
- use the file
prepare_few_shot.py
- create a
few_shot_set_creator
objectdataset_name
: (str) name of the dataset- default: wikidata5m_v3_semi_inductive
use_invese
: (bool) whether to use inverse relations- default: False
- if True: for all triples where the unseen entity is in the object slot, increase relation id by num-relations and invert triple
- default: False
split
: (str) which split to use - default: validcontext_selection
: (str) which context_selection technique to use - default: most_common - options: most_common, least_common, random
few_shot_set_creator = FewShotSetCreator(
dataset_name="wikidata5m_v3_semi_inductive",
use_inverse=True,
split="test"
)
- generate the data using the
few_shot_set_creator
num_shots
: (int) the number of shots to use (between 0 and 10)
data = few_shot_set_creator.create_few_shot_dataset(num_shots=5)
- evaluation is performed in direction unseen to seen
- output format looks like this
[
{
"unseen_entity": <id of unseen entity>,
"unseen_slot": <slot of unseen entity: 0 for head/subject, 2 for tail/object>,
"triple: <[s, p, o]>,
"context: <[unseen_entity_id, unseen_entity_slot, s, p, o]>
},
...
]
- to create similar benchmark based on other graphs use the file
create_semi_inductive_dataset.py
- this file was used to create wikidata5m-si based on wikidata5m
- if you use the proposed benchmark, the provided code or insights presented in the paper please cite.
@inproceedings{kochsiek2023benchmark,
title={A Benchmark for Semi-Inductive Link Prediction in Knowledge Graphs},
author={Kochsiek, Adrian and Gemulla, Rainer},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},
year={2023}
}