Closes #42 #494

alisoncallahan · 2022-04-20T16:40:25Z

Finished data loader for source schema only, because the Bigbio KB schema does not currently support all features that exist in the source data - per conversation with @jason-fries

Name: RadGraph
Description: This dataset is derived from radiology reports and is designed for named entity recognition and relatation extraction.
Paper: https://doi.org/10.13026/hm87-5p47
Data: https://physionet.org/content/radgraph/1.0.0/

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

… to properly represent unique keys.

…examples for source schema

alisoncallahan · 2022-04-27T05:14:42Z

@jason-fries @ruisi-su this is ready for review. Per guidance from Jason, this version includes only functionality to generate examples for the source schema, as it is not possible to represent RadGraph records properly using the KB schema as is. Thus, it will not pass tests.

hakunanatasha · 2022-04-27T05:32:43Z

@alisoncallahan is this a local dataset? Can you give us a print out of the following command?

from datasets import load_dataset
x = load_dataset("biodatasets/radgraph/radgraph.py")
print(x["train"]["entities"][-1])
print(x["train"]["relations"][-1])

alisoncallahan · 2022-04-27T05:40:25Z

@hakunanatasha yes, it is a local dataset b/c RadGraph is provided by PhysioNet, which requires user registration and vetting.

In the source schema, relations are nested in entities. The output of print(x["train"]["entities"][-1]) is:

[{'entity_id': '1', 'tokens': 'lungs', 'label': 'ANAT-DP', 'start_ix': 24, 'end_ix': 24, 'labeler': '', 'relations': []}, {'entity_id': '2', 'tokens': 'clear', 'label': 'OBS-DP', 'start_ix': 26, 'end_ix': 26, 'labeler': '', 'relations': [{'relation_id': '9667', 'type': 'located_at', 'arg': '1'}]}, {'entity_id': '3', 'tokens': 'Cardiomediastinal', 'label': 'ANAT-DP', 'start_ix': 28, 'end_ix': 28, 'labeler': '', 'relations': []}, {'entity_id': '4', 'tokens': 'hilar', 'label': 'ANAT-DP', 'start_ix': 30, 'end_ix': 30, 'labeler': '', 'relations': []}, {'entity_id': '5', 'tokens': 'contours', 'label': 'ANAT-DP', 'start_ix': 31, 'end_ix': 31, 'labeler': '', 'relations': [{'relation_id': '9668', 'type': 'modify', 'arg': '3'}, {'relation_id': '9669', 'type': 'modify', 'arg': '4'}]}, {'entity_id': '6', 'tokens': 'normal', 'label': 'OBS-DP', 'start_ix': 33, 'end_ix': 33, 'labeler': '', 'relations': [{'relation_id': '9670', 'type': 'located_at', 'arg': '3'}, {'relation_id': '9671', 'type': 'located_at', 'arg': '4'}]}, {'entity_id': '7', 'tokens': 'pleural', 'label': 'ANAT-DP', 'start_ix': 38, 'end_ix': 38, 'labeler': '', 'relations': []}, {'entity_id': '8', 'tokens': 'effusions', 'label': 'OBS-DA', 'start_ix': 39, 'end_ix': 39, 'labeler': '', 'relations': [{'relation_id': '9672', 'type': 'located_at', 'arg': '7'}]}, {'entity_id': '9', 'tokens': 'pneumothorax', 'label': 'OBS-DA', 'start_ix': 41, 'end_ix': 41, 'labeler': '', 'relations': []}, {'entity_id': '10', 'tokens': 'acute', 'label': 'OBS-DA', 'start_ix': 46, 'end_ix': 46, 'labeler': '', 'relations': [{'relation_id': '9673', 'type': 'modify', 'arg': '12'}]}, {'entity_id': '11', 'tokens': 'cardiopulmonary', 'label': 'ANAT-DP', 'start_ix': 47, 'end_ix': 47, 'labeler': '', 'relations': []}, {'entity_id': '12', 'tokens': 'process', 'label': 'OBS-DA', 'start_ix': 48, 'end_ix': 48, 'labeler': '', 'relations': [{'relation_id': '9674', 'type': 'located_at', 'arg': '11'}]}]

First commit of RadGraph dataset loader. Needs fixes to source schema…

947385b

… to properly represent unique keys.

alisoncallahan requested review from hakunanatasha, jason-fries, sunnnymskang, ruisi-su, galtay, leonweber, sg-wbi and debajyotidatta as code owners April 20, 2022 16:40

Updated description

10371c1

ruisi-su self-assigned this Apr 21, 2022

alisoncallahan added 2 commits April 21, 2022 15:53

fixes to source schema

89ad7b0

updated radgraph data loader to complete source schema and _generate_…

0e645e2

…examples for source schema

hakunanatasha self-assigned this Apr 27, 2022

jason-fries added the schema improvements Suggested improvements to the BigBio schemas label May 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #42 #494

Closes #42 #494

alisoncallahan commented Apr 20, 2022 •

edited

Loading

alisoncallahan commented Apr 27, 2022

hakunanatasha commented Apr 27, 2022

alisoncallahan commented Apr 27, 2022

Closes #42 #494

Are you sure you want to change the base?

Closes #42 #494

Conversation

alisoncallahan commented Apr 20, 2022 • edited Loading

Checkbox

alisoncallahan commented Apr 27, 2022

hakunanatasha commented Apr 27, 2022

alisoncallahan commented Apr 27, 2022

alisoncallahan commented Apr 20, 2022 •

edited

Loading