Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #714 #721

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

shamikbose
Copy link
Contributor

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

First pass at supporting bigbio_kb schema properly. WIP
Passes all tests
@mariosaenger mariosaenger requested a review from phlobo October 28, 2024 10:18
@mariosaenger
Copy link
Collaborator

@phlobo I transferred the bug fix to the hub implementation. Please have a look. Thanks!

Copy link
Collaborator

@phlobo phlobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariosaenger some minor issues, could you take a quick look, please?

"""

_HOMEPAGE = "http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004"

_LICENSE = 'Creative Commons Attribution 3.0 Unported'
_LICENSE = "CC_BY_3p0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is correct. The data archives contain a LICENSE file with "GENIA Project License for Annotated Corpora".

The data came from the GENIA version 3.02 corpus (Kim et al., 2003).
This was formed from a controlled search on MEDLINE using the MeSH terms human, blood cells and transcription factors.
From this search 2,000 abstracts were selected and hand annotated according to a small taxonomy of 48 classes based on
a chemical classification. Among the classes, 36 terminal classes were used to annotate the GENIA corpus.
"""

_HOMEPAGE = "http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link does not work - If we can't find another one, maybe just a link to ACL Anthology?

document["passages"] = [
{
"id": next(uid),
"type": "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passage type should not be empty imho. I guess it is "sentence" in this dataset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants