Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a dataset loader for CRAFT #60

Closed
hakunanatasha opened this issue Jan 21, 2022 · 13 comments
Closed

Create a dataset loader for CRAFT #60

hakunanatasha opened this issue Jan 21, 2022 · 13 comments
Assignees
Labels
CC BY 3.0 Licence CoNLL Format Coreference Task English Language High Priority NER Task

Comments

@hakunanatasha
Copy link
Collaborator

Colorado Richly Annotated Full-Text (CRAFT) Corpus

https://github.com/UCDenver-ccp/CRAFT

@uzaymacar
Copy link

#self-assign

@hakunanatasha
Copy link
Collaborator Author

Hi @uzaymacar, can you let us know if you are still working on this so we can update our project board? Please just notify us the status by Friday April 8, no worries if you are not finished but intend to work on this. Please either ping me here at @hakunanatasha or ping the discord admins (with @admins)

@uzaymacar
Copy link

Hey @hakunanatasha, yes I am still working on this! I am planning to follow up with a PR by mid-next week.

@hakunanatasha
Copy link
Collaborator Author

@uzaymacar awesome! Feel free to ping me here, via your PR, or on the discord for help! I'm looking forward to your submission 🌸

@davidkartchner
Copy link
Contributor

#self-assign

@shamikbose
Copy link
Contributor

#self-assign

@shamikbose
Copy link
Contributor

@jason-fries There's multiple versions of this. I'm using 5.0.0, which is the latest one

@jason-fries
Copy link
Member

SGTM -- just make certain the versioning is reflected in the data loader metadata.

@shamikbose
Copy link
Contributor

Hi @jason-fries @galtay @ruisi-su
I think I'm starting to understand the CRAFT dataset. I have a few questions:

  1. From what I can understand, this dataset support Tasks.COREF and Tasks.NER. Please let me know if there are other tasks it supports
  2. Corefs are somewhat tricky. There are multiple annotations of the same thing. How should that be handled? Here's an example:
        <annotation annotator="Annotator" id="1" type="identity">
            <class id="IDENTITY chain" label="IDENTITY chain"/>
            <span end="71" id="11532192-2" start="65">strain</span>
        </annotation>
        <annotation annotator="CCP Colorado Computational Pharmacology, UC Denver" id="11532192SHM_Instance_150000" type="identity">
            <class id="Noun Phrase" label="Noun Phrase"/>
            <span end="71" id="11532192-3" start="65">strain</span>
        </annotation>
  1. The NER seems to be pretty straightforward, but just to clarify, the covered types are as follows:

    • CHEBI
    • CL
    • GO_BP
    • GO_CC
    • GO_MF
    • MONDO
    • MOP
    • NCBITaxon
    • PR
    • SO
    • UBERON
  2. There's also structural annotations, but I'm not sure which task that would solve in the bigbio schema. Does this need to be implemented?

@shamikbose
Copy link
Contributor

@ruisi-su This is implemented as a local dataset in #681 since download_and_extract() doesn't seem to work properly with the archive containing the dataset

@mariosaenger
Copy link
Collaborator

@shamikbose Are you still working on that?

@shamikbose
Copy link
Contributor

@mariosaenger This is already implemented as a local dataset in #681 It's awaiting review

phlobo added a commit that referenced this issue Dec 9, 2024
* Initial commit

Issues with `download_and_extract()`

* Use this version of craft.py to debug

Updated to show the code which was being run earlier.
Format is "CRAFT-5.0.0\concept-annotation\key\key"

* Removed print statements out of shame

* Implemented as a local dataset

- Passes all tests
- Warnings logged for multiple annotations

* Can be loaded with `load_datasets()`. Passes all tests

General changes:
- Updated paths to use `os.path.join()` to make it platform-agnostic
MONDO specific changes:
- Specific ways to read annotations
- Specific ways to find corresponding annotations

* Update craft.py

* Update craft.py

_PUBMED set to True

* refactor: Refactor and improve implementation of CRAFT to hub-style integration

* Fix license key

---------

Co-authored-by: Mario Sänger <[email protected]>
Co-authored-by: Florian Borchert <[email protected]>
@phlobo
Copy link
Collaborator

phlobo commented Dec 9, 2024

Closed by #681

@phlobo phlobo closed this as completed Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CC BY 3.0 Licence CoNLL Format Coreference Task English Language High Priority NER Task
Projects
Development

No branches or pull requests

8 participants