Closes #67 - Add Monero #516

napsternxg · 2022-04-25T02:00:42Z

Fixes #67 - Add Monero

If the following information is NOT present in the issue, please populate:

Name: MoNERo
Description: MoNERo: a Biomedical Gold Standard Corpus for the Romanian Language for part of speech tagging and named entity recognition.
Paper: https://www.racai.ro/en/tools/text/
Data: https://github.com/bigscience-workshop/biomedical/files/8550757/MoNERo.tar.gz

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Tested via data loading.

hakunanatasha · 2022-04-27T05:06:20Z

@napsternxg passes all the unit tests and loads fine, but I noticed if I do the following:

from datasets import load_dataset
x = load_dataset("biodatasets/monero/monero.py", name="monero_bigbio_kb")["train"]["entities"][-1]

I find these all empty. Is this intended?

hakunanatasha · 2022-04-27T05:07:31Z

@napsternxg also I made a small change at the end of the file (near the main call)

napsternxg · 2022-04-30T04:18:14Z

Hi @hakunanatasha thanks. Let me have a look at this. I will address this by early next week.

napsternxg · 2022-05-05T05:57:30Z

Hi @hakunanatasha I checked the entities. They are present. When no entity is present in a doc we see an empty list.
This is a better way to check:

from datasets import load_dataset
data = load_dataset("biodatasets/monero/monero.py", name="monero_bigbio_kb")
data["train"]["entities"][-5:]

Will output

[[],
 [],
 [],
 [{'id': 'docid-4982-E0',
   'type': 'DISO',
   'text': ['hemipareză spastică'],
   'offsets': [[109, 128]],
   'normalized': []}],
 []]

This means only the second last doc among the last 5 docs has any entity.

I also added a fix about entity offsets.
I think this PR is ready for Merge.

…integration

mariosaenger · 2024-10-27T10:50:41Z

@phlobo I revised this dataset. Please have a look at it.

phlobo

LGTM!

napsternxg and others added 2 commits April 11, 2022 16:25

Fixes bigscience-workshop#67 - Add monero

ee234c5

Working setup for Monero.

26f7986

Tested via data loading.

napsternxg requested review from hakunanatasha, jason-fries, sunnnymskang, ruisi-su, galtay, leonweber, sg-wbi and debajyotidatta as code owners April 25, 2022 02:00

napsternxg mentioned this pull request Apr 25, 2022

Create a dataset loader for MoNERo #67

Closed

hakunanatasha self-assigned this Apr 27, 2022

fix: remove main call

9cee97b

Fixed entity offset

4591ef0

sg-wbi changed the title ~~Fixes #67 - Add Monero~~ Closes #67 - Add Monero May 9, 2022

mariosaenger assigned mariosaenger and unassigned hakunanatasha Oct 26, 2024

Mario Sänger added 3 commits October 27, 2024 10:36

Merge branch 'main' into monero

34ea4a1

refactor: Refactor and improve implementation of MoNERo to hub-style …

b845500

…integration

style: Fix code formatting

974c531

mariosaenger requested a review from phlobo October 27, 2024 10:50

phlobo approved these changes Dec 9, 2024

View reviewed changes

phlobo merged commit 0435fcd into bigscience-workshop:main Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #67 - Add Monero #516

Closes #67 - Add Monero #516

napsternxg commented Apr 25, 2022

hakunanatasha commented Apr 27, 2022

hakunanatasha commented Apr 27, 2022

napsternxg commented Apr 30, 2022

napsternxg commented May 5, 2022

mariosaenger commented Oct 27, 2024

phlobo left a comment

Closes #67 - Add Monero #516

Closes #67 - Add Monero #516

Conversation

napsternxg commented Apr 25, 2022

Checkbox

hakunanatasha commented Apr 27, 2022

hakunanatasha commented Apr 27, 2022

napsternxg commented Apr 30, 2022

napsternxg commented May 5, 2022

mariosaenger commented Oct 27, 2024

phlobo left a comment

Choose a reason for hiding this comment