-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #261 #518
base: main
Are you sure you want to change the base?
Closes #261 #518
Conversation
@MFreidank happy to help you run it; do you know if the full dataset is an aggregate of the individual years? If the individual years pass via --subset_id in the unit tests, then this is fine. I noticed this also requires a package that is not default |
Thank you for offering your help. Yes, essentially the full dataset would be the aggregate over all individual years. Regarding Please let me know if any of the above is unclear. |
@MFreidank good answers - 8h is a bit tough. Let me see if my machine can handle it. an on-the-fly json parser is overkill; your rationale is plenty enough to warrant a new package in the requirements! |
Hi @hakunanatasha - any updates? |
This PR implements a dataloader for BioASQ Task A (text task) in an attempt to close issue #261 . I've attempted to stay close to
examples/bioasq_task_b.py
whereever possible. Please let me know if any changes are required.Tagging @jason-fries as we discussed previously on the issue thread and I noticed he made some recent changes to
bioasq_task_b
that I also tried to match with my PR.I have been able to confirm that my data loader works across years and also got unit test runs for individual years to pass (I tried 2022 and 2013). However, it's hard for me to do a single clean unittest run across all configurations as dataset sizes are very large (>>10 GB for some of the files) and individual tests take a very long time to run on the machine I have access to.
Could someone help with testing?
If the following information is NOT present in the issue, please populate:
Checkbox
biodatasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_BIGBIO_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneBigBioConfig
for the source schema and one for a bigbio schema.datasets.load_dataset
function.python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py
.