This repository contains the training data for the Task 3 in TASS 2018. The files and folders are organized as follows:
trial
contains relevant files for the trial phase.training
contains relevant files for the training phase.training_example
contains an example training phase, as it appears in the website.test
contains relevant files for the test phase.develop
contains additional training files useful for fine-tunning or model selection.score_training.py
is a Python 3 script that provides an evaluation useful for the training pahse (see below).score_test.py
is a Python 3 script that provides the exact same evaluation as used in the Codalab competition (see below)
In the trial phase, we are releasing an example input file and all the relevant output files, that can be used by participants to understand the competition workflow and the files format. All the relevant files are located in the trial
folder. In this folder you'll find three subfolders:
-
input
contains the input files that you will receive for each evaluation scenario. The purpose of this folder is to illustrate how participants should expect the input files to be structured in the test phase.scenario1-ABC
contains the files to be submitted for the scenario 1 evaluation.input_trial.txt
is plain text.
scenario2-BC
contains the files to be submitted for the scenario 2 evaluation.input_trial.txt
is plain text.output_A_trial.txt
is the gold output for task A.
scenario3-C
contains the files to be submitted for the scenario 3 evaluation.input_trial.txt
is plain text.output_A_trial.txt
is the gold output for task A.output_B_trial.txt
is the gold output for task B.
-
submit
contains the outputs files that you should submit for each evaluation scenario. The purpose of this folder is to illustrate how participants should submit their output to Codalab in the test phase.scenario1-ABC
contains the files to be submitted for the scenario 1 evaluation.output_A_trial.txt
is the expected output file for task A in scenario 1.output_B_trial.txt
is the expected output file for task B in scenario 1.output_C_trial.txt
is the expected output file for task C in scenario 1.
scenario2-BC
contains the files to be submitted for the scenario 2 evaluation.output_B_trial.txt
is the expected output file for task B in scenario 2.output_C_trial.txt
is the expected output file for task C in scenario 2.
scenario3-C
contains the files to be submitted for the scenario 3 evaluation.output_C_trial.txt
is the expected output file for task C in scenario 3.
-
gold
contains the reference files used by the competition evaluator to compare against and score the submitted files. These are just the plain text and outputs for each of the tasks.
During the training phase the folder training
will contain the reference files that are needed to train a model. Inside this folder you will find three sub-folders:
input
contains all theinput_*.txt
files with plain text.gold
contains all the referenceoutput_*.txt
files for the three tasks, that you should use to train your models.submit
is an empty folder where you are expected to place your ownoutput_*.txt
for using the supplied evaluation script
The current version of the training dataset contains a total of 559 sentences and 5673 annotations. More details are provided in the next tables:
Entity | 3276 |
---|---|
Action | 849 |
Concept | 2427 |
Relation | 1012 |
---|---|
Is-a | 434 |
Part-of | 149 |
Property-of | 399 |
Same-as | 30 |
Roles | 1385 |
---|---|
Subjects | 599 |
Targets | 786 |
The file score_training.py
performs an automatic evaluation of your output files against the gold files. You can use this script to validate your technique(s). The metrics reported are exactly the same ones that will be used in the final evaluation. This script simply evaluates each pair of gold/dev files separately and outputs detailed information of all the mistakes. This file's output corresponds to the Development evaluation...
sections in each of the subtasks.
To run it simply use:
python3 score_training.py [training-folder]
If the optional arg training-folder
is provided, then the files are looked for in that folders instead of the default training
. You can use these options to test different variants or to see the evaluation for the example files, by running:
python3 score_training.py training_example
During the Training phase you are expected to submit the outputs your technique produces on the training set into Codalab to test the workflow and get used to the formats. Please make sure to try at least once to submit these files on your own. To prepare a submission, you should zip the contents on the submit
folder in a .zip
file and send them through Codalab's interface. The content of the zip
file should be only the three folders scenario*
with their respective content, as illustrated below:
submission.zip/
scenario1-ABC/
output_A_*.txt
output_B_*.txt
output_C_*.txt
scenario2-BC/
output_B_*.txt
output_C_*.txt
scenario3-C/
output_C_*.txt
Make sure not to mistakenly zip the
submit
folder itself, but only it's content.
For simplicity, there is Makefile
that creates the right zip
. Just running make
inside the projects root folder should work.
Inside the baseline
folder you will find a naive implementation of the whole process. This implementation simply counts the number of occurrences of all concepts, actions and relations, and uses these statistics to match the exact same occurrences. Hence, it can be used as a minimal baseline of the expected score in each evaluation scenario.
To run it, cd
into the baseline
folder and execute:
python3 main.py
(!) BEWARE that running this script will overwrite your
submit
folder with its output.
An additional 285 sentences are included in the develop
folder. These sentences are also fully tagged, and are meant to be used for model selection and parameter tunning. We encourage participants to try different models, algorithms, and parameter settings. Each of these different variants should be trained on the training corpus only, and then their performance measured on the development corpus, to select the best variant. This separation ensures first a fair comparison among participants. Furthermore, comparting different models on a development corpus, independent from the training corpus, also helps reducing the risk of overfitting, and will give you a more accurate estimate of the actual performance of your models.
The file score_test.py
performs the final evaluation exactly as described in the competition rules, i.e., according to the three evaluation scenarios presented. It assumes the reference files are in a gold
subdirectory and the files to be submitted are in the submit
folder, according to the folder structure presented there. This file's output is the one actually used in Codalab
to rank competitors.
This script will output a file score.txt
that contains the calculated metrics described in the Overall evaluation...
section of the competition rules.
(!) The testing phase is already open in Codalab. This phase is blind reviewed, hence you won't be able to see your results until May 28th when all results will be published.
This folder contains the test files divided in the corresponding scenarios, following the same structure as presented in the trial folder. For each evaluation scenario there is a single input file. Each file contains 100 sentences (300 sentences in total), randomly selected from the original corpus. None of these sentences have been published before either in the training or development corpora. However, the test corpus has been built with care, to guarantee there is a certain level of overlap (in terms of the concepts and relations) with the training and development corpora, but there are also brand new concepts and relations tuples which do not appear in the training set.
input
contains the relevant input files:scenario1-ABC
contains only theinput_scenario1.txt
.scenario2-BC
will contain input files and the correspondingoutput_A_scenario2.txt
files.scenario3-C
will contain input files and the correspondingoutput_A_scenario3.txt
files and also the correspondingoutput_B_scenario3.txt
files .
submit
will contain empty subfolders where you should place your output files:scenario1-ABC
where you should place theoutput_A_scenario1.txt
,output_B_scenario1.txt
andoutput_C_scenario1.txt
files for the scenario 1 evaluation.scenario2-BC
where you should place theoutput_B_scenario2.txt
andoutput_C_scenario2.txt
files for the scenario 2 evaluation.scenario3-C
where you should place theoutput_C_scenario3.txt
files for the scenario 3 evaluation.
During the Test phase you are expected to submit the final outputs into Codalab to for grading. To prepare a submission, you should zip the contents on the submit
folder in a .zip
file and send them through Codalab's interface. The content of the zip
file should be only the three folders scenario*
with their respective content, as illustrated below:
test.zip/
scenario1-ABC/
output_A_scenario1.txt
output_B_scenario1.txt
output_C_scenario1.txt
scenario2-BC/
output_B_scenario2.txt
output_C_scenario2.txt
scenario3-C/
output_C_scenario3.txt
This file is a sample .zip
with exactly the trial output in the exact format that should be uploaded to Codalab.
Make sure not to mistakenly zip the
submit
folder itself, but only it's content.
For your convenience, there is Makefile
in the root project folder to help you prepare the submission data. Just run make
and the content of the test/submit
folder will be correctly zipped in the structure that Codalab is expecting.
If you use this data for research purposes, please don't forget to cite our paper:
A corpus to support eHealth Knowledge Discovery technologies
Alejandro Piad-Morffis, Yoan Gutiérrez, Rafael Muñoz,
A corpus to support eHealth Knowledge Discovery technologies,
Journal of Biomedical Informatics,
2019,
103172,
ISSN 1532-0464,
https://doi.org/10.1016/j.jbi.2019.103172.
Here is a BibTeX entry:
@article{ehealthkd,
abstract = {This paper presents and describes eHealth-KD corpus. The corpus is a collection of 1173 Spanish health-related sentences manually annotated with a general semantic structure that captures most of the content, without resorting to domain-specific labels. The semantic representation is first defined and illustrated with example sentences from the corpus. Next, the paper summarizes the process of annotation and provides key metrics of the corpus. Finally, three baseline implementations, which are supported by machine learning models, were designed to consider the complexity of learning the corpus semantics. The resulting corpus was used as an evaluation scenario in TASS 2018 [1] and the findings obtained by participants are discussed. The eHealth-KD corpus provides the first step in the design of a general-purpose semantic framework that can be used to extract knowledge from a variety of domains.},
author = {Piad-Morffis, Alejandro and Guti{\'{e}}rrez, Yoan and Mu{\~{n}}oz, Rafael},
doi = {https://doi.org/10.1016/j.jbi.2019.103172},
issn = {1532-0464},
journal = {Journal of Biomedical Informatics},
keywords = { Knowledge Discovery, Spanish, Subject-Verb-Object, eHealth,Corpus},
pages = {103172},
title = {{A corpus to support eHealth Knowledge Discovery technologies}},
url = {http://www.sciencedirect.com/science/article/pii/S1532046419300905},
year = {2019}
}
Copyright (c) 2018 University of Alicante & University of Havana.
TASS-2018 Task3 eHealth KD Corpus is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://github.com/tass18-task3/data.
Corpus data has been gathered from MedlinePlus.gov and manually post-processed.
Government information at NLM Web sites is in the public domain. Public domain information may be freely distributed and copied, but it is requested that in any subsequent use the National Library of Medicine (NLM) be given appropriate acknowledgement. When using NLM Web sites, you may encounter documents, illustrations, photographs, or other information resources contributed or licensed by private individuals, companies, or organizations that may be protected by U.S. and foreign copyright laws. Transmission or reproduction of protected items beyond that allowed by fair use as defined in the copyright laws requires the written permission of the copyright owners. Specific NLM Web sites containing protected information provide additional notification of conditions associated with its use.