Skip to content

Latest commit

 

History

History
221 lines (162 loc) · 15 KB

Readme.md

File metadata and controls

221 lines (162 loc) · 15 KB

Development data for TASS-2018 Task 3: eHealth Knowledge Discovery

This repository contains the training data for the Task 3 in TASS 2018. The files and folders are organized as follows:

  • trial contains relevant files for the trial phase.
  • training contains relevant files for the training phase.
  • training_example contains an example training phase, as it appears in the website.
  • test contains relevant files for the test phase.
  • develop contains additional training files useful for fine-tunning or model selection.
  • score_training.py is a Python 3 script that provides an evaluation useful for the training pahse (see below).
  • score_test.py is a Python 3 script that provides the exact same evaluation as used in the Codalab competition (see below)

Trial phase

In the trial phase, we are releasing an example input file and all the relevant output files, that can be used by participants to understand the competition workflow and the files format. All the relevant files are located in the trial folder. In this folder you'll find three subfolders:

  • input contains the input files that you will receive for each evaluation scenario. The purpose of this folder is to illustrate how participants should expect the input files to be structured in the test phase.

  • submit contains the outputs files that you should submit for each evaluation scenario. The purpose of this folder is to illustrate how participants should submit their output to Codalab in the test phase.

  • gold contains the reference files used by the competition evaluator to compare against and score the submitted files. These are just the plain text and outputs for each of the tasks.

Training phase

During the training phase the folder training will contain the reference files that are needed to train a model. Inside this folder you will find three sub-folders:

  • input contains all the input_*.txt files with plain text.
  • gold contains all the reference output_*.txt files for the three tasks, that you should use to train your models.
  • submit is an empty folder where you are expected to place your own output_*.txt for using the supplied evaluation script

Training data statistics

The current version of the training dataset contains a total of 559 sentences and 5673 annotations. More details are provided in the next tables:

Entity 3276
Action 849
Concept 2427
Relation 1012
Is-a 434
Part-of 149
Property-of 399
Same-as 30
Roles 1385
Subjects 599
Targets 786

Training evaluation

The file score_training.py performs an automatic evaluation of your output files against the gold files. You can use this script to validate your technique(s). The metrics reported are exactly the same ones that will be used in the final evaluation. This script simply evaluates each pair of gold/dev files separately and outputs detailed information of all the mistakes. This file's output corresponds to the Development evaluation... sections in each of the subtasks.

To run it simply use:

python3 score_training.py [training-folder]

If the optional arg training-folder is provided, then the files are looked for in that folders instead of the default training. You can use these options to test different variants or to see the evaluation for the example files, by running:

python3 score_training.py training_example

Training submissions

During the Training phase you are expected to submit the outputs your technique produces on the training set into Codalab to test the workflow and get used to the formats. Please make sure to try at least once to submit these files on your own. To prepare a submission, you should zip the contents on the submit folder in a .zip file and send them through Codalab's interface. The content of the zip file should be only the three folders scenario* with their respective content, as illustrated below:

submission.zip/
    scenario1-ABC/
        output_A_*.txt
        output_B_*.txt
        output_C_*.txt
    scenario2-BC/
        output_B_*.txt
        output_C_*.txt
    scenario3-C/
        output_C_*.txt

Make sure not to mistakenly zip the submit folder itself, but only it's content.

For simplicity, there is Makefile that creates the right zip. Just running make inside the projects root folder should work.

Baseline implementation

Inside the baseline folder you will find a naive implementation of the whole process. This implementation simply counts the number of occurrences of all concepts, actions and relations, and uses these statistics to match the exact same occurrences. Hence, it can be used as a minimal baseline of the expected score in each evaluation scenario.

To run it, cd into the baseline folder and execute:

python3 main.py

(!) BEWARE that running this script will overwrite your submit folder with its output.

Development corpus

An additional 285 sentences are included in the develop folder. These sentences are also fully tagged, and are meant to be used for model selection and parameter tunning. We encourage participants to try different models, algorithms, and parameter settings. Each of these different variants should be trained on the training corpus only, and then their performance measured on the development corpus, to select the best variant. This separation ensures first a fair comparison among participants. Furthermore, comparting different models on a development corpus, independent from the training corpus, also helps reducing the risk of overfitting, and will give you a more accurate estimate of the actual performance of your models.

Training score

The file score_test.py performs the final evaluation exactly as described in the competition rules, i.e., according to the three evaluation scenarios presented. It assumes the reference files are in a gold subdirectory and the files to be submitted are in the submit folder, according to the folder structure presented there. This file's output is the one actually used in Codalab to rank competitors.

This script will output a file score.txt that contains the calculated metrics described in the Overall evaluation... section of the competition rules.

Testing phase

(!) The testing phase is already open in Codalab. This phase is blind reviewed, hence you won't be able to see your results until May 28th when all results will be published.

This folder contains the test files divided in the corresponding scenarios, following the same structure as presented in the trial folder. For each evaluation scenario there is a single input file. Each file contains 100 sentences (300 sentences in total), randomly selected from the original corpus. None of these sentences have been published before either in the training or development corpora. However, the test corpus has been built with care, to guarantee there is a certain level of overlap (in terms of the concepts and relations) with the training and development corpora, but there are also brand new concepts and relations tuples which do not appear in the training set.

  • input contains the relevant input files:
    • scenario1-ABC contains only the input_scenario1.txt.
    • scenario2-BC will contain input files and the corresponding output_A_scenario2.txt files.
    • scenario3-C will contain input files and the corresponding output_A_scenario3.txt files and also the corresponding output_B_scenario3.txt files .
  • submit will contain empty subfolders where you should place your output files:
    • scenario1-ABC where you should place the output_A_scenario1.txt, output_B_scenario1.txt and output_C_scenario1.txt files for the scenario 1 evaluation.
    • scenario2-BC where you should place the output_B_scenario2.txt and output_C_scenario2.txt files for the scenario 2 evaluation.
    • scenario3-C where you should place the output_C_scenario3.txt files for the scenario 3 evaluation.

Submissions

During the Test phase you are expected to submit the final outputs into Codalab to for grading. To prepare a submission, you should zip the contents on the submit folder in a .zip file and send them through Codalab's interface. The content of the zip file should be only the three folders scenario* with their respective content, as illustrated below:

test.zip/
    scenario1-ABC/
        output_A_scenario1.txt
        output_B_scenario1.txt
        output_C_scenario1.txt
    scenario2-BC/
        output_B_scenario2.txt
        output_C_scenario2.txt
    scenario3-C/
        output_C_scenario3.txt

This file is a sample .zip with exactly the trial output in the exact format that should be uploaded to Codalab.

Make sure not to mistakenly zip the submit folder itself, but only it's content.

For your convenience, there is Makefile in the root project folder to help you prepare the submission data. Just run make and the content of the test/submit folder will be correctly zipped in the structure that Codalab is expecting.

Citation

If you use this data for research purposes, please don't forget to cite our paper:

A corpus to support eHealth Knowledge Discovery technologies

Alejandro Piad-Morffis, Yoan Gutiérrez, Rafael Muñoz,
A corpus to support eHealth Knowledge Discovery technologies,
Journal of Biomedical Informatics,
2019,
103172,
ISSN 1532-0464,
https://doi.org/10.1016/j.jbi.2019.103172.

Here is a BibTeX entry:

@article{ehealthkd,
abstract = {This paper presents and describes eHealth-KD corpus. The corpus is a collection of 1173 Spanish health-related sentences manually annotated with a general semantic structure that captures most of the content, without resorting to domain-specific labels. The semantic representation is first defined and illustrated with example sentences from the corpus. Next, the paper summarizes the process of annotation and provides key metrics of the corpus. Finally, three baseline implementations, which are supported by machine learning models, were designed to consider the complexity of learning the corpus semantics. The resulting corpus was used as an evaluation scenario in TASS 2018 [1] and the findings obtained by participants are discussed. The eHealth-KD corpus provides the first step in the design of a general-purpose semantic framework that can be used to extract knowledge from a variety of domains.},
author = {Piad-Morffis, Alejandro and Guti{\'{e}}rrez, Yoan and Mu{\~{n}}oz, Rafael},
doi = {https://doi.org/10.1016/j.jbi.2019.103172},
issn = {1532-0464},
journal = {Journal of Biomedical Informatics},
keywords = { Knowledge Discovery, Spanish, Subject-Verb-Object, eHealth,Corpus},
pages = {103172},
title = {{A corpus to support eHealth Knowledge Discovery technologies}},
url = {http://www.sciencedirect.com/science/article/pii/S1532046419300905},
year = {2019}
}

License

Creative Commons License
Copyright (c) 2018 University of Alicante & University of Havana.
TASS-2018 Task3 eHealth KD Corpus is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://github.com/tass18-task3/data.

External resources from MedlinePlus

Corpus data has been gathered from MedlinePlus.gov and manually post-processed.

NLM Copyright Information

Government information at NLM Web sites is in the public domain. Public domain information may be freely distributed and copied, but it is requested that in any subsequent use the National Library of Medicine (NLM) be given appropriate acknowledgement. When using NLM Web sites, you may encounter documents, illustrations, photographs, or other information resources contributed or licensed by private individuals, companies, or organizations that may be protected by U.S. and foreign copyright laws. Transmission or reproduction of protected items beyond that allowed by fair use as defined in the copyright laws requires the written permission of the copyright owners. Specific NLM Web sites containing protected information provide additional notification of conditions associated with its use.