-
Notifications
You must be signed in to change notification settings - Fork 509
Annotation Data Format
This page describes the annotation format used by brat.
For the BioNLP 2009 Shared Task on Event Extraction a plain-text stand-off format was introduced as opposed to an XML-based format. brat derives it own format and aims to remain backwards compatible.
This specification is motivated in order to avoid the case where the format is defined by the implementation (I am looking at you MediaWiki). When an implementation differs from what is described in this specification, the implementation is wrong, not the specification. Suggestions on additions which are to be motivated can of course be made. Try to weave these changes into this page and keep the version section up-to-date.
This section specifies the format.
Identifiers are used to reference between annotation lines and use the following format.
([A-Za-z]|#)([0-9]+)(.*)
The id consists of three groups:
- Type specifier: Has semantic implications to the interpretation of the id
- Number: A running number used to differentiate between ids of the same type
- Tail: A free-text tail that is to have no semantic implications for the id
NOTE: Why do we even allow the tail in the first place again, shouldn't we discourage the use of the identifier as place to store information? The same with the hash, why was it necessary to have ids for comments?
Not all annotations have identifiers, for this special case the wild-card identifier *
is used (see Equivalence for an example of such an annotation).
Text-bound annotations marks a span of text and assigns a type to it.
${ID}\t${TYPE} ${START} ${END}\t${TEXT}${COMMENT}
The following restrictions apply to a text-bound annotation:
-
Must have an id (${ID}) with a leading
T
- Must have a type (${TYPE}) which may contain any non-space character
- Must have have both the marked span (${START} and ${END}) and the text contained for that span (${TEXT})
- May have a comment (${COMMENT}) trailing after the ${TEXT} segment that
I difference between the BioNLP'09 ST format is that the ${TEXT} component is mandatory for brat to function properly, it also enables sanity checking to a larger extent than if it was left out.
TBD
${ID}\t${TYPE} [${ARGUMENT}:${PARTICIPANT}...]
Equivalences signifies logical equivalence between annotations or entities.
*\tEquiv [${MEMBER}...]
- Must have at least one member ${MEMBER} which is an id of another annotation
A modifier annotation applies a binary modification of another id;ed annotation, for example: speculation, negation etc.
${ID}\t${TYPE} ${TARGET}
-
Must have an id (${ID}) with a leading
M
- Must have a type (${TYPE}) which may contain any non-space character
- Must have a target (${TARGET}) which is a valid id for another annotation
TBD