Skip to content

emorynlp/character-identification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Character Identification

Character Identification is an entity linking task that finds the global entity of each personal mention in multiparty dialogue. Let a mention be a nominal referring to a person (e.g., she, mom, Judy), and an entity be a character in a dialogue. The goal is to assign each mention to its entity, who may or may not participate in the dialogue. For the following example, the mention "mom" is not one of the speakers; nonetheless, it clearly refers to the specific person, Judy Geller, that could appear in some other dialogue. Identifying such mentions as real characters requires cross-document entity resolution, which makes this task challenging.

Character Identification Example

This task is a part of the Character Mining project led by the Emory NLP research group.

Dataset

All personal mentions are annotated with their global entities. For the above example, the first mention "I" is annotated with its global entity, Ross Geller, and the second mention "mom" is annotated with, Judy Geller, and so on. The mention detection is first performed automatically then corrected manually. The entity annotation is mostly crowdsourced although lots of them are fixed manually by experts.

Statistics

For each season, episodes 1 ~ 19 are used for training (TRN), 20 ~ 21 for development (DEV), and 22 ~ rest for evaluation (TST).

Dataset Episodes Scenes Utterances Tokens Speakers Mentions Entities
TRN 76 987 18,789 262,650 265 36,385 628
DEV 8 122 2142 28523 48 3932 102
TST 13 192 3,597 50,232 91 7,050 165
Total 97 1,301 24,528 341,405 331 47,367 781

Annotation

Each utterance is split into sentences and personal mentions in every sentence are annotated with their entities. For the example below, the utterance consists of one sentence including four mentions. The first three mentions, I, *mom and dad, are singular that refer to Ross Geller, Judy Geller and Jack Geller, respectively. The last mention, they, is plural that refers to both Judy Geller and Jack Geller.

{
  "utterance_id": "s01_e01_c01_u039",
  "speakers": ["Ross Geller"],
  "transcript": "I told mom and dad last night, they seemed to take it pretty well.",
  "tokens": [
    ["I", "told", "mom", "and", "dad", "last", "night", ",", "they", "seemed", "to", "take", "it", "pretty", "well", "."]
  ],
  "character_entities": [
    [[0, 1, "Ross Geller"], [2, 3, "Judy Geller"], [4, 5, "Jack Geller"], [8, 9, "Jack Geller", "Judy Geller"]]
  ]
}

Each mention is annotated by the following scheme:

[begin_index, end_index, entity(, entity)*]
  • begin_index: int - the beginning token index of the mention (inclusive).
  • end_index: int - the ending token index of the mention (exclusive).
  • entity: str - the label of the entity.

Citatioin

References

Shared Task

Contact