-
Notifications
You must be signed in to change notification settings - Fork 0
Using Named Entity Recognition as a Discovery Tool
(For a good, broad overview of Named Entity Recognition (NER), please see Wikipedia
Named Entity Recognition comprises two tasks:
- Discovering names (Name Detection) in a source document
- Classifying those names by the kind of entity to which they refer (people, places, organizations, dates, etc.)
In Named Entity Linking (NEL), a third task is added: associating a named entity with a referent in some authority database.
In a typical pipeline, one program segments a source text into tokens (words, spaces, punctuation marks, etc.), and a second program segments tokens (or contiguous groups of tokens) into lexical items, based on orthography (capitalization) and syntax. The segmentation task can be improved with machine learning, in which an algorithm is taught to recognize particular patterns as names.
The NER task is complicated when the source text is derived from uncorrected OCR. Basic tokenization is often hampered, as are orthographic pattern-matching and syntactic parsing.