Skip to content

Using Named Entity Recognition as a Discovery Tool

Cliff Wulfman edited this page Mar 7, 2022 · 3 revisions

(For a good, broad overview of Named Entity Recognition (NER), please see Wikipedia

Named Entity Recognition comprises two tasks:

  • Discovering names (Name Detection) in a source document
  • Classifying those names by the kind of entity to which they refer (people, places, organizations, dates, etc.)

In Named Entity Linking (NEL), a third task is added: associating a named entity with a referent in some authority database.

In a typical pipeline, one program segments a source text into tokens (words, spaces, punctuation marks, etc.), and a second program segments tokens (or contiguous groups of tokens) into lexical items, based on orthography (capitalization) and syntax. The segmentation task can be improved with machine learning, in which an algorithm is taught to recognize particular patterns as names.

The NER task is complicated when the source text is derived from uncorrected OCR. Basic tokenization is often hampered, as are orthographic pattern-matching and syntactic parsing.

Clone this wiki locally