feat(docs): update README (#48)

pelias · Jun 11, 2019 · 1e54a28 · 1e54a28
1 parent 4aa9e3e
commit 1e54a28
Showing 1 changed file with 178 additions and 81 deletions.
diff --git a/README.md b/README.md
@@ -1,111 +1,208 @@
-| WARNING: This repo is a work in progress! |
-| --- |
+# A natural language classification engine for geocoding
 
-An address parsing engine designed for geocoding.
+This library contains primitive 'building blocks' which can be composed together to produce a powerful and flexible natural language parser.
 
-Strategic goals:
-- Seperate unit, housenumber, road from 'everything else'
-- Does not require a corpus of 'real places' to operate
-- Do not attempt to classify administrative areas
-- Provide multiple solutions in the case of ambiguous parses
-- Basic typo correction
-- Honour delimiters
-- Extensible to handle queries such as 'pizza near new york'
-- Record offsets to the original token positions in the input text
-- Pluggable classifiers
-- Support for partially complete 'autocomplete' tokens 
+The project was designed and built to work with the [Pelias geocoder](https://github.com/pelias/pelias), so it comes bundled with a parser called `AddressParser` which can be included in other npm project independent of Pelias.
 
-#### CLI
+It is also possible to modify the configuration of `AddressParser`, the dictionaries or the semantics. You can also easily create a completely new parser to suit your own domain.
+
+[![NPM](https://nodei.co/npm/pelias-parser.png?downloads=true&stars=true)](https://nodei.co/npm/pelias-parser)
+
+[![Gitter](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/pelias/pelias)
+
+## AddressParser Example
+
+```
+30 w 26 st nyc 10010
+
+(0.95) ➜ [
+  { housenumber: '30' },
+  { street: 'w 26 st' },
+  { locality: 'nyc' },
+  { postcode: '10010' }
+]
+```
+
+## Application Interfaces
+
+You can access the library via three different interfaces:
+- all parts of the codebase are available in `javascript` via `npm`
+- on the `command line` via the `node bin/cli.js` script
+- through a `web service` via the `node server/http.js` script
+
+> the web service provides an interactive demo at the URL `/parser/parse`
+
+## Quick Start
+
+A quick and easy way to get started with the library is to use the command-line interface:
 
 ```
 node bin/cli.js West 26th Street, New York, NYC, 10010
 ```
 
 ![cli](./docs/cli.png)
 
-#### Server
+---
 
-```bash
-$ PORT=6100 npm run server;
-```
+# Architecture Description
 
-![demo](./docs/demo.png)
-![api](./docs/api.png)
+Please refer to the CLI screenshot above for a visual reference.
 
-### open browser
+## Tokenization
 
-the server should now be running and you should be able to access the http API:
+Tokenization is the process of splitting text into individual words.
 
-```bash
-http://localhost:6100/
-```
+The spliting process used by the engine maintains token positions, so it's able to 'remember' where each character was in the original input text.
+
+> Tokenization is coloured `blue` on the command-line.
+
+### Span
+
+The most primitive element is called a `span`, this is essentially just a single string of text with some metadata attached.
 
-try the following paths:
+The terms `word`, `phrase` and `section` (explained below) are all just ways of using a `span`.
+
+### Section Boundaries
+
+Some parsers like [libpostal](https://github.com/openvenues/libpostal) ignore characters such as `comma`, `tab`, `newline` and `quote`.
+
+While it's unrealistic to expect commas always being present, it's very useful to record their positions when they are.
+
+These boundary positions help to avoid parsing errors for queries such as `Main St, East Village` being parsed as `Main St East` in `Village`.
+
+Once sections are established there is no 'bleeding' of information between sections, avoiding the issue above.
+
+### Word Splitting
+
+Each section is then split in to individual `words`, by default this simply considers whitespace as a word boundary.
+
+As per the `section`, the original token positions are maintained.
+
+### Phrase Generation
+
+May terms such as 'New York City' span multiple words, these multi-word tokens are called `phrases`.
+
+In order to be able to classify `phrase` terms, permutations of adjacent words are generated.
+
+Phrase generation is performed per-section, so it will not generate a `phrase` which contains words from more than one `section`.
+
+Phrase generation is controlled by a configuration which specifies things like the minimum & maximum amount of words allowed in a `phrase`.
+
+### Token Graph
+
+A graph is used to associate `word`, `phrase` and `section` elements to each other.
+
+The graph is free-form, so it's easy to add a new relationship between terms on the future, as required.
+
+Graph Example:
 
 ```javascript
-/demo
-/parser/parse?text=12 main st
+// find the next word in this section
+word.findOne('next')
+
+// find all words in this phrase
+phrase.findAll('child')
 ```
 
-### Architecture overview
+## Classification
 
-#### 1. start with an input:
-```
-30 West 26th Street, New York, NYC, 10010
-```
+Classification is the process of establishing that a `word` or `phrase` represents a 'concept' (such as a street name).
 
-#### 2. split tokens in to logical groups:
-```
-[
-  "30 West 26th Street",
-  "New York",
-  "NYC",
-  "10010"
-]
-```
+Classification can be based on:
+- Dictionary matching (usually with normalization applied)
+- Pattern matching (such as regular expressions)
+- Composite matching (such as relative positioning)
+- External API calls (such as calling other services)
+- Other semantic matching techniques
 
-#### 3. tokenize groups:
-```
-[
-  [ "30", "west", "26th", "street" ],
-  [ "new", "york" ],
-  [ "nyc" ],
-  [ "10010" ]
-]
-```
+> Classification is coloured `green` and `red` on the command-line.
 
-#### 4. generate phrase permutations:
-```
-[
-  [
-    "30 west 26th street",
-    "30 west 26th",
-    "30 west",
-    "30",
-    "west 26th street",
-    "west 26th",
-    "west",
-    "26th street",
-    "26th"
-  ],
-  [
-    "new york",
-    "new",
-    "york"
-  ],
-  [ "nyc" ],
-  [ "10010" ]
-]
+### Classifier Types
+
+The library comes with three generic classifiers which can be extended in order to create a new `classifier`:
+
+- WordClassifier
+- PhraseClassifier
+- SectionClassifier
+
+### Classifiers
+
+The library comes bundled with a range of classifiers out-of-the box.
+
+You can find them in the `/classifier` directory, dictionary-based classifiers usually store their data in the `/resources` directory.
+
+Example of some of the included classifiers:
+
+```javascript
+// word classifiers
+HouseNumberClassifier
+PostcodeClassifier
+StreetPrefixClassifier
+StreetSuffixClassifier
+CompoundStreetClassifier
+DirectionalClassifier
+OrdinalClassifier
+StopWordClassifier
+
+// phrase classifiers
+IntersectionClassifier
+PersonClassifier
+GivenNameClassifier
+SurnameClassifier
+PersonalSuffixClassifier
+PersonalTitleClassifier
+ChainClassifier
+PlaceClassifier
+WhosOnFirstClassifier
 ```
 
-#### 5. run classifiers against all phrases and record potential classes per phrase
+## Solvers
+
+Solving is the final process, where `solutions` are generated based on all the classifications that have been made.
+
+Each parse can contain multiple `solutions`, each is provided with a `confidence` score and is displayed sorted from highest scoring solution to lowest scoring.
+
+The core of this process is the `ExclusiveCartesianSolver` module.
+
+This `solver` generates all the possible permutations of the different classifications while taking care to:
+- ensure the same `span` position is not used more than once
+- ensure that the same `classification` is not used more than once.
+
+After the `ExclusiveCartesianSolver` has run there are additional solvers which can:
+- filter the `solutions` to remove inconsistencies
+- add new `solutions` to provide additional functionality (such as intersections)
+
+### Solution Masks
+
+It is possible to produce a simple `mask` for any generated solution, this is useful for comparing the `solution` to the original text:
+
+```javascript
+VVV VVVV NN SSSSSSS AAAAAA PPPPP
+Foo Cafe 10 Main St London 10010 Earth      
 ```
-'10010' -> postcode
-'west 26th street' -> street
-'26th street' -> street
-'street' -> street_postfix
+
+# Contributing
+
+Please fork and pull request against upstream master on a feature branch. Pretty please; provide unit tests.
+
+## Unit tests
+
+You can run the unit test suite using the command:
+
+```bash
+$ npm test
 ```
 
-#### 6. generate solutions
 
-Given the classifications for each phrase, compute an array of potential parses for the input, a confidence score can also be provided.
+### Continuous Integration
+
+Travis tests every release against all supported Node.js versions.
+
+[![Build Status](https://travis-ci.org/pelias/parser.png?branch=master)](https://travis-ci.org/pelias/parser)
+
+
+### Versioning
+
+We rely on semantic-release and Greenkeeper to maintain our module and dependency versions.
+
+[![Greenkeeper badge](https://badges.greenkeeper.io/pelias/parser.svg)](https://greenkeeper.io/)