Skip to content

Latest commit

 

History

History
61 lines (51 loc) · 3.21 KB

README.md

File metadata and controls

61 lines (51 loc) · 3.21 KB

Redactor

Developer: Biswas Nandamuri

Redactor is a python based utillity tool used to redact sensitive information using Natural Language processing tools like Spacy and Nltk.

The project's python code follows PEP8 Style Guide

This utility uses a number of open source packages and tools:

  • SpaCy - Industrial-strength Natural Language Processing (NLP) in Python.
  • en_core_web_md - SpaCy's English pipeline optimized for CPU.
  • nltk - A suite of open source tools, data sets, and tutorials for Natural Language Processing research.
  • Pyap - Python address detector and parser.
  • SpaCy-Wordnet - SpaCy and Nltk's wordnet Annotator.
  • Pytest - Testing framework that supports complex functional testing.
  • Pytest-cov - Coverage plugin for pytest.
  • autopep8 - Tool that automatically formats Python code to conform to the PEP 8 style guide.

Run on local system

  1. Clone this repository and move into the folder.
    $ git clone https://github.com/Biswas-N/Redactor.git
    $ cd Redactor
  2. Install dependencies using Pipenv.
    $ pipenv install
  3. Run the utility tool.
    $ make

    Note: Project includes a Makefile which has commonly used commands. By running make the following command pipenv run python redactor.py --input '*.txt' --names --dates --phones --genders --address --concept 'war' --concept 'dog' --output 'files/' --stats 'process.log' is executed.

  4. The redacted files will be stored in files/ folder with .redacted extension.
  5. Finally, the stats for the redaction process are stored in process.log.

Documentation

The documentation about code structure and extraction algorithm can be found here.

Testing

This utility is tested using pytest.

Documentation about the tests can be found here. Follow the below commands to run tests on your local system.

  1. Install dev-dependencies.
    $ pipenv install --dev
  2. Run tests using Makefile.
    $ make test
  3. Run test coverage.
    $ make cov

Bugs/Assumptions

  • Names of people, organizations, geo political entities, Nationalitiesm religious or political group names are considered as names and thus redacted if --names flag is used.
  • This tool depends on SpaCy's en_core_web_md model and WordNet. Thus, the accuracy and performance of this application is directly dependent on SpaCy model and WordNet respective accuracies and performances.
  • This tools accuracy and performance is enhanced by using regular expressions along with SpaCy and WordNet, but unfortunately not all cases of the entities (names, phones, genders, dates and addresses) were included as regular expressions. Thus, some information may not be redacted if they were not recognized by SpaCy model or present in WordNet and included regular expressions.