Skip to content

lvapeab/sentence-selectioNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Neural Networks for Data Selection

This repository contains the code for the paper "Neural Networks Classifier for Data Selection in Statistical Machine Translation"

Built upon our fork of Keras (version 1.2) and tested for the Theano backend.

Features

  • Neural network-based sentence classifiers, either at monolingual and bilingual level.

  • BLSTMs / CNNs classifiers. Easy to extend.

  • Support for including Glove or Word2Vec pretrained word vectors (binary or text formats).

  • Iterative semi-supervised selection from top/bottom scoring sentences from an out-of-domain corpus.

Installation

Provided that you have pip installed, run:

git clone https://github.com/lvapeab/sentence-selectioNN
cd sentence-selectioNN
pip install -r requirements.txt

for obtaining the required packages for running this library.

sentence-selectioNN requires the following libraries:

Instructions:

Assuming you have a corpus:

  1. Check out the inputs/outputs of your model in data_engine/prepare_data.py

  2. If you want to use pretrained word vectors, use the preprocessing scripts for binary or text for pretrained Glove or Word2Vec vectors.

  3. Set a model configuration in config.py

  4. Train!:

python main.py

Architecture

We support two different network architecture, BLSTM or CNN, both at monolingual or bilingual level.

NN_Classifier

Please, see the paper for a more detailed description of the model.

Citation

If you use this code for any purpose, please cite the following paper:

Peris Á., Chinea-Rios M., Casacuberta F. 
Neural Networks Classifier for Data Selection in Statistical Machine Translation. 
In  The Prague Bulletin of Mathematical Linguistics No. 108, pp. 283–294. 2017.

Contact

Álvaro Peris (web page): [email protected]