This repo provides step-by-step tutorials for training models with Stanza - the official Python NLP library by the Stanford NLP Group. All neural processors in Stanza, including the tokenzier, the multi-word token (MWT) expander, the POS/morphological features tagger, the lemmatizer, the dependency parser, and the named entity tagger, can be trained with your own data.
This repo is meant to complement our training documentation, by providing runnable scripts coupled with toy data that makes it much easier for users to get started with model training. To train models with your own data, you should be able to simply replace the provided toy data with your own data in the same format, and start using them with Stanza right after training.
Warning: This repo is fully tested on Linux. Due to syntax differences between macOS and Linux (e.g., the
declare -A
in thescripts/treebank_to_shorthand.sh
is not supported by macOS), you need to rewrite some files to run on macOS. The command lines given here will not work on Windows.
This repo is designed for and tested on stanza 1.4.0. Earlier versions will not fully work with these commands.
To reiterate, this is only tested on Linux. In order to run on
Windows, there is a source scripts/config.sh
line in the initial
setup below. Theoretically, if you manually set those variables in
the shell, or if you add those variables to the environment using the
control panel, the rest of the scripts might work.
Run the following commands at the command line.
Stanza only supports python3
. You can install all dependencies needed by training Stanza models with:
pip install -r requirements.txt
Next, set up the folders and scripts needed for training with:
git clone [email protected]:stanfordnlp/stanza-train.git
cd stanza-train
git clone [email protected]:stanfordnlp/stanza.git
cp config/config.sh stanza/scripts/config.sh
cp config/xpos_vocab_factory.py stanza/stanza/models/pos/xpos_vocab_factory.py
cd stanza
source scripts/config.sh
The config.sh
script is used to set environment variables (e.g., data path, word vector path, etc.) needed by training and testing Stanza models.
The xpos_vocab_factory.py
script is used to build XPOS vocabulary file for our provided UD_English-TEST
toy treebank. Compared with the original file in the downloaded Stanza repo, we only add the shorthand name of the toy treebank (en_test
) to the script, so that it can be recognized during training. If you want to use another dataset other than UD_English-TEST
after running this tutorial, you can add the shorthand of your treebank in the same way. In case you're curious, here's how we built this file.
Here we provide instructions for training each processor currently supported by Stanza, using the toy data in this repo as example datasets. Model performance will be printed during training. As our provided toy data only contain several sentences for demonstration purpose, you should be able to get 100% accuracy at the end of training.
The tokenize
processor segments the text into tokens and sentences. All downstream processors which generate annotations at the token or sentence level depends on the output from this processor.
Training the tokenize
processor currently requires the Universal Dependencies treebank data in both plain text and the conllu
format, as you can find in our provided toy examples here. To train the tokenize
processor with this toy data, run the following command:
python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_English-TEST
python3 -m stanza.utils.training.run_tokenizer UD_English-TEST --step 500
Note that since this toy data is very small in scale, we are restricting the training with a very small step
parameter. To train on your own data, you can either set a larger step
parameter, or use the default parameter value.
The Universal Dependencies grammar defines syntatic relations between syntactic words, which, for many languages (e.g., French), are different from raw tokens as segmented from the text. For these languages, the mwt
processor expands the multi-word tokens (MWT) recognized by the tokenize
processor into multiple syntactic words, paving the ways for downstream annotations.
Note: The mwt processor is not needed and cannot be trained for languages that do not have multi-word tokens (MWT), such as English or Chinese.
Like the tokenize
processor, training the mwt
processor requires UD data, in the format like our provided toy examples here. You can run the following command to train the mwt
processor:
python3 -m stanza.utils.datasets.prepare_mwt_treebank UD_English-TEST
python3 -m stanza.utils.training.run_mwt UD_English-TEST --num_epoch 2
Note: Running the above command with the toy data will yield a message saying that zero training data can be found for MWT training. This is normal since MWT is not needed for English. The training should work when you replace the provided data with data in languages that support MWT (e.g., German, French, etc.).
The lemma
processor predicts lemmas for all words in an input sentence. Training the lemma
processor requires data files in the conllu
format. With the toy examples, you can train the lemma
processor with the following command:
python3 -m stanza.utils.datasets.prepare_lemma_treebank UD_English-TEST
python3 -m stanza.utils.training.run_lemma UD_English-TEST --num_epoch 2
The pos
processor annotates words with three types of syntactic information simultaneously: the Universal POS (UPOS) tags, and treebank-specific POS (XPOS) tags, and universal morphological features (UFeats).
Training the pos
processor usually requires UD data in the conllu
format and pretrained word vectors. For demo purpose, we provide an example word vector file here. With the toy data and word vector file, you can train the pos
processor with:
python3 -m stanza.utils.datasets.prepare_pos_treebank UD_English-TEST
python3 -m stanza.utils.training.run_pos UD_English-TEST --max_steps 500
The depparse
processor implements a dependency parser that predicts syntactic relations between words in a sentence. Training the depparse
processor requires data files in the conllu
format, and a pretrained word vector file. With the toy data and word vector file, you can train the depparse
processor with:
python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_English-TEST
python3 -m stanza.utils.training.run_depparse UD_English-TEST --max_steps 500
Note that the gold
parameter here tells the scripts to use the "gold" human-annotated POS tags in the training of the parser.
The ner
processor recognizes named entities in the input text. Training the ner
processor requires column training data in either BIO
or BIOES
format. See this wikipedia page for an introduction of the formats. We provide toy examples here in the BIO format. For better performance a pretrained word vector file is also recommended. With the toy data and word vector file, you can train the ner
processor with:
python3 -m stanza.utils.training.run_ner en_sample --max_steps 500 --word_emb_dim 5
Note that for demo purpose we are restricting the word vector dimension to be 5 with the word_emb_dim
parameter. You should change it to match the dimension of your own word vectors.
The performance of the ner
processor can be significantly improved by using contextualized string embeddings (i.e., a character-level language model), as was shown in this COLING 2018 paper. To enable this in your NER model, you'll need to first train two character-level language models for your language (named as charlm
module in Stanza), and then use these trained charlm
models in your NER training.
Training charlm
requires a large amount of raw text, such as text from news articles or wikipedia pages, in plain text files. We provide toy data for training charlm
here. With the toy data, you can run the following command to train two charlm
models, one in the forward direction of the text and another in the backward direction, respectively:
python3 -m stanza.utils.training.run_charlm en_TEST --forward --epochs 2 --cutoff 0 --batch_size 2
python3 -m stanza.utils.training.run_charlm en_TEST --backward --epochs 2 --cutoff 0 --batch_size 2
Running these commands will result in two model files in the saved_models/charlm
directory, with the prefix en_test
.
Note: For details on why two models are needed and how they are used in the NER tagger, please refer to this COLING 2018 paper.
Training contextualized ner
models requires BIO-format data, pretrained word vectors, and the pretrained charlm
models obtained in the last step. You can run the following command to train the ner
processor:
python3 -m stanza.utils.training.run_ner en_sample --max_steps 500 --word_emb_dim 5 --charlm test
Note that the charlm
here instructs the training script to look for the character language model files with the prefix of en_test
.
Initializing a processor with your own trained model only requires the path for the model file. Here we provide an example to initialize the tokenize
processor with a model file saved at saved_models/tokenize/en_test_tokenizer.pt
:
>>> import stanza
>>> nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_model_path='saved_models/tokenize/en_test_tokenizer.pt')
After training your own models, we welcome you to contribute your models so that it can be used by the community. To do this, you can start by creating a GitHub issue. Please help us understand your model by clearly describing your dataset, model performance, your contact information, and why you think your model would benefit the whole community. We will integrate your models into our official repository once we are able to verify its quality and usability.