Multi-label Dialect Identification

Experimental code on multi-label dialect identification, developed for the paper Dialect and Variant Identification as a Multi-Label Classification Task: A Proposal Based on Near-Duplicate Analysis (Bernier-Colborne, Goutte, and Léger; VarDial 2023).

We provide this code for the purpose of reproducing the experiments we conducted on the FreCDo dataset. It is licensed under GPL 3.0, as it uses a library licensed under a previous version of GPL.

Requirements

The scripts below require Python (tested with version 3.9.12), and the following libraries (tested versions are in brackets):

NumPy (v 1.22.3)
SciPy (v 1.10.0)
scikit-learn (v 1.2.1)
PyTorch (v 1.12.0)
Transformers (v 4.20.1)
Datasets (v 2.3.2)
Levenshtein (v 0.20.9)
tqdm (v 4.64.0)

Usage

The following commands assume that the text files containing the data are split into texts and labels, e.g.:

data/
    train.txt
    train.labels
    dev.txt
    dev.labels
    test.txt
    test.labels

This is the format produced by make_dataset.py (see below), but if you want to apply these commands to the original version of the FreCDo dataset, you will have to split the train and dev sets into separate files for texts and labels.

All the scripts mentioned below have their own internal documentation, so run python <script-name> -h for more details on usage.

To analyse exact duplicates in the data, use:

python count_dups.py data.txt data.labels
python show_dups.py data.txt data.labels

To analyse near-duplicates in the data using the Levenshtein edit ratio as similarity measure, with a cutoff at 0.8, use:

python make_sim_matrix.py data.txt sim.pkl -c 0.8 -b 1024 -p loky
python count_near_dups.py sim.pkl data.txt data.labels -m 0.8 -w log.txt -n token

where sim.pkl will contain the result of the first command.

To make a random split from the original split of the FreCDo dataset, optionally combine labels of (near) duplicates, and produce various representations of the resulting data, use:

python make_dataset.py original-data.txt original-data.labels sim.pkl dir_modified_data -m 0.8 -t 0.85 -d 0.05

where original-data.txt and original-data.labels should contain the complete source data, and dir_modified_data will contain the result.

To finetune a CamemBERT model and evaluate it, use one of the following (for single-label and multi-label classification respectively):

python finetune_single.py train.txt train.labels dev.txt dev.labels dir_checkpoint --freeze_embeddings --freeze_encoder_upto 10
python finetune_multi.py train.txt train.labels dev.txt dev.labels dir_checkpoint --freeze_embeddings --freeze_encoder_upto 10

where dir_checkpoint will contain the resulting model, the training logs, etc.

To evaluate classifiers, use:

python predict.py dir_checkpoint/checkpoint/best_model dir_checkpoint/checkpoint/tokenizer test.txt pred.labels
python evaluate.py pred.labels test.labels multi

where pred.labels will contain the predicted labels output by the first command.

Copyright

Licence

This software is licensed under GPL version 3. It relies on the Levenshtein library, which is licensed under GPL version 2 (or any later version). Licence compatibility of all python dependencies has been confirmed with licensecheck 2023.1.3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-label Dialect Identification

Requirements

Usage

Copyright

Licence

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
count_dups.py		count_dups.py
count_near_dups.py		count_near_dups.py
evaluate.py		evaluate.py
finetune_multi.py		finetune_multi.py
finetune_single.py		finetune_single.py
make_dataset.py		make_dataset.py
make_sim_matrix.py		make_sim_matrix.py
predict.py		predict.py
show_dups.py		show_dups.py
utils.py		utils.py

License

nrc-cnrc/vardial-2023

Folders and files

Latest commit

History

Repository files navigation

Multi-label Dialect Identification

Requirements

Usage

Copyright

Licence

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages