Source code for replication of the experiments in the paper accepted to EMNLP 2024 Findings.
If you use the data, code, or the information in this repository, cite the following paper, please (also available on arXiv).
@misc{macko2024authorshipobfuscationmultilingualmachinegenerated,
title={Authorship Obfuscation in Multilingual Machine-Generated Text Detection},
author={Dominik Macko and Robert Moro and Adaku Uchendu and Ivan Srba and Jason Samuel Lucas and Michiharu Yamashita and Nafis Irtiza Tripto and Dongwon Lee and Jakub Simko and Maria Bielikova},
year={2024},
eprint={2401.07867},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2401.07867},
}
Each external obfuscator has each own requirements and dependencies that need to be installed from the corresponding repository. At least, install the dependencies of the original MULTITuDE benchmark, on which our study is built.
Firstly, the original unobfuscated data need to be downloaded from MULTITuDE and put into the 'dataset/multitude.csv.gz' file. Afterwards, the obfuscated data can be generated by the provided scripts:
-
For backtranslation (to select m2m100 or nllb-200 model, just uncomment the corresponding model_name in the source code), use the provided 01_backtranslation.py.
-
For paraphrasing by ChatGPT, use the provided 01_chatgpt_paraphrase.py.
-
For paraphrasing by PEGASUS-paraphrase, use the provided 01_pegasus_paraphrase.py.
-
For paraphrasing by DIPPER, use the source code provided in the original DIPPER repository, while applying the settings mentioned in the paper.
-
For text edits by GPTZzzs, use the source code provided in the original GPTZzzs repository.
-
For text edits by GPTZero-Bypasser, use the source code provided in the original GPTZero-Bypasser repository.
-
For text edits by a generic HomoglyphAttack, use the provided 01_homoglyphattack.py.
-
For text edits by ALISON, use the source code provided in the original ALISON repository, while applying the settings mentioned in the paper.
-
For text edits by DFTFooler, use the source code provided in the original DFTFooler repository, while applying the settings mentioned in the paper.
-
After the obfuscated data are generated, run the provided 02_text_quality.ipynb Google Colab notebook to calculate and analyze various automated text similarity metrics between the original and obfuscated data.
To fine-tune a base model for machine-generated text detection task, use the provided 03_finetune_detector.py script.
To run statistical methods, use the IMGTB framework. To run pre-trained and fine-tuned methods, use the provided 04_test_detector.py script. Since the Longformer Detector requires a special pre-processing step, use the provided 04_test_longformer.py instead.
To analyze the results, use the provided 05_results_analysis.ipynb Google Colab notebook.