This cross-platform sample tool detects exact and near duplicates of code. It is a fork of Near Duplicate Code Detector and mainly adds a convenient shell script to automate the deduplication process for Java datasets.
Requirements:
- .NET Core 2.1 for parsing code, an appropriate runtime for each of the languages that needs to be tokenized is also required.
- Java 1.8 for tokenizing Java code
- Python for removing found duplicates from code.
Duplicate removal consists of tokenizing the code, detecting duplicates, copying the dataset and then
removing the duplicates from copy. This results with a deduplicated copy and untouched original dataset.
A convenient shell script is provided for this, just run:
NOTE: Works only for JAVA and must be run from the location of the shell script a.k.a project root
sh deduplicate.sh target/project/path output/path/ > clone_removal.log 2> error.log
- After script finishes, you can find deduplicated dataset under
output/path/
You can optionally skip the argument and specify the path in the shell script by changingDEFAULT_TARGET_PROJECT_PATH
value. Same goes for output path.
For java, run:
java -jar tokenizers/java/target/javatokenizer-1.0-SNAPSHOT.jar /path/to/target/project/ ./output true
The last boolean is for granularity:
- true - look only at identifier tokens
- false - look at identifier names + all tokens, including things like ";" and operators like "<", "||", etc.
To run the near-duplicate detection run:
$ dotnet run /path/to/DuplicateCodeDetector.csproj --project /path/to/DuplicateCodeDetector/ --dir=<folder>
This will use all the .gz
files in the <folder>
generated by tokenizer and output
DuplicateCodeDetector/CloneDetectorCli.json
file with the groups of detected duplicates.
Invoke --help
for more options.
The input data should be one or more .jsonl.gz
files. These are compressed JSONL files where each line has a single JSON entry of the form
{
"filename": "unique identifier of file, such as a path or a unique id",
"tokens" : ["list", "of", "tokens", "in", "file"]
}
Alternative formats can be accepted by providing the --tokens-field
and --id-fields
options.
The tokenizers
folder in this repository contains tokenizers for
C#, Java, JavaScript and Python. Please, feel free to contribute tokenizers for other languages too.
Once code is tokenized and clones are detected, a removal script can be run.
python deduplicate.py --project project/to/deduplicate --duplicates_data data/generated/by/duplicate/detection
The main part of this repository (tokenization/duplication detection) works as described and originally implemented for the following paper:
@inproceedings{allamanis2019adverse,
title={The adverse effects of code duplication in machine learning models of code},
author={Allamanis, Miltiadis},
booktitle={Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software},
pages={143--153},
year={2019}
}
This project welcomes contributions and suggestions.