Skip to content

Latest commit

 

History

History
74 lines (61 loc) · 3.32 KB

README.md

File metadata and controls

74 lines (61 loc) · 3.32 KB

Near-Duplicate Code Remover

This cross-platform sample tool detects exact and near duplicates of code. It is a fork of Near Duplicate Code Detector and mainly adds a convenient shell script to automate the deduplication process for Java datasets.

Requirements:

  • .NET Core 2.1 for parsing code, an appropriate runtime for each of the languages that needs to be tokenized is also required.
  • Java 1.8 for tokenizing Java code
  • Python for removing found duplicates from code.

Duplicate Detection and Removal

Duplicate removal consists of tokenizing the code, detecting duplicates, copying the dataset and then removing the duplicates from copy. This results with a deduplicated copy and untouched original dataset. A convenient shell script is provided for this, just run:
NOTE: Works only for JAVA and must be run from the location of the shell script a.k.a project root

sh deduplicate.sh target/project/path output/path/ > clone_removal.log 2> error.log
  • After script finishes, you can find deduplicated dataset under output/path/ You can optionally skip the argument and specify the path in the shell script by changing DEFAULT_TARGET_PROJECT_PATH value. Same goes for output path.

Running Tokenizer

For java, run: java -jar tokenizers/java/target/javatokenizer-1.0-SNAPSHOT.jar /path/to/target/project/ ./output true
The last boolean is for granularity:

  • true - look only at identifier tokens
  • false - look at identifier names + all tokens, including things like ";" and operators like "<", "||", etc.

Duplicate Detection

To run the near-duplicate detection run:

$ dotnet run /path/to/DuplicateCodeDetector.csproj --project /path/to/DuplicateCodeDetector/ --dir=<folder>

This will use all the .gz files in the <folder> generated by tokenizer and output DuplicateCodeDetector/CloneDetectorCli.json file with the groups of detected duplicates. Invoke --help for more options.

Input Data

The input data should be one or more .jsonl.gz files. These are compressed JSONL files where each line has a single JSON entry of the form

{
    "filename": "unique identifier of file, such as a path or a unique id",
    "tokens" : ["list", "of", "tokens", "in", "file"]
}

Alternative formats can be accepted by providing the --tokens-field and --id-fields options.

The tokenizers folder in this repository contains tokenizers for C#, Java, JavaScript and Python. Please, feel free to contribute tokenizers for other languages too.

Duplicate Removal

Once code is tokenized and clones are detected, a removal script can be run. python deduplicate.py --project project/to/deduplicate --duplicates_data data/generated/by/duplicate/detection

Original Paper

The main part of this repository (tokenization/duplication detection) works as described and originally implemented for the following paper:

@inproceedings{allamanis2019adverse,
  title={The adverse effects of code duplication in machine learning models of code},
  author={Allamanis, Miltiadis},
  booktitle={Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software},
  pages={143--153},
  year={2019}
}

Contributing

This project welcomes contributions and suggestions.