PatchScope – A Modular Tool for Annotating and Analyzing Contributions

Annotates files and lines of diffs (patches) with their purpose and type, and performs statistical analysis on the generated annotation data.

Note: this project was called 'python-diff-annotator' earlier in its history instead of 'PatchScope', and the python package was called 'diffannotator' instead of being called 'patchscope', so there are some references to that older name, for example in directory names in some Jupyter Notebooks.

You can find early version of the project documentation at https://ncusi.github.io/PatchScope/, but it is currently incomplete and unpolished.

Installation

Use the package manager pip to install patchscope.

To avoid dependency conflicts, it is strongly recommended to create a virtual environment first, activate it, and install patchscope into this environment. See also "Virtual environment" subsection below.

To install the most recent version, use

python -m pip install patchscope@git+https://github.com/ncusi/PatchScope#egg=main

or (assuming that you can clone the repository with SSH)

python -m pip install patchscope@git+ssh://[email protected]/ncusi/PatchScope.git#egg=main

Usage

This tool integrates four key components

extracting patches from version control system or user-provided folders
as separate step with diff-generate, or integrated into annotation step: diff-annotate
applying specified annotation rules for selected patches
using diff-annotate, which generates one JSON data file per patch
generating configurable reports or summaries
with various subcommands of diff-gather-stats; each summary is saved as a single JSON file
advanced visualization with a web application (dashboard)
which you can run it with panel serve, see the description below

Running scripts

This package installs scripts (currently three) that you can run to generate patches, annotate them, and extract their statistics. Every script name starts with the diff-* prefix.

Each script and subcommand supports the --help option.

diff-generate: used to generate patches (*.patch and *.diff files) from a given repository, in the format suitable for later analysis; not strictly necessary;

Usage: diff-generate [OPTIONS] REPO_PATH [REVISION_RANGE...]
(where REVISION_RANGE... is passed as arguments to the git log command)
diff-annotate: annotates existing dataset (patch files in subdirectories), or annotates selected subset of commits (of changes in commits) in the given repository;

Usage: diff-annotate [OPTIONS] COMMAND [ARGS]...
- diff-annotate patch [OPTIONS] PATCH_FILE RESULT_JSON: annotate a single PATCH_FILE, writing results to RESULT_JSON,
- diff-annotate dataset [OPTIONS] DATASETS...: annotate all bugs in provided DATASETS,
- diff-anotate from-repo [OPTIONS] REPO_PATH [REVISION_RANGE...]: create annotation data for commits from local Git repository (with REVISION_RANGE... passed as arguments to the git log command);
diff-gather-stats: compute various statistics and metrics from patch annotation data generated by the diff-annotate script;

Usage: diff-gather-stats [OPTIONS] COMMAND [ARGS]...
- diff-gather-stats purpose-counter [--output JSON_FILE] DATASETS...: calculate count of purposes from all bugs in provided datasets,
- diff-gather-stats purpose-per-file [OPTIONS] RESULT_JSON DATASETS...: calculate per-file count of purposes from all bugs in provided datasets,
- diff-gather-stats lines-stats [OPTIONS] OUTPUT_FILE DATASETS...: calculate per-bug and per-file count of line types in provided datasets,
- diff-gather-stats timeline [OPTIONS] OUTPUT_FILE DATASETS...: calculate timeline of bugs with per-bug count of different types of lines;

Running web app (dashboard)

This package also includes web dashboard, created using the Panel framework. You would need to install additional dependencies, for example with pip install --editable .[web] (if you are running this project in editable mode).

Currently, it includes two web apps, namely Contributors Graph and Author Statistics. You can run each app with panel serve:

panel serve src/diffinsights_web/apps/contributors.py
panel serve src/diffinsights_web/apps/author.py

By default, it would make those apps available at http://localhost:5006/contributors and http://localhost:5006/author, respectively.

You can also run both of them at once with

panel serve src/diffinsights_web/apps/*.py

See the basic demo on Heroku:

Examples and demos

This repository also includes some examples demonstrating how this project works, and what it can be used for.

First time setup (for generating examples)

You can set up the environment for using this project, following the recommended practices (described in the "Development" section of this document), by running the examples-init.bash Bash script, and following its instructions.

Note that this script assumes that it is run on Linux, or Linux-like system. For other operating systems, you are probably better following the steps described in this document manually.

This script includes the configuration section at the beginning of it; you can change parameters to better fit your environment:

DVCSTORE_DIR - directory with local dvc remote
PYTHON - Python 3.x executable (before activating virtual environment)

This project uses DVC (Data Version Control) tool to track annotations and metrics data, and version this data. It allows to store large files and large directories outside of Git repository, while still have them to be version controlled. They can be stored locally, or in the cloud.

The examples-init.bash script also configures local DVC storage (see the next subsection).

Data pipeline with DVC

To provide reproducibility, and to make it possible to version data files separately from versioning the code, this project uses DVC (Data Version Control) tool for its examples.

DVC pipelines are versioned using Git, and allow you to better organize projects and reproduce complete workflows and results at will. The pipeline is defined in the dvc.yaml file.

You can re-run the whole pipeline, after installing DVC, with the dvc repro command. It will run only those pipeline stages that needed it, by examining if stage dependencies (defined in dvc.yaml) changed. The results are saved in DVC cache, and you can push them to DVC remote with dvc push, if you have one configured (the examples-init.bash script from previous subsection configures DVC remote with storage on the local filesystem).

Downloading data from DAGsHub

DAGsHub is a platform for AI and ML developers that lets you manage and collaborate on your data, models, experiments, alongside your code. Among other things, it can be used as DVC remote.

Here is fragment from .dvc/config that defines "dagshub" DVC remote to store data:

['remote "dagshub"']
    url = s3://dvc
    endpointurl = https://dagshub.com/ncusi/PatchScope.s3

Note that you need to install support for S3 for DVC to use this remote, see Example: installing DVC with support for Amazon S3 storage.

You can then download all data with dvc pull -r dagshub.

Alternatively, you can download data via DagsHub web interface, from https://dagshub.com/ncusi/PatchScope:

data/examples/ includes annotations and statistics for a few example repositories:
- hellogitworld (only statistics, repository archived 2020),
- qtile,
- tensorflow (limited to top-2 non-bot authors),
- linux (years 2021-2023)
data/experiments/ includes various pieces of data computed when comparing automatic annotations generated by PatchScope with manual annotations from BugsInPy subset of HaPy-Bugs dataset, and with manual annotations from Herbold et al. "A fine-grained data set and analysis of tangling in bug fixing commits" available in SmartSHARK.

Jupyter Notebooks

The notebooks/ directory contains Jupyter Notebooks with data exploration, data analysis, etc. See notebooks/README.md for details.

Development

Virtual environment

To avoid dependency conflicts, it is strongly recommended to create a virtual environment, for example with:

python -m venv .venv

This needs to be done only once, from top directory of the project. For each session, you should activate the environment:

source .venv/bin/activate

Using virtual environment, either directly like shown above, or by using pipx, might be required if you cannot install system packages, but Python is configured in a very specific way:

error: externally-managed-environment

× This environment is externally managed

Installing the package in editable mode

To install the project in editable mode (from top directory of this repo):

python -m pip install -e .

To be able to also run test, use:

python -m pip install --editable .[dev]

Running tests

This project uses pytest framework. Note that pytest requires Python 3.8+ or PyPy3.

To run tests, run the following command

pytest

or

python -m pytest

Roadmap

See TODO.md.

Related projects

Here are some related projects that can also be used to extract development statistics from project or a repository.

Command line and terminal interface tools:

git-quick-stats is a simple and efficient way to access various statistics in a git repository
git-stats provides local git statistics, including GitHub-like contributions calendars
git_dash.sh is a command-line shell script for generating a Git metrics dashboard directly in your terminal
heatwave visualizes your git commits with a heat map in the terminal, similar to how GitHub's heat map looks
statscat is a CLI tool to get statistics of your all git repositories
hxtools by Jan Engelhardt is a collection of small tools and scripts, which include git-author-stat (commit author statistics of a git repository), git-blame-stat (per-line author statistics), and git-revert-stats (reverting statistics)
git-fame (in Python ) and git-fame-rb (in Ruby ) are command-line tools to pretty-print Git repository collaborators sorted by contributions
git-of-theseus is a set of scripts to analyze how a Git repo grows over time.
- See The half-life of code & the ship of Theseus by Erik Bernhardsson (2016).
GitHub Linguist can also be used from the command line, using the github-linguist executable to generate repository's languages stats (the language breakdown by percentage and file size), also for selected revision
git-metrics tool is a set of util scripts to scrape data from git repositories to help teams improve (metrics such as lead time and open branches)

Tools to generate HTML dashboard, or providing an interactive web application:

GitStats is an open source GitHub contribution analyzer, providing live dashboard;
note that gitstats.me no longer works (the domain is parked for sale)
repostat is Git repository analyser and HTML-report generator with NVD3-driven interactive metrics visualisations;
note that demo site https://repostat.imfast.io/ no longer works
- NVD3.js is an attempt to build re-usable charts and chart components for d3.js
Repositorch is a Git repository analysis engine written in C#; it recommends using Docker Compose to install (Repositorch on Docker Hub)
no demo site, but there is "How to use Repositorch" video on YouTube
cregit is a tool for helping to find and analyse code credits (unify identities, find contribution by token, extract metadata into a SQLite database, etc.)
Githru is an interactive visual analytics system that enables developers to effectively understand the context of development history through the interactive exploration of Git and GitHub metadata (demo). It uses novel techniques (paper) (graph reconstruction, clustering, and Context-Preserving Squash Merge (CSM) methods) to abstract a large-scale Git commit graph.
Assayo is a dashboard providing visualization and analysis of git commit statistics. Requires exporting data from Git. Has a homepage with demo. Its use is described in The visualization and analysis of git commit statistics for IT team leaders.

Visualizations for a specific repository:

A Git history visualization page by Jeff Palmer shows "An Interactive Development History" of Git: project and contributor statistics, relative cumulative contributions by contributor, and aggregated commits by contributor by month with milestone annotations. Jeff wrote an associated blog post about how he created the visualization.
gitdm (the "git data miner") is the tool that Greg KH and Jonathan Corbet have used to create statistics on where kernel patches come from. Written in Python. Original at git://git.lwn.net/gitdm.git

Web applications that demonstrate some MSR tool:

GitHub offers GitHub Insights for repositories (see for example Contributors to qtile/qtile). This includes the following subpages:
- Pulse (with configurable period of 1 month, 1 week, 3 days, 24 hours) shows information about pull requests and issues, and summary of changes as text (N authors pushed X commits to master, and Y to all branches. On master, M files were changed ad there had been A additions and D deletions).
- Contributions per week to master, excluding merge commits {as smoothed (!) line/area plot}, for whole project, and for up to 100 authors (with configurable period of all, last month, last 3 months, last 6 months, last 12 months, last 24 months; with configurable type of contributions: commits, additions, deletions). For each author we also have summary of their contributions as text (N commits, A ++, D --).
- Commits shows two plots: bar plot of commits per week over time for the last year {without any explanation, except for information shown on mouse hover}, and line plot with days of the week on x-axis {no explanation, no information on hover (!)}. No configuration.
- Code frequency over the history of the project: additions and deletions per week (where additions use green solid lines, and deletions use red dashed lines and are plotted upside-down). No configuration.
- other pages related to GitHub specifically, or the project as whole but not its history (like Community Standards, Dependency graph, Forks, or Action Usage Metrics).
GitHub also offers Developer Overview, which among others include the following chart:
- N contributions in last year / in YYYY, showing heatmap using 5-color discrete colormap, with year worth of weeks on x-axis, and day of the week (Sun to Sat) on the y-axis. You can switch between the years with a "radio button" (though there is no 'last year' entry). Contributions are timestamped according to Coordinated Universal Time (UTC) rather than contributor's local time zone.
Assayo has a homepage with demo where you can provide the output of given Git CLI command in your repo to create the demo for your repo, and there is also view a demo with mock data. Written in JavaScript with React.
Githru has an interactive demo, where you can select one of the following two GitHub repositories to visualize: vuejs/vue and realm/realm-java. Written in JavaScript with React, D3, dagre.
GitVision, a 3D repository graph visualization tool, has a live demo with visualization for more than 20 repositories (ranging from tiny to large), and where you can visualize your own repository by uploading the result of running the GitVision script. The demo is written in JavaScript using Vue and deployed with Vite.
GitBug-Java, a reproducible Java benchmark of recent bugs (tool accompanying the GitBug-Java: A Reproducible Java Benchmark of Recent Bugs paper (on arXiv)), has web app visualizing the dataset. No source code for the web app; it seems to be in JavaScript using Angular, with the help of Chart.js and diff2html.
Defects4J Dissection is an open-source web app that presents data to help researchers and practitioners to better understand the Defects4J bug dataset. Includes table view (the default) and charts. It is the open-science appendix of "Dissection of a bug dataset: anatomy of 395 patches from Defects4J" paper. Written in Python and JavaScript, under MIT license.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 922 Commits
.dvc		.dvc
.github		.github
data		data
docs		docs
notebooks		notebooks
schema		schema
scripts		scripts
src		src
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.mailmap		.mailmap
Dockerfile		Dockerfile
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
TODO.md		TODO.md
VERSION		VERSION
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
examples-init.bash		examples-init.bash
favicon-author.png		favicon-author.png
favicon-author.svg		favicon-author.svg
favicon.png		favicon.png
favicon.svg		favicon.svg
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PatchScope – A Modular Tool for Annotating and Analyzing Contributions

Installation

Usage

Running scripts

Running web app (dashboard)

Examples and demos

First time setup (for generating examples)

Data pipeline with DVC

Downloading data from DAGsHub

Jupyter Notebooks

Development

Virtual environment

Installing the package in editable mode

Running tests

Roadmap

Related projects

Contributing

License

About

Releases

Packages

Contributors 4

Languages

License

ncusi/PatchScope

Folders and files

Latest commit

History

Repository files navigation

PatchScope – A Modular Tool for Annotating and Analyzing Contributions

Installation

Usage

Running scripts

Running web app (dashboard)

Examples and demos

First time setup (for generating examples)

Data pipeline with DVC

Downloading data from DAGsHub

Jupyter Notebooks

Development

Virtual environment

Installing the package in editable mode

Running tests

Roadmap

Related projects

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages