- cleanup
- make it a separate package
- directory structure
- 'pyproject.toml' (and optionally 'setup.py')
- 'README.md'
- 'LICENSE' (MIT License)
- maybe 'MANIFEST.in'
- separate repository on GitHub
- move
__version__
to__init__.py
config.py
(see also "version at runtime" in setuptools_scm docs and "Single-sourcing the package version" in Python Packaging User Guide) - use
packaging.version.Version
(askey
function for sorting) to find cases where__version__
is newer than installed version (in which case we are for sure in editable install mode) - add
docs/
directory (for man pages, and maybe API documentation)- use MkDocs or Material for MkDocs for general documentation
- generate API documentation using mkdocstrings
- generate documentation for scripts using mkdocs-typer (typer is used for parsing CLI arguments)
- maybe generate manpages from MkDocs with mkdocs-manpage (at least for scripts)
- maybe include gallery of examples with mkdocs-gallery
- maybe CLI demos with Asciinema, or one of the alternatives, like shelldemo, Terminalizer, ttyrec (and possibly also ttygif)
- maybe use build tool like Poetry, Hatch, PDM, Rye, uv, Flit,...
- maybe use in HaPy-Bug (python_bug_dataset) via a GitHub URL
-
34 scripts (their names may change in the future) - see[project.scripts]
section inpyproject.toml
-
diff-generate
(fromgenerate_patches.py
) -
diff-annotate
(fromannotate.py
) -
diff-gather-stats
(fromgather_data.py
) -
diff-augment
- augment JSON files with data from Git or from GitHub
-
- make it possible to use each script features from Python
(see for example
process_single_bug()
function inannotate.py
) and document such use (indocs/
, in function docstring, in file docstring, in doctests) - improve common handling of command line parameters for all scripts
- maybe make it possible to use configuration files to set parameters for CLI
(similarly to Hydra) with typer-config (e.g.
my-typer-app --config config.yml
) - maybe implement options common to all scripts, like
--version
, putting their implementation__init__.py
, and make use of "Options Anywhere" and "Dependency Injection" capabilities that typer-tools adds - maybe implement
--log-file
(defaults to '.log', supports '-' for stderr) and--log-level
options, the latter with the help of click-loglevel and Typer support for Click custom type
- maybe make it possible to use configuration files to set parameters for CLI
(similarly to Hydra) with typer-config (e.g.
- add logging, save logged information to a
*.log
(or*.err
and*.messages
): currently uses logging module from standard library.- limit information logged to console to ERROR or higher, or CRITICAL only
- maybe if consider using colorlog for colored log on console
- maybe allow structured JSON logging (e.g. if log file name ends with *.json) with python-json-logger
- use
logger.exception()
in exception handlers, in place oflogger.error()
- maybe consider using alternative tools:
- loguru (possibly with pytest-loguru,
or see Replacing
caplog
fixture frompytest
library in the loguru documentation) - structlog (possibly with pytest-structlog plugin, or use structlog tools for testing)
- ...
- loguru (possibly with pytest-loguru,
or see Replacing
This script can be used to generate patches (*.patch and *.diff files)
from a given repository, in the format suitable for later analysis:
annotating with diff-annotate
, and computing statistics with diff-gather-stats
.
However, you can also create annotations directly from the repository
with diff-annotate from-repo
subcommand.
- improvements and new features for
generate_patches.py
- configure what and where to output
-
--use-fanout
(e.g. save result in 'c0/dcf39b046d1b4ff6de14ac99ad9a1b10487512.diff' instead of in '0001-Create-.gitignore-file.patch');
NOTE: this required switching from usinggit format-patch
to usinggit log -p
, and currently does not save the commit message.
-
- configure what and where to output
This script can be used to annotate existing dataset (patch files in subdirectories), or selected subset of commits (of changes in commits) in given repository.
The result of annotation is saved in JSON files, one per patch / commit.
- improvements and new features for
annotate.py
- subcommands
-
patch
- annotate a given single patch file -
dataset
- annotate all patches in a given dataset (directory with directories with patches) -
from-repo
- annotate changesets of given selected revisions in a given Git repository
-
- parse whole pre-image and post-image files
(only via Git currently;
or via GitHub / GitLab / ...) - configurable file type
- global option
--ext-to-language
(the API it uses already existed) - global option
--filename-to-language
(using new API) - global option
--glob-to-language
(using new API) - global option
--pattern-to-purpose
(using new API) - (optionally?) use
wcmatch.pathlib
to be able to use**
in patterns (withglobmatch
andpathlib.GLOBSTAR
)
- global option
- option to limit analyzing changes to only "production code" changes,
for example with
--production-code-only
, or--file-purpose-limit
, etc. - support .gitattributes overrides of GitHub Linguist
- optionally use Python clone of github/linguist, namely retanoj/linguist, installed from GitHub,
with
--use-pylinguist
(note: install requires libmagic-dev and libicu-dev libraries)- make it use newer version of
languages.yml
by default - maybe use
Language.detect(name=file_name, data=file_contents)
, orFileBlob(file_name).language.name
(deprecated) to detect language based on file contents if extension is not enough to determine it
- make it use newer version of
- optionally use Python wrapper around github/linguist,
namely scivision/linguist-python,with--use-ghlinguist
(e.g. via RbCall, orvia rython, or other technique) - configurable line annotation based on file
typepurpose-
PURPOSE_TO_ANNOTATION
global variable - global option
--purpose-to-annotation
inannotate.py
script- do not modify the global variable
PURPOSE_TO_ANNOTATION
, reuse the code fromdiff-gather-stats timeline --purpose-to-annotation
- do not modify the global variable
-
- configurable line annotation based on tokens
- separate commit metadata, diff metadata (patch size and spread metrics), and changes/diff (parsed), instead of having them intermixed together (in "v2" format).
- computing patch/diff size and spread, following
"Dissection of a bug dataset: Anatomy of 395 patches from Defects4J"
(and extending it) - independent implementation
- patch size counting added ('+'), removed ('-'), and modified ('!') lines,
with simplified changed lines detection:
"Lines are considered modified when sequences of removed lines are straight followed by added lines (or vice versa). Thus, to count each modified line, a pair of added and removed lines is needed." - patch spreading - counting number of chunks / groups:
"A chunk is a sequence of continuous changes in a file, consisting of the combination of addition, removal, and modification of lines." - patch spreading - sum of spreading of chunks:
"number of lines interleaving chunks in a patch", per file
(counting inter-hunk distances) - patch spreading - number of modified source files
- patch spreading - number of modified classes (not planned)
- patch spreading - number of modified methods [and functions] (not planned)
- check the Python (and JavaScript) code used by work mentioned above, available at
https://github.com/program-repair/defects4j-dissection,
and maybe use it(copy,or import from PyPI/GitHub, or include as submodule and import): it callsdefect4j
binary from https://github.com/rjust/defects4j (Java code, Ant build system, with Perl wrappers - for Java code only) - find out which lines were modified, and not only their count
with some kind of fuzzy matching between lines (RapidFuzz, thefuzz,
maybe regex and orc, maybe
SequenceMatcher
,get_close_matches
from difflib, or maybe the context diff algorithm)
- patch size counting added ('+'), removed ('-'), and modified ('!') lines,
with simplified changed lines detection:
- retrieving and adding commit metadata
- from Git repository - for 'from-repo'
- from *.message files - for 'dataset' (see BugsInPy, HaPy-Bugs)
- from
git log -p
generated *.diff files - for 'dataset' - from
git format-patch
generated *.patch/*.diff files - for 'dataset' - from Git (or GitHub) repository provided via CLI option - for 'dataset'
- configuration file (*.toml, *.yaml, *.json, *.ini, *.cfg, or *.py);
maybe using Hydra (see Using Typer and Hydra together), maybe using typer-config (e.g.my-typer-app --config config.yml
), maybe using Dynaconf, maybe using configparser standard library (see also: files read by rcfile package, or better use platformdirs or appdirs) - documentation on how to use API, and change behavior
- configure output format (and what to output)
- for
from-repo
subcommand:--use-fanout
(e.g. save in 'c0/dcf39b046d1b4ff6de14ac99ad9a1b10487512.json', instead of in 'c0dcf39b046d1b4ff6de14ac99ad9a1b10487512.json') - for
dataset
subcommand:--uses-fanout
to process the result of generating patches with--use-fanout
- for
from-repo
anddataset
:--output-file=<filename>
to save everything into single JSON or JSON Lines file
- for
- maybe configuration options
- maybe configuration callbacks (in Python), like in git-filter-repo
-
AnnotatedPatchedFile.line_callback
static field - global option
--line-callback
inannotate.py
script
-
- maybe generate skeleton, like a framework, like in Scrapy
- maybe provide an API to generate processing pipeline, like in SciKit-Learn
- subcommands
This script and its subcommands can compute various statistics and metrics
from patch annotation data generated by the diff-annotate
script.
It saves extracted insights in a single file; currently only JSON is supported. Different subcommands use different schemas and save different data.
- improvements and new features for
gather_data.py
- docstring for
common()
function -
purpose-counter
subcommand- rename to
dataset-summary
(and include other metrics) - draw Venn diagram of patches that contain added, removed and/or modified lines, like on Fig. 1 of "Dissection of a bug dataset: Anatomy of 395 patches from Defects4J"
- draw Venn / Euler diagram, or upsetplot, of patches that unidiff contain added and/or removed lines; see above
- table or DataFrame with descriptive statistics for patch size and spreading,
like on Table 1 in "Dissection..."
- patch size: # Added lines, # Removed lines, # Modified lines, Patch size
- patch spreading: # Chunks (Groups), Spreading, # Files,
# Classes,# Methods - statistics: min, 25%, 50%, 75%, 90%, 95%, max
- if missing, table or DataFrame with statistics of dataset and patch size: # Patches/Bugs/Commits, # Files, # Lines (however the last one is determined: sum of '+' and '-' lines, max, average,...) like in the first third of the table on Fig. 1(b) "dataset characteristics" in unpublished "HaPy-Bug – Human Annotated Python Bug Resolution Dataset" paper
- maybe number of patches/bugs/commits for each project,
like on Table 1 in "BugsInPy:..."
- maybe augmented with project data:
LoC (lines of code, e.g. via
SLOCCount in Perl,
loccount in Go,
pygount in Python with Pygments),
Test LoC, # Tests, # Stars
(but see "The Fault in Our Stars: An Analysis of GitHub Stars as an Importance Metric for Web Source Code")
- maybe exponential fit, half life in years, to % of commit still present in code base over time (KM estimate of survival function), like Git-of-Theseus ("The half-life of code & the ship of Theseus", see Half-life by repository section)
- maybe augmented with project data:
LoC (lines of code, e.g. via
SLOCCount in Perl,
loccount in Go,
pygount in Python with Pygments),
Test LoC, # Tests, # Stars
(but see "The Fault in Our Stars: An Analysis of GitHub Stars as an Importance Metric for Web Source Code")
- maybe with Timeframe, # Bugs, # Commits, like on Table 3 in Herbold et al.
- statistics of assigned line labels over all data (automatic, human consensus),
like in Table 4 in Herbold et al.:
- labels in rows (bug fix, test, documentation, refactoring,..., no consensus, total),
- all changes, production code, other code in columns - number of lines, % of lines (% of lines is also used in second third of table in Fig. 1(b), "line annotations", in "HaPy Bug - ..." unpublished paper)
- robust statistics of assigned line labels over all data (automatic,...)
like in table in Fig. 2(a) in Herbold et al.:
- labels in rows (bug fix, test, documentation, refactoring,..., no consensus, total),
- overall (all changes), production code in columns - subdivided into median, MAD (Median Absolute Deviation from median), CI (Confidence Interval), >0 count
- histogram of bug fixing lines percentage per commit (overall, production code) like in Fig. 2(b,c) in Herbold et al.
- boxplot, (or boxenplot, violin plot, or scatterplot, or beeswarm plot) of percentages of line labels per commit (overall, production code) like in Fig. 2(b,c) in Herbold et al. and in Fig. 1(d) in "HaPy Bug - ..." - "distribution of number of line types divided by all changes made in the bugfix"
- maybe hexgrid colormap showing relationship between the number of lines changed
in production code files and the percentage of bug fixing lines
and lines without consensuslike in Fig. 9 in Herbold et al.. The plot has- percentage of bugfixing lines
(or lines without consensus)on X axis (0.0..1.0), - # Lines changed on Y axis using logscale (10^0..10^4),
- and log10(# Commits)
or log10(# Issues)as the hue / color (10^0..10^3, mostly), - with the regression line for a linear relationship between the variables overlaid,
and the r-value i.e. Pearson's correlation coefficient
- percentage of bugfixing lines
- maybe the table of observed label combinations;
the Table 8 in the appendix of Herbold et al.
is for lines without consensus, but we may put lines in a single commit / patch;
instead of the table, UpSet Chart
/ UpSet: Visualizing Intersecting Sets
may be used (using
upsetplot
library/package, or olderpyupset
for Python) - add
--output
option - currently supports only the JSON format- support for
-
as file name for printing to stdout
- support for
- rename to
-
purpose-per-file
subcommand- table, horizontal bar plot, or pie chart - of % of file purposes in the dataset, like bar plot in left part of Fig. 1(c) "percentage of lines by annotated file type" in "HaPy Bug - ..." unpublished paper
- composition of different line labels for different file types,
using horizontal stacked bar plot of percentages, or many pie charts,
like the stacked bar plot on the right part of Fig. 1(c)
"breakdown of line types by file type" in "HaPy Bug - ...";
though note that for some file types all lines are considered to be specific type, and that this plot might be more interesting for human-generated line types, rather than for line types generated bydiff-annotate
tool
-
lines-stats
subcommand- fix handling of
'commit_metadata'
field (skip it)
- fix handling of
-
timeline
subcommand- maybe create pandas.DataFrame and save as Parquet, Feather, HDF5, or pickle
- maybe resample / groupby (see
notebooks/
) - print information about results of
--purpose-to-annotation
- include information about patch size and spread metrics
- store only basename of the dataset in *.json output, not the full path
- global option
--output-format
(json, maybe jsonlines, csv, parquet,...) - global options
--bugsinpy-layout
,--from-repo-layout
,--uses-fanout
(mutually exclusive), configuring where the script searches for annotation data; print errors if there is a mismatch of expectations vs reality (if detectable) - option or subcommand to output flow diagram
(here the flow could be from file purpose to line type,
or from directory structure (with different steps) to line type or file purpose)
using:- Mermaid diagramming language (optionally wrapped in Markdown block)
- Plotly (for Python)
plotly.graph_objects.Sankey()
/plotly.express.parallel_categories()
(orplotly.graph_objects.Parcats()
), or
HoloViewsholoviews.Sankey()
- with Bokeh and matplotlib backends, or
pySankey - which uses matplotlib, but is limited to simple two divisions flow diagram
- option or subcommand to generate ASCII-art chart in terminal;
perhaps using Rich (used by typer by default) or Textual, or just Colorama - perhaps with tabulate or termtables. Possibilities:- pure Python: horizontal bar, created by repeating a character N times, like in How to Create Stunning Graphs in the Terminal with Python
- terminalplot - only XY plot with '*', minimalistic
- asciichartpy - only XY plot, somewhat configurable, uses Node.js asciichart
- uniplot - XY plots using Unicode, fast, uses NumPy
- termplot - XY plots and histograms, somewhat flexible
- termplotlib - XY plots (using gnuplot), horizontal and vertical histograms
- termgraph - candle stick graphs drawn using Unicode box drawing characters, with Colorama used for colors
- plotille - XY plots, scatter plots, histograms and heatmaps in the terminal using braille dots
- termcharts - bar, pie, and doughnut charts, with Rich compatibility
- plotext - scatter, line, bar, histogram and date-time plots (including candlestick), with support for error bars and confusion matrices
- matplotlib-sixel - a matplotlib backend which outputs sixel graphics onto the terminal (
matplotlib.use('module://matplotlib-sixel')
)
- docstring for