diff --git a/chapters/ch01_introduction.tex b/chapters/ch01_introduction.tex index 14805aa..9d0d6b1 100755 --- a/chapters/ch01_introduction.tex +++ b/chapters/ch01_introduction.tex @@ -189,7 +189,7 @@ \subsection{SBFL in Action} from three integers. The program contains a bug on line 6 where the wrong maximum value is detected. The figure also shows seven different test cases that send various inputs to the function and check whether the actual output matches the -expected. The results of each test is found on the last row of the table. +expected. The results of each test are found on the last row of the table. Additionally, the large dots under the Input Tests column illustrate the concept of code coverage. For each line of code and test input, a dot in the cell means that the line was executed when this input was passed. On the rightmost column @@ -238,7 +238,7 @@ \section{Main Aims} packages such as Pytest and Coverage.py can be used to collect test suite data in order to calculate suspiciousness. AFLuent runs as a Pytest plugin and is integrated with the command line interface of Pytest, this feature increases -it's accessibility by allowing developers to easily integrate it into their +it accessibility by allowing developers to easily integrate it into their development environment. Following the implementation of AFLuent, this research evaluates the @@ -274,7 +274,7 @@ \section{Research Questions} available literature on SBFL is analyzed and the most popular and cited formulas are included in the implementation of AFLuent. Answering this question also requires that each approach is evaluated through an experiment section. -Since this research question is includes two separate sections, it's further split into +Since this research question includes two separate sections, it's further split into smaller sub-questions discussed below. \begin{center} @@ -301,7 +301,7 @@ \section{Research Questions} To ensure correctness and effectiveness, the implemented formulas in AFLuent are evaluated through experiments that measure their accuracy in sorting suspicious -statements and blocks. More specifically, the formulas will be assessed in in +statements and blocks. More specifically, the formulas will be assessed in the context of Python projects that use the Pytest unit testing framework. More details on this research question can be found in the evaluation section. @@ -323,7 +323,7 @@ \section{Research Questions} In addition to ensuring a smooth user experience while utilizing AFLuent functionalities, setup process and usage of the tool AFLuent is simplified to facilitate installation. Clear and descriptive documentation is also a crucial -step in making AFLuent accessible available for new users. +step in making AFLuent accessible and available for new users. \section{Thesis Outline} \label{sec:outline} @@ -335,6 +335,6 @@ \section{Thesis Outline} standards. The different tools used to build and test AFLuent are also discussed in the methods sections. Following that, the evaluation section describes the steps taken to evaluate AFLuent by testing the tool and -collecting data regarding it's output. The evaluation section also includes an +collecting data regarding its output. The evaluation section also includes an analysis of the results of the evaluation and various plots and charts that show the findings. diff --git a/chapters/ch02_relatedwork.tex b/chapters/ch02_relatedwork.tex index cd97adf..b14aa1c 100755 --- a/chapters/ch02_relatedwork.tex +++ b/chapters/ch02_relatedwork.tex @@ -6,7 +6,7 @@ \chapter{Related Work} to facilitate debugging and increase developer efficiency. Considering that AFLuent relies on many concepts developed by this literature, this section will explore and discuss how past work shapes AFLuent. Several sections are created -to for specific area of literature. +for specific areas of literature. \section{Automated Fault Localization} \label{sec:AFLlit} @@ -29,9 +29,9 @@ \section{Automated Fault Localization} majority of found papers are focused on Spectrum-Based Fault Localization (SBFL). Overall this research provides a great starting point to find and compare the different types and approaches of AFL. -Another benefit of this resources is that +Another benefit of these resources is that Wong et al. \cite{wong2016survey} expands on the types of SBFL -and reviews key literature that contributes show the benefits and drawbacks of +and reviews key literature that contributes to show the benefits and drawbacks of each approach. Another insightful survey paper is by Idrees Sarhan et. al \cite{sarhan2022Challenges} @@ -50,7 +50,7 @@ \subsubsection{Similarity Coefficient Based Technique} One of the most relevant SBFL techniques described by Wong et al. \cite{wong2016survey} is similarity coefficient based ones. Generally, these approaches seek to quantify how close ``the execution pattern of a statement is to the -failure pattern of all test cases'', where the the closer they are the more +failure pattern of all test cases'', where the closer they are the more likely that this statement to contain the error. In order to create a measurement of closeness, several equations have been developed and evaluated by past literature. Figure \ref{fig:sbfl_eq} shows some of the equations reviewed by @@ -90,14 +90,14 @@ \subsubsection{Tarantula} what causes the numerator to grow larger. This means that an increase in failed tests that cover the element cause an increase in suspiciousness. Additionally, a decrease in the number of failing tests that do not cover the element also -increase suspiciousness. Considering these two points, Tarantula gives a better +increases suspiciousness. Considering these two points, Tarantula gives a better indicator of suspiciousness when there are fewer failures in tests covering elements not under inspection. In addition to the logical analysis of the equation previous works provide an empirical evaluation of Tarantula in comparison to other formulas. Jones et al. \cite{Jones2005TarantulaEval} compares the effectiveness and efficiency of Tarantula to techniques such as Set Union, Set Intersection, and Nearest Neighbor. The results demonstrate that Tarantula -outperform the other Techniques where it provided a better guidance to the +outperformed the other Techniques where it provided better guidance to the developer. Using Tarantula a developer would need to manually inspect fewer elements of the program compared to when using other approaches. @@ -109,7 +109,7 @@ \subsubsection{Tarantula} exist, the first one based on the number of failed tests covering the element, and then the suspiciousness scores. The empirical results in Debroy et al. \cite{debroy2010grouping} show a statistically significant improvement provided -by this grouping technique where the developer need to review less elements and +by this grouping technique where the developer needs to review less elements and more faults are accurately detected. While Debroy et al. only applied the grouping technique to Tarantula and a neural network-based approach, it could be extended to include other similarity coefficient based techniques. @@ -124,7 +124,7 @@ \subsubsection{Ochiai} \label{subsubsec:ochiai_lit} Ochiai is another similarity coefficient formula for SBFL that uses code -coverage information and test output to produce as suspiciousness score. +coverage information and test output to produce a suspiciousness score. Originally used in computing genetic similarity in molecular biology and evaluated in Abreu et al. \cite{Abreu2006Ochiai}, the equation for this approach is shown in fog.\ref{fig:ochiaiEquation}. Similar to Tarantula, the number of @@ -133,12 +133,12 @@ \subsubsection{Ochiai} of tests that cover the element, unlike Tarantula, however, it does not consider successful tests that do not cover the element. Papers such as \cite{Abreu2006Ochiai,ABREU20091780} also evaluate the -performance of Ochiai in comparison to other such as Tarantula, AMPLE, and +performance of Ochiai in comparison to others such as Tarantula, AMPLE, and Jaccard. Another evaluation of Ochiai is done by Le et al. \cite{le2013theory} where it was found to have a statistically significant improvement when compared to Tarantula. The paper demonstrates that on average developers only need to inspect 21.02\% of the source code before finding the fault. -AFLuent includes and implementation and evaluation of Ochiai to +AFLuent includes an implementation and evaluation of Ochiai to validate that it performs as expected compared to the Tarantula technique. Additionally, considering that Ochiai is considered a fairly accurate and effective formula to detect faults, AFLuent takes advantage of the performance @@ -170,7 +170,7 @@ \subsubsection{DStar} information of a program to locate and rank faults. The equation for this approach can be found in figure \ref{fig:dstarEquation}. Wong et al. \cite{Wong2014DStar} introduce and extensively evaluate this approach in a 2014 -paper that demonstrate it's effectiveness compared to other formulas. In the +paper that demonstrate its effectiveness compared to other formulas. In the process of constructing D*, the paper lists the factors involved in determining suspiciousness of an element. The principles are as follows: \begin{enumerate} @@ -185,34 +185,34 @@ \subsubsection{DStar} \end{enumerate} Considering that multiplying \(\textbf{N$_{CF}$}\) by a constant to increase its -weight will not affect the ranking of statements, he authors argue that -rasing \(\textbf{N$_{CF}$}\) to a value * greater than +weight will not affect the ranking of statements, the authors argue that +raising \(\textbf{N$_{CF}$}\) to a value * greater than or equal to 1 would be more appropriate in increasing the weight of this variable. The study continues by illustrating how increasing the value of * produces more clear rankings that facilitate the debugging process by requiring -the developer to examine less elements in bot the best and worst case. However, -the authors also point out that this benefit of increasing teh value of * levels +the developer to examine less elements in both the best and worst case. However, +the authors also point out that this benefit of increasing the value of * levels off at a certain point depending on the size of the program under analysis. The paper concludes by reviewing performance results showing that D* is more effective than the previously discussed formulas (Tarantula, Ochiai, and Ochiai2). With that in mind, D* offers the latest and most effective formula to calculate suspiciousness compared to all others included in this research. AFLuent implements D* to validate this step up in effectiveness in the context of -Python projects and gives the user the ability to use t. +Python projects and gives the user the ability to use it. \subsection{Combining Approaches} \label{subsec:combining_approaches} -While AFLuent only relies SBFL approaches in its implementations, it's +While AFLuent only relies on SBFL approaches in its implementations, it's useful to explore other methodologies that could assist in the debugging -process. This creates a guide for potential extention of AFLuent and +process. This creates a guide for potential extension of AFLuent and provides a way to fill in the shortcomings of AFLuent. Xuan et al. explores the possibility of combining several SBFL metrics of fault localization and introducing a machine learning model to assist with the ranking \cite{Xuan2014Combine}. While AFLuent does not support this approach, Xuan et al. shows some promising results that could potentially uncover performance improvements in fault localization. There are many tricky aspects of this -research, especially that it suggests to train a machine learning model to +research, especially that it suggests training a machine learning model to assist with ranking. Depending on the data used to train the model, the results could be very different. Overall, while AFLuent does not use machine learning, this research provides a great idea for future work and improvements. @@ -236,7 +236,8 @@ \subsection{Acknowledging Problems} \label{subsec:acknowledging_problems} With the multitude of approaches and formulas to use in SBFL, various criticisms -are brought up for each proposed research. In a survey study, Wong et al. +are brought up for each proposed research. Some research even suggests that SBFL +and AFL in general is not effective for all developers \cite{parnin}. In a survey study, Wong et al. \cite{wong2016survey} identifies a series of issues and concerns surrounding SBFL in general. The main one being the central problem of giving failed and successful tests accurate weights in order to produce a meaningful @@ -251,22 +252,22 @@ \subsection{Acknowledging Problems} One of the brought up concerns of SBFL is the inclusion of passed program spectra in calculating suspiciousness of an element. Xie et al. \cite{xie2010isolating} argue that while a failed program test case does -indicate the presence of an error a passed program spectra/test data, ``is not +indicate the presence of an error in a passed program spectra/test data, ``is not guaranteed to be absolutely free of any faulty statement''. With that in mind, -passed tests information alone do not give reliable results on an element +passed test information alone does not give reliable results on elements suspiciousness. The proposed approach to mitigate this problem is to organize program entities into two main groups, those who have been ``activated'' at least once by a failed program spectra, and ``clean'' ones, which have not at all. The research continues by experimenting with this approach and presenting results that showed some signs of improvement on existing SBFL formulas. Overall, this research provides a way to address inaccuracies with AFLuent and -assists in expending the project beyond simple calculations based on formulas. +assists in expanding the project beyond simple calculations based on formulas. Another concern with the use of SBFL to debug programs is the possibility of having equal suspiciousness scores assigned to multiple statements. These ties hinder the debugging process and present the developer with a dilemma. Which element should be inspected first? they're equally suspicious! This problem -becomes more significant when only one of the tied elements actually contain the +becomes more significant when only one of the tied elements actually contains the fault. A study by Xu et al. \cite{xu2011ties} recognizes this problem and expands on the different outcomes. In the best case, the developer picks the statement containing the fault as their first choice and finds the error right @@ -314,7 +315,7 @@ \section{Existing Tools} program spectra and calculates suspiciousness scores using Tarantula, Ochiai, and DStar approaches. Overall, CharmFL has many similarities with AFLuent, but it's also less accessible considering that it's a PyCharm plugin which is not -used by every developer. Overall, the implementation of CharmFL provides and +used by every developer. Overall, the implementation of CharmFL provides an inspiration for AFLuent and encourages improvements where CharmFL may fall short. @@ -335,14 +336,14 @@ \section{Usability and Accessibility} and verbosity of output messages from the tool. Instead of simply displaying the ranked scores of statements, it would be more user friendly to explain the meaning of the output to guide the user into beginning the debugging process. -Kohn \cite{kohn2019error} explores the experience of beginner with Python errors +Kohn \cite{kohn2019error} explores the experience of beginners with Python errors with different severity and various Python interpreter error output. The results confirm that more clear error messages tend to have a higher percentage of students finding and fixing the error. This connection between error output and the ability for beginner developers to fix faults is very crucial in the case of -AFLuent. And while a user survey is out of scope of this research, it's Kohn +AFLuent. And while a user survey is out of scope of this research, Kohn provides encouragement to account for the different use cases in AFLuent and -attempt to provide a clear output that describes the fault and guides the +attempts to provide a clear output that describes the fault and guides the developer for the next step. Another aspiration of AFLuent is to assist beginners in debugging their code in @@ -350,7 +351,7 @@ \section{Usability and Accessibility} identifying popular python errors in Python among beginners, cause of faults can more quickly be pointed out after statement ranking has been produced. These steps require additional analysis of the suspicious statements by analyzing -their syntax to identify potential cause. The goal of AFLuent would then become +their syntax to identify potential causes. The goal of AFLuent would then become more than simply locating the fault, but also giving an educated guess regarding the reason behind the error. Cosman et al. \cite{cosman2020pablo} create a tool named PABLO that uses a trained classifier to identify common bugs and faults in diff --git a/chapters/ch03_method.tex b/chapters/ch03_method.tex index 0eaa074..b99b6ef 100755 --- a/chapters/ch03_method.tex +++ b/chapters/ch03_method.tex @@ -4,14 +4,14 @@ \chapter{Method of Approach} This chapter describes the implementation of AFLuent and the experiment setup and execution process. More specifically, the reasoning behind design decisions and the result are the main focus. Additionally, charts and diagrams are used to -demonstrate the the algorithms, structure, and flow of execution. +demonstrate the algorithms, structure, and flow of execution. \section{Development Environment and Toolset} \label{sec:DevEnviron} In order to begin discussing how AFLuent is implemented, a ground-up overview of the tools used and their roles is necessary to establish definitions and facilitate -the understanding of how dependencies they are connected. By being a Python +the understanding of how dependencies are connected. By being a Python package AFLuent can rely on a wide variety of helpful and popular tools. Some of the most important tools and dependencies are discussed below. @@ -19,7 +19,7 @@ \subsection{Poetry} \label{subsec:poetry} Poetry is a Python virtual environment management tool that allows developers to -set up an isolated environment for their projects. Furthermore, it manges the +set up an isolated environment for their projects. Furthermore, it manages the installation of Python dependencies on the virtualenv and updates them when necessary. Poetry has a crucial role in the implementation of AFLuent since it's used to make the development process simpler, its role also goes beyond @@ -43,10 +43,10 @@ \subsection{Coverage.py} \label{subsec:coverage} Spectrum-based fault localization requires data on code coverage and test -results in order to calculate and rank suspicious of elements in the code. +results in order to calculate and rank the suspicious elements in the code. Coverage.py\cite{coverage_py_website} is a Python tool that provides an easy to use application programming interface to collect that data. The tool also provides various -configuration for the user to skip certain files or directories from being +configurations for the user to skip certain files or directories from being considered. AFLuent relies on this tool to calculate what's known as per-test coverage. This data describes the lines of code covered by a single test case and organized in an accessible way to find out the number of passing and failing @@ -69,7 +69,7 @@ \subsection{Radon} \subsection{Libcst} \label{subsec:libcst} -In order to provide additional methods to breaking ties between element +In order to provide additional methods to break ties between element rankings, Libcst is used to create an abstract syntax tree of the code in question. This approach allows AFLuent to detect error prone syntax and formulate a score to use to break ties between lines if the need arises. @@ -92,7 +92,7 @@ \section{AFLuent as a Pytest Plugin} packaged hooks that fit in the workflow of Pytest. AFLuent makes use of five different hooks to implement automated fault localization. Figure \ref{fig:pytest_flow} shows a general overview of the steps changed in the -workflow of Pytest. Additionally, the section below describes in details how each +workflow of Pytest. Additionally, the section below describes in detail how each step was modified. \section{Installing AFLuent} @@ -103,7 +103,7 @@ \section{Installing AFLuent} complications. To achieve that, AFLuent is published to the Python Package Index (PyPI), which makes it installable through the \code{pip install afluent} command. Once this command runs successfully, AFLuent is automatically -integrated with Pytest as a plugin and will be ran with every pytest session +integrated with Pytest as a plugin and will be run with every pytest session when the user specifies. AFLuent's dependency on Coverage.py creates a small but avoidable conflict. In the case that AFluent and another plugin that utilizes Coverage.py is active in the same Pytest session, various errors might @@ -119,7 +119,7 @@ \subsection{Adding Command-Line Arguments} \label{subsec:pytest_cli} Pytest already supports a multitude of command line arguments that allow the -user to pass configuration that change how the test suite is executed and +user to pass configurations that change how the test suite is executed and reported. Similarly, AFLuent requires user passed arguments to complete a variety of tasks. The hook \code{pytest\_addoption} allows adding new arguments in a fashion similar to the \code{argparse} Python library. @@ -156,7 +156,7 @@ \subsection{Adding Command-Line Arguments} \end{itemize} \item The types of file reports to create after the Pytest session is over \begin{itemize} - \item \code{---report}: accepts \code{json} or \code{csv} and generate + \item \code{---report}: accepts \code{json} or \code{csv} and generates reports with the passed format. \item \code{---per-test-report}: requires that a per-test coverage report is produced. This report is only generated in JSON format. @@ -174,7 +174,7 @@ \subsection{Adding Command-Line Arguments} \subsection{Activating AFLuent} \label{subsec:activate_afluent} -After arguments are passed, the next steps parses through some of them to check +After arguments are passed, the next steps parse through some of them to check if AFLuent was enabled and to validate some of their values. The Pytest hook \code{pytest\_cmdline\_main} gives access to the collected configuration. In this hook, checks are conducted to see if there are other active plugins that @@ -195,7 +195,7 @@ \subsection{Calculating Per-test Coverage and Test Result} \begin{enumerate} \item \code{cov.start()} begins recording coverage - \item the hook yields back control to Pytest which calls the individual test case + \item The hook yields back control to Pytest which calls the individual test case \item \code{cov.stop()} stops recording coverage \item The collected data is then organized in a simpler structure defined as the program spectra @@ -212,7 +212,7 @@ \subsection{Reporting Results} The last step in AFLuent execution as part of Pytest is to report the fault localization outcome. The \code{pytest\_sessionfinish} hook is used to detect the -exit code of the session and display output on the console accordingly. An exist +exit code of the session and display output on the console accordingly. An exit code of 0 means that all tests have passed and there is no need to perform fault localization, therefore, a message would display that to the user before finishing the Pytest run. On the other hand an exit code of 1, would indicate @@ -268,7 +268,7 @@ \subsubsection{Division by Zero: Tarantula} to acknowledge that the outcomes reached here only apply to code that has been covered by the test suite. Faulty lines, which are not covered by any test case will not be investigated since there is no data to calculate their -suspiciousness. Using values plugged in to Figure \ref{fig:tarantulaEquation}, the +suspiciousness. Using values plugged into Figure \ref{fig:tarantulaEquation}, the examples in Figure \ref{fig:taran_div_by_zero_1} and Figure \ref{fig:taran_div_by_zero_2} show the two possible cases where division by zero might occur in the Tarantula equation. @@ -293,16 +293,16 @@ \subsubsection{Division by Zero: Tarantula} \end{center} \end{figure} -Figure \ref{fig:taran_div_by_zero_1} shows the case where there is a total of zero +Figure \ref{fig:taran_div_by_zero_1} shows the case where there are zero total failed test cases, covering AND not covering the element, in which a score of 0 is assigned to the element. On the other hand, -Figure \ref{fig:taran_div_by_zero_2} shows the case where there is a total of zero +Figure \ref{fig:taran_div_by_zero_2} shows the case where there are zero total passing test cases, in which a maximum score of 1 is assigned to the element. \subsubsection{Division by Zero: Ochiai} \label{subsubsec:div_by_zero_ochiai} -Division by zero occurs in the Ochiai formula when there is not test coverage +Division by zero occurs in the Ochiai formula when there is no test coverage information for a line or when the total number of failed tests is zero. The latter case indicates that the line is not suspicious since it did not cause any failures, therefore, zero is returned as the suspiciousness score. @@ -332,9 +332,9 @@ \subsubsection{Division by Zero: Ochiai} \end{center} \end{figure} -Figure \ref{fig:ochiai_div_by_zero_1} is an example of when there is zero total +Figure \ref{fig:ochiai_div_by_zero_1} is an example of when there are zero total failed test cases, resulting in a zero suspiciousness score. -However Fig\ref{fig:ochiai_div_by_zero_2} shows an example where there is no +However Figure \ref{fig:ochiai_div_by_zero_2} shows an example where there are no failed or successful tests that cover the line. This scenario does not occur in AFLuent, which only looks at lines that have some coverage data through passing or failing tests. Therefore, it wasn't necessary to handle this possibility. @@ -419,7 +419,7 @@ \subsubsection{Division by Zero: DStar} In the DStar equation, division by zero takes place only in one case. When the number of passing test cases that cover the line AND the number of failed test cases that do not cover the line are both zero, the denominator evaluates to -zero. This translates to the following: if there is no passing tests executing +zero. This translates to the following: if there are no passing tests executing this line and no failing test executing other lines only, then this line should have the maximum suspiciousness score possible. However, since DStar has a numerator raised to a power set by the user, it has no numerical upper limit on @@ -443,7 +443,7 @@ \subsection{ProjFile Object} ProjFile objects are designed to contain attributes that describe whole files. Additionally, they support functionality that apply to these files. The most -important attributes of these objects is the \code{lines} instance variable, +important attribute of these objects is the \code{lines} instance variable, which stores a dictionary of contents of the file. Specifically, the keys in this dictionary are line numbers in the file and the values are the Line objects discussed previously. In addition to storing this data, ProjFile implements an @@ -492,11 +492,11 @@ \subsection{Objects Overview} \end{center} \end{figure} -Fig\ref{fig:oop_structure} provides a visual simplification of the different +Fig \ref{fig:oop_structure} provides a visual simplification of the different components of AFluent and an overview of their roles in the functioning of the tool. Overall the nested structure creates several layers that facilitate development by isolating the different components and hiding unnecessary -information from other objects in the hierarchy. By following this structures, +information from other objects in the hierarchy. By following these structures, unit tests can be written much easier and debugging becomes a simpler task. \section{AFLuent's Output} @@ -522,10 +522,10 @@ \subsubsection{Success Messages} \label{subsubsec:success_message} This message is produced in the case that the test suite passes with no -error or failures. Using bright green highlighted message with bold white +error or failures. Using bright green highlighted text with bold white letters, the message displays: \code{All tests passed, no need to diagnose using -AFLuent}. Fig\ref{fig:success_message} demonstrates the success message output -when AFLuent is ran on the project's test suite. +AFLuent}. Figure \ref{fig:success_message} demonstrates the success message output +when AFLuent is run on the project's test suite. \begin{figure}[!htb] \begin{center} @@ -580,7 +580,7 @@ \subsubsection{Warning Messages} Python environment but not enabled by the user through the \code{---afl} or \code{---afl-debug} flags. This message serves as a reminder to the user to enable the plugin if they're interested in utilizing fault localization. The -message is shown in Fig\ref{fig:warning_message_2}. +message is shown in Fig \ref{fig:warning_message_2}. \begin{figure}[!htb] \begin{center} @@ -645,7 +645,7 @@ \subsection{Console Report} suspiciousness score possible (usually 1), the color indicates that these elements are extremely likely to be the ones causing the fault. For the remaining results, the top 20\% are highlighted using orange to show that they -are risky of being faulty. The remaining non-zero elements are highlighted using +are at risk of being faulty. The remaining non-zero elements are highlighted using yellow. Figure \ref{fig:report_2} shows an example of how safe statements are displayed in the report. @@ -685,7 +685,7 @@ \subsection{Random} \label{subsec:tiebreak_random} Random tie breaking is the baseline approach for dealing with ties in -suspiciousness scores. It's the method to compare other approaches to in order +suspiciousness scores. Other approaches are compared to random tie breaking to in order to detect if there are any improvements. In random tie breaking, statements are ranked in a descending order by the chosen suspiciousness score first, however, the order between tied elements is random. This could lead to different rankings @@ -698,7 +698,7 @@ \subsection{Cyclomatic Complexity} complexity as a secondary score to consider when sorting. This score is proposed by McCabe \cite{cyclomatic_complexity} and can be easily calculated using the Radon library. It measures the number of available paths that execution -cold go through in a function. Since this type of score only applies to whole +could go through in a function. Since this type of score only applies to whole functions and not to individual elements, lines inherit the cyclomatic complexity score of the function they live in when being ranked.Essentially, if a function has a cyclomatic complexity score of 7, then all statements within @@ -714,7 +714,7 @@ \subsection{Mutant Density: Logical Set} density \cite{Parsai_2020}. More specifically, this score indicates how error prone the statement is by calculating the number of all possible mutants. For example, a statement that has many mathematical and logical operators to perform a calculation is -more error prone that a statement that only has one or two of these operations. +more error prone than a statement that only has one or two of these operations. With that in mind, AFLuent uses this information to break ties between statements that have the same suspiciousness score but are syntactically different. @@ -756,22 +756,22 @@ \subsection{Mutant Density: Logical Set} \subsection{Mutant Density: Enhanced Set} \label{subsec:tiebreak_mutant_density_enhanced} -This approach to tie breaking also uses mutant density evaluate how error prone +This approach to tie breaking also uses mutant density to evaluate how error prone a statement is. However, it seeks to provide a more holistic metric that also -considers how error prone the constructs that an a statement is nested in. For +considers how error prone the constructs that a statement is nested in. For example, a statement inside a multi-level if statement, which is also nested in a loop, is more error prone than a statement which is outside these constructs. Since there is more room for errors in loops and if statement conditions, a statement nested in them takes on this risk of error. In order to measure this -score, the list of mutant used in the logical set was extended to include +score, the list of mutants used in the logical set was extended to include additional constructs that might contain the error. Additionally, the tiebreaker looks through and scores each block that the statement is nested in. -Table\ref{table:enhanced_set_mutants} shows the additional mutants that are -looked for in the enhanced set. Additionally, Table\ref{table:construct_scoring} +Table \ref{table:enhanced_set_mutants} shows the additional mutants that are +looked for in the enhanced set. Additionally, Table \ref{table:construct_scoring} discusses how a score is assigned to each construct while parsing through the syntax tree and assessing how error prone a statement is. All of these approaches are used to calculate a score that gets used in breaking ties of -suspicious statements while ranking. Figure\ref{fig:enhanced_score_equation} +suspicious statements while ranking. Figure \ref{fig:enhanced_score_equation} shows how each construct score is used in generating the final score of a statement. \begin{table}[!htb] @@ -864,7 +864,7 @@ \subsection{Tiebreaking Overview} Using all the tie breaking approaches discussed previously, this subsection provides an overview and an example that demonstrates tie breaking on a sample -program. Table\ref{table:scoring_examples} shows a sample program that implement +program. Table \ref{table:scoring_examples} shows a sample program that implements two functions with some conditional logic and simple mathematical operations. It also contains three columns with each scoring approach. Starting with the cyclomatic scores, one can see that all statements in a function contain the same score. diff --git a/chapters/ch04_experiments.tex b/chapters/ch04_experiments.tex index a7d480d..616866d 100755 --- a/chapters/ch04_experiments.tex +++ b/chapters/ch04_experiments.tex @@ -14,7 +14,7 @@ \subsection{Approach Overview} In order to evaluate AFLuent, several prerequisites are needed that enable collecting data for analysis. The primary requirement is a collection of Python -programs, which are susceptible of becoming faulty. Additionally, this +programs, which are susceptible to becoming faulty. Additionally, this collection's complexity must be comparable to code typically written by novice developers, which AFLuent targets. Another crucial requirement before evaluation can begin is a test suite for the selected code python code. The test suite must @@ -40,8 +40,8 @@ \subsection{Research Questions} \label{subsec:research_questions_eval} Before discussing the evaluation process, it's important to clearly state the -questions to answer. Previously, Section\ref{sec:researchq} brought up few -research question concerning the implementation and evaluation of AFL in Python. +questions to answer. Previously, Section \ref{sec:researchq} brought up a few +research questions concerning the implementation and evaluation of AFL in Python. Related Work and Methods section addressed \hyperref[para:RQ1.1]{\emph{RQ1.1}} as well as \hyperref[para:RQ2]{\emph{RQ2}}, however, the answer \hyperref[para:RQ1.2]{\emph{RQ1.2}} remains unclear. While \hyperref[para:RQ1.2]{\emph{RQ1.2}} generally involved the efficiency and accuracy of AFLuent, there was no mention @@ -50,7 +50,7 @@ \subsection{Research Questions} equation used, Tarantula, Ochiai, Ochiai2, and Dstar, there are four tie breaking approaches. Each equation-tiebreaker pair will be evaluated on the same dataset, where the resulting rankings will be used to produce a score to assess how -close the produced ranking are to localizing the fault correctly. In addition to +close the produced rankings are to localizing the fault correctly. In addition to this score, the time taken to run each approach will be recorded to compare their time overhead. @@ -89,7 +89,7 @@ \subsubsection{Filtering Sample} expedite generating a test suite using Pynguin, several projects that use external packages were removed. These packages include \code{scikit-learn}, \code{Tensorflow}, \code{Matplotlib}, \code{Sympy}, and \code{PIL}. - Generating data and creating unit tests for projects that using theses + Generating data and creating unit tests for projects that use these packages can be difficult because they are time consuming or simply cannot be tested due to their graphical output. \item Functions that do not take input or return no results: some implemented @@ -105,7 +105,7 @@ \subsubsection{Filtering Sample} perform. Therefore, these functions were removed. In some instances where the documentation explicitly stated the type of input for the function, type hints were manually added to avoid the removal of the function. This was - especially frequent in sorting function, in which the input was specified as + especially frequent in the sorting function, in which the input was specified as integer values. \item Code snippets under \code{if \_\_name\_\_ == "\_\_main\_\_"}: In most instances the code under this if statement either ran the doctest tests, or @@ -136,9 +136,9 @@ \subsubsection{Filtering Sample} Following the initial phase of filtering the codebase, general statistics and observations were recorded for the remaining sample thus far. Table\ref{table:remaining_projects} shows the name of the remaining project. -Additionally, SLOCcount was used to calculate the number of non-comment line of +Additionally, SLOCcount was used to calculate the number of non-comment lines of code included in the sample. The output from the tool is shown in -Fig\ref{fig:SLOCcount_phase1}. Overall, there was 12199 lines of code remaining +Figure \ref{fig:SLOCcount_phase1}. Overall, there was 12199 lines of code remaining in the sample prior to automatically generating tests using Pynguin. \begin{figure}[!htb] @@ -186,7 +186,7 @@ \subsubsection{Filtering Sample} \subsubsection{Generating a Test Suite} \label{subsubsec:generating_test_suite} -Once initial filtering of the codebase was completed, Pynguin was ran to +Once initial filtering of the codebase was completed, Pynguin was run to generate tests. However, additional issues came up in this process that required additional filtering to be done. The new content was filtered as follows: \begin{itemize} @@ -197,7 +197,7 @@ \subsubsection{Generating a Test Suite} getting timed out. Finally, there were 249 modules where tests were successfully generated. Modules with failed and timed out runs were removed due to the lack of tests that cover them. - \item Some generated tests were faulty and caused errors when ran, or + \item Some generated tests were faulty and caused errors when run, or indeterminately failed making them flaky. Those tests were removed in some instances or fixed when possible. \item Since AFLuent relies on a thorough test suite with high coverage in @@ -245,7 +245,7 @@ \subsubsection{Generating a Test Suite} \begin{center} \includegraphics[width=15.5cm]{cyclomatic_complexity.png} \caption{\label{fig:cyclomatic_complexity_of_sample} Cyclomatic - Complextiy of Sample} + Complexity of Sample} \end{center} \end{figure} @@ -254,7 +254,7 @@ \subsubsection{Generating a Test Suite} Figure\ref{fig:SLOCcount_phase2}. While the codebase was significantly reduced in size, the resulting test suite contains 1105 test cases with 99\% coverage. In order to get a better understanding of the remaining sample, additional data -was collected on the cyclomatic complexity the functions in the code. This +was collected on the cyclomatic complexity of the functions in the code. This information would give some ideas on the structure of the code regarding if statements, loops and other constructs. The cyclomatic complexity of the remaining 516 functions was calculated and plotted as shown on Figure @@ -279,7 +279,7 @@ \subsection{Data Collection} an automated approach to collect this data. Some existing tools such as \emph{mutmut} \cite{mutmut} already perform similar steps for mutation testing and it could be easily repurposed to generate the bugs/mutants and run -AFLuent after inserting a mutant into the codebase. Futhermore, \emph{mutmut} +AFLuent after inserting a mutant into the codebase. Furthermore, \emph{mutmut} supports a hook function that facilitates collecting results after each run and before the next mutant is applied. Lastly the test suite run command can be modified to run AFLuent in evaluation mode and collect fault localization data. @@ -290,7 +290,7 @@ \subsection{Data Collection} approach-tiebreaker combination. Note that a value of 3 was used for \code{*} in the DStar equation. \item Per-test coverage report - \item Timing report containing the time taken to run the test suite befor + \item Timing report containing the time taken to run the test suite before fault localization and the time to perform fault localization and get rankings. \item Information about the mutant such as the file and line number it was @@ -344,7 +344,7 @@ \subsection{Results} only include lines from that project. For example, when the \emph{EXAM} score is calculated for a fault in the \code{bit\_manipulation} project, the percentage of lines belonging to that project that a developer must read before -finding the fault is the score. The lower the exam score for an approach, to +finding the fault is the score. The lower the exam score for an approach, the more effective it is because the developer would have to analyze less lines before finding the fault. @@ -355,7 +355,7 @@ \subsection{Results} Additionally, heat maps are used to compare each pair to the other fifteen after completing statistical tests. -\subsubsection{Data Vizualization} +\subsubsection{Data Visualization} \label{subsubsec:data_vizualization} To compare suspiciousness score equations when the tie breaking approach is the @@ -422,7 +422,7 @@ \subsubsection{Data Vizualization} so. But, further statistical analysis is needed to come to that conclusion. When analyzing Figure \ref{fig:logical_tiebreak_boxplot}, there is a noticeable -visual difference to the previous plots. Specifically, the the median +visual difference to the previous plots. Specifically, the median \emph{EXAM} score for all approaches using logical tie breaking is smaller when compared to random and cyclomatic tie breaking. While this implies an improvement, it's still unclear if it's statistically significant. In addition @@ -433,7 +433,7 @@ \subsubsection{Data Vizualization} still appears to be the worst performing while DStar has the lower scores overall. -Lastly, \emph{EXAM} score while using enhanced tie breaking are shown in Figure +Lastly, \emph{EXAM} scores while using enhanced tie breaking are shown in Figure \ref{fig:enhanced_tiebreak_boxplot}. When comparing the results of this figure to those in Figure \ref{fig:logical_tiebreak_boxplot}, the median values as well as maximums appear to be higher. This indicates that enhanced tie breaking might @@ -452,7 +452,7 @@ \subsubsection{Data Vizualization} In addition to the box plots that showcase the medians, and the quartiles, it's worth looking at the average exam score for each approach. Figure \ref{fig:averages_barplot} plots the \emph{EXAM} score averages for all -categories. This barplot further demonstrates the low performance of Tarantula, +categories and lists the value of the mean at the top of each bar. This bar plot further demonstrates the low performance of Tarantula, in which it has the highest averages out of all other equations. Based on the averages in the plot, the best performing approach is DStar with logical tie breaking with an average of 18.27\% \emph{EXAM} score. It's followed by Ochiai @@ -492,8 +492,9 @@ \subsubsection{Statistical Tests: Mann-Whitney} statistically different. However, it does not conclude which approach had the higher or lower exam scores. Figure \ref{fig:two_sided_mw_test} plots a heat map of the resulting \emph{p-value}s from the two sided Mann-Whitney test. It compares each -approach to the fifteen remaining ones. - +approach to the fifteen remaining ones. \emph{P-value}s for each comparison are +included in each square, where squares with lower \emph{p-value} are +darker color than those with larger value. \begin{figure}[!htb] \begin{center} \includegraphics[width=15.5cm]{two_sided_mw_test.png} @@ -508,9 +509,9 @@ \subsubsection{Statistical Tests: Mann-Whitney} show that these sets are different from all non-Tarantula approaches to a very statistically significant level. Therefore, the null hypothesis is rejected -here. There are only few exceptions, where Tarantula Enhanced and Tarantula +here. There are only a few exceptions, where Tarantula Enhanced and Tarantula Logical are not different from the random and cyclomatic variants of Ochiai, -Ochiai2, and DStar, so the null hypotheses is accepted for these exceptions. +Ochiai2, and DStar, so the null hypothesis is accepted for these exceptions. These cases represent an interesting case where using the logical and enhanced tiebreakers was a significant enough improvement to bring Tarantula in line with the random and cyclomatic variants of other equations. @@ -530,7 +531,7 @@ \subsubsection{Statistical Tests: Mann-Whitney} level. Therefore, the null hypothesis is accepted for these combinations. While the results of this test are helpful to show if statistically significant -differences exists between the different approaches, it does not clarify which +differences exist between the different approaches, it does not clarify which one is better than the other. To have a better understanding of that relationship, a one sided \code{less} Mann-Whitney test is performed with the hypotheses are as follows: @@ -549,7 +550,7 @@ \subsubsection{Statistical Tests: Mann-Whitney} In contrast to the two-sided test, this one gives a more clear idea on which approaches has a lower \emph{EXAM} scores when compared to others. A \emph{p-value} less than or equal to 0.05 suggests that the approach on the x-axis is -significantly more effective than it's counterpart on the y-axis. On the other +significantly more effective than its counterpart on the y-axis. On the other hand, a \emph{p-value} greater than 0.95 indicates the opposite, where the approach on the y-axis is significantly more effective than the one on the x-axis. Figure \ref{fig:one_sided_mw_test} plots a heat map of the resulting \emph{p-value}s @@ -573,13 +574,13 @@ \subsubsection{Statistical Tests: Mann-Whitney} by others, however, it was less frequently. As for other equations, the Random and Cyclomatic variants of Ochiai, Ochiai2, -and DStar all had similar results between each other, where non of them +and DStar all had similar results between each other, where none of them outperformed another on a significant level. Therefore, for these approaches, the null hypotheses is accepted. This result matches the outcome of the two-sided test where these approaches were very similar. Some notable large differences in performance can be seen in the logical variant -of Ochiai, Ochiai2, and DStar. Theses three significantly outperformed every +of Ochiai, Ochiai2, and DStar. These three significantly outperformed every random and cyclomatic variant of all other equations. On the other hand, the enhanced variant of these slightly outperformed the cyclomatic and random ones, but not on a statistically significant degree. This result suggests that the @@ -590,14 +591,14 @@ \subsubsection{Statistical Tests: Mann-Whitney} One of the surprising outcomes of this data is the performance of enhanced tie breaking. Despite the fact that it adds on the mutant density metric provided by -logical, and considers wider possibility while generating it's score, it did not +logical, and considers wider possibility while generating its score, it did not outperform the logical tie breaker. In fact, the \emph{p-value} was leaning more to the other outcome, but not in any significant way. \subsubsection{Statistical Tests: Cohen's D Effect Size} \label{subsubsec:statistical_test_cohen} -While the Mann-Whitney test checks if the distributions of the approaches is +While the Mann-Whitney test checks if the distribution of the approaches is different to others on a statistically significant level, it does not give an idea of the magnitude of this difference. In order to get that information, the non-parametric Cohen's D effect size is calculated for each pair of approaches @@ -624,8 +625,8 @@ \subsubsection{Statistical Tests: Cohen's D Effect Size} Ochiai2 logical with DStar logical, the values 0.0013 and 0.0021 respectively show that DStar had an improvement over the other two. This improvement is very small and, as established in the previous test, not statistically significant. -As for the comparison between Ochiai logical and Ochiai2 logical, and even -smaller value indicate a very small difference in favor of Ochiai logical. +As for the comparison between Ochiai logical and Ochiai2 logical, an even +smaller value indicates a very small difference in favor of Ochiai logical. Again, this difference is not statistically significant to reach a conclusion of which is the better approach. Overall, the performed statistical tests allowed the filtering of 16 different approaches to determine the top 3 ones that @@ -639,7 +640,7 @@ \subsubsection{Time Efficiency} execute tests without AFLuent, (II) Time to execute tests with AFLuent enabled, (III) Time to locate faults and perform tie breaking using all equation-tiebreaker combinations. Generally all these times were calculated when -all 1105 test cases in the project's suite were ran. +all 1105 test cases in the project's suite were run. \begin{figure}[!htb] \begin{center} @@ -650,7 +651,7 @@ \subsubsection{Time Efficiency} Figure \ref{fig:test_timings} compares the time taken to execute all test functions with and without using AFLuent. The baseline includes -4000 data points where the tests were ran without AFLuent. On the other hand +4000 data points where the tests were run without AFLuent. On the other hand 7613 data points are plotted for when AFLuent generated a timing report. It's important to mention that every point in the boxplot on the right represents the combined time to run all equation-tiebreaker combinations after one another. @@ -676,8 +677,8 @@ \subsubsection{Time Efficiency} \end{center} \end{figure} -Another useful time metric for developers hoping to use AFLuent is the time the -took takes to locate faults. And while +Another useful time metric for developers hoping to use AFLuent is the time it +took to locate faults. And while comprehensive time data for each equation and tie breaking approach was not collected, some conclusions can be drawn from the most time consuming case. Figure \ref{fig:localization_timings} shows the time taken to generate a @@ -685,8 +686,8 @@ \subsubsection{Time Efficiency} suspiciousness scores using all equations and to retrieve and evaluate tie breaking values using all approaches. In general, this is the worst possible scenario for AFLuent localization time. While the median time is quite large, -around 115 second, so is the test suite. For every run, AFLuent generates and parses -abstract syntax trees using libcst and calculates cyclomatic complexity fo all +around 115 second, so is the test suite. For every run, AFLuent generates and parse +abstract syntax trees using LibCST and calculates cyclomatic complexity fo all files covered in the suite. In the case of this evaluation, this includes all files in the codebase. Further discussion on how this time can be minimize is found in Section \ref{sec:future_work}: Future Work. diff --git a/chapters/conclusion.tex b/chapters/conclusion.tex index 6fac222..5f5adff 100755 --- a/chapters/conclusion.tex +++ b/chapters/conclusion.tex @@ -15,7 +15,7 @@ \section{Summary of Results} best and worst performances. One observation that stood out is the significant underperformance of Tarantula when compared to all other equations. Additionally, the statistically significant edge that logical tie -breaking achieve to outperform it's random and cyclomatic counterparts was +breaking achieves to outperform its random and cyclomatic counterparts was surprising. Overall, Ochiai, Ochiai2 and DStar had very strong performances when the same tie breaker was used across all of them, however, the data did not suggest that one outperformed the other significantly. Additional experiments may @@ -58,16 +58,16 @@ \subsubsection{Effectiveness} \subsubsection{Efficiency} One of the concerning outcomes of this study is the long time AFLuent takes to -produce fault localization output. Developer usually want fast and optimized +produce fault localization output. Developers usually want fast and optimized tools, which could render AFLuent unusable in the eyes of many. And while AFLuent's reliance on several tools reduce the ability to control the time it takes to run, there are some measures that could mitigate this problem. Throughout manual testing of the tool, it was generally observed that the most time consuming feature of AFLuent involves generating the tie breaker datasets. -In oder for AFLuent to adequately understand the code under test, it must -generate abstract syntax tree for every file covered in the test suite. Since +In order for AFLuent to adequately understand the code under test, it must +generate an abstract syntax tree for every file covered in the test suite. Since the debugging process usually involves making small changes at a time and -rerunning the tests, a time improvement is possible be caching all generated +rerunning the tests, a time improvement is possible by caching all generated syntax trees and only re-generating ones for files that have been edited since the last run. This solution will not reduce the runtime for the first time but it could have a significant effect on the runs that follow. Overall, this change @@ -104,7 +104,7 @@ \subsubsection{Evaluation} \item Students can be split into groups to complete different assignments \item Chosen assignments should allow for test driven development \end{itemize} - \item Allow some groups of students to use AFLuent and assess it's + \item Allow some groups of students to use AFLuent and assess its effectiveness in helping them locate faults \item Collect direct feedback regarding the students' experience while using AFLuent \end{enumerate} @@ -131,7 +131,7 @@ \section{Ethical Implications} negatively impact their development skills by taking away the experience of manually analyzing the code and understanding its expected behavior and why failures occur. This could be especially harmful if AFLuent was overused in -educational environments since students could utilize it's functionality without +educational environments since students could utilize its functionality without fully understanding how to fix the code themselves, and thus negatively impact their learning process. diff --git a/preamble/bibliography.bib b/preamble/bibliography.bib index d05aaa0..ce47482 100755 --- a/preamble/bibliography.bib +++ b/preamble/bibliography.bib @@ -259,4 +259,21 @@ @inproceedings{Parsai_2020 author = {Ali Parsai and Serge Demeyer}, title = {Mutant Density}, booktitle = {Proceedings of the {IEEE}/{ACM} 42nd International Conference on Software Engineering Workshops} +} + +@inproceedings{parnin, +author = {Parnin, Chris and Orso, Alessandro}, +title = {Are Automated Debugging Techniques Actually Helping Programmers?}, +year = {2011}, +isbn = {9781450305624}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +url = {https://doi.org/10.1145/2001420.2001445}, +doi = {10.1145/2001420.2001445}, +booktitle = {Proceedings of the 2011 International Symposium on Software Testing and Analysis}, +pages = {199-209}, +numpages = {11}, +keywords = {user studies, statistical debugging}, +location = {Toronto, Ontario, Canada}, +series = {ISSTA '11} } \ No newline at end of file