Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_data: Fix compatibility with R 4.4 #588

Merged
merged 3 commits into from
Nov 19, 2024
Merged

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Nov 15, 2024

This series resolves two issues with calculating package coverage under R 4.4:

In addition to the links above, there are cases demonstrating both these issues at https://github.com/kyleam/covr-r44-issues.

Here's a summary of timings (seconds, wall time) to give a sense of how severe the slowdown can be:

package R 4.3.3 R 4.4.2 R 4.4.2 + patch
data.table 254 2290 235
diffobj 231 1126 151
R.utils 759 22197 599

("patch" refers to a run with a covr patched with the equivalent of the changes proposed here.)


Both the slowdown and error bisect to R's 990fe3d8a1 (getParseData() - for first-file objects ..., 2023-06-14). And the associated bug report was actually connected to a covr PR (gh-154).

As discussed in the third commit's message, I think covr is getting tripped up by the new getParseData returning parse data for the entire package.

example

Under R 4.4.1, getParseData returns the same result for nanaduration and nanoival, which are in separate files. The parse data appears to be for all files of the package.

d1 <- getParseData(nanotime::nanoduration)
d2 <- getParseData(nanotime::nanoival)

basename(attr(d1, "srcfile")[["filename"]])
#> [1] "nanoduration.R"
basename(attr(d2, "srcfile")[["filename"]])
#> [1] "nanoival.R"

attr(d1, "srcfile") <- NULL
attr(d2, "srcfile") <- NULL

identical(d1, d2)
#> [1] TRUE

I haven't worked out the details of how that leads to the NA values that trigger exclusions error.

My proposed solution is to reorder the utils::getParseData and covr:::get_parse_data calls so that get_parse_data is used for R 4.4 or greater, just as it is for R 4.3 or below. As a result, the get_tokens call returns the expected per-file parse data.

The utils::getParseData call is retained as the fallback. That's needed when not calculating package coverage (e.g., when calling file_coverage).

Putting get_parse_data in front of the getParseData call exposes two issues in split_on_line_directives. Those are resolved in the first two commits.

Closes #576
Closes #579

cc: @dgkf @jimhester @krlmlr @mcol @MichaelChirico

@CLAassistant
Copy link

CLAassistant commented Nov 15, 2024

CLA assistant check
All committers have signed the CLA.

@MichaelChirico
Copy link
Contributor

I don't know the internals well enough to comment further than: looks promising!!! thank you 🙌

@jimhester
Copy link
Member

Your investigation makes sense thanks, do you think you could add a test for the split_on_line_directives cases that are fixed with this PR? Otherwise this looks good to me, appreciate the work!

get_parse_data extracts lines from the input srcfile object and feeds
them to split_on_line_directives, which expects the lines to be a
concatenation of all the package R files, separated by #line
directives.

With how get_parse_data is currently called, that expectation is met.
get_parse_data is called only if utils::getParseData returns NULL, and
getParseData doesn't return NULL for any of the cases where the input
does _not_ have line directives (i.e. entry points other than
package_coverage).

An upcoming commit is going to move the get_parse_data call in front
of the getParseData call, so update split_on_line_directives to detect
the "no directives" case.

Without this guard, the mapply call in split_on_line_directives would
error under an R version before 4.2; with R 4.2 or later,
split_on_line_directives returns empty.
split_on_line_directives breaks the input at #line directives and
returns a named list of lines for each file.

For a package with a single file under R/, there is one directive.
The bounds calculation is still correct for that case.  However, the
return value is incorrectly a matrix rather than a list because the
mapply call simplifies the result.

At this point, this bug is mostly [*] unexposed because this code path
is only triggered if utils::getParseData returns NULL, and it should
always return a non-NULL result for the single-file package case.  The
next commit will reorder things, exposing the bug.

Tell mapply to not simplify the result.

[*] The simplification to a matrix could also happen for multi-file
    packages in the unlikely event that all files have the same number
    of lines.
utils::getParseData has a longstanding bug: for an installed package,
parse data is available only for the last file [1].  To work around
that, the get_tokens helper first calls getParseData and then falls
back to custom logic that extracts the concatenated source lines,
splits them on #line directives, and calls getParseData on each file's
lines.

The getParseData bug was fixed in R 4.4.0 (r84538).  Unfortunately
that change causes at least two issues (for some subset of packages):
a substantial performance regression [2] and an error when applying
exclusions [3].

Under R 4.4, getParseData always returns non-NULL as a result of that
change when calculating package coverage (in other words, the
get_parse_data fallback is _not_ triggered).  The slowdown is
partially due to the parse data no longer being cached across
get_tokens calls.  Another relevant aspect, for both the slowdown and
the error applying exclusions, is likely that the new getParseData
returns data for the entire package rather than the per-file parse
data the downstream covr code expects.

One solution would be to adapt covr's caching and handling of the
getParseData when running under R 4.4.0 or later.  Instead go with a
simpler and more minimal fix.  Reorder the calls so that the
get_parse_data call, which we know has been the primary code path for
package coverage before R 4.4.0, is the first call tried.  Leave
getParseData as the fallback to handle the non-package coverage cases.

[1] r-lib#154
    https://bugs.r-project.org/show_bug.cgi?id=16756

[2] As an extreme case, calling package_coverage on R.utils goes from
    under 15 minutes to over 6 hours.

[3] nanotime (v0.3.10) and diffobj (v0.3.5) are two examples of
    packages that hit into this error.

Closes r-lib#576
Closes r-lib#579
Re: r-lib#567
@kyleam
Copy link
Contributor Author

kyleam commented Nov 15, 2024

@jimhester thanks for taking a look!

do you think you could add a test for the split_on_line_directives cases that are fixed with this PR?

Sure, I've pushed an update. (The issues are triggered by existing tests, but I agree that a direct/explicit test makes sense.)

@mcol
Copy link
Contributor

mcol commented Nov 15, 2024

I've tried the patch and it brings the time for running coverage on Luminescence back to 7 minutes, which is roughly what we get on 4.3.3 (as opposed to 80m with the version on CRAN with 4.4.1). 🎉

@jimhester
Copy link
Member

Thanks again for the work and thank you @mcol for reporting on the improved runtime.

@jimhester jimhester merged commit c27deac into r-lib:main Nov 19, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants