parse_data: Fix compatibility with R 4.4 #588

kyleam · 2024-11-15T16:40:56Z

This series resolves two issues with calculating package coverage under R 4.4:

for some packages, there is a substantial performance regression compared to R 4.3 (related to segfault recently in codecov() #567)
some packages hit into an error when applying exclusions (Error in exclude() 'from must be a finite number' for seq() usage #576, Applying exclusions raise error due to missing srcref lines #579)

In addition to the links above, there are cases demonstrating both these issues at https://github.com/kyleam/covr-r44-issues.

Here's a summary of timings (seconds, wall time) to give a sense of how severe the slowdown can be:

package	R 4.3.3	R 4.4.2	R 4.4.2 + patch
data.table	254	2290	235
diffobj	231	1126	151
R.utils	759	22197	599

("patch" refers to a run with a covr patched with the equivalent of the changes proposed here.)

Both the slowdown and error bisect to R's 990fe3d8a1 (getParseData() - for first-file objects ..., 2023-06-14). And the associated bug report was actually connected to a covr PR (gh-154).

As discussed in the third commit's message, I think covr is getting tripped up by the new getParseData returning parse data for the entire package.

example

Under R 4.4.1, getParseData returns the same result for nanaduration and nanoival, which are in separate files. The parse data appears to be for all files of the package.

d1 <- getParseData(nanotime::nanoduration)
d2 <- getParseData(nanotime::nanoival)

basename(attr(d1, "srcfile")[["filename"]])
#> [1] "nanoduration.R"
basename(attr(d2, "srcfile")[["filename"]])
#> [1] "nanoival.R"

attr(d1, "srcfile") <- NULL
attr(d2, "srcfile") <- NULL

identical(d1, d2)
#> [1] TRUE

I haven't worked out the details of how that leads to the NA values that trigger exclusions error.

My proposed solution is to reorder the utils::getParseData and covr:::get_parse_data calls so that get_parse_data is used for R 4.4 or greater, just as it is for R 4.3 or below. As a result, the get_tokens call returns the expected per-file parse data.

The utils::getParseData call is retained as the fallback. That's needed when not calculating package coverage (e.g., when calling file_coverage).

Putting get_parse_data in front of the getParseData call exposes two issues in split_on_line_directives. Those are resolved in the first two commits.

Closes #576
Closes #579

cc: @dgkf @jimhester @krlmlr @mcol @MichaelChirico

CLAassistant · 2024-11-15T16:41:02Z

All committers have signed the CLA.

MichaelChirico · 2024-11-15T16:52:56Z

I don't know the internals well enough to comment further than: looks promising!!! thank you 🙌

jimhester · 2024-11-15T18:19:25Z

Your investigation makes sense thanks, do you think you could add a test for the split_on_line_directives cases that are fixed with this PR? Otherwise this looks good to me, appreciate the work!

get_parse_data extracts lines from the input srcfile object and feeds them to split_on_line_directives, which expects the lines to be a concatenation of all the package R files, separated by #line directives. With how get_parse_data is currently called, that expectation is met. get_parse_data is called only if utils::getParseData returns NULL, and getParseData doesn't return NULL for any of the cases where the input does _not_ have line directives (i.e. entry points other than package_coverage). An upcoming commit is going to move the get_parse_data call in front of the getParseData call, so update split_on_line_directives to detect the "no directives" case. Without this guard, the mapply call in split_on_line_directives would error under an R version before 4.2; with R 4.2 or later, split_on_line_directives returns empty.

split_on_line_directives breaks the input at #line directives and returns a named list of lines for each file. For a package with a single file under R/, there is one directive. The bounds calculation is still correct for that case. However, the return value is incorrectly a matrix rather than a list because the mapply call simplifies the result. At this point, this bug is mostly [*] unexposed because this code path is only triggered if utils::getParseData returns NULL, and it should always return a non-NULL result for the single-file package case. The next commit will reorder things, exposing the bug. Tell mapply to not simplify the result. [*] The simplification to a matrix could also happen for multi-file packages in the unlikely event that all files have the same number of lines.

utils::getParseData has a longstanding bug: for an installed package, parse data is available only for the last file [1]. To work around that, the get_tokens helper first calls getParseData and then falls back to custom logic that extracts the concatenated source lines, splits them on #line directives, and calls getParseData on each file's lines. The getParseData bug was fixed in R 4.4.0 (r84538). Unfortunately that change causes at least two issues (for some subset of packages): a substantial performance regression [2] and an error when applying exclusions [3]. Under R 4.4, getParseData always returns non-NULL as a result of that change when calculating package coverage (in other words, the get_parse_data fallback is _not_ triggered). The slowdown is partially due to the parse data no longer being cached across get_tokens calls. Another relevant aspect, for both the slowdown and the error applying exclusions, is likely that the new getParseData returns data for the entire package rather than the per-file parse data the downstream covr code expects. One solution would be to adapt covr's caching and handling of the getParseData when running under R 4.4.0 or later. Instead go with a simpler and more minimal fix. Reorder the calls so that the get_parse_data call, which we know has been the primary code path for package coverage before R 4.4.0, is the first call tried. Leave getParseData as the fallback to handle the non-package coverage cases. [1] r-lib#154 https://bugs.r-project.org/show_bug.cgi?id=16756 [2] As an extreme case, calling package_coverage on R.utils goes from under 15 minutes to over 6 hours. [3] nanotime (v0.3.10) and diffobj (v0.3.5) are two examples of packages that hit into this error. Closes r-lib#576 Closes r-lib#579 Re: r-lib#567

kyleam · 2024-11-15T19:36:53Z

@jimhester thanks for taking a look!

do you think you could add a test for the split_on_line_directives cases that are fixed with this PR?

Sure, I've pushed an update. (The issues are triggered by existing tests, but I agree that a direct/explicit test makes sense.)

mcol · 2024-11-15T20:21:59Z

I've tried the patch and it brings the time for running coverage on Luminescence back to 7 minutes, which is roughly what we get on 4.3.3 (as opposed to 80m with the version on CRAN with 4.4.1). 🎉

jimhester · 2024-11-19T17:29:36Z

Thanks again for the work and thank you @mcol for reporting on the improved runtime.

kyleam added 3 commits November 15, 2024 14:26

kyleam force-pushed the r44-compat branch from 4a4c846 to 5ab2280 Compare November 15, 2024 19:36

olivroy mentioned this pull request Nov 19, 2024

Update condition rstudio/gt#1639

Closed

jimhester merged commit c27deac into r-lib:main Nov 19, 2024
12 checks passed

jimhester mentioned this pull request Nov 19, 2024

Guess a fix to segfault issue #577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parse_data: Fix compatibility with R 4.4 #588

parse_data: Fix compatibility with R 4.4 #588

kyleam commented Nov 15, 2024 •

edited

Loading

CLAassistant commented Nov 15, 2024 •

edited

Loading

MichaelChirico commented Nov 15, 2024

jimhester commented Nov 15, 2024

kyleam commented Nov 15, 2024 •

edited

Loading

mcol commented Nov 15, 2024

jimhester commented Nov 19, 2024

parse_data: Fix compatibility with R 4.4 #588

parse_data: Fix compatibility with R 4.4 #588

Conversation

kyleam commented Nov 15, 2024 • edited Loading

CLAassistant commented Nov 15, 2024 • edited Loading

MichaelChirico commented Nov 15, 2024

jimhester commented Nov 15, 2024

kyleam commented Nov 15, 2024 • edited Loading

mcol commented Nov 15, 2024

jimhester commented Nov 19, 2024

kyleam commented Nov 15, 2024 •

edited

Loading

CLAassistant commented Nov 15, 2024 •

edited

Loading

kyleam commented Nov 15, 2024 •

edited

Loading