Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ScanCodeSummarizer #999

Closed

Conversation

qtomlinson
Copy link
Collaborator

@qtomlinson qtomlinson commented Jul 17, 2023

Fix a few cases where Scancode results provide license information.

  1. In addition to content.summary.packages[0].declared_license, check content.summary.packages[0].license_expression
  2. Concatenate the license only when its score >= 80, to be consistent with summarizing license info for files.
  3. Add META/INF directory as license location for maven packages.

See commit messages for detailed explanation and test cases

@qtomlinson
Copy link
Collaborator Author

@mpcen ready for review

@qtomlinson qtomlinson marked this pull request as ready for review September 19, 2023 21:39
In scancode, packages[0].license_expression contains license information.
This later becomes 'declared_license_expression'.
Details see aboutcode-org/scancode-toolkit@ab677c6#diff-47cc909d82dee95ebbb1a3d3a8ed519ae75684072c8f4867b90056d66863f964

Based on documentation, 'declared_license_expression' is the 'primary
license expression as determined from the declaration(s) of the authors
of the package'.
See see https://www.nexb.com/scancode-license-clarity-scoring/

When the existing logic fails to normalize, try to derive
license information from packages[0].license_expression.
In some cases, CD (scancode summarizer) goes through the root files to
derive license information.  file.licenses is an array of licenses with
varying matching scores.  Currently, after the license file is found,
we concat all the license expressions as declared license (even though
some of them have low matching scores).  However, when the licenses are
tallied from the files, those licenses with score lower than 80 are
filtered out.  This inconsistency causes the declared license does not
show up in the discovered license (>80 score) in some cases.

Refactor the logic, so that only license score above 80 is considered
for declared license.

Test case:
https://clearlydefined.io/definitions/composer/packagist/colinmollenhour/cache-backend-redis/1.14.4
In one of the workflow, scancode summarizer goes through the root files
to look for license.  The filtering for root files is based on case
sensitive file path matching.  In maven packaging, META/INF directory
is used often.  Add this to the license location to search.

Test case:
https://clearlydefined.io/definitions/maven/mavencentral/org.flywaydb/flyway-core/9.20.0
https://dev.clearlydefined.io/definitions/maven/mavencentral/org.flywaydb/flyway-core/7.7.2

Task: clearlydefined#846
@qtomlinson
Copy link
Collaborator Author

@mpcen @jeffwilcox rebased and ready for review. Any feedback would be very much appreciated!

@@ -178,6 +189,10 @@ class ScanCodeSummarizer {
.filter(e => e)
}

_isMatchedLicense(license) {
return license.score >= 80
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this threshold documented/defined for the community understanding - is that a scancode native value that means "definitely the match"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[I read the more detailed commit messages]

Are there any situations where this >= 80 threshold might produce an inaccurate resulting license that we could point to a risk on this?

Copy link
Collaborator Author

@qtomlinson qtomlinson Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback!
The exiting logic in scancode summarizer includes licenses with lower scores in the declared license, but lists only licenses scoring above 80 for the files. For example, the declared licenses for https://clearlydefined.io/definitions/composer/packagist/colinmollenhour/cache-backend-redis/1.14.4 are BSD-3-Clause AND GPL-2.0-only. However discovered license is only BSD-3-Clause (GPL-2.0-only is filtered out due to lower score). This inconsistency causes confusion. My attempt here is to make declared license and discovered license consistent and remove licenses scoring less than 80 from the declared license.

As to what is the appropriate score threshold to avoid risk of incorrect license:

  • in scancode data with versions prior to 3.0.0, licenses with score >= 90 are considered as declared license (_getLicenseByFileName)
  • Since version 3.0.0 of scancode, is_license_text flag from scancode is used. The “is_license_text” flag is set to true for files that contain mostly license texts and notices. Based on documentation, "mostly means over 90% of the content of the file."
    According to this documentation (prior to v32 update, 30.1.0 is currently used in the crawler), the license objects for each detected license in the files has "is_license_text" attribute as well. I will adapt my solution to use the is_license_text flag on the detected license instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further research, there are four flags determined at each license match level: "is_license_text", "is_license_notice", "is_license_tag", "is_license_reference". This is also the order of precedence. Details see https://scancode-results-analyzer-fork-ayan.readthedocs.io/en/latest/how-analysis-is-performed/cases-incorrect-scans.html. is_license_text alone is not enough to determine whether it is the correct license match. Also see more discussion at aboutcode-org/scancode-toolkit#3082

I will refactor and reuse the logic for scancode prior to 3.0.0 to determine declared license by scoring >= 90. This way is more consistent within CD's implementation. @jeffwilcox What do you think?

Copy link
Collaborator Author

@qtomlinson qtomlinson Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any situations where this >= 80 threshold might produce an inaccurate resulting license that we could point to a risk on this?

@jeffwilcox Without any fixes, there is currently a risk of inaccurate resulting license. Details see issue. Scancode file licenses (as discovered) are always filtered based on score (>= 80). For declared licenses, there is a filtering based on score (>=90) in place for ScanCode data with versions prior to 3.0.0. This filtering is missing for later versioned ScanCode data, resulting license matched with lower scores showing up in "declared", but filtered out in "discovered". Including the score filtering (>=90) will ensure more accurate license information, and also consistency in handling different versions of ScanCode data.

@qtomlinson
Copy link
Collaborator Author

@elrayle for review

@pombredanne
Copy link
Member

Feel free to pull me and @AyanSinhaMahapatra in. We maintain ScanCode.
Note that filtering based on license score is likely to miss several detected licenses.

@AyanSinhaMahapatra
Copy link

where is this threshold documented/defined for the community understanding - is that a scancode native value that means "definitely the match"?

See https://scancode-toolkit.readthedocs.io/en/stable/reference/license-detection-reference.html for reference, we now have LicenseDetection objects (with tons of upgrades in license detection heuristics) which have a detection_log optionally which has info on license match accuracy. See https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/detection.py#L98 for possible values, there is some docs/docstrings but scattered, I'll improve on the docs here too. Note that this is similar to https://scancode-results-analyzer-fork-ayan.readthedocs.io/en/latest/how-analysis-is-performed/selecting-incorrect-unique.html#file-regions-with-incorrect-scans but has many improvements, as this was a prototype, which was improved a lot before being merged in scancode proper.

Also note that we do much better package and license summarization on a codebase-level https://scancode-toolkit.readthedocs.io/en/stable/cli-reference/scan-options-post.html#summary-option based on legalese/key files which would be useful also.

@qtomlinson
Copy link
Collaborator Author

qtomlinson commented Nov 7, 2023

@pombredanne @AyanSinhaMahapatra It is great to have feedback from the experts! Thanks for the insight and pointers on the new ScanCode data format. The references are extremely helpful for our ScanCode upgrade. I have included the related comments in the upgrade issue.

The current issue at hand deals with ScanCode output data from version 30.1.0. How to make declared license consistent with discovered licenses (issue)?

Options:

  1. rescan the package when ScanCode is upgraded, or
  2. see whether the harvested data has enough information where a license can be correctly derived without triggering a re-harvest.
    Option2 is the cheaper one, and is being explored here.

With the introduction of is_license_text on the file level, file.is_license_text is used in CD to detect a license file. When there are multiple license matches for a license file, historically, licenses[i].score has been used as a filter (>=80). For example, for colinmollenhour-Cm_Cache_Backend_Redis-6661e3f/LICENSE, there are two license matches:
- BSD-3-Clause (score 95.71) and
- GPL 2.0 (score 11.72).
After filtering based on score, only BSD-3-Clause is displayed as the license for the file, and later in discovered licenses, in clearlydefined UI,

For the declared license, the filtering based on score is missing for recent versions of ScanCode data, resulting both BSD-3-Clause and GPL 2.0 are declared but only BSD-3-Clause is discovered. This is confusing to end users.

Without resorting to filtering based on a score threshold,

  • using licenses[i].matched_rule[j].is_license_text was also explored. It is found that in some cases, when licenseFile.is_license_text === true, licenses[i].matched_rule[j].is_license_text can be false for all matched licenses (is_license_notice matched, see test/fixtures/scancode/3.0.2/commons-clause.json).
  • using licenses[i].matched_rule[j].is_license_notice is not good either. For colinmollenhour-Cm_Cache_Backend_Redis-6661e3f/LICENSE, GPL 2.0 is a match by is_license_notice === true with score of 11.72.

It seems that score is a consistent measure of license match in data versioned <= 30.1.0. That is why filtering based on score >=90 (existing filter value for older versioned ScanCode data) for declared license is proposed, with filtering score >=80 as discovered license already in place.

Any alternatives?

@qtomlinson
Copy link
Collaborator Author

qtomlinson commented Nov 8, 2023

Note that filtering based on license score is likely to miss several detected licenses.

Yes. Filtering (for consistency) does increase the likelihood of NOASSERTION here. Those cases need curation or re-harvested using the updated ScanCode. Adjusting filtering threshold can be one way to reduce those cases. Filter for "discovered" license is set to be 80. What is a sensible threshold for "declared"? 90 is used for older versioned ScanCode data (prior to 3.0.0).

@qtomlinson qtomlinson changed the title Fix NoAssertion license cases Fix ScanCodeSummarizer Nov 8, 2023
if (!declaredLicense || declaredLicense === 'NOASSERTION') {
declaredLicense = this._getDeclaredLicense(scancodeVersion, harvested, coordinates)
if (!isDeclaredLicense(declaredLicense)) {
declaredLicense = this._readLicenseExpression(harvested) || declaredLicense
Copy link
Collaborator Author

@qtomlinson qtomlinson Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest ScanCode format, "the codebase level packages has been also updated: license_expression -> declared_license_expression, also with it’s SPDX version, declared_license -> extracted_license_statement" from doc. Here in CD with 30.1.0 ScanCode data, declared_license is used to parse package level license information. In addition, package level license_expression (->declared_license_expression in the new format) can be utilized to provide license information. @pombredanne @AyanSinhaMahapatra Let me know if you have any feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants