-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ScanCodeSummarizer #999
Conversation
@mpcen ready for review |
In scancode, packages[0].license_expression contains license information. This later becomes 'declared_license_expression'. Details see aboutcode-org/scancode-toolkit@ab677c6#diff-47cc909d82dee95ebbb1a3d3a8ed519ae75684072c8f4867b90056d66863f964 Based on documentation, 'declared_license_expression' is the 'primary license expression as determined from the declaration(s) of the authors of the package'. See see https://www.nexb.com/scancode-license-clarity-scoring/ When the existing logic fails to normalize, try to derive license information from packages[0].license_expression.
In some cases, CD (scancode summarizer) goes through the root files to derive license information. file.licenses is an array of licenses with varying matching scores. Currently, after the license file is found, we concat all the license expressions as declared license (even though some of them have low matching scores). However, when the licenses are tallied from the files, those licenses with score lower than 80 are filtered out. This inconsistency causes the declared license does not show up in the discovered license (>80 score) in some cases. Refactor the logic, so that only license score above 80 is considered for declared license. Test case: https://clearlydefined.io/definitions/composer/packagist/colinmollenhour/cache-backend-redis/1.14.4
In one of the workflow, scancode summarizer goes through the root files to look for license. The filtering for root files is based on case sensitive file path matching. In maven packaging, META/INF directory is used often. Add this to the license location to search. Test case: https://clearlydefined.io/definitions/maven/mavencentral/org.flywaydb/flyway-core/9.20.0 https://dev.clearlydefined.io/definitions/maven/mavencentral/org.flywaydb/flyway-core/7.7.2 Task: clearlydefined#846
e1e549f
to
5a68d1f
Compare
@mpcen @jeffwilcox rebased and ready for review. Any feedback would be very much appreciated! |
providers/summary/scancode.js
Outdated
@@ -178,6 +189,10 @@ class ScanCodeSummarizer { | |||
.filter(e => e) | |||
} | |||
|
|||
_isMatchedLicense(license) { | |||
return license.score >= 80 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this threshold documented/defined for the community understanding - is that a scancode native value that means "definitely the match"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[I read the more detailed commit messages]
Are there any situations where this >= 80 threshold might produce an inaccurate resulting license that we could point to a risk on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback!
The exiting logic in scancode summarizer includes licenses with lower scores in the declared license, but lists only licenses scoring above 80 for the files. For example, the declared licenses for https://clearlydefined.io/definitions/composer/packagist/colinmollenhour/cache-backend-redis/1.14.4 are BSD-3-Clause AND GPL-2.0-only. However discovered license is only BSD-3-Clause (GPL-2.0-only is filtered out due to lower score). This inconsistency causes confusion. My attempt here is to make declared license and discovered license consistent and remove licenses scoring less than 80 from the declared license.
As to what is the appropriate score threshold to avoid risk of incorrect license:
- in scancode data with versions prior to 3.0.0, licenses with score >= 90 are considered as declared license (
_getLicenseByFileName
) - Since version 3.0.0 of scancode, is_license_text flag from scancode is used. The “is_license_text” flag is set to true for files that contain mostly license texts and notices. Based on documentation, "mostly means over 90% of the content of the file."
According to this documentation (prior to v32 update, 30.1.0 is currently used in the crawler), the license objects for each detected license in the files has "is_license_text" attribute as well. I will adapt my solution to use the is_license_text flag on the detected license instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further research, there are four flags determined at each license match level: "is_license_text", "is_license_notice", "is_license_tag", "is_license_reference". This is also the order of precedence. Details see https://scancode-results-analyzer-fork-ayan.readthedocs.io/en/latest/how-analysis-is-performed/cases-incorrect-scans.html. is_license_text alone is not enough to determine whether it is the correct license match. Also see more discussion at aboutcode-org/scancode-toolkit#3082
I will refactor and reuse the logic for scancode prior to 3.0.0 to determine declared license by scoring >= 90. This way is more consistent within CD's implementation. @jeffwilcox What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any situations where this >= 80 threshold might produce an inaccurate resulting license that we could point to a risk on this?
@jeffwilcox Without any fixes, there is currently a risk of inaccurate resulting license. Details see issue. Scancode file licenses (as discovered) are always filtered based on score (>= 80). For declared licenses, there is a filtering based on score (>=90) in place for ScanCode data with versions prior to 3.0.0. This filtering is missing for later versioned ScanCode data, resulting license matched with lower scores showing up in "declared", but filtered out in "discovered". Including the score filtering (>=90) will ensure more accurate license information, and also consistency in handling different versions of ScanCode data.
@elrayle for review |
Feel free to pull me and @AyanSinhaMahapatra in. We maintain ScanCode. |
See https://scancode-toolkit.readthedocs.io/en/stable/reference/license-detection-reference.html for reference, we now have LicenseDetection objects (with tons of upgrades in license detection heuristics) which have a Also note that we do much better package and license summarization on a codebase-level https://scancode-toolkit.readthedocs.io/en/stable/cli-reference/scan-options-post.html#summary-option based on legalese/key files which would be useful also. |
@pombredanne @AyanSinhaMahapatra It is great to have feedback from the experts! Thanks for the insight and pointers on the new ScanCode data format. The references are extremely helpful for our ScanCode upgrade. I have included the related comments in the upgrade issue. The current issue at hand deals with ScanCode output data from version 30.1.0. How to make declared license consistent with discovered licenses (issue)? Options:
With the introduction of is_license_text on the file level, file.is_license_text is used in CD to detect a license file. When there are multiple license matches for a license file, historically, licenses[i].score has been used as a filter (>=80). For example, for colinmollenhour-Cm_Cache_Backend_Redis-6661e3f/LICENSE, there are two license matches: For the declared license, the filtering based on score is missing for recent versions of ScanCode data, resulting both BSD-3-Clause and GPL 2.0 are declared but only BSD-3-Clause is discovered. This is confusing to end users. Without resorting to filtering based on a score threshold,
It seems that score is a consistent measure of license match in data versioned <= 30.1.0. That is why filtering based on score >=90 (existing filter value for older versioned ScanCode data) for declared license is proposed, with filtering score >=80 as discovered license already in place. Any alternatives? |
Yes. Filtering (for consistency) does increase the likelihood of NOASSERTION here. Those cases need curation or re-harvested using the updated ScanCode. Adjusting filtering threshold can be one way to reduce those cases. Filter for "discovered" license is set to be 80. What is a sensible threshold for "declared"? 90 is used for older versioned ScanCode data (prior to 3.0.0). |
if (!declaredLicense || declaredLicense === 'NOASSERTION') { | ||
declaredLicense = this._getDeclaredLicense(scancodeVersion, harvested, coordinates) | ||
if (!isDeclaredLicense(declaredLicense)) { | ||
declaredLicense = this._readLicenseExpression(harvested) || declaredLicense |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest ScanCode format, "the codebase level packages has been also updated: license_expression -> declared_license_expression, also with it’s SPDX version, declared_license -> extracted_license_statement" from doc. Here in CD with 30.1.0 ScanCode data, declared_license is used to parse package level license information. In addition, package level license_expression (->declared_license_expression in the new format) can be utilized to provide license information. @pombredanne @AyanSinhaMahapatra Let me know if you have any feedback.
Fix a few cases where Scancode results provide license information.
See commit messages for detailed explanation and test cases