Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update rules with required phrases automatically #3924

Open
wants to merge 82 commits into
base: develop
Choose a base branch
from

Conversation

AyanSinhaMahapatra
Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra commented Sep 17, 2024

This is a continuation of #3254 with added required phrases in license rules after review and further manual curations. Also contains improvements in required phrase collection and marking.

Reference: #2637 #2878

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁

Add a script which can add required phrases in already existing rules
automatically from required phrases already present in other rules and
license field names. This can be done one license expression at a time.

Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
@AyanSinhaMahapatra AyanSinhaMahapatra changed the title Update rules with required phrases auto Update rules with required phrases automatically Sep 17, 2024
@AyanSinhaMahapatra AyanSinhaMahapatra force-pushed the update-rules-with-required-phrases-auto branch from ea221d4 to 518116d Compare September 18, 2024 13:43
@pombredanne
Copy link
Member

I am pushing shortly a few updates:

  • decouple the creation of new rules from updating existing rules in a separate CLI
  • ensure we skip more rules in the whole process: any rule that cannot be matched approximately and not only tiny rules, and also false positives
  • ensure that no rule get a required phrase addition that would break in the middle of a URL, email, or copyright. This will be done to check that no required phrase injection changes the set of ignorables of a rule and makes the URL not longer a proper URL for instance.
  • extend "skipping" the collection of required phrases flag to skip a rule from both required phrases collection AND injection. This allow to handle exceptions more easily.

Do not damage rules with URLs

Signed-off-by: Philippe Ombredanne <[email protected]>
Ensure that the leading /usr is not broken with {{ required phrase }}
markers.

Signed-off-by: Philippe Ombredanne <[email protected]>
Ensure that /usr paths are not broken with {{ required phrase }}
markers.

Signed-off-by: Philippe Ombredanne <[email protected]>
Ensure that URLs are not broken with {{ required phrase }} markers.

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
This is code that belongs to required_phrase.py, not to tokenize.py

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
This creates many false positives.

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
This helps with required phrases handling, addition and generation

* Add new Rule.source attribute to track the "source" of a license rule
  like when adding a new required phrase to a rule
* Add new Rule.is_tiny computed attribute to ytrack tiny, very small
  rules
* Add new Rule.is_approx_matchable property for rules that can only be
  matched exactly
* Add new Rule.is_generic for rules that contain "generic" licenses
* Support required_phrases-related fields in Rule.validate()
* Update index.py accordingly

Signed-off-by: Philippe Ombredanne <[email protected]>
filter_invalid_matches_to_single_word_gibberish() also considers
license_clues as eligible for gibberish filtering.

Signed-off-by: Philippe Ombredanne <[email protected]>
This is a more correct result

Signed-off-by: Philippe Ombredanne <[email protected]>
is_candidate_false_positive() now also considers
license_clues as eligible for false poistive filtering.

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Adjust with latest license code detection for required phrases

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
This after merging the latest develop and taking into account updates
with required phrases.

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Seen in Debian copyright files

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Only process stopwords this for "is_continuous" rules

Signed-off-by: Philippe Ombredanne <[email protected]>
Some rules now have a "is_required_phrase" flag

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants