Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScanCode contains too many data files #3049

Open
pombredanne opened this issue Aug 16, 2022 · 6 comments
Open

ScanCode contains too many data files #3049

pombredanne opened this issue Aug 16, 2022 · 6 comments

Comments

@pombredanne
Copy link
Member

The src/licensedcode/data directory contains 68K+ files and 64k just for the rules.
These rule files are not used much at runtime because they are baked into the index in a compressed form that is used at runtime. The same applies to the licenses files that are fully included in the index in an object form.

These are only needed when the index is rebuilt.
Another issue is that handling so many files makes any filesystem operation (unbearably) slow including during development time and at installation time.

It also creates side issues as #2427 (comment) and linkedin/shiv#224

I suggest some of these to fix the issue:

Combining either these three actions or just the last two should make this OK and workable both for development, installation and runtime.

@mjherzog
Copy link
Member

The first action seems to be the simplest solution.

@AyanSinhaMahapatra
Copy link
Member

@pombredanne :

YAML or YAML front matter https://jekyllrb.com/docs/front-matter/

I think YAML front-matter would be much better. There are a lot of issues in YAML when it comes to text with whitespace/symbols as we were discussing, and plain YAMLs are causing a lot of YAML read errors.

AyanSinhaMahapatra added a commit that referenced this issue Aug 24, 2022
* Delete all .yml files for rules
* Modify .RULE files to contain their data as YAML frontmatter

Reference: #3049
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
AyanSinhaMahapatra added a commit that referenced this issue Aug 29, 2022
This commit ends the process of merging .RULE and .yml files into
a single .RULE file which has YAML frontmatter storing the rule
metadata present in the .yml file previously.

This renaming and merging has been done to preserve line-history
for both the files.

Reference: #3049
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
@AyanSinhaMahapatra
Copy link
Member

we can split the files in multiple sub-directories to limit the number of files to some sensible number (say under 5K per dir)

are we going ahead with this?

Then the options are:

  • Generate some kind of content based hash (no need for nested dirs, just one level of directories, so maybe just one character of the hash is required)
  • give the major license keys a folder of their own (the various gpls, apache, lgpl and so on)
  • make directories based on characters from the license-key (if there's a lot of them under one character, we can use option 2. or the later characters)

@pombredanne
Copy link
Member Author

IMHO let's wait a bit for this

@milahu
Copy link

milahu commented May 1, 2024

licensedcode/data/rules could be compressed from 150 to 3 MB

$ du -sh scancode-toolkit-32.1.0/lib/python3.11/site-packages/licensedcode/data/rules/
149M    scancode-toolkit-32.1.0/lib/python3.11/site-packages/licensedcode/data/rules/

$ ls -U scancode-toolkit-32.1.0/lib/python3.11/site-packages/licensedcode/data/rules/ | wc -l
34859

$ tar cf scancode-toolkit-rules.tar  scancode-toolkit-32.1.0/lib/python3.11/site-packages/licensedcode/data/rules/
$ zstd -19 scancode-toolkit-rules.tar -o scancode-toolkit-rules.tar.19.zst

$ du -sh scancode-toolkit-rules.tar*
64M     scancode-toolkit-rules.tar
3.0M    scancode-toolkit-rules.tar.19.zst

@pombredanne
Copy link
Member Author

@milahu Thanks!
See also #3761 where we are building smaller wheels .... this may help too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants