Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] CC: One PR to rule them all -- feature parity with old API #59

Merged
merged 59 commits into from
May 14, 2021

Conversation

adamjanovsky
Copy link
Collaborator

@adamjanovsky adamjanovsky commented Apr 19, 2021

Description

Reaches feature parity with the old API with one exception -- analysis. The analyses will be replicated anyway and coded on-the-fly.

This PR implements

  • Refactor folder structure
  • Compute further heuristics, see below
  • Get rid of src attribute serialization in CommonCriteriaCert class
  • Refactor cc_download_tests, they quite often fail, allow for that or just get rid of them
  • Fully delete old API
  • CLI for new API CC
  • Keywords json segment should be a flat dictionary of matched_string: absolute_frequency
  • Fully download and parse the maintainance reports
  • Add yaml configuration for CC framework
  • Tests for computation of heuristics and extraction of stuff from txt files.
  • Existing contents of dataset should be moved if root_dir changes, (property.setter)

Heuristics to compute

After short discussion with @J08nY, I'll be computing the following:

  • List of manufacturers -- and unify their naming across whole dataset
  • Certification lab from pdf frontpage scan of the certificate report
  • Pair certificate with its protection profiles -- use constant PP json for now.
  • certificate ID

@adamjanovsky adamjanovsky self-assigned this Apr 19, 2021
@adamjanovsky adamjanovsky changed the title CC: One PR to rule them all -- feature parity with old API [WIP] CC: One PR to rule them all -- feature parity with old API Apr 19, 2021
@adamjanovsky adamjanovsky changed the base branch from master to dev April 19, 2021 15:56
@adamjanovsky
Copy link
Collaborator Author

adamjanovsky commented Apr 19, 2021

@J08nY which fields from "processed" in old json do you exactly use? This segment contains some redundant information, e.g. validity period in years that can be easily computed from not_valid_before and not_valid_after. So, I'm curious to know which of the following is directly used on the webpage?

  • List of manufacturers (some parsing of manufacturer field of the certificate)
  • Highest security level
  • Certification lab from pdf frontpage scan
  • Protection profile ID if available
  • certificate ID if available
  • Direct / indirect references -- dunno what these really are, produced by analysis I suppose?
  • lifespan in years

@J08nY
Copy link
Member

J08nY commented Apr 19, 2021

So I only rendered a few things from the processed field that I found and I didn't even know some of these existed so they are not that important w.r.t. the website render. What I use is the certificate ID field from processed which I use as the certificate ID for the certificate, this obviously has a huge effect on the network of certificates and so we need a good heuristic for this.

Even though I don't really use some of the others (only display the cert lab and cert lifetime length here: https://github.com/crocs-muni/sec-certs/blob/page/sec_certs/templates/cc/entry.html.jinja2#L121) I feel like they have value in the JSON export if the heuristics used to compute them are interesting. So I maybe wouldn't include something that can be computed as a simple subtraction of dates/years but I would include something that is computed non-trivially.

@adamjanovsky
Copy link
Collaborator Author

adamjanovsky commented May 11, 2021

I've decided not to include the analogies of old-api fields:

  • cert['processed']['cc_manufacturer_list']
  • cert['processed']['cc_manufacturer_simple']

I've manually went through approximately 200 certificates, only single of them had some non-trivial content in those fields. Their parsing is tricky, may trigger false positives if not careful enough and simply does not add any valuable information.

I've noticed that the old API is used to draw some dot plots, are they of any use @petrs ?

Created #73 new issue for that, in case we ever encounter a sensible use case for such functionality.

@adamjanovsky
Copy link
Collaborator Author

adamjanovsky commented May 14, 2021

@J08nY I was thinking about your request to process Maintenance updates along with CC certificates. It turned out to be quite tricky and makes the resulting JSON messy. I ended up with considering the following solution:

  • The maintenance reports will be considered as a separate dataset that will sit inside certs/maintenance folder of the CC dataset
  • Inside certs/maintenance, json of the maintenance dataset will reside, together with reasonable folder structure for parsed pdfs, txt files that correspond to the maintenance updates
  • The json of maintenance dataset will merely contain a list of processed maintenances, where for each maintenance we collect:
    • Data from pdfs, txts, title, date, links to the documents
    • unambigous reference to the related CC certificate

I prepared a small demo at ajanovsky.cz/test_maintenance.zip

That way, we can still process all certificates and not make the CC json too messy. Do you mind having two different sources for your database on web?

I can explain on the phone if I'm not clear enough.

@adamjanovsky adamjanovsky marked this pull request as ready for review May 14, 2021 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants