[WIP] CC: One PR to rule them all -- feature parity with old API #59

adamjanovsky · 2021-04-19T12:30:07Z

Description

Reaches feature parity with the old API with one exception -- analysis. The analyses will be replicated anyway and coded on-the-fly.

This PR implements

Heuristics to compute

After short discussion with @J08nY, I'll be computing the following:

~~List of manufacturers -- and unify their naming across whole dataset~~
Certification lab from pdf frontpage scan of the certificate report
Pair certificate with its protection profiles -- use constant PP json for now.
certificate ID

adamjanovsky · 2021-04-19T16:21:10Z

@J08nY which fields from "processed" in old json do you exactly use? This segment contains some redundant information, e.g. validity period in years that can be easily computed from not_valid_before and not_valid_after. So, I'm curious to know which of the following is directly used on the webpage?

List of manufacturers (some parsing of manufacturer field of the certificate)
Highest security level
Certification lab from pdf frontpage scan
Protection profile ID if available
certificate ID if available
Direct / indirect references -- dunno what these really are, produced by analysis I suppose?
lifespan in years

J08nY · 2021-04-19T16:32:58Z

So I only rendered a few things from the processed field that I found and I didn't even know some of these existed so they are not that important w.r.t. the website render. What I use is the certificate ID field from processed which I use as the certificate ID for the certificate, this obviously has a huge effect on the network of certificates and so we need a good heuristic for this.

Even though I don't really use some of the others (only display the cert lab and cert lifetime length here: https://github.com/crocs-muni/sec-certs/blob/page/sec_certs/templates/cc/entry.html.jinja2#L121) I feel like they have value in the JSON export if the heuristics used to compute them are interesting. So I maybe wouldn't include something that can be computed as a simple subtraction of dates/years but I would include something that is computed non-trivially.

- Cert_ids are now extracted - cert_id from frontpage is always preferred - If no frontpage cert_id is found, the most ocurring keyword is preferred - Some tests were revoked as heuristics should be computed after pdf processing

adamjanovsky · 2021-05-11T11:53:40Z

I've decided not to include the analogies of old-api fields:

cert['processed']['cc_manufacturer_list']
cert['processed']['cc_manufacturer_simple']

I've manually went through approximately 200 certificates, only single of them had some non-trivial content in those fields. Their parsing is tricky, may trigger false positives if not careful enough and simply does not add any valuable information.

I've noticed that the old API is used to draw some dot plots, are they of any use @petrs ?

Created #73 new issue for that, in case we ever encounter a sensible use case for such functionality.

adamjanovsky · 2021-05-14T12:19:05Z

@J08nY I was thinking about your request to process Maintenance updates along with CC certificates. It turned out to be quite tricky and makes the resulting JSON messy. I ended up with considering the following solution:

The maintenance reports will be considered as a separate dataset that will sit inside certs/maintenance folder of the CC dataset
Inside certs/maintenance, json of the maintenance dataset will reside, together with reasonable folder structure for parsed pdfs, txt files that correspond to the maintenance updates
The json of maintenance dataset will merely contain a list of processed maintenances, where for each maintenance we collect:
- Data from pdfs, txts, title, date, links to the documents
- unambigous reference to the related CC certificate

I prepared a small demo at ajanovsky.cz/test_maintenance.zip

That way, we can still process all certificates and not make the CC json too messy. Do you mind having two different sources for your database on web?

I can explain on the phone if I'm not clear enough.

dummy folders for directory refactoring

4bc9be7

adamjanovsky self-assigned this Apr 19, 2021

adamjanovsky force-pushed the cc-feature-parity branch from 978f18d to 4bc9be7 Compare April 19, 2021 12:50

adamjanovsky added 4 commits April 19, 2021 14:57

refactor folder strucure

4f646e3

delete print statement in tests

6354f6a

fix import in dataset

a15acdd

delete print statement in tests

b410def

adamjanovsky changed the title ~~CC: One PR to rule them all -- feature parity with old API~~ [WIP] CC: One PR to rule them all -- feature parity with old API Apr 19, 2021

adamjanovsky changed the base branch from master to dev April 19, 2021 15:56

adamjanovsky added 16 commits April 20, 2021 09:04

collect certification lab in heuristics

c13a2ab

compute heuristics cert lab

eb08912

Extraction of cert_id, revocation of some tests

a2ffac0

- Cert_ids are now extracted - cert_id from frontpage is always preferred - If no frontpage cert_id is found, the most ocurring keyword is preferred - Some tests were revoked as heuristics should be computed after pdf processing

increase max cpe matches constant

b6d4be2

dont compute heuristics immediately

34eca2f

basically rebase of dev

1f69f17

adjust cpe matching ocnstant

5fc379b

functions to import/export for label-studio

ce8eae0

merge dev onto cc-feature-parity

f684f70

rename test -> tests

4c92744

delete stale tests of download cc

1746ba5

isolate function for downloading csv/html resources

a5a6c4f

new cc download tests

4baef42

fix serialization in toy dataset test

796a02e

cc links http -> https

f76db3c

cc implement protection profiles dataset

f812921

adamjanovsky added 2 commits May 11, 2021 15:16

New API: delete src attribute of CC cert

79e1233

improve config handling

7676b82

adamjanovsky added 10 commits May 13, 2021 16:35

change badge link

cac770c

delete old API

a910f54

add test for txt processing

033f3a9

method for downloading json dset from web

e28592b

CC CLI

535a271

improvements in config handling CLI

4aa4770

setting root dir properly copies whole dataset contents

9b4e3c0

computing heuristics now updates state of dataset

93a6c47

exception handling of dataset copy function

d6a52af

flatten keywords dictionary

1f98b35

adamjanovsky added 3 commits May 14, 2021 14:30

implements maintenance update processing

8880e1e

some constraints on maintenance updates

c8a330a

fix tests: delete maintenance update

dcb44d6

adamjanovsky marked this pull request as ready for review May 14, 2021 12:39

adamjanovsky added 10 commits May 14, 2021 14:49

cli: error on no input and no output

b1dcb36

adds maintenances action to cli

c019ccb

add url to latest dataset snapshot

897f667

add python 3.10 to verions

d44ca1f

add basic notebook for exploration

83d25a1

update readme

b5573af

trying to workout the dockerfile for binder

dddceeb

merge dev into cc-feature-parity

41bb82d

working on broken merge

70fba15

fix broken merge

7551936

adamjanovsky merged commit 996186b into dev May 14, 2021

adamjanovsky deleted the cc-feature-parity branch May 14, 2021 16:44

This was referenced May 14, 2021

Refactoring: Split large files that cover multiple classes #51

Closed

Write tests for cc data extraction from pdf, txt files #33

Closed

Add mybinder.org example for simple CC data analysis #78

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] CC: One PR to rule them all -- feature parity with old API #59

[WIP] CC: One PR to rule them all -- feature parity with old API #59

adamjanovsky commented Apr 19, 2021 •

edited

Loading

adamjanovsky commented Apr 19, 2021 •

edited

Loading

J08nY commented Apr 19, 2021

adamjanovsky commented May 11, 2021 •

edited

Loading

adamjanovsky commented May 14, 2021 •

edited

Loading

[WIP] CC: One PR to rule them all -- feature parity with old API #59

[WIP] CC: One PR to rule them all -- feature parity with old API #59

Conversation

adamjanovsky commented Apr 19, 2021 • edited Loading

Description

This PR implements

Heuristics to compute

adamjanovsky commented Apr 19, 2021 • edited Loading

J08nY commented Apr 19, 2021

adamjanovsky commented May 11, 2021 • edited Loading

adamjanovsky commented May 14, 2021 • edited Loading

adamjanovsky commented Apr 19, 2021 •

edited

Loading

adamjanovsky commented Apr 19, 2021 •

edited

Loading

adamjanovsky commented May 11, 2021 •

edited

Loading

adamjanovsky commented May 14, 2021 •

edited

Loading