Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] CC: One PR to rule them all -- feature parity with old API #59

Merged
merged 59 commits into from
May 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
4bc9be7
dummy folders for directory refactoring
adamjanovsky Apr 19, 2021
4f646e3
refactor folder strucure
adamjanovsky Apr 19, 2021
6354f6a
delete print statement in tests
adamjanovsky Apr 19, 2021
a15acdd
fix import in dataset
adamjanovsky Apr 19, 2021
b410def
delete print statement in tests
adamjanovsky Apr 19, 2021
c13a2ab
collect certification lab in heuristics
adamjanovsky Apr 20, 2021
eb08912
compute heuristics cert lab
adamjanovsky Apr 20, 2021
a2ffac0
Extraction of cert_id, revocation of some tests
adamjanovsky Apr 20, 2021
b6d4be2
increase max cpe matches constant
adamjanovsky Apr 20, 2021
34eca2f
dont compute heuristics immediately
adamjanovsky Apr 20, 2021
1f69f17
basically rebase of dev
adamjanovsky Apr 20, 2021
5fc379b
adjust cpe matching ocnstant
adamjanovsky Apr 20, 2021
ce8eae0
functions to import/export for label-studio
adamjanovsky Apr 23, 2021
f684f70
merge dev onto cc-feature-parity
adamjanovsky May 10, 2021
4c92744
rename test -> tests
adamjanovsky May 10, 2021
1746ba5
delete stale tests of download cc
adamjanovsky May 10, 2021
a5a6c4f
isolate function for downloading csv/html resources
adamjanovsky May 10, 2021
4baef42
new cc download tests
adamjanovsky May 10, 2021
796a02e
fix serialization in toy dataset test
adamjanovsky May 10, 2021
f76db3c
cc links http -> https
adamjanovsky May 11, 2021
f812921
cc implement protection profiles dataset
adamjanovsky May 11, 2021
79e1233
New API: delete src attribute of CC cert
adamjanovsky May 11, 2021
7676b82
improve config handling
adamjanovsky May 11, 2021
3c2e643
new dot notation of config
adamjanovsky May 11, 2021
a117f34
get rid of constants in favor of config
adamjanovsky May 11, 2021
9d3455e
fix test path problems
adamjanovsky May 11, 2021
55c1d3f
test improvements
adamjanovsky May 13, 2021
51e7908
override __bool__ for PdfData class
adamjanovsky May 13, 2021
c25601e
fix in __bool__ for pdfData class
adamjanovsky May 13, 2021
114a3cf
implement heuristics tests
adamjanovsky May 13, 2021
134ee9b
add simple PP test
adamjanovsky May 13, 2021
3815728
simplify json search process
adamjanovsky May 13, 2021
be9385b
__file__ -> inspect.getfile()
adamjanovsky May 13, 2021
70f46dc
inspect() -> sys.module[package].__file__
adamjanovsky May 13, 2021
6f64ea6
workflow dispatch for tests
adamjanovsky May 13, 2021
18a283f
now trying with __path__
adamjanovsky May 13, 2021
cac770c
change badge link
adamjanovsky May 13, 2021
a910f54
delete old API
adamjanovsky May 13, 2021
033f3a9
add test for txt processing
adamjanovsky May 13, 2021
e28592b
method for downloading json dset from web
adamjanovsky May 13, 2021
535a271
CC CLI
adamjanovsky May 13, 2021
4aa4770
improvements in config handling CLI
adamjanovsky May 13, 2021
9b4e3c0
setting root dir properly copies whole dataset contents
adamjanovsky May 14, 2021
93a6c47
computing heuristics now updates state of dataset
adamjanovsky May 14, 2021
d6a52af
exception handling of dataset copy function
adamjanovsky May 14, 2021
1f98b35
flatten keywords dictionary
adamjanovsky May 14, 2021
8880e1e
implements maintenance update processing
adamjanovsky May 14, 2021
c8a330a
some constraints on maintenance updates
adamjanovsky May 14, 2021
dcb44d6
fix tests: delete maintenance update
adamjanovsky May 14, 2021
b1dcb36
cli: error on no input and no output
adamjanovsky May 14, 2021
c019ccb
adds maintenances action to cli
adamjanovsky May 14, 2021
897f667
add url to latest dataset snapshot
adamjanovsky May 14, 2021
d44ca1f
add python 3.10 to verions
adamjanovsky May 14, 2021
83d25a1
add basic notebook for exploration
adamjanovsky May 14, 2021
b5573af
update readme
adamjanovsky May 14, 2021
dddceeb
trying to workout the dockerfile for binder
adamjanovsky May 14, 2021
41bb82d
merge dev into cc-feature-parity
adamjanovsky May 14, 2021
70fba15
working on broken merge
adamjanovsky May 14, 2021
7551936
fix broken merge
adamjanovsky May 14, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions .github/workflows/GA_CI.yml → .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
name: tests
on: [push]
on:
push:
workflow_dispatch:



jobs:
Expand All @@ -14,9 +17,9 @@ jobs:
python-version: '3.8'
- name: Install python dependencies
run: pip install -r requirements.txt
- name : Install pytest and run scripts like Travis does
- name : Install pytest and run tests
run: |
pip install pytest
pip install pytest-cov
pip install ".[dev,test]"
pytest test
python3 -m unittest discover tests
14 changes: 13 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
FROM ubuntu

ARG NB_USER
ARG NB_UID
ENV USER ${NB_USER}
ENV HOME /home/${NB_USER}

RUN adduser --disabled-password \
--gecos "Default user" \
--uid ${NB_UID} \
${NB_USER}
WORKDIR ${HOME}

#installing dependencies
RUN apt-get update
RUN apt-get install python3 -y
Expand All @@ -25,9 +36,10 @@ ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN cp /opt/sec-certs/requirements.txt .
RUN pip install wheel
RUN pip install -r requirements.txt
RUN pip install --no-cache notebook
#just to be sure that pdftotext is in $PATH
ENV PATH /usr/bin/pdftotext:${PATH}


# Run the application:
CMD ["python3", "/opt/sec-certs/examples/cc_oop_demo.py"]
CMD ["python3", "/opt/sec-certs/cc_cli.py"]
98 changes: 61 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,14 @@ This project is developed by the [Centre for Research On Cryptography and Securi
[![Website](https://img.shields.io/website?down_color=red&down_message=offline&style=flat-square&up_color=SpringGreen&up_message=online&url=https%3A%2F%2Fseccerts.org)](https://seccerts.org)
[![PyPI](https://img.shields.io/pypi/v/sec-certs?style=flat-square)](https://pypi.org/project/sec-certs/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sec-certs?label=Python%20versions&style=flat-square)](https://pypi.org/project/sec-certs/)
[![GitHub Workflow Status](https://img.shields.io/github/workflow/status/crocs-muni/sec-certs/tests?style=flat-square)](https://github.com/crocs-muni/sec-certs/actions/workflows/GA_CI.yml)
[![GitHub Workflow Status](https://img.shields.io/github/workflow/status/crocs-muni/sec-certs/tests?style=flat-square)](https://github.com/crocs-muni/sec-certs/actions/workflows/tests.yml)
[![GitHub Workflow Status](https://img.shields.io/github/workflow/status/crocs-muni/sec-certs/Docker%20Image%20CI?label=Docker%20build&style=flat-square)](https://hub.docker.com/repository/docker/seccerts/sec-certs)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/crocs-muni/sec-certs/cc-feature-parity?filepath=notebooks%2Fcc_data_exploration.ipynb)

## Installation (CC)

The tool requires several Python packages as well as the `pdftotext` binary somewhere on the `PATH`.
[
The tool requires `Python >=3.8` and `pdftotext` binary somewhere on the `PATH`.

The stable release is published on [PyPi](https://pypi.org/project/sec-certs/) as well as on [DockerHub](https://hub.docker.com/repository/docker/seccerts/sec-certs), you can install it with:

```
Expand All @@ -26,52 +27,75 @@ or
docker pull seccerts/sec-certs
```

Alternatively, you can setup the tool for development in a virtual environment, e.g.:
Install Python virtual environment (if not yet):
```
python3 -m pip install --upgrade pip
pip install virtualenv
```
Setup new local one named 'virt' :
Alternatively, you can setup the tool for development in virtual environment:

```
python3 -m venv virt
. virt/bin/activate
python3 -m venv venv
source venv/bin/activate
pip install -e .
```

## Examples
## Usage

There are two main steps in exploring the world of Common Criteria certificates:

1. Processing all the certificates
2. Data exploration

For the first step, we currently provide CLI and our already processed fresh snapshot. For the second step, we provide simple API that can be used directly inside our Jupyter notebook or locally, at your machine.

### Explore data with MyBinder Jupyter notebook

Most probably, you don't want to process fresh snapshot of Common Criteria certificates by yourself. Instead, you can use our results and explore them using [online Jupyter notebook](https://mybinder.org/v2/gh/crocs-muni/sec-certs/cc-feature-parity?filepath=notebooks%2Fcc_data_exploration.ipynb).

### Explore the latest snapshot locally

In Python, run

```python
from sec_certs.dataset.common_criteria import CCDataset
import pandas as pd

dset = CCDataset.from_web_latest() # now you can inspect the object, certificates are held in dset.certs
df = dset.to_pandas() # Or you can transform the object into Pandas dataframe
dset.to_json(
'./latest_cc_snapshot.json') # You may want to store the snapshot as json, so that you don't have to download it again
dset = CCDataset.from_json('./latest_cc_snapshot.json') # you can now load your stored dataset again
```

### Process CC data with Python

Some examples are documented in [examples](https://github.com/crocs-muni/sec-certs/blob/master/examples/)
If you wish to fully process the Common Criteria (CC) data by yourself, you can do that as follows. Running

## Old API
```python
cc-cli all --output ./cc_dataset
```

will fully process the Common Criteria dataset, which can take up to 6 hours to finish. You can select only same tasks to run. Calling `cc-cli --help` yields

The following steps will do a full extraction and analysis of CC certificates:
```
Usage: cc_cli.py [OPTIONS] [all|build|download|convert|analyze|maintenances]...

1. Make a directory in which the certificates will be downloaded and processing will take place.
The contents of the directory are under the control of the tool, and **may be overwritten**!
2. Run `python process_certificates.py --fresh --do-download-meta <dir>` to download certificate metadata from the Common Criteria portal.
3. Run `python process_certificates.py --fresh --do-extraction-meta <dir>` to extract metadata from the downloaded Common Criteria pages.
4. Run `python process_certificates.py --fresh --do-download-certs <dir>` to download the certificate and security target PDF files. This
step takes time as there is quite a lot of files. It also takes up a lot of space (around 5GB). It is done in parallel
and the number of threads can be changed with the `-t/--threads` switch (the default is 4).
5. Run `python process_certificates.py --fresh --do-pdftotext <dir>` to convert the PDF files to text.
6. Run `python process_certificates.py --fresh --do-extraction <dir>` to extract information from the certificates and security targets.
7. Run `python process_certificates.py --fresh --do-pairing <dir>`.
8. Run `python process_certificates.py --fresh --do-processing <dir>` to run various heuristics which will create post-processed section
`processed` for every certificate (results are stored in `certificate_data_complete_processed.json`).
9. Run `python process_certificates.py --fresh --do-analysis <dir>` to perform analysis of certificates (various graphs, statistics...).
10. Open, look and enjoy graphs like `num_certs_in_years.png` or `num_certs_eal_in_years.png`. For `certid_graph.dot.pdf`
and other large graphs use Chrome to display as Adobe Acrobat Reader will fail to show whole graph.
Specify actions, sequence of one or more strings from the following list:
[all, build, download, convert, analyze] If 'all' is specified, all
actions run against the dataset. Otherwise, only selected actions will run
in the correct order.

Options:
-o, --output DIRECTORY Path where the output of the experiment will be
stored. May overwrite existing content.

## Extending the analysis
-c, --config FILE Path to your own config yaml file that will override
the default one.

The analysis can be extended in several ways:
1. Additional keywords can be extracted from PDF files (modify `cert_rules.py`)
2. Data from `certificate_data_complete.json` can be analyzed in a novel way - this is why this project was concieved at the first place.
3. Help to fix problems in data extraction - some PDF files are corrupted, there are many typos even in certificate IDs...
-i, --input FILE If set, the actions will be performed on a CC
dataset loaded from JSON from the input path.

-s, --silent If set, will not print to stdout
--help Show this message and exit.
```

## How to run the application with a Docker container
### Process CC data with Docker

1. pull the image from the DockerHub repository : `docker pull seccerts/sec-certs`
2. run `docker run --volume ./processed_data:/opt/sec-certs/examples/debug_dataset -it seccerts/sec-certs`
Expand Down
105 changes: 105 additions & 0 deletions cc_cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
#!/usr/bin/env python3
from typing import Optional, List
import click
from pathlib import Path
import logging
import sys
from datetime import datetime

from sec_certs.configuration import config
from sec_certs.dataset.common_criteria import CCDataset

logger = logging.getLogger(__name__)


@click.command()
@click.argument('actions', required=True, nargs=-1, type=click.Choice(['all', 'build', 'download', 'convert', 'analyze', 'maintenances'], case_sensitive=False))
@click.option('-o', '--output', type=click.Path(file_okay=False, dir_okay=True, writable=True, readable=True),
help='Path where the output of the experiment will be stored. May overwrite existing content.')
@click.option('-c', '--config', 'configpath', default=None, type=click.Path(file_okay=True, dir_okay=False, writable=True, readable=True),
help='Path to your own config yaml file that will override the default one.')
@click.option('-i', '--input', 'inputpath', type=click.Path(file_okay=True, dir_okay=False, writable=True, readable=True),
help='If set, the actions will be performed on a CC dataset loaded from JSON from the input path.')
@click.option('-s', '--silent', is_flag=True, help='If set, will not print to stdout')
def main(configpath: Optional[str], actions: List[str], inputpath: Optional[Path], output: Optional[Path], silent: bool):
"""
Specify actions, sequence of one or more strings from the following list: [all, build, download, convert, analyze]
If 'all' is specified, all actions run against the dataset. Otherwise, only selected actions will run in the correct order.
"""
file_handler = logging.FileHandler(config.log_filepath)
stream_handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
file_handler.setFormatter(formatter)
stream_handler.setFormatter(formatter)
handlers = [file_handler]

if output:
output = Path(output)

if not inputpath and not output:
print('Error: You did not specify path to load the dataset from, nor did you specify where dataset can be stored.')
sys.exit(1)

if not silent:
handlers.append(stream_handler)

logging.basicConfig(level=logging.INFO, handlers=handlers)
start = datetime.now()

if configpath:
try:
config.load(Path(configpath))
except FileNotFoundError:
print('Error: Bad path to configuration file')
sys.exit(1)
except ValueError as e:
print(f'Error: Bad format of configuration file: {e}')

actions = {'build', 'download', 'convert', 'analyze'} if 'all' in actions else set(actions)

if inputpath and 'build' not in actions:
dset: CCDataset = CCDataset.from_json(Path(inputpath))
if output:
print(f'Warning: you provided both input and output paths. The dataset from input path will get copied to output path.')
dset.root_dir = output

if inputpath and 'build' in actions:
print(f'Warning: you wanted to build a dataset but you provided one in JSON -- that will be ignored. New one will be constructed at: {output}')

if 'build' in actions:
dset: CCDataset = CCDataset(certs={}, root_dir=output, name=f'CommonCriteria_dataset', description=f'Full CommonCriteria dataset snapshot {datetime.now().date()}')
dset.get_certs_from_web()
elif 'build' not in actions and not inputpath:
print('Error: If you do not provide input parameter, you must use \'build\' action to build dataset first.')
sys.exit(1)

if 'download' in actions:
if not dset.state.meta_sources_parsed:
print('Error: You want to download all pdfs, but the data from commoncriteria.org was not parsed. You must use \'build\' action first.')
sys.exit(1)
dset.download_all_pdfs()

if 'convert' in actions:
if not dset.state.pdfs_downloaded:
print('Error: You want to convert pdfs -> txt, but the pdfs were not downloaded. You must use \'download\' action first.')
sys.exit(1)
dset.convert_all_pdfs()

if 'analyze' in actions:
if not dset.state.pdfs_converted:
print('Error: You want to process txt documents of certificates, but pdfs were not converted. You must use \'convert\' action first.')
sys.exit(1)
dset.extract_data()
dset.compute_heuristics()

if 'maintenances' in actions:
if not dset.state.meta_sources_parsed:
print('Error: You want to process maintenance updates, but the data from commoncriteria.org was not parsed. You must use \'build\' action first.')
sys.exit(1)

end = datetime.now()
logger.info(f'The computation took {(end-start)} seconds.')


if __name__ == '__main__':
main()
6 changes: 3 additions & 3 deletions examples/cc_cpe_labeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
import logging
from pathlib import Path

from sec_certs.dataset import CCDataset
import sec_certs.constants as constants
from sec_certs.dataset.common_criteria import CCDataset
from sec_certs.configuration import config

logger = logging.getLogger(__name__)


def main():
file_handler = logging.FileHandler(constants.LOGS_FILENAME)
file_handler = logging.FileHandler(config.log_filepath)
stream_handler = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
file_handler.setFormatter(formatter)
Expand Down
21 changes: 11 additions & 10 deletions examples/cc_oop_demo.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,15 @@
from sec_certs.dataset import CCDataset
from sec_certs.serialization import CustomJSONEncoder, CustomJSONDecoder
import sec_certs.constants as constants
from pathlib import Path
from datetime import datetime
import logging
import json
import pandas as pd

from sec_certs.dataset.common_criteria import CCDataset
from sec_certs.configuration import config

logger = logging.getLogger(__name__)


def main():
file_handler = logging.FileHandler(constants.LOGS_FILENAME)
file_handler = logging.FileHandler(config.log_filepath)
stream_handler = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
file_handler.setFormatter(formatter)
Expand All @@ -28,24 +26,27 @@ def main():
# explicitly dump to json
dset.to_json(dset.json_path)

# Retrieve protection profile IDs
dset.process_protection_profiles()

# Load dataset from JSON
new_dset = CCDataset.from_json('./debug_dataset/cc_full_dataset.json')
assert dset == new_dset

# Download pdfs and update json
dset.download_all_pdfs(update_json=True)
dset.download_all_pdfs()

# Convert pdfs to text and update json
dset.convert_all_pdfs(update_json=True)
dset.convert_all_pdfs()

# Extract data from txt files and update json
dset.extract_data(update_json=True)
dset.extract_data()

# transform to pandas DataFrame
df = dset.to_pandas()

# Compute heuristics on the dataset
dset.compute_heuristics(update_json=True)
dset.compute_heuristics()

# Manually verify CPE findings and compute related cves
# dset.manually_verify_cpe_matches(update_json=True)
Expand Down
4 changes: 2 additions & 2 deletions examples/fips_oop_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
from datetime import datetime
import logging
import click

from sec_certs.dataset import FIPSDataset, FIPSAlgorithmDataset
from sec_certs.dataset.fips import FIPSDataset
from sec_certs.dataset.fips_algorithm import FIPSAlgorithmDataset
from sec_certs.configuration import config


Expand Down
Loading