crocs-muni · adamjanovsky · May 14, 2021 · Apr 19, 2021 · Apr 19, 2021 · Apr 19, 2021
diff --git a/.github/workflows/GA_CI.yml → .github/workflows/tests.yml b/.github/workflows/GA_CI.yml → .github/workflows/tests.yml
@@ -1,5 +1,8 @@
 name: tests
-on: [push]
+on:
+  push:
+  workflow_dispatch:
+
 
 
 jobs:
@@ -14,9 +17,9 @@ jobs:
           python-version: '3.8'
       - name: Install python dependencies
         run: pip install -r requirements.txt
-      - name : Install pytest and run scripts like Travis does
+      - name : Install pytest and run tests
         run: |
           pip install pytest
           pip install pytest-cov
           pip install ".[dev,test]"
-          pytest test
+          python3 -m unittest discover tests
diff --git a/Dockerfile b/Dockerfile
@@ -1,5 +1,16 @@
 FROM ubuntu
 
+ARG NB_USER
+ARG NB_UID
+ENV USER ${NB_USER}
+ENV HOME /home/${NB_USER}
+
+RUN adduser --disabled-password \
+    --gecos "Default user" \
+    --uid ${NB_UID} \
+    ${NB_USER}
+WORKDIR ${HOME}
+
 #installing dependencies
 RUN apt-get update
 RUN apt-get install python3 -y
@@ -25,9 +36,10 @@ ENV PATH="$VIRTUAL_ENV/bin:$PATH"
 RUN cp /opt/sec-certs/requirements.txt .
 RUN pip install wheel
 RUN pip install -r requirements.txt
+RUN pip install --no-cache notebook
 #just to be sure that pdftotext is in $PATH
 ENV PATH /usr/bin/pdftotext:${PATH}
 
 
 # Run the application:
-CMD ["python3", "/opt/sec-certs/examples/cc_oop_demo.py"]
+CMD ["python3", "/opt/sec-certs/cc_cli.py"]
diff --git a/README.md b/README.md
@@ -7,13 +7,14 @@ This project is developed by the [Centre for Research On Cryptography and Securi
 [![Website](https://img.shields.io/website?down_color=red&down_message=offline&style=flat-square&up_color=SpringGreen&up_message=online&url=https%3A%2F%2Fseccerts.org)](https://seccerts.org)
 [![PyPI](https://img.shields.io/pypi/v/sec-certs?style=flat-square)](https://pypi.org/project/sec-certs/)
 [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sec-certs?label=Python%20versions&style=flat-square)](https://pypi.org/project/sec-certs/)
-[![GitHub Workflow Status](https://img.shields.io/github/workflow/status/crocs-muni/sec-certs/tests?style=flat-square)](https://github.com/crocs-muni/sec-certs/actions/workflows/GA_CI.yml)
+[![GitHub Workflow Status](https://img.shields.io/github/workflow/status/crocs-muni/sec-certs/tests?style=flat-square)](https://github.com/crocs-muni/sec-certs/actions/workflows/tests.yml)
 [![GitHub Workflow Status](https://img.shields.io/github/workflow/status/crocs-muni/sec-certs/Docker%20Image%20CI?label=Docker%20build&style=flat-square)](https://hub.docker.com/repository/docker/seccerts/sec-certs)
+[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/crocs-muni/sec-certs/cc-feature-parity?filepath=notebooks%2Fcc_data_exploration.ipynb)
 
 ## Installation (CC)
 
-The tool requires several Python packages as well as the `pdftotext` binary somewhere on the `PATH`.
-[
+The tool requires `Python >=3.8` and `pdftotext` binary somewhere on the `PATH`.
+
 The stable release is published on [PyPi](https://pypi.org/project/sec-certs/) as well as on [DockerHub](https://hub.docker.com/repository/docker/seccerts/sec-certs), you can install it with:
 
 ```
@@ -26,52 +27,75 @@ or
 docker pull seccerts/sec-certs
 ```
 
-Alternatively, you can setup the tool for development in a virtual environment, e.g.:
-Install Python virtual environment (if not yet):
-```
-python3 -m pip install --upgrade pip
-pip install virtualenv  
-```
-Setup new local one named 'virt' :
+Alternatively, you can setup the tool for development in virtual environment:
+
 ```
-python3 -m venv virt
-. virt/bin/activate
+python3 -m venv venv
+source venv/bin/activate
 pip install -e .
 ```
 
-## Examples
+## Usage
+
+There are two main steps in exploring the world of Common Criteria certificates:
+
+1. Processing all the certificates
+2. Data exploration
+
+For the first step, we currently provide CLI and our already processed fresh snapshot. For the second step, we provide simple API that can be used directly inside our Jupyter notebook or locally, at your machine. 
+
+### Explore data with MyBinder Jupyter notebook
+
+Most probably, you don't want to process fresh snapshot of Common Criteria certificates by yourself. Instead, you can use our results and explore them using [online Jupyter notebook](https://mybinder.org/v2/gh/crocs-muni/sec-certs/cc-feature-parity?filepath=notebooks%2Fcc_data_exploration.ipynb).
+
+### Explore the latest snapshot locally
+
+In Python, run
+
+```python
+from sec_certs.dataset.common_criteria import CCDataset
+import pandas as pd
+
+dset = CCDataset.from_web_latest()  # now you can inspect the object, certificates are held in dset.certs
+df = dset.to_pandas()  # Or you can transform the object into Pandas dataframe
+dset.to_json(
+    './latest_cc_snapshot.json')  # You may want to store the snapshot as json, so that you don't have to download it again
+dset = CCDataset.from_json('./latest_cc_snapshot.json')  # you can now load your stored dataset again
+```
+
+### Process CC data with Python
 
-Some examples are documented in [examples](https://github.com/crocs-muni/sec-certs/blob/master/examples/)
+If you wish to fully process the Common Criteria (CC) data by yourself, you can do that as follows. Running
 
-## Old API
+```python
+cc-cli all --output ./cc_dataset
+```
+
+will fully process the Common Criteria dataset, which can take up to 6 hours to finish. You can select only same tasks to run. Calling `cc-cli --help` yields
 
-The following steps will do a full extraction and analysis of CC certificates:
+```
+Usage: cc_cli.py [OPTIONS] [all|build|download|convert|analyze|maintenances]...
 
- 1. Make a directory in which the certificates will be downloaded and processing will take place.
-    The contents of the directory are under the control of the tool, and **may be overwritten**!
- 2. Run `python process_certificates.py --fresh --do-download-meta <dir>` to download certificate metadata from the Common Criteria portal.
- 3. Run `python process_certificates.py --fresh --do-extraction-meta <dir>` to extract metadata from the downloaded Common Criteria pages.
- 4. Run `python process_certificates.py --fresh --do-download-certs <dir>` to download the certificate and security target PDF files. This
-    step takes time as there is quite a lot of files. It also takes up a lot of space (around 5GB). It is done in parallel
-    and the number of threads can be changed with the `-t/--threads` switch (the default is 4).
- 5. Run `python process_certificates.py --fresh --do-pdftotext <dir>` to convert the PDF files to text.
- 6. Run `python process_certificates.py --fresh --do-extraction <dir>` to extract information from the certificates and security targets.
- 7. Run `python process_certificates.py --fresh --do-pairing <dir>`.
- 8. Run `python process_certificates.py --fresh --do-processing <dir>` to run various heuristics which will create post-processed section
-   `processed` for every certificate (results are stored in `certificate_data_complete_processed.json`).
- 9. Run `python process_certificates.py --fresh --do-analysis <dir>` to perform analysis of certificates (various graphs, statistics...).
- 10. Open, look and enjoy graphs like `num_certs_in_years.png` or `num_certs_eal_in_years.png`. For `certid_graph.dot.pdf` 
-     and other large graphs use Chrome to display as Adobe Acrobat Reader will fail to show whole graph. 
+  Specify actions, sequence of one or more strings from the following list:
+  [all, build, download, convert, analyze] If 'all' is specified, all
+  actions run against the dataset. Otherwise, only selected actions will run
+  in the correct order.
 
+Options:
+  -o, --output DIRECTORY  Path where the output of the experiment will be
+                          stored. May overwrite existing content.
 
-## Extending the analysis
+  -c, --config FILE       Path to your own config yaml file that will override
+                          the default one.
 
-The analysis can be extended in several ways:
- 1. Additional keywords can be extracted from PDF files (modify `cert_rules.py`)
- 2. Data from `certificate_data_complete.json` can be analyzed in a novel way - this is why this project was concieved at the first place.
- 3. Help to fix problems in data extraction - some PDF files are corrupted, there are many typos even in certificate IDs...
+  -i, --input FILE        If set, the actions will be performed on a CC
+                          dataset loaded from JSON from the input path.
+
+  -s, --silent            If set, will not print to stdout
+  --help                  Show this message and exit.
+```
 
-## How to run the application with a Docker container
+### Process CC data with Docker 
 
  1. pull the image from the DockerHub repository : `docker pull seccerts/sec-certs`
  2. run `docker run --volume ./processed_data:/opt/sec-certs/examples/debug_dataset -it seccerts/sec-certs`

diff --git a/cc_cli.py b/cc_cli.py
@@ -0,0 +1,105 @@
+#!/usr/bin/env python3
+from typing import Optional, List
+import click
+from pathlib import Path
+import logging
+import sys
+from datetime import datetime
+
+from sec_certs.configuration import config
+from sec_certs.dataset.common_criteria import CCDataset
+
+logger = logging.getLogger(__name__)
+
+
+@click.command()
+@click.argument('actions', required=True, nargs=-1, type=click.Choice(['all', 'build', 'download', 'convert', 'analyze', 'maintenances'], case_sensitive=False))
+@click.option('-o', '--output', type=click.Path(file_okay=False, dir_okay=True, writable=True, readable=True),
+              help='Path where the output of the experiment will be stored. May overwrite existing content.')
+@click.option('-c', '--config', 'configpath', default=None, type=click.Path(file_okay=True, dir_okay=False, writable=True, readable=True),
+              help='Path to your own config yaml file that will override the default one.')
+@click.option('-i', '--input', 'inputpath', type=click.Path(file_okay=True, dir_okay=False, writable=True, readable=True),
+              help='If set, the actions will be performed on a CC dataset loaded from JSON from the input path.')
+@click.option('-s', '--silent', is_flag=True, help='If set, will not print to stdout')
+def main(configpath: Optional[str], actions: List[str], inputpath: Optional[Path], output: Optional[Path], silent: bool):
+    """
+    Specify actions, sequence of one or more strings from the following list: [all, build, download, convert, analyze]
+    If 'all' is specified, all actions run against the dataset. Otherwise, only selected actions will run in the correct order.
+    """
+    file_handler = logging.FileHandler(config.log_filepath)
+    stream_handler = logging.StreamHandler(sys.stderr)
+    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+    file_handler.setFormatter(formatter)
+    stream_handler.setFormatter(formatter)
+    handlers = [file_handler]
+
+    if output:
+        output = Path(output)
+
+    if not inputpath and not output:
+        print('Error: You did not specify path to load the dataset from, nor did you specify where dataset can be stored.')
+        sys.exit(1)
+
+    if not silent:
+        handlers.append(stream_handler)
+
+    logging.basicConfig(level=logging.INFO, handlers=handlers)
+    start = datetime.now()
+
+    if configpath:
+        try:
+            config.load(Path(configpath))
+        except FileNotFoundError:
+            print('Error: Bad path to configuration file')
+            sys.exit(1)
+        except ValueError as e:
+            print(f'Error: Bad format of configuration file: {e}')
+
+    actions = {'build', 'download', 'convert', 'analyze'} if 'all' in actions else set(actions)
+
+    if inputpath and 'build' not in actions:
+        dset: CCDataset = CCDataset.from_json(Path(inputpath))
+        if output:
+            print(f'Warning: you provided both input and output paths. The dataset from input path will get copied to output path.')
+            dset.root_dir = output
+
+    if inputpath and 'build' in actions:
+        print(f'Warning: you wanted to build a dataset but you provided one in JSON -- that will be ignored. New one will be constructed at: {output}')
+
+    if 'build' in actions:
+        dset: CCDataset = CCDataset(certs={}, root_dir=output, name=f'CommonCriteria_dataset', description=f'Full CommonCriteria dataset snapshot {datetime.now().date()}')
+        dset.get_certs_from_web()
+    elif 'build' not in actions and not inputpath:
+        print('Error: If you do not provide input parameter, you must use \'build\' action to build dataset first.')
+        sys.exit(1)
+
+    if 'download' in actions:
+        if not dset.state.meta_sources_parsed:
+            print('Error: You want to download all pdfs, but the data from commoncriteria.org was not parsed. You must use \'build\' action first.')
+            sys.exit(1)
+        dset.download_all_pdfs()
+
+    if 'convert' in actions:
+        if not dset.state.pdfs_downloaded:
+            print('Error: You want to convert pdfs -> txt, but the pdfs were not downloaded. You must use \'download\' action first.')
+            sys.exit(1)
+        dset.convert_all_pdfs()
+
+    if 'analyze' in actions:
+        if not dset.state.pdfs_converted:
+            print('Error: You want to process txt documents of certificates, but pdfs were not converted. You must use \'convert\' action first.')
+            sys.exit(1)
+        dset.extract_data()
+        dset.compute_heuristics()
+
+    if 'maintenances' in actions:
+        if not dset.state.meta_sources_parsed:
+            print('Error: You want to process maintenance updates, but the data from commoncriteria.org was not parsed. You must use \'build\' action first.')
+            sys.exit(1)
+
+    end = datetime.now()
+    logger.info(f'The computation took {(end-start)} seconds.')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/examples/cc_cpe_labeling.py b/examples/cc_cpe_labeling.py
@@ -2,14 +2,14 @@
 import logging
 from pathlib import Path
 
-from sec_certs.dataset import CCDataset
-import sec_certs.constants as constants
+from sec_certs.dataset.common_criteria import CCDataset
+from sec_certs.configuration import config
 
 logger = logging.getLogger(__name__)
 
 
 def main():
-    file_handler = logging.FileHandler(constants.LOGS_FILENAME)
+    file_handler = logging.FileHandler(config.log_filepath)
     stream_handler = logging.StreamHandler()
     formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
     file_handler.setFormatter(formatter)

diff --git a/examples/cc_oop_demo.py b/examples/cc_oop_demo.py
@@ -1,17 +1,15 @@
-from sec_certs.dataset import CCDataset
-from sec_certs.serialization import CustomJSONEncoder, CustomJSONDecoder
-import sec_certs.constants as constants
 from pathlib import Path
 from datetime import datetime
 import logging
-import json
-import pandas as pd
+
+from sec_certs.dataset.common_criteria import CCDataset
+from sec_certs.configuration import config
 
 logger = logging.getLogger(__name__)
 
 
 def main():
-    file_handler = logging.FileHandler(constants.LOGS_FILENAME)
+    file_handler = logging.FileHandler(config.log_filepath)
     stream_handler = logging.StreamHandler()
     formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
     file_handler.setFormatter(formatter)
@@ -28,24 +26,27 @@ def main():
     # explicitly dump to json
     dset.to_json(dset.json_path)
 
+    # Retrieve protection profile IDs
+    dset.process_protection_profiles()
+
     # Load dataset from JSON
     new_dset = CCDataset.from_json('./debug_dataset/cc_full_dataset.json')
     assert dset == new_dset
 
     # Download pdfs and update json
-    dset.download_all_pdfs(update_json=True)
+    dset.download_all_pdfs()
 
     # Convert pdfs to text and update json
-    dset.convert_all_pdfs(update_json=True)
+    dset.convert_all_pdfs()
 
     # Extract data from txt files and update json
-    dset.extract_data(update_json=True)
+    dset.extract_data()
 
     # transform to pandas DataFrame
     df = dset.to_pandas()
 
     # Compute heuristics on the dataset
-    dset.compute_heuristics(update_json=True)
+    dset.compute_heuristics()
 
     # Manually verify CPE findings and compute related cves
     # dset.manually_verify_cpe_matches(update_json=True)

diff --git a/examples/fips_oop_demo.py b/examples/fips_oop_demo.py
@@ -2,8 +2,8 @@
 from datetime import datetime
 import logging
 import click
-
-from sec_certs.dataset import FIPSDataset, FIPSAlgorithmDataset
+from sec_certs.dataset.fips import FIPSDataset
+from sec_certs.dataset.fips_algorithm import FIPSAlgorithmDataset
 from sec_certs.configuration import config