Skip to content
This repository has been archived by the owner on Sep 10, 2020. It is now read-only.

Add initial machine learning pipeline #57

Merged
merged 27 commits into from
Nov 30, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
44804a2
Specify evaluation type in YAML file
redshiftzero Sep 21, 2016
8dec954
Refactor engine creation and add initial DatasetLoader class
redshiftzero Sep 21, 2016
b221b39
Add generic function for getting feature importances of trained model
redshiftzero Sep 21, 2016
c2fb467
Add function for plotting feature importances
redshiftzero Sep 21, 2016
260ba7c
Add function for ROC plot
redshiftzero Sep 21, 2016
d4e306b
Add machine learning classifier and CV code, add Experiment() classif…
redshiftzero Sep 21, 2016
f47349f
Add function to load closed world dataset
redshiftzero Sep 21, 2016
43ea65b
Add scikit-learn classifiers and hyperparameters
redshiftzero Sep 22, 2016
42eba49
Add database setup for model storage
redshiftzero Sep 22, 2016
e25c950
Add model storage to database.py
redshiftzero Sep 22, 2016
c578a79
Add observed_fraction to open world validation
redshiftzero Sep 23, 2016
efe2732
Add requirements for machine learning codes
redshiftzero Sep 24, 2016
819d6a2
Add little guide describing how to integrate a new classifier into th…
redshiftzero Oct 8, 2016
e1ecf56
Add feature scaling option to attack setup
redshiftzero Oct 10, 2016
0d1db54
Add figures to documentation directory
redshiftzero Oct 10, 2016
42e272c
Add writeup/readme on how pipeline works
redshiftzero Oct 10, 2016
c365bbd
Add tests for custom evaluation code
redshiftzero Oct 11, 2016
286529f
Add a ton of evaluation metrics to the models table
redshiftzero Oct 11, 2016
01dd6b9
Feedback from code review
redshiftzero Nov 16, 2016
0955916
Move requirements
redshiftzero Nov 17, 2016
2c2b442
Add Ansible configuration of models schema
redshiftzero Nov 17, 2016
4977ac3
Add new pip requirements from pip-compile.sh
redshiftzero Nov 23, 2016
ce1e0ad
Point database base class to use the PGPASSFILE
redshiftzero Nov 28, 2016
fe28cfe
Label column should always be is_sd
redshiftzero Nov 28, 2016
d0d8896
Database classes should use test keyword
redshiftzero Nov 28, 2016
4fc9d54
Add test keyword in database class
redshiftzero Nov 28, 2016
5a47434
Replace get_dict() with __dict__
redshiftzero Nov 28, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CONTRIB.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Adding Features

Writing a new feature can be done by adding a method in `fpsd/features.py`.

# Adding Classifiers

A new classifier can be added by adding a new stanza in the `Experiment._get_model_object()` method in `classify.py`. This method must return an object that has the `fit()` - model training - and `predict_proba()` - predict scores on test set - methods defined on it. It expects these methods because that is what scikit-learn uses for its classifier objects.

To get the code to use the classifier, add a string corresponding to its name to your attack YAML file under `models` and add any hyperparameters and options (e.g. number of rounds for Wa-kNN) under `parameters`.
85 changes: 85 additions & 0 deletions docs/Pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Web Fingerprinting Analysis Pipeline

![](images/pipeline.png)

## Feature Generation

Our feature generation code is primarily in SQL and takes data the crawlers dump into the `raw` schema, generates all features relevant to Tor traffic from Wang et al. 2014, and stores the results in the `features` schema:

![](images/feature_generation.png)

Run this step with:

```./features.py```

## Machine Learning

This step:

* takes the features in the database,
* trains a series of binary classifiers,
* evaluates how well each classifier performs,
* and then saves performance metrics in the database as well as pickling the trained model objects (for use in future scoring)

Run this step with:

```./attack.py -a my_attack_file.yaml```

### Attack Setup

The machine learning part of the code takes a YAML file (by default `attack.yaml`) as input to specify details of the models that should be generated. Here are the options that are currently implemented:

* `world`: specifies what kind of cross validation should be performed.
* `type`: `closed` or `open`
* `observed_fraction`: specifies the fraction of the world that is "observed" (measured by the adversary) for open world validation.

* `num_kfolds`: value of k for k-fold cross-validation

* `feature_scaling`: this option will take the features and [rescale](https://en.wikipedia.org/wiki/Feature_scaling) them to a [zero mean and unit standard deviation](https://en.wikipedia.org/wiki/Standard_score). For some classifiers, e.g. primarily those based on decision trees, this should not improve performance, but for many classifiers, e.g. SVM, this is necessary. See also [scikit-learn's documentation](http://scikit-learn.org/stable/modules/preprocessing.html).

* `models`: a list of types of binary classifiers that should be trained

* `parameters`: this option specifies the range of hyperparameters that should be used for each classifier type

For more details and examples, see `attack.yaml` as an example.

### Model Training and Evaluation

When this step of the pipeline runs, it will:

* get the features from the database,
* split the data into train/test sets,
* generate a series of experiments to be run that tries every possible combination of preprocessing option, model type and hyperparameter set
* for each experiment:
* for every train/test split, it will:
* train on the training set,
* evaluate on the testing set,
* save the metrics in the database in `models.undefended_frontpage_folds`,
* pickle the trained model and save for future scoring
* average the metrics for the folds from that experiment and save them in the database in `models.undefended_frontpage_attacks`.

### Evaluation and Output in Model Schema

The information and evaluation metrics stored in the database in table `models.undefended_frontpage_folds`:

* `auc`: [Area under the ROC curve](http://people.inf.elte.hu/kiss/12dwhdm/roc.pdf)
* `tpr`: true positive rate [array for default sklearn thresholds]
* `fpr`: false positive rate [array for default sklearn thresholds]
* `precision_at_k` for `k=[0.01, 0.05, 0.1, 0.5, 1, 5, 10]`: "Fraction of SecureDrop users correctly identified in the top k percent of the testing set"
* `recall_at_k` for `k=[0.01, 0.05, 0.1, 0.5, 1, 5, 10]`: "Number of SecureDrop users captured by flagging the top k percent of the testing set"
* `f1_at_k` for `k=[0.01, 0.05, 0.1, 0.5, 1, 5, 10]`

The same metrics are then computed over all folds and saved in `models.undefended_frontpage_attacks`, in addition to:

* `world_type`
* `train_class_balance`
* `base_rate` (test class balance)
* `observed_world_size` if in open world validation
* `model_type`
* `hyperparameters` in json format

The `model_timestamp` and `fold_timestamp` are saved as identifiers in `models.undefended_frontpage_folds` and the `model_timestamp` is saved in `models.undefended_frontpage_attacks`.

## Model Selection

This is currently done manually by selecting the top `auc` model in the database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in WTF-PAD they correctly point out that precision and recall are more important measures than TPR and FPR. They considered F1 to be the most important metric, but I personally believe that F0.5 is closer to the most informative single metric we can look at given our specific problem. There is a function sklearn.metrics.fbeta_score that will compute FB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s true that precision, recall, and F1 score are better metrics, but for this first pass we are using AUC since it’s independent of class balance in the testing set (see Figure 5 in this paper). I filed #62 to implement precision, recall, F1 storage. Not following why F_{0.5} would be a better metric?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss in-person.

Binary file added docs/images/feature_generation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
79 changes: 79 additions & 0 deletions fpsd/attack.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
#!/usr/bin/env python3.5
import argparse
import datetime
from itertools import product
import pdb
import pickle
import yaml

import classify, database


def run(options):
"""Takes an attack file, gets the features, and runs all experiments
and saves the output in the database.

Args:
options [dict]: attack setup file
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last code I wrote (see

"""Return an :obj:collections.OrderedDict from ``dict_str``.
), I wrote with the documentation style described in #53. I'm not absolutely set on one particular style, but we should be consistent. If you want to make other suggestions, do so in #53 and we can discuss, but if you also like the python-gnupg style, then you should make the appropriate changes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using Google style docstrings, but don’t have strong feelings one way or the other so the docstring format you like there is fine w/me and from now on I will follow this format. However, most of the existing docstrings in this branch I wrote before that issue was filed so would rather not rewrite the Google style docstrings at this stage in the interests of time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sgtm.


with open(options, 'r') as f:
options = yaml.load(f)

db = database.DatasetLoader(test=False)

df = db.load_world(options["world"]["type"])

df = classify.imputation(df)
x = df.drop(['exampleid', 'is_sd'], axis=1).values
y = df['is_sd'].astype(int).values

for experiment in generate_experiments(options):
experiment.train_eval_all_folds(x, y)


def generate_experiments(options):
"""Takes an attack file and generates all the experiments that
should be run

Args:
options [dict]: attack setup file

Returns:
all_experiments [list]: list of Experiment objects
"""

all_experiments = []

for model in options["models"]:
model_hyperparameters = options["parameters"][model]

parameter_names = sorted(model_hyperparameters)
parameter_values = [model_hyperparameters[p] for p in parameter_names]

# Compute Cartesian product of hyperparameter lists
all_params = product(*parameter_values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I though this was the multiplicative product, a note specifying it's the Cartesian product would help readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


for param in all_params:
parameters = {name: value for name, value
in zip(parameter_names, param)}

timestamp = datetime.datetime.now().isoformat()

all_experiments.append(classify.Experiment(
model_timestamp=timestamp,
world=options["world"],
model_type=model,
hyperparameters=parameters,
feature_scaling=options["feature_scaling"]))
return all_experiments


if __name__=='__main__':
parser = argparse.ArgumentParser()
parser.add_argument("-c", "--config", dest="config", type=str,
default="attack.yaml",
help="point to attack config/setup file")
args = parser.parse_args()

run(args.config)
91 changes: 91 additions & 0 deletions fpsd/attack.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
#################
# Train/Test #
#################

world:
type: 'closed' # 'closed' or 'open'
observed_fraction: 0.20 # fraction of the open world that can be measured by the adversary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be fun to make this a hyperparameter to play with as well. Would help in threat modeling. Though it is computationally costly. So it could just be set to [0.20] initially, and then if we ever want to investigate later, we can add some other values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep we can make this take a list at a later point, updated #64 issue to keep this suggestion

num_kfolds: 10



######################
# Preprocessing #
######################

feature_scaling: True # Rescale each feature to mean zero and unit standard deviation



#################
# Classifiers #
#################

# All supported models
# model: ['RandomForest', 'RandomForestBagging', 'RandomForestBoosting', 'ExtraTrees',
# 'AdaBoost', 'LogisticRegression', 'SVM', 'GradientBoostingClassifier',
# 'DecisionTreeClassifier', 'SGDClassifier', 'KNeighborsClassifier']

models: ['RandomForest', 'ExtraTrees', 'DecisionTreeClassifier', 'RandomForestBagging',
'RandomForestBoosting']
parameters:
RandomForest:
n_estimators: [25, 50, 100] #[25, 50] # 100, 1000, 10000, 10
max_depth: [10, 20, 50, 100] # 50, 100, 5
max_features: ['sqrt', 'log2', 2, 4, 8, 16, "auto"]
criterion: ['gini', 'entropy']
min_samples_split: [2, 5, 10]
RandomForestBagging:
n_estimators: [10] # [25, 50, 100, 1000, 10000]
max_depth: [5] # [10, 20, 50, 100]
max_features: ['sqrt'] # ['log2', 2, 4, 8, 16, "auto"]
criterion: ['gini'] # ['entropy']
min_samples_split: [2] # [5, 10]
max_samples: [0.5] # [1.0]
bootstrap: [True]
bootstrap_features: [False] # [True]
n_estimators_bag: [10] # [25, 50, 100, 1000, 10000]
max_features_bag: [2] # [4, 8, 16]
RandomForestBoosting:
n_estimators: [100] # [25, 50, 100, 1000, 10000]
max_depth: [20] # [10, 20, 50, 100]
max_features: [2] # ['sqrt', 'log2', 2, 4, 8, 16, "auto"]
criterion: ['gini'] # ['entropy']
min_samples_split: [2] # [5, 10]
algorithm: ['SAMME'] # ['SAMME.R']
learning_rate: [0.01] # [0.1, 1, 10, 100]
n_estimators_boost: [10] # [25, 50, 100, 1000, 10000]
ExtraTrees:
n_estimators: [ 10] # [25, 50, 100, 1000, 10000]
max_depth: [3 ] # 5, 10] # [20, 50, 100]
max_features: ['log2'] # [4, 8, 16, "auto"]
criterion: ['gini'] #, 'entropy']
min_samples_split: [2] #, 5, 10]
AdaBoost:
algorithm: ['SAMME', 'SAMME.R']
n_estimators: [1, 10, 100] # [1000, 10000]
learning_rate: [0.01, 0.1, 1, 10, 100]
LogisticRegression:
C_reg: [0.00001, 0.0001, 0.001, 0.01, 0.1] # [1, 10]
penalty: ['l1', 'l2']
SVM:
C_reg: [0.00001, 0.0001, 0.001, 0.01, 0.1] # [1, 10]
kernel: ['linear']
GradientBoostingClassifier:
n_estimators: [1, 10, 100] # [1000, 10000]
learning_rate: [0.001, 0.01, 0.05, 0.1, 0.5]
subsample: [0.1, 0.5, 1.0]
max_depth: [1, 3, 5, 10, 20] # [50, 100]
DecisionTreeClassifier:
criterion: ['gini', 'entropy']
max_depth: [1, 5, 10, 20] # [50, 100]
max_features: ['sqrt', 'log2']
min_samples_split: [2, 5, 10]
SGDClassifier:
loss: ['log', 'modified_huber']
penalty: ['l1', 'l2', 'elasticnet']
KNeighborsClassifier:
n_neighbors: [1, 3, 5, 10, 25, 50, 100]
weights: ['uniform', 'distance']
algorithm: ['auto', 'kd_tree']

Loading