-
Notifications
You must be signed in to change notification settings - Fork 9
Add initial machine learning pipeline #57
Changes from all commits
44804a2
8dec954
b221b39
c2fb467
260ba7c
d4e306b
f47349f
43ea65b
42eba49
e25c950
c578a79
efe2732
819d6a2
e1ecf56
0d1db54
42e272c
c365bbd
286529f
01dd6b9
0955916
2c2b442
4977ac3
ce1e0ad
fe28cfe
d0d8896
4fc9d54
5a47434
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Adding Features | ||
|
||
Writing a new feature can be done by adding a method in `fpsd/features.py`. | ||
|
||
# Adding Classifiers | ||
|
||
A new classifier can be added by adding a new stanza in the `Experiment._get_model_object()` method in `classify.py`. This method must return an object that has the `fit()` - model training - and `predict_proba()` - predict scores on test set - methods defined on it. It expects these methods because that is what scikit-learn uses for its classifier objects. | ||
|
||
To get the code to use the classifier, add a string corresponding to its name to your attack YAML file under `models` and add any hyperparameters and options (e.g. number of rounds for Wa-kNN) under `parameters`. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
# Web Fingerprinting Analysis Pipeline | ||
|
||
![](images/pipeline.png) | ||
|
||
## Feature Generation | ||
|
||
Our feature generation code is primarily in SQL and takes data the crawlers dump into the `raw` schema, generates all features relevant to Tor traffic from Wang et al. 2014, and stores the results in the `features` schema: | ||
|
||
![](images/feature_generation.png) | ||
|
||
Run this step with: | ||
|
||
```./features.py``` | ||
|
||
## Machine Learning | ||
|
||
This step: | ||
|
||
* takes the features in the database, | ||
* trains a series of binary classifiers, | ||
* evaluates how well each classifier performs, | ||
* and then saves performance metrics in the database as well as pickling the trained model objects (for use in future scoring) | ||
|
||
Run this step with: | ||
|
||
```./attack.py -a my_attack_file.yaml``` | ||
|
||
### Attack Setup | ||
|
||
The machine learning part of the code takes a YAML file (by default `attack.yaml`) as input to specify details of the models that should be generated. Here are the options that are currently implemented: | ||
|
||
* `world`: specifies what kind of cross validation should be performed. | ||
* `type`: `closed` or `open` | ||
* `observed_fraction`: specifies the fraction of the world that is "observed" (measured by the adversary) for open world validation. | ||
|
||
* `num_kfolds`: value of k for k-fold cross-validation | ||
|
||
* `feature_scaling`: this option will take the features and [rescale](https://en.wikipedia.org/wiki/Feature_scaling) them to a [zero mean and unit standard deviation](https://en.wikipedia.org/wiki/Standard_score). For some classifiers, e.g. primarily those based on decision trees, this should not improve performance, but for many classifiers, e.g. SVM, this is necessary. See also [scikit-learn's documentation](http://scikit-learn.org/stable/modules/preprocessing.html). | ||
|
||
* `models`: a list of types of binary classifiers that should be trained | ||
|
||
* `parameters`: this option specifies the range of hyperparameters that should be used for each classifier type | ||
|
||
For more details and examples, see `attack.yaml` as an example. | ||
|
||
### Model Training and Evaluation | ||
|
||
When this step of the pipeline runs, it will: | ||
|
||
* get the features from the database, | ||
* split the data into train/test sets, | ||
* generate a series of experiments to be run that tries every possible combination of preprocessing option, model type and hyperparameter set | ||
* for each experiment: | ||
* for every train/test split, it will: | ||
* train on the training set, | ||
* evaluate on the testing set, | ||
* save the metrics in the database in `models.undefended_frontpage_folds`, | ||
* pickle the trained model and save for future scoring | ||
* average the metrics for the folds from that experiment and save them in the database in `models.undefended_frontpage_attacks`. | ||
|
||
### Evaluation and Output in Model Schema | ||
|
||
The information and evaluation metrics stored in the database in table `models.undefended_frontpage_folds`: | ||
|
||
* `auc`: [Area under the ROC curve](http://people.inf.elte.hu/kiss/12dwhdm/roc.pdf) | ||
* `tpr`: true positive rate [array for default sklearn thresholds] | ||
* `fpr`: false positive rate [array for default sklearn thresholds] | ||
* `precision_at_k` for `k=[0.01, 0.05, 0.1, 0.5, 1, 5, 10]`: "Fraction of SecureDrop users correctly identified in the top k percent of the testing set" | ||
* `recall_at_k` for `k=[0.01, 0.05, 0.1, 0.5, 1, 5, 10]`: "Number of SecureDrop users captured by flagging the top k percent of the testing set" | ||
* `f1_at_k` for `k=[0.01, 0.05, 0.1, 0.5, 1, 5, 10]` | ||
|
||
The same metrics are then computed over all folds and saved in `models.undefended_frontpage_attacks`, in addition to: | ||
|
||
* `world_type` | ||
* `train_class_balance` | ||
* `base_rate` (test class balance) | ||
* `observed_world_size` if in open world validation | ||
* `model_type` | ||
* `hyperparameters` in json format | ||
|
||
The `model_timestamp` and `fold_timestamp` are saved as identifiers in `models.undefended_frontpage_folds` and the `model_timestamp` is saved in `models.undefended_frontpage_attacks`. | ||
|
||
## Model Selection | ||
|
||
This is currently done manually by selecting the top `auc` model in the database. | ||
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,79 @@ | ||||
#!/usr/bin/env python3.5 | ||||
import argparse | ||||
import datetime | ||||
from itertools import product | ||||
import pdb | ||||
import pickle | ||||
import yaml | ||||
|
||||
import classify, database | ||||
|
||||
|
||||
def run(options): | ||||
"""Takes an attack file, gets the features, and runs all experiments | ||||
and saves the output in the database. | ||||
|
||||
Args: | ||||
options [dict]: attack setup file | ||||
""" | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The last code I wrote (see fingerprint-securedrop/fpsd/utils.py Line 28 in fecb7fa
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was using Google style docstrings, but don’t have strong feelings one way or the other so the docstring format you like there is fine w/me and from now on I will follow this format. However, most of the existing docstrings in this branch I wrote before that issue was filed so would rather not rewrite the Google style docstrings at this stage in the interests of time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sgtm. |
||||
|
||||
with open(options, 'r') as f: | ||||
options = yaml.load(f) | ||||
|
||||
db = database.DatasetLoader(test=False) | ||||
|
||||
df = db.load_world(options["world"]["type"]) | ||||
|
||||
df = classify.imputation(df) | ||||
x = df.drop(['exampleid', 'is_sd'], axis=1).values | ||||
y = df['is_sd'].astype(int).values | ||||
|
||||
for experiment in generate_experiments(options): | ||||
experiment.train_eval_all_folds(x, y) | ||||
|
||||
|
||||
def generate_experiments(options): | ||||
"""Takes an attack file and generates all the experiments that | ||||
should be run | ||||
|
||||
Args: | ||||
options [dict]: attack setup file | ||||
|
||||
Returns: | ||||
all_experiments [list]: list of Experiment objects | ||||
""" | ||||
|
||||
all_experiments = [] | ||||
|
||||
for model in options["models"]: | ||||
model_hyperparameters = options["parameters"][model] | ||||
|
||||
parameter_names = sorted(model_hyperparameters) | ||||
parameter_values = [model_hyperparameters[p] for p in parameter_names] | ||||
|
||||
# Compute Cartesian product of hyperparameter lists | ||||
all_params = product(*parameter_values) | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At first I though this was the multiplicative product, a note specifying it's the Cartesian product would help readability. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||
|
||||
for param in all_params: | ||||
parameters = {name: value for name, value | ||||
in zip(parameter_names, param)} | ||||
|
||||
timestamp = datetime.datetime.now().isoformat() | ||||
|
||||
all_experiments.append(classify.Experiment( | ||||
model_timestamp=timestamp, | ||||
world=options["world"], | ||||
model_type=model, | ||||
hyperparameters=parameters, | ||||
feature_scaling=options["feature_scaling"])) | ||||
return all_experiments | ||||
|
||||
|
||||
if __name__=='__main__': | ||||
parser = argparse.ArgumentParser() | ||||
parser.add_argument("-c", "--config", dest="config", type=str, | ||||
default="attack.yaml", | ||||
help="point to attack config/setup file") | ||||
args = parser.parse_args() | ||||
|
||||
run(args.config) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
################# | ||
# Train/Test # | ||
################# | ||
|
||
world: | ||
type: 'closed' # 'closed' or 'open' | ||
observed_fraction: 0.20 # fraction of the open world that can be measured by the adversary | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be fun to make this a hyperparameter to play with as well. Would help in threat modeling. Though it is computationally costly. So it could just be set to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep we can make this take a list at a later point, updated #64 issue to keep this suggestion |
||
num_kfolds: 10 | ||
|
||
|
||
|
||
###################### | ||
# Preprocessing # | ||
###################### | ||
|
||
feature_scaling: True # Rescale each feature to mean zero and unit standard deviation | ||
|
||
|
||
|
||
################# | ||
# Classifiers # | ||
################# | ||
|
||
# All supported models | ||
# model: ['RandomForest', 'RandomForestBagging', 'RandomForestBoosting', 'ExtraTrees', | ||
# 'AdaBoost', 'LogisticRegression', 'SVM', 'GradientBoostingClassifier', | ||
# 'DecisionTreeClassifier', 'SGDClassifier', 'KNeighborsClassifier'] | ||
|
||
models: ['RandomForest', 'ExtraTrees', 'DecisionTreeClassifier', 'RandomForestBagging', | ||
'RandomForestBoosting'] | ||
parameters: | ||
RandomForest: | ||
n_estimators: [25, 50, 100] #[25, 50] # 100, 1000, 10000, 10 | ||
max_depth: [10, 20, 50, 100] # 50, 100, 5 | ||
max_features: ['sqrt', 'log2', 2, 4, 8, 16, "auto"] | ||
criterion: ['gini', 'entropy'] | ||
min_samples_split: [2, 5, 10] | ||
RandomForestBagging: | ||
n_estimators: [10] # [25, 50, 100, 1000, 10000] | ||
max_depth: [5] # [10, 20, 50, 100] | ||
max_features: ['sqrt'] # ['log2', 2, 4, 8, 16, "auto"] | ||
criterion: ['gini'] # ['entropy'] | ||
min_samples_split: [2] # [5, 10] | ||
max_samples: [0.5] # [1.0] | ||
bootstrap: [True] | ||
bootstrap_features: [False] # [True] | ||
n_estimators_bag: [10] # [25, 50, 100, 1000, 10000] | ||
max_features_bag: [2] # [4, 8, 16] | ||
RandomForestBoosting: | ||
n_estimators: [100] # [25, 50, 100, 1000, 10000] | ||
max_depth: [20] # [10, 20, 50, 100] | ||
max_features: [2] # ['sqrt', 'log2', 2, 4, 8, 16, "auto"] | ||
criterion: ['gini'] # ['entropy'] | ||
min_samples_split: [2] # [5, 10] | ||
algorithm: ['SAMME'] # ['SAMME.R'] | ||
learning_rate: [0.01] # [0.1, 1, 10, 100] | ||
n_estimators_boost: [10] # [25, 50, 100, 1000, 10000] | ||
ExtraTrees: | ||
n_estimators: [ 10] # [25, 50, 100, 1000, 10000] | ||
max_depth: [3 ] # 5, 10] # [20, 50, 100] | ||
max_features: ['log2'] # [4, 8, 16, "auto"] | ||
criterion: ['gini'] #, 'entropy'] | ||
min_samples_split: [2] #, 5, 10] | ||
AdaBoost: | ||
algorithm: ['SAMME', 'SAMME.R'] | ||
n_estimators: [1, 10, 100] # [1000, 10000] | ||
learning_rate: [0.01, 0.1, 1, 10, 100] | ||
LogisticRegression: | ||
C_reg: [0.00001, 0.0001, 0.001, 0.01, 0.1] # [1, 10] | ||
penalty: ['l1', 'l2'] | ||
SVM: | ||
C_reg: [0.00001, 0.0001, 0.001, 0.01, 0.1] # [1, 10] | ||
kernel: ['linear'] | ||
GradientBoostingClassifier: | ||
n_estimators: [1, 10, 100] # [1000, 10000] | ||
learning_rate: [0.001, 0.01, 0.05, 0.1, 0.5] | ||
subsample: [0.1, 0.5, 1.0] | ||
max_depth: [1, 3, 5, 10, 20] # [50, 100] | ||
DecisionTreeClassifier: | ||
criterion: ['gini', 'entropy'] | ||
max_depth: [1, 5, 10, 20] # [50, 100] | ||
max_features: ['sqrt', 'log2'] | ||
min_samples_split: [2, 5, 10] | ||
SGDClassifier: | ||
loss: ['log', 'modified_huber'] | ||
penalty: ['l1', 'l2', 'elasticnet'] | ||
KNeighborsClassifier: | ||
n_neighbors: [1, 3, 5, 10, 25, 50, 100] | ||
weights: ['uniform', 'distance'] | ||
algorithm: ['auto', 'kd_tree'] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in WTF-PAD they correctly point out that precision and recall are more important measures than TPR and FPR. They considered F1 to be the most important metric, but I personally believe that F0.5 is closer to the most informative single metric we can look at given our specific problem. There is a function
sklearn.metrics.fbeta_score
that will compute FB.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s true that precision, recall, and F1 score are better metrics, but for this first pass we are using AUC since it’s independent of class balance in the testing set (see Figure 5 in this paper). I filed #62 to implement precision, recall, F1 storage. Not following why F_{0.5} would be a better metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss in-person.