freedomofpress · conorsch · Nov 30, 2016 · Sep 21, 2016 · Sep 21, 2016 · Sep 21, 2016
diff --git a/CONTRIB.md b/CONTRIB.md
@@ -0,0 +1,9 @@
+# Adding Features
+
+Writing a new feature can be done by adding a method in `fpsd/features.py`.
+
+# Adding Classifiers
+
+A new classifier can be added by adding a new stanza in the `Experiment._get_model_object()` method in `classify.py`. This method must return an object that has the `fit()` - model training - and `predict_proba()` - predict scores on test set - methods defined on it. It expects these methods because that is what scikit-learn uses for its classifier objects. 
+
+To get the code to use the classifier, add a string corresponding to its name to your attack YAML file under `models` and add any hyperparameters and options (e.g. number of rounds for Wa-kNN) under `parameters`. 
diff --git a/docs/Pipeline.md b/docs/Pipeline.md
@@ -0,0 +1,85 @@
+# Web Fingerprinting Analysis Pipeline
+
+![](images/pipeline.png)
+
+## Feature Generation
+
+Our feature generation code is primarily in SQL and takes data the crawlers dump into the `raw` schema, generates all features relevant to Tor traffic from Wang et al. 2014, and stores the results in the `features` schema:
+
+![](images/feature_generation.png)
+
+Run this step with:
+
+```./features.py```
+
+## Machine Learning
+
+This step:
+
+* takes the features in the database, 
+* trains a series of binary classifiers, 
+* evaluates how well each classifier performs, 
+* and then saves performance metrics in the database as well as pickling the trained model objects (for use in future scoring)
+
+Run this step with:
+
+```./attack.py -a my_attack_file.yaml```
+
+### Attack Setup
+
+The machine learning part of the code takes a YAML file (by default `attack.yaml`) as input to specify details of the models that should be generated. Here are the options that are currently implemented:
+
+* `world`: specifies what kind of cross validation should be performed.
+   * `type`: `closed` or `open`
+   * `observed_fraction`: specifies the fraction of the world that is "observed" (measured by the adversary) for open world validation. 
+
+* `num_kfolds`: value of k for k-fold cross-validation 
+
+* `feature_scaling`: this option will take the features and [rescale](https://en.wikipedia.org/wiki/Feature_scaling) them to a [zero mean and unit standard deviation](https://en.wikipedia.org/wiki/Standard_score). For some classifiers, e.g. primarily those based on decision trees, this should not improve performance, but for many classifiers, e.g. SVM, this is necessary. See also [scikit-learn's documentation](http://scikit-learn.org/stable/modules/preprocessing.html).
+
+* `models`: a list of types of binary classifiers that should be trained
+
+* `parameters`: this option specifies the range of hyperparameters that should be used for each classifier type
+
+For more details and examples, see `attack.yaml` as an example.
+
+### Model Training and Evaluation
+
+When this step of the pipeline runs, it will:
+
+* get the features from the database,
+* split the data into train/test sets,
+* generate a series of experiments to be run that tries every  possible combination of preprocessing option, model type and hyperparameter set
+* for each experiment:
+	* for every train/test split, it will:
+		* train on the training set, 
+		* evaluate on the testing set, 
+		* save the metrics in the database in  `models.undefended_frontpage_folds`,
+		* pickle the trained model and save for future scoring
+	* average the metrics for the folds from that experiment and save them in the database in `models.undefended_frontpage_attacks`.
+
+### Evaluation and Output in Model Schema
+
+The information and evaluation metrics stored in the database in table `models.undefended_frontpage_folds`:
+
+* `auc`: [Area under the ROC curve](http://people.inf.elte.hu/kiss/12dwhdm/roc.pdf)
+* `tpr`: true positive rate [array for default sklearn thresholds]
+* `fpr`: false positive rate [array for default sklearn thresholds]
+* `precision_at_k` for `k=[0.01, 0.05, 0.1, 0.5, 1, 5, 10]`: "Fraction of SecureDrop users correctly identified in the top k percent of the testing set"
+* `recall_at_k` for `k=[0.01, 0.05, 0.1, 0.5, 1, 5, 10]`: "Number of SecureDrop users captured by flagging the top k percent of the testing set"
+* `f1_at_k` for `k=[0.01, 0.05, 0.1, 0.5, 1, 5, 10]` 
+
+The same metrics are then computed over all folds and saved in `models.undefended_frontpage_attacks`, in addition to:
+
+* `world_type`
+* `train_class_balance`
+* `base_rate` (test class balance)
+* `observed_world_size` if in open world validation
+* `model_type`
+* `hyperparameters` in json format
+
+The `model_timestamp` and `fold_timestamp` are saved as identifiers in `models.undefended_frontpage_folds` and the `model_timestamp` is saved in `models.undefended_frontpage_attacks`.
+
+## Model Selection
+
+This is currently done manually by selecting the top `auc` model in the database. 
diff --git a/docs/images/feature_generation.png b/docs/images/feature_generation.png
diff --git a/docs/images/pipeline.png b/docs/images/pipeline.png
diff --git a/fpsd/attack.py b/fpsd/attack.py
@@ -0,0 +1,79 @@
+#!/usr/bin/env python3.5
+import argparse
+import datetime
+from itertools import product
+import pdb
+import pickle
+import yaml
+
+import classify, database
+
+
+def run(options):
+    """Takes an attack file, gets the features, and runs all experiments
+    and saves the output in the database.
+
+    Args:
+        options [dict]: attack setup file
+    """
 """Return an :obj:collections.OrderedDict from ``dict_str``. 
 """Return an :obj:collections.OrderedDict from ``dict_str``. 
+
+    with open(options, 'r') as f:
+        options = yaml.load(f)
+
+    db = database.DatasetLoader(test=False)
+
+    df = db.load_world(options["world"]["type"])
+
+    df = classify.imputation(df)
+    x = df.drop(['exampleid', 'is_sd'], axis=1).values
+    y = df['is_sd'].astype(int).values
+
+    for experiment in generate_experiments(options):
+        experiment.train_eval_all_folds(x, y)
+
+
+def generate_experiments(options):
+    """Takes an attack file and generates all the experiments that
+    should be run 
+
+    Args:
+        options [dict]: attack setup file
+
+    Returns:
+        all_experiments [list]: list of Experiment objects
+    """
+
+    all_experiments = []
+
+    for model in options["models"]:
+        model_hyperparameters = options["parameters"][model]
+
+        parameter_names = sorted(model_hyperparameters)
+        parameter_values = [model_hyperparameters[p] for p in parameter_names]
+
+        # Compute Cartesian product of hyperparameter lists
+        all_params = product(*parameter_values)
+
+        for param in all_params:
+            parameters = {name: value for name, value
+                              in zip(parameter_names, param)}
+
+            timestamp = datetime.datetime.now().isoformat()
+
+            all_experiments.append(classify.Experiment(
+                                model_timestamp=timestamp,
+                                world=options["world"],
+                                model_type=model,
+                                hyperparameters=parameters,
+                                feature_scaling=options["feature_scaling"]))
+    return all_experiments
+
+
+if __name__=='__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-c", "--config", dest="config", type=str,
+                        default="attack.yaml",
+                        help="point to attack config/setup file")
+    args = parser.parse_args()
+
+    run(args.config)
diff --git a/fpsd/attack.yaml b/fpsd/attack.yaml
@@ -0,0 +1,91 @@
+#################
+# Train/Test    #
+#################
+
+world: 
+  type: 'closed'  # 'closed' or 'open'
+  observed_fraction: 0.20  # fraction of the open world that can be measured by the adversary 
+num_kfolds: 10
+
+
+
+######################
+# Preprocessing      #
+######################
+
+feature_scaling: True  # Rescale each feature to mean zero and unit standard deviation
+
+
+
+#################
+# Classifiers   #
+#################
+
+# All supported models 
+# model: ['RandomForest', 'RandomForestBagging', 'RandomForestBoosting', 'ExtraTrees',
+#         'AdaBoost', 'LogisticRegression', 'SVM', 'GradientBoostingClassifier',
+#         'DecisionTreeClassifier', 'SGDClassifier', 'KNeighborsClassifier']
+
+models: ['RandomForest', 'ExtraTrees', 'DecisionTreeClassifier', 'RandomForestBagging',
+         'RandomForestBoosting']
+parameters:
+  RandomForest:
+    n_estimators: [25, 50, 100] #[25, 50] # 100, 1000, 10000, 10
+    max_depth: [10, 20, 50, 100] # 50, 100, 5
+    max_features: ['sqrt', 'log2', 2, 4, 8, 16, "auto"]
+    criterion: ['gini', 'entropy']
+    min_samples_split: [2,  5, 10]
+  RandomForestBagging:
+    n_estimators: [10] # [25, 50, 100, 1000, 10000]
+    max_depth: [5] # [10, 20, 50, 100]
+    max_features: ['sqrt'] # ['log2', 2, 4, 8, 16, "auto"]
+    criterion: ['gini'] # ['entropy']
+    min_samples_split: [2] # [5, 10]
+    max_samples: [0.5] # [1.0]
+    bootstrap: [True]
+    bootstrap_features: [False] # [True]
+    n_estimators_bag: [10] # [25, 50, 100, 1000, 10000]
+    max_features_bag: [2] # [4, 8, 16]
+  RandomForestBoosting:
+    n_estimators: [100] # [25, 50, 100, 1000, 10000]
+    max_depth: [20] # [10, 20, 50, 100]
+    max_features: [2] # ['sqrt', 'log2', 2, 4, 8, 16, "auto"]
+    criterion: ['gini'] # ['entropy']
+    min_samples_split: [2] # [5, 10]
+    algorithm: ['SAMME'] # ['SAMME.R']
+    learning_rate: [0.01] # [0.1, 1, 10, 100]
+    n_estimators_boost: [10] # [25, 50, 100, 1000, 10000]
+  ExtraTrees:
+    n_estimators: [ 10] # [25, 50, 100, 1000, 10000]
+    max_depth: [3 ] #  5, 10] # [20, 50, 100]
+    max_features: ['log2'] # [4, 8, 16, "auto"]
+    criterion: ['gini'] #, 'entropy']
+    min_samples_split: [2] #, 5, 10]
+  AdaBoost:
+    algorithm: ['SAMME', 'SAMME.R']
+    n_estimators: [1, 10, 100]  # [1000, 10000]
+    learning_rate: [0.01, 0.1, 1, 10, 100]
+  LogisticRegression:
+    C_reg: [0.00001, 0.0001, 0.001, 0.01, 0.1]  # [1, 10]
+    penalty: ['l1', 'l2']
+  SVM:
+    C_reg: [0.00001, 0.0001, 0.001, 0.01, 0.1]  # [1, 10]
+    kernel: ['linear']
+  GradientBoostingClassifier:
+    n_estimators: [1, 10, 100]  # [1000, 10000]
+    learning_rate: [0.001, 0.01, 0.05, 0.1, 0.5]
+    subsample: [0.1, 0.5, 1.0]
+    max_depth: [1, 3, 5, 10, 20]  # [50, 100]
+  DecisionTreeClassifier:
+    criterion: ['gini', 'entropy']
+    max_depth: [1, 5, 10, 20]  # [50, 100]
+    max_features: ['sqrt', 'log2']
+    min_samples_split: [2, 5, 10]
+  SGDClassifier:
+    loss: ['log', 'modified_huber']
+    penalty: ['l1', 'l2', 'elasticnet']
+  KNeighborsClassifier:
+    n_neighbors: [1, 3, 5, 10, 25, 50, 100]
+    weights: ['uniform', 'distance']
+    algorithm: ['auto', 'kd_tree']
+