-
Notifications
You must be signed in to change notification settings - Fork 9
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After changes and a rebase are made, I'll take another look. I'll do my part to help us iterate faster than happened with the features branch.
|
||
Run this step with: | ||
|
||
```python features.py``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails because the default Python in Xenial is 2.7. ./features.py
will read the shebang and choose the version appropriately. Be sure to fix the ones below too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Word - changed. Also changed attack.py below
|
||
* `num_kfolds`: value of k for k-fold cross-validation | ||
|
||
* `feature_scaling`: this option will take the features and rescale them to a zero mean and unit standard deviation. For some classifiers, e.g. primarily those based on decision trees, this should not improve performance, but for many classifiers, e.g. SVM, this is necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposal to help the ML newbies like me understand what's going on here:
feature_scaling
: setting feature scaling will take the features and compute their Standard score, re-scaling them to zero.... See also scikit-learn's documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added these links!
* `auc`: [Area under the ROC curve](http://people.inf.elte.hu/kiss/12dwhdm/roc.pdf) | ||
* `tpr`: true positive rate [array for default sklearn thresholds] | ||
* `fpr`: false positive rate [array for default sklearn thresholds] | ||
* `precision_at_k` for `k=[0.01, 0.05, 0.1, 0.5, 1, 5, 10]`: "Fraction of SecureDrop users correctly identified in the top k percent" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Percent of what? I'll probably figure out later as I read on in the docs or source, but would be better if it was clarified here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Percentage of the testing set, explicitly added
|
||
Args: | ||
options [dict]: attack setup file | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last code I wrote (see
fingerprint-securedrop/fpsd/utils.py
Line 28 in fecb7fa
"""Return an :obj:collections.OrderedDict from ``dict_str``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using Google style docstrings, but don’t have strong feelings one way or the other so the docstring format you like there is fine w/me and from now on I will follow this format. However, most of the existing docstrings in this branch I wrote before that issue was filed so would rather not rewrite the Google style docstrings at this stage in the interests of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sgtm.
import matplotlib.pyplot as plt | ||
import numpy as np | ||
import pickle | ||
from scipy import interp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't see you using this dependency, and believe you meant to import the interpolate
sub-package anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this.
return fig | ||
|
||
|
||
def precision_recall_at_x_proportion(test_labels, test_predictions, x_proportion=0.01, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea what this function is doing or why. Needs at least a docstring, maybe a comment in get_metrics
as well. Also, can you move this function right below get_metrics
since they are related and the two functions above this one are not related to get_metrics
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added docstring, moved up in the file
self.feature_scaling = feature_scaling | ||
self.db = database.ModelStorage() | ||
|
||
def get_dict(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a method that does this, self.__dict__
. I would just rename the object attributes to match sklearn
's. More clear that way anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 5a47434
return engine | ||
|
||
|
||
class DatasetLoader(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class DatasetLoader(Database)
--classes below too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
options["world_type"], options["model_type"], | ||
options["base_rate"], json.dumps(options["hyperparameters"]), | ||
self.metric_formatter(eval_metrics))) | ||
with safe_session(self.engine) as session: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with self.safe_session() as session:
--and below too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
def get_feature_importances(model): | ||
|
||
try: | ||
return model.feature_importances_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This attribute is cool. If we can identify features that are consistently found not to be useful for our top classifiers, we could use this information to help us implement the feature selection stage described in #63.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
cdf0dc7
to
07dc7b8
Compare
f527a27
to
847e4e6
Compare
…e analysis pipeline
Database needs a password, so I'm not sure how this was working before? Either way, it's there now, and reads from the environmental variable $PGPASSFILE.
The existing codes were always using the test database, which is not want we want. The production or test database can be selected using this new test keyword.
847e4e6
to
5a47434
Compare
OK comments addressed, ansible-ification of the creation of the models schema and tables is done, Travis builds are passing, and it's rebased on current master. Should be good to go 🌞 |
What a review process this has been! Thanks for your patience here, @redshiftzero. Given the frequent back-and-forth here, I'm inclined to merge, and we can bite off smaller hunks to discuss in discrete issues going forward. |
👍 sounds good - any other outstanding problems we can make issues for and address in smaller PRs |
@redshiftzero has implemented the changes requested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes requested during review have been implemented.
This PR adds an initial machine learning pipeline that takes the features in the database, trains a series of binary classifiers, evaluates how well each classifier performs, and then saves a bunch of relevant performance metrics in the database as well as pickling the trained model objects (for use in future scoring). The work in this PR corresponds to the latter half of this diagram from the
features
schema on:A more complete description of our pipeline is described in
docs/pipeline.md
and a (very) brief description of how specialized classifiers might be integrated is stored inCONTRIB.md
.