ATgfe stands for Automated Transparent Genetic Feature Engineering. ATgfe is powered by genetic algorithm to engineer new features. The idea is to compose new interpretable features based on interactions between the existing features. The predictive power of the newly constructed features are measured using a pre-defined evaluation metric, which can be custom designed.
ATgfe applies the following techniques to generate candidate features:
- Simple feature interactions by using the basic operators (+, -, *, /).
(petalwidth * petallength)
- Scientific feature interactions by applying transformation operators (e.g. log, cosine, cube, etc. as well as custom operators which can be easily implemented using user defined functions).
squared(sepalwidth)*(log_10(sepalwidth)/squared(petalwidth))-cube(sepalwidth)
- Weighted feature interactions by adding weights to the simple and/or scientific feature interactions.
(0.09*exp(petallength)+0.7*sepallength/0.12*exp(petalwidth))+0.9*squared(sepalwidth)
- Complex feature interactions by applying groupBy on the categorical features.
(0.56*groupByYear0TakeMeanOfFeelslike*0.51*feelslike)+(0.45*temp)
ATgfe allows you to deal with non-linear problems by generating new interpretable features from existing features. The generated features can then be used with a linear model, which is inherently explainable. The idea is to explore potential predictive information that can be represented using interactions between existing features.
When compared with non-linear models (e.g. gradient boosting machines, random forests, etc.), ATgfe can achieve comparable results and in some cases over-perform them. This is demonstrated in the following examples: BMI, Rational difference and IRIS.
Expression | Linear Regression | LightGBM Regressor | Linear Regression + ATgfe |
---|---|---|---|
BMI = weight/height^2 |
|
|
|
Y = (X1 - X2) / (X3 - X4) |
|
|
|
Y = (Log10(X1) + Log10(X2)) / X5 |
|
|
|
Y = 0.4X2^2 + 2X4 + 2 |
|
|
|
Dataset | Logistic Regression | LightGBM Classifier | Logistic Regression + ATgfe |
---|---|---|---|
IRIS (4 features) |
|
|
|
Dataset | Linear Regression | LightGBM Regressor | Linear Regression + ATgfe |
---|---|---|---|
Concrete (8 features) |
|
|
|
Boston (13 features) |
|
|
|
- Python ^3.6
- DEAP ^1.3
- Pandas ^0.25.2
- Scipy ^1.3
- Numpy ^1.17
- Sympy ^1.4
pip install atgfe
pip install -U atgfe
The Examples are grouped under the following two sections:
-
Generated examples test ATgfe against hand-crafted non-linear problems where we know there is information that can be captured using feature interactions.
-
Toy Examples show how to use ATgfe in solving a mix of regression and classification problems from publicly available benchmark datasets.
ATgfe requires column names that are free from special characters and spaces (e.g. @, $, %, #, etc.)
# example
def prepare_column_names(columns):
return [col.replace(' ', '').replace('(cm)', '_cm') for col in columns]
columns = prepare_column_names(df.columns.tolist())
df.columns = columns
GeneticFeatureEngineer(
model,
x_train: pandas.core.frame.DataFrame,
y_train: pandas.core.frame.DataFrame,
numerical_features: List[str],
number_of_candidate_features: int,
number_of_interacting_features: int,
evaluation_metric: Callable[..., Any],
minimize_metric: bool = True,
categorical_features: List[str] = None,
enable_grouping: bool = False,
sampling_size: int = None,
cv: int = 10,
fit_wo_original_columns: bool = False,
enable_feature_transformation_operations: bool = False,
enable_weights: bool = False,
enable_bias: bool = False,
max_bias: float = 100.0,
weights_number_of_decimal_places: int = 2,
shuffle_training_data_every_generation: bool = False,
cross_validation_in_objective_func: bool = False,
objective_func_cv: int = 3,
n_jobs: int = 1,
verbose: bool = True
)
ATgfe works with any model or pipeline that follows scikit-learn API (i.e. the model should implement the fit()
and predict()
methods).
Training features in a pandas Dataframe.
Training labels in a pandas Dataframe to also handle multiple target problems.
The list of column names that represent the numerical features.
The maximum number of features to be generated.
The maximum number of existing features that can be used in constructing new features.
These features are selected from those passed in the numerical_features
argument.
Any of the scitkit-learn metrics or a custom-made evaluation metric to be used by the genetic algorithm to evaluate the predictive power of the newly generated features.
import numpy as np
from sklearn.metrics import mean_squared_error
def rmse(y_true, y_pred):
return np.sqrt(mean_squared_error(y_true, y_pred))
A boolean flag, which should be set to True
if the evaluation metric is to be minimized; otherwise set to False
if the evaluation metric is to be maximized.
The list of column names that represent the categorical features. The parameter enable_grouping
should be set to True
in order for the categorical_features
to be utilized in grouping.
A boolean flag, which should be set to True
to construct complex feature interactions that use pandas.groupBy
.
The exact size of the sampled training dataset. Use this parameter to run the optimization using the specified number of observations in the training data. If the sampling_size
is greater than the number of observations, then ATgfe will create a sample with replacement.
The number of folds for cross validation. Every generation of the genetic algorithm, ATgfe evaluates the current best solution using k-fold cross validation. The default number of folds is 10.
A boolean flag, which should be set to True
to fit the model without the original features specified in numerical_features
. In this case, ATgfe will only use the newly generated features together with any remaining original features in x_train
.
A boolean flag, which should be set to True
to enable scientific feature interactions on the numerical_features
.
The pre-defined transformation operators are listed as follows:
np_log(), np_log_10(), np_exp(), squared(), cube()
You can easily remove from or add to the existing list of transformation operators. Check out the next section for examples.
A boolean flag, which should be set to True
to enable weighted feature interactions.
The number of decimal places (i.e. precision) to be applied to the weight values.
A boolean flag, which enables the genetic algorithm to add a bias to the expressions generated. For example:
0.43*log(cement) + 806.8557595548646
The value of the bias will be between -max_bias
and max_bias
.
If the max_bias
is 100 then the bias value will be between -100 and 100.
A boolean flag, if enabled the train_test_split
method in the objective function uses the generation number as its random seed. This can prevent over-fitting.
This option is only available if cross_validation_in_objective_func
is set to False
.
A boolean flag, if enabled the train_test_split
method will not be used in the objective function. Instead of using train_test_split
, the genetic algorithm will use cross validation to evaluate the generated features.
The default number of folds is 3. The number of folds can modified using the objective_func_cv
parameter.
The number of folds to be used when cross_validation_in_objective_func
is enabled.
A boolean flag, which should be set to True
to enable the logging functionality.
To enable parallel processing, set n_jobs
to the number of CPUs that you would like to utilise. If n_jobs
is set to -1, all the machine's CPUs will be utilised.
gfe.fit(
number_of_generations: int = 100,
mu: int = 10,
lambda_: int = 100,
crossover_probability: float = 0.5,
mutation_probability: float = 0.2,
early_stopping_patience: int = 5,
random_state: int = 77
)
The maximum number of generations to be explored by the genetic algorithm.
The number of solutions to select for the next generation.
The number of children to produce at each generation.
The crossover probability.
The mutation probability.
The maximum number of generations to be explored before early the stopping criteria is satisfied when the validation score is not improving.
X = gfe.transform(X)
Where X is the pandas dataframe that you would like to append the generated features to.
gfe.get_enabled_transformation_operations()
The enabled transformation operations will be returned.
['None', 'np_log', 'np_log_10', 'np_exp', 'squared', 'cube']
gfe.remove_transformation_operation
accepts string or a list of strings
gfe.remove_transformation_operation('squared')
gfe.remove_transformation_operation(['np_log_10', 'np_exp'])
np_sqrt = np.sqrt
def some_func(x):
return (x * 2)/3
gfe.add_transformation_operation('sqrt', np_sqrt)
gfe.add_transformation_operation('some_func', some_func)