Skip to content

Commit

Permalink
ENH Improve wording in stratification notebook (#760)
Browse files Browse the repository at this point in the history
Co-authored-by: ArturoAmorQ <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
  • Loading branch information
3 people authored May 17, 2024
1 parent 9a9e74f commit ca7d1d7
Showing 1 changed file with 40 additions and 31 deletions.
71 changes: 40 additions & 31 deletions python_scripts/cross_validation_stratification.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,11 @@
model = make_pipeline(StandardScaler(), LogisticRegression())

# %% [markdown]
# Once we created our model, we will use the cross-validation framework to
# evaluate it. We will use the `KFold` cross-validation strategy. We will define
# a dataset with nine samples and repeat the cross-validation three times (i.e.
# `n_splits`).
# Once the model is created, we can evaluate it using cross-validation. We start
# by using the `KFold` strategy.
#
# Let's review how this strategy works. For such purpose, we define a dataset
# with nine samples and split the dataset into three folds (i.e. `n_splits=3`).

# %%
import numpy as np
Expand All @@ -51,12 +52,12 @@
print("TRAIN:", train_index, "TEST:", test_index)

# %% [markdown]
# By defining three splits, we will use three samples for testing and six for
# training each time. `KFold` does not shuffle by default. It means that it will
# select the three first samples for the testing set at the first split, then
# the next three samples for the second split, and the three next for the
# last split. In the end, all samples have been used in testing at least once
# among the different splits.
# By defining three splits, we use three samples (1-fold) for testing and six
# (2-folds) for training each time. `KFold` does not shuffle by default. It
# means that the three first samples are selected for the testing set at the
# first split, then the three next three samples for the second split, and the
# three next for the last split. In the end, all samples have been used in
# testing at least once among the different splits.
#
# Now, let's apply this strategy to check the generalization performance of our
# model.
Expand All @@ -73,8 +74,8 @@

# %% [markdown]
# It is a real surprise that our model cannot correctly classify any sample in
# any cross-validation split. We will now check our target's value to understand
# the issue.
# any cross-validation split. We now check our target's value to understand the
# issue.

# %%
import matplotlib.pyplot as plt
Expand All @@ -86,18 +87,17 @@
_ = plt.title("Class value in target y")

# %% [markdown]
# We see that the target vector `target` is ordered. It will have some
# unexpected consequences when using the `KFold` cross-validation. To illustrate
# the consequences, we will show the class count in each fold of the
# cross-validation in the train and test set.
# We see that the target vector `target` is ordered. This has some unexpected
# consequences when using the `KFold` cross-validation. To illustrate the
# consequences, we show the class count in each fold of the cross-validation in
# the train and test set.
#
# Let's compute the class counts for both the training and testing sets using
# the `KFold` cross-validation, and plot these information in a bar plot.
#
# We will iterate given the number of split and check how many samples of each
# are present in the training and testing set. We will store the information
# into two distincts lists; one for the training set and one for the testing
# set.
# We iterate given the number of split and check how many samples of each are
# present in the training and testing set. We then store the information into
# two distinct lists; one for the training set and one for the testing set.

# %%
import pandas as pd
Expand All @@ -114,8 +114,8 @@
test_cv_counts.append(target_test.value_counts())

# %% [markdown]
# To plot the information on a single figure, we will concatenate the
# information regarding the fold within the same dataset.
# To plot the information on a single figure, we concatenate the information
# regarding the fold within the same dataset.

# %%
train_cv_counts = pd.concat(
Expand All @@ -138,13 +138,13 @@
train_cv_counts.plot.bar()
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.ylabel("Count")
_ = plt.title("Training set")
_ = plt.title("Training set class counts")

# %%
test_cv_counts.plot.bar()
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.ylabel("Count")
_ = plt.title("Test set")
_ = plt.title("Test set class counts")

# %% [markdown]
# We can confirm that in each fold, only two of the three classes are present in
Expand All @@ -168,7 +168,7 @@
# 90%. Now that we solved our first issue, it would be interesting to check if
# the class frequency in the training and testing set is equal to our original
# set's class frequency. It would ensure that we are training and testing our
# model with a class distribution that we will encounter in production.
# model with a class distribution that we would encounter in production.

# %%
train_cv_counts = []
Expand All @@ -191,13 +191,13 @@
train_cv_counts.plot.bar()
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.ylabel("Count")
_ = plt.title("Training set")
_ = plt.title("Training set class counts\n(with suffling)")

# %%
test_cv_counts.plot.bar()
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.ylabel("Count")
_ = plt.title("Test set")
_ = plt.title("Test set class counts\n(with suffling)")

# %% [markdown]
# We see that neither the training and testing sets have the same class
Expand Down Expand Up @@ -242,18 +242,27 @@
train_cv_counts.plot.bar()
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.ylabel("Count")
_ = plt.title("Training set")
_ = plt.title("Training set class counts\n(with stratifying)")

# %%
test_cv_counts.plot.bar()
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.ylabel("Count")
_ = plt.title("Test set")
_ = plt.title("Test set class counts\n(with stratifying)")

# %% [markdown]
# In this case, we observe that the class counts are very close both in the
# train set and the test set. The difference is due to the small number of
# samples in the iris dataset.
#
# In conclusion, this is a good practice to use stratification within the
# cross-validation framework when dealing with a classification problem.
# In other words, stratifying is more effective than just shuffling when it
# comes to making sure that the distributions of classes in all the folds are
# representative of the entire dataset. As training and testing folds have
# similar class distributions, stratifying leads to a more realistic measure of
# the model’s ability to generalize. This is specially important when the
# performance metrics depend on the proportion of the positive class, as we will
# see in a future notebook.
#
# The interested reader can learn about other stratified cross-validation
# techniques in the [scikit-learn user
# guide](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-with-stratification-based-on-class-labels).

0 comments on commit ca7d1d7

Please sign in to comment.