-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize apply_replacement #207
base: master
Are you sure you want to change the base?
Conversation
WDYT about using map on the dicts instead? import numpy as np
import pandas as pd
nrows = 1_000_000
ncols = 20
n_unique_vals = 100
df = pd.DataFrame(
np.random.randint(0, n_unique_vals, (nrows, ncols)),
columns=[f'x{i}' for i in range(ncols)],
dtype=str
)
cols_to_replace = [f'x{i}' for i in range(0, 20, 2)]
vec = {col: {str(i): str(i + 1) for i in range(n_unique_vals - 10)} # last 10 values were not seen
for col in cols_to_replace}
# define one of the replacement columns as float
df['x0'] = df['x0'].astype('float')
vec['x0'] = {float(k): float(v) for k, v in vec['x0'].items()}
replace_unseen = -1
def apply_replacements(df, columns, vec, replace_unseen):
def column_categorizer(col: str):
return np.select(
# the original had an and here so I guess it should be &
[df[col].isna() & (df[col].dtype == "float"), ~df[col].isin(vec[col].keys())],
[np.nan, replace_unseen],
df[col].replace(vec[col])
)
return df.assign(**{col: column_categorizer(col) for col in columns})
%time res1 = apply_replacements(df, cols_to_replace, vec, replace_unseen)
# Wall time: 1min 22s
# proposal
def apply_replacements2(df, columns, vec, replace_unseen):
def column_categorizer(col: str):
replaced = df[col].map(vec[col])
unseen = df[col].notnull() & replaced.isnull()
replaced[unseen] = replace_unseen
return replaced
return df.assign(**{col: column_categorizer(col) for col in columns})
%time res2 = apply_replacements2(df, cols_to_replace, vec, replace_unseen)
# Wall time: 3.93 s
pd.testing.assert_frame_equal(res1, res2) |
WOW! What is this magic? How does map works? |
The main difference is that replace only changes the values you provide in the dict, whereas map tries to replace all of them and when there isn't a match it sets the value to null, which in this case I think is helpful for us because we can get the ones that didn't match very easily. |
Codecov Report
@@ Coverage Diff @@
## master #207 +/- ##
==========================================
- Coverage 94.69% 94.29% -0.40%
==========================================
Files 25 34 +9
Lines 1507 2050 +543
Branches 203 269 +66
==========================================
+ Hits 1427 1933 +506
- Misses 48 80 +32
- Partials 32 37 +5
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
replaced[unseen] = replace_unseen | ||
return replaced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced[unseen] = replace_unseen | |
return replaced | |
return replaced.mask(unseen, replace_unseen) |
replaced[unseen] = replace_unseen | ||
return replaced | ||
def column_categorizer(col: str) -> pd.Series: | ||
return df[col].map(lambda x: vec[col].get(x, replace_unseen), na_action='ignore') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this faster than .apply
? I don't know how map works under the hood, but if it implements a for loop in the backend for lambda function, than its just as bad as apply no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I wasn't benchmarking against the original implementation, this takes about the same time (4.5s original, 3.8s this one). Do you have an example where the original is very slow? That may be a better case to benchmark.
Status
IN DEVELOPMENT
Todo list
Background context
Pandas .apply is incredibly slow to run. Since this function is used in multiple learners, speeding it up should yield tremendous grains in performance. As a side note, we should never introduce new call
.apply
. They are a major source of headacheDescription of the changes proposed in the pull request
Vectorize
apply_replacements