Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Store fitted attributes as dpt and return dpt tensors #65

Merged
merged 4 commits into from
Dec 2, 2022

Conversation

fcharras
Copy link
Collaborator

@fcharras fcharras commented Nov 25, 2022

The compatibility layer for having the sklearn test keep working is still TODO. Let alone that, I think this would be the final state of KMeansEngine.

@fcharras fcharras changed the base branch from accept_dpnp_dpt_inputs to main November 25, 2022 14:59
@fcharras fcharras changed the base branch from main to accept_dpnp_dpt_inputs November 25, 2022 14:59
@fcharras fcharras force-pushed the store_fitted_attributes_as_dpt branch from c04e512 to bd8308c Compare November 25, 2022 15:03
@fcharras
Copy link
Collaborator Author

fcharras commented Nov 25, 2022

I wanted to duplicate the KMeansEngine class with KMeansEngineDebug class that would be loaded when engine sklearn_numba_dpex.debug is used rather than bare sklearn_numba_dpex. The difference between debug and vanilla would be that debug engine converts all fitted attributes and all outputs to numpy arrays. The usecase is using it in sklearn unit tests.

But with current plugin design in scikit-learn/scikit-learn#24497 I don't see a way around. One module can only implement one engine.

Maybe there's a workaround consisting in exposing several top level namespaces (e.g sklearn_numba_dpex and sklearn_numba_dpex_debug) in the same package using the packages list in setup.py ?

Or WDYT about changing the plugin spec to enable a different syntax that allow passing dotted string to the config_context ?

E.g rewriting https://github.com/scikit-learn/scikit-learn/pull/24497/files#diff-1d31de81e903bd6529fbe68f8009b7113e3b7de4f1465572ef88af4d03a7dc5bR35 such that entry points whose value prefixes match the user inputed string are selected. (i.e. using prefix selection rather than str.split(".", 1) to detect the name of the engine)

Edit: somewhat related to #21

@jjerphan jjerphan marked this pull request as draft November 28, 2022 11:06
@jjerphan jjerphan changed the title [DRAFT] ENH: Store fitted attributes as dpt and return dpt tensors ENH: Store fitted attributes as dpt and return dpt tensors Nov 28, 2022
@ogrisel
Copy link
Collaborator

ogrisel commented Nov 28, 2022

I think we should allow to programmatically register additional provider names (without in addition the importlib.metadata registry) for testing / debug / benchmarking purposes.

from sklearn._engine import register_engine
register_engine(
    engine_name="kmeans",
    engine_class="sklearn_numba_dpex.kmeans.engine:KMeansEngineDebug",
    provider_name="sklearn_numba_dpex_debug",
)

In this case we break the fact that the provider name is usually the same as the toplevel import package of the engine class, so this might impose some refactoring of the engine lookup / import logic to avoid relying on this assumption.

Base automatically changed from accept_dpnp_dpt_inputs to main November 29, 2022 12:54
@fcharras
Copy link
Collaborator Author

I agree although since it requires work on the sklearn side, I think we can still push this PR on with an emvironment variable based switch in the meantime, I'll add a commit going this direction

@fcharras fcharras force-pushed the store_fitted_attributes_as_dpt branch from bd8308c to c9a356d Compare November 30, 2022 13:12
@fcharras fcharras marked this pull request as ready for review November 30, 2022 13:13
@fcharras
Copy link
Collaborator Author

Out of WIP

@fcharras fcharras requested review from jjerphan and ogrisel November 30, 2022 13:13
@fcharras fcharras force-pushed the store_fitted_attributes_as_dpt branch from c9a356d to ef49d7e Compare November 30, 2022 13:21
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fcharras. Here are a few comments.

.github/workflows/run_tests.yaml Show resolved Hide resolved
sklearn_numba_dpex/kmeans/engine.py Outdated Show resolved Hide resolved
sklearn_numba_dpex/kmeans/drivers.py Outdated Show resolved Hide resolved
sklearn_numba_dpex/kmeans/drivers.py Show resolved Hide resolved
sklearn_numba_dpex/kmeans/engine.py Outdated Show resolved Hide resolved
sklearn_numba_dpex/kmeans/engine.py Outdated Show resolved Hide resolved
sklearn_numba_dpex/kmeans/kernels/utils.py Outdated Show resolved Hide resolved
initialized with value 1.
"""
sample_idx = dpex.get_global_id(zero_idx)
if sample_idx >= n_samples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if sample_idx >= n_samples:
# Early return if result is already False.
if sample_idx >= n_samples or result[zero_idx] == zero_idx:

Copy link
Collaborator Author

@fcharras fcharras Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this add a read operation and doesn't stop future tasks, it just ensures that future tasks will exit early after the read. So on average it's better when clusterings are different but not when clusterings are the same. Still better since different clusterings should be the most common outcome.

sklearn_numba_dpex/kmeans/kernels/utils.py Outdated Show resolved Hide resolved
Co-authored-by: Julien Jerphanion <[email protected]>
@fcharras
Copy link
Collaborator Author

fcharras commented Dec 1, 2022

Comments should be addressed in last commit and suggestions applied.

@fcharras fcharras requested a review from jjerphan December 1, 2022 16:24
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM. Thank you, @fcharras!

.github/workflows/run_tests.yaml Outdated Show resolved Hide resolved
sklearn_numba_dpex/kmeans/engine.py Outdated Show resolved Hide resolved
sklearn_numba_dpex/kmeans/engine.py Outdated Show resolved Hide resolved
sklearn_numba_dpex/kmeans/tests/test_kmeans.py Outdated Show resolved Hide resolved
Co-authored-by: Julien Jerphanion <[email protected]>
@fcharras
Copy link
Collaborator Author

fcharras commented Dec 2, 2022

So in fact make_get_nb_distinct_clusters_kernel does suffer from a race condition, the unittest sometimes output 6, sometimes 7, but it more seems like a bug in numba_dpex atomics rather than our code. Or maybe I'm not sure what "loaded atomically" should actually mean.

@fcharras
Copy link
Collaborator Author

fcharras commented Dec 2, 2022

The test was the issue, we have n_clusters = max(labels) + 1 but used n_clusters = max(labels).

@fcharras fcharras merged commit 882cdcf into main Dec 2, 2022
@fcharras fcharras deleted the store_fitted_attributes_as_dpt branch December 2, 2022 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants