ENH: Store fitted attributes as dpt and return dpt tensors #65

fcharras · 2022-11-25T13:17:36Z

~~The compatibility layer for having the sklearn test keep working is still TODO.~~ Let alone that, I think this would be the final state of KMeansEngine.

fcharras · 2022-11-25T15:18:26Z

I wanted to duplicate the KMeansEngine class with KMeansEngineDebug class that would be loaded when engine sklearn_numba_dpex.debug is used rather than bare sklearn_numba_dpex. The difference between debug and vanilla would be that debug engine converts all fitted attributes and all outputs to numpy arrays. The usecase is using it in sklearn unit tests.

But with current plugin design in scikit-learn/scikit-learn#24497 I don't see a way around. One module can only implement one engine.

Maybe there's a workaround consisting in exposing several top level namespaces (e.g sklearn_numba_dpex and sklearn_numba_dpex_debug) in the same package using the packages list in setup.py ?

Or WDYT about changing the plugin spec to enable a different syntax that allow passing dotted string to the config_context ?

E.g rewriting https://github.com/scikit-learn/scikit-learn/pull/24497/files#diff-1d31de81e903bd6529fbe68f8009b7113e3b7de4f1465572ef88af4d03a7dc5bR35 such that entry points whose value prefixes match the user inputed string are selected. (i.e. using prefix selection rather than str.split(".", 1) to detect the name of the engine)

Edit: somewhat related to #21

ogrisel · 2022-11-28T13:11:54Z

I think we should allow to programmatically register additional provider names (without in addition the importlib.metadata registry) for testing / debug / benchmarking purposes.

from sklearn._engine import register_engine
register_engine(
    engine_name="kmeans",
    engine_class="sklearn_numba_dpex.kmeans.engine:KMeansEngineDebug",
    provider_name="sklearn_numba_dpex_debug",
)

In this case we break the fact that the provider name is usually the same as the toplevel import package of the engine class, so this might impose some refactoring of the engine lookup / import logic to avoid relying on this assumption.

fcharras · 2022-11-29T13:11:16Z

I agree although since it requires work on the sklearn side, I think we can still push this PR on with an emvironment variable based switch in the meantime, I'll add a commit going this direction

fcharras · 2022-11-30T13:13:13Z

Out of WIP

…r for running sklearn tests

jjerphan

Thanks @fcharras. Here are a few comments.

.github/workflows/run_tests.yaml

sklearn_numba_dpex/kmeans/engine.py

sklearn_numba_dpex/kmeans/drivers.py

sklearn_numba_dpex/kmeans/engine.py

sklearn_numba_dpex/kmeans/kernels/utils.py

jjerphan · 2022-11-30T14:41:28Z

sklearn_numba_dpex/kmeans/kernels/utils.py

+        initialized with value 1.
+        """
+        sample_idx = dpex.get_global_id(zero_idx)
+        if sample_idx >= n_samples:


Suggested change

if sample_idx >= n_samples:

# Early return if result is already False.

if sample_idx >= n_samples or result[zero_idx] == zero_idx:

Note that this add a read operation and doesn't stop future tasks, it just ensures that future tasks will exit early after the read. So on average it's better when clusterings are different but not when clusterings are the same. Still better since different clusterings should be the most common outcome.

sklearn_numba_dpex/kmeans/kernels/utils.py

Co-authored-by: Julien Jerphanion <[email protected]>

fcharras · 2022-12-01T16:24:33Z

Comments should be addressed in last commit and suggestions applied.

jjerphan

LTGM. Thank you, @fcharras!

.github/workflows/run_tests.yaml

sklearn_numba_dpex/kmeans/engine.py

sklearn_numba_dpex/kmeans/tests/test_kmeans.py

Co-authored-by: Julien Jerphanion <[email protected]>

fcharras · 2022-12-02T08:35:35Z

So in fact make_get_nb_distinct_clusters_kernel does suffer from a race condition, the unittest sometimes output 6, sometimes 7, but it more seems like a bug in numba_dpex atomics rather than our code. Or maybe I'm not sure what "loaded atomically" should actually mean.

fcharras · 2022-12-02T09:01:04Z

The test was the issue, we have n_clusters = max(labels) + 1 but used n_clusters = max(labels).

fcharras changed the base branch from accept_dpnp_dpt_inputs to main November 25, 2022 14:59

fcharras changed the base branch from main to accept_dpnp_dpt_inputs November 25, 2022 14:59

fcharras force-pushed the store_fitted_attributes_as_dpt branch from c04e512 to bd8308c Compare November 25, 2022 15:03

fcharras mentioned this pull request Nov 25, 2022

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans scikit-learn/scikit-learn#24497

Closed

jjerphan marked this pull request as draft November 28, 2022 11:06

jjerphan changed the title ~~[DRAFT] ENH: Store fitted attributes as dpt and return dpt tensors~~ ENH: Store fitted attributes as dpt and return dpt tensors Nov 28, 2022

Base automatically changed from accept_dpnp_dpt_inputs to main November 29, 2022 12:54

fcharras force-pushed the store_fitted_attributes_as_dpt branch from bd8308c to c9a356d Compare November 30, 2022 13:12

fcharras marked this pull request as ready for review November 30, 2022 13:13

fcharras requested review from jjerphan and ogrisel November 30, 2022 13:13

store fitted attributes as dpctl tensors and add a compatibility laye…

ef49d7e

…r for running sklearn tests

fcharras force-pushed the store_fitted_attributes_as_dpt branch from c9a356d to ef49d7e Compare November 30, 2022 13:21

jjerphan reviewed Nov 30, 2022

View reviewed changes

Clarity testing and fixes

bff585f

Co-authored-by: Julien Jerphanion <[email protected]>

fcharras requested a review from jjerphan December 1, 2022 16:24

jjerphan approved these changes Dec 1, 2022

View reviewed changes

Small fixes

f2a4e8a

Co-authored-by: Julien Jerphanion <[email protected]>

fix test

ac6236f

fcharras merged commit 882cdcf into main Dec 2, 2022

fcharras deleted the store_fitted_attributes_as_dpt branch December 2, 2022 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Store fitted attributes as dpt and return dpt tensors #65

ENH: Store fitted attributes as dpt and return dpt tensors #65

fcharras commented Nov 25, 2022 •

edited

Loading

fcharras commented Nov 25, 2022 •

edited

Loading

ogrisel commented Nov 28, 2022 •

edited

Loading

fcharras commented Nov 29, 2022

fcharras commented Nov 30, 2022

jjerphan left a comment

jjerphan Nov 30, 2022

fcharras Nov 30, 2022 •

edited

Loading

fcharras commented Dec 1, 2022

jjerphan left a comment

fcharras commented Dec 2, 2022 •

edited

Loading

fcharras commented Dec 2, 2022

	if sample_idx >= n_samples:
	# Early return if result is already False.
	if sample_idx >= n_samples or result[zero_idx] == zero_idx:

ENH: Store fitted attributes as dpt and return dpt tensors #65

ENH: Store fitted attributes as dpt and return dpt tensors #65

Conversation

fcharras commented Nov 25, 2022 • edited Loading

fcharras commented Nov 25, 2022 • edited Loading

ogrisel commented Nov 28, 2022 • edited Loading

fcharras commented Nov 29, 2022

fcharras commented Nov 30, 2022

jjerphan left a comment

Choose a reason for hiding this comment

jjerphan Nov 30, 2022

Choose a reason for hiding this comment

fcharras Nov 30, 2022 • edited Loading

Choose a reason for hiding this comment

fcharras commented Dec 1, 2022

jjerphan left a comment

Choose a reason for hiding this comment

fcharras commented Dec 2, 2022 • edited Loading

fcharras commented Dec 2, 2022

fcharras commented Nov 25, 2022 •

edited

Loading

fcharras commented Nov 25, 2022 •

edited

Loading

ogrisel commented Nov 28, 2022 •

edited

Loading

fcharras Nov 30, 2022 •

edited

Loading

fcharras commented Dec 2, 2022 •

edited

Loading