[BUG]: [MLJ Interface] `SRRegressor` likely has too broad a `target_scitype` #390

ablaom · 2024-12-20T01:22:01Z

What happened?

The target_scitype is probably meant to return AbstractVector{Continuous} or maybe AbstractVector{<:Infinite} but the current setting is very permissive:

using SymbolicRegression
using MLJBase

MLJBase.target_scitype(SRRegressor)
# AbstractVector (alias for AbstractArray{<:Any, 1})

For consider this example:

X = (x1 = rand(5), x2 = rand(5))
y = MLJBase.coerce([1, 2, 1, 2, 1], OrderedFactor)

scitype(y)
# AbstractVector{OrderedFactor{2}}

M = SRRegressor
@assert scitype(y) <: target_scitype(M)

verbosity = 0
MLJBase.fit(M(), verbosity, X, y)
# ERROR: MethodError: no method matching _loss(::Vector{…}, ::CategoricalArrays.CategoricalVector{…}, ::L2DistLoss)                  
# The function `_loss` exists, but no method is defined for this combination of argument types.

# Closest candidates are:
#   _loss(::AbstractArray{T}, ::AbstractArray{T}, ::LT) where {T, LT<:Union{Function, SupervisedLoss}}                                      
#    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/Znzix/src/LossFunctions.jl:15

Version

1.5.0

Operating System

macOS

Interface

Julia REPL

Relevant log output

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 12 × Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 12 default, 0 interactive, 6 GC (on 12 virtual cores)
Environment:
  JULIA_LTS_PATH = /Applications/Julia-1.10.app/Contents/Resources/julia/bin/julia
  JULIA_PATH = /Applications/Julia-1.11.app/Contents/Resources/julia/bin/julia
  JULIA_EGLOT_PATH = /Applications/Julia-1.7.app/Contents/Resources/julia/bin/julia
  JULIA_NUM_THREADS = 12
  JULIA_NIGHTLY_PATH = /Applications/Julia-1.11.app/Contents/Resources/julia/bin/julia

Extra Info

No response

The text was updated successfully, but these errors were encountered:

MilesCranmer · 2024-12-20T01:45:08Z

This is technically correct because TemplateExpression as the expression_type lets you regress on an arbitrary struct: https://ai.damtp.cam.ac.uk/symbolicregression/stable/examples/template_expression/ according to a user-defined output function. This requires passing a custom elementwise_loss though. In this example, the target has element type

struct Force{T}
    x::T
    y::T
    z::T
end

Maybe I should try to define the target_scitype based on the parameters of SRRegressor - and only permit Any for the target_scitype when both elementwise_loss and expression_type are set to something custom?

MilesCranmer · 2024-12-20T03:55:10Z

What about this?

MMI.metadata_model(
    SRRegressor;
    input_scitype,
    target_scitype=AbstractVector{<:Any}
    supports_weights=true,
    reports_feature_importances=false,
    load_path="SymbolicRegression.MLJInterfaceModule.SRRegressor",
    human_name="Symbolic Regression via Evolutionary Search",
)
# ^ Existing code

# For standard `Expression` type, we require continuous input
MMI.target_scitype(::Type{<:SRRegressor{<:Any,<:Any,Expression}}}) = AbstractVector{<:Continuous}

ablaom · 2024-12-20T22:06:13Z

The searchable MLJ model registry basically requires that traits parameter-independent, so your suggestion doesn't work. This isn't really something we can change, because encoding parameter-dependent traits in the registry would be a big complication. The only thing that gets written in the registry is the trait applied to the union type. (See also here, from another project.)

The main suggestions I can think of are:

Implemement fallback loss functions that will aways work. If these are sometimes "dummy" loss functions, issue a warning suggesting user may want to provide a custom one.
Make a more restrictive (parameter-independent) target_scitype declaration, ruling out non-Continuous targets. Users could still provide other targets, but would need to override MLJ warnings with default_scitype_check_level. This isn't a well-tested solution.3.

Can you make one of these work?

You could alternatively split your model into two different models - one for regular targets, and one for unorthodox targets, but that still leaves you with the problem of specifying a loss fallback for the unorthodox model.

MilesCranmer · 2024-12-21T00:25:00Z

Thanks, makes sense to me. Questions:

If I make the target_scitype parameter-dependent, could that still be used inside Julia? It's just that the model registry would be unaware of it?
Is there a way I can automatically set default_scitype_check_level when the user provides a custom loss_function and/or expression_type?
- Then I can restrict to Continuous for the searchable database (where default regressor initialization will work out-of-the-box) while still avoiding warnings for users who know they want to use these branches of the code.
Could one option be to provide detailed error messages with how a user can update the parameters to make it work? Like I can just say "To regress elements of type $T, please use the TemplateExpression interface."
- Or does this hurt automation, like users who want to simply query+train over all models with a matching scitype – I guess you'd prefer things to work out-of-the-box?
- On the other hand, a user could technically get a SRRegressor to work for any type, with the right expression_type. So would hiding it from the index prevent them from using a model that might actually fit their use-case?

I don't think there's a way to have a dummy loss function here. When I say you can regress anything, I really do mean anything – floats, bools, tuples, named tuples, structs, hierarchical structs, images, symbols, strings, pointers to complicated C++ objects, etc., etc. You just need to pass whatever expression_type and elementwise_loss (doesn't even need to be differentiable), put that in a vector of examples, and you can regress it. (SymbolicRegression.jl is basically evolving code to fit a target)

It's also the case that a user can define a custom type of <: AbstractExpression and also pass that for expression_type. So only detecting TemplateExpression wouldn't cover everything, though I suppose it would help.

ablaom · 2024-12-21T01:48:37Z

Thanks, makes sense to me. Questions:

If I make the target_scitype parameter-dependent, could that still be used inside Julia? It's just that the model registry would be unaware of it?

Yes, you can make parameter-dependent target_scitype declarations that I think would be operative most (all) MLJ workflows, excluding model-search based on the registry. However, to get a valid entry into the registry, you would need (without catch-all loss fallbacks) to have the union-type declaration suitably restrictive: target_scitype(::Type{<:SRRegressor}) = AbstractVector{Continuous}, not Any as you have above. You can add specialisations, for certain types, but that isn't going to help you, because you want to specialise on what the parameters are not, rather than what they are, right?

Is there a way I can automatically set default_scitype_check_level when the user provides a custom loss_function and/or expression_type?

You could. You would have to issue a warning telling the user you're changing a global setting. This sounds like a drastic hack to me.

Then I can restrict to Continuous for the searchable database (where default regressor initialization will work out-of-the-box) while still avoiding warnings for users who know they want to use these branches of the code.

Could one option be to provide detailed error messages with how a user can update the parameters to make it work? Like I can just say "To regress elements of type $T, please use the TemplateExpression interface."

Or does this hurt automation, like users who want to simply query+train over all models with a matching scitype – I guess you'd prefer things to work out-of-the-box?

Yeah, it's good if things work out of the box. (And out-of-the-box functionality is needed if your model is to remain part of MLJ integration tests.)

On the other hand, a user could technically get a SRRegressor to work for any type, with the right expression_type. So would hiding it from the index prevent them from using a model that might actually fit their use-case?

Yes, using the more restrictive target scitype in the registry shouldn't prevent the user from getting it to work (by changing the scitype_warning level or somehow adding appropriate parameter-specializations of target_scitype). But yes, the cost is that the extra functionality is not discoverable in model-registry based search.

I don't think there's a way to have a dummy loss function here. When I say you can regress anything, I really do mean anything – floats, bools, tuples, named tuples, structs, hierarchical structs, images, symbols, strings, pointers to complicated C++ objects, etc., etc. You just need to pass whatever expression_type and elementwise_loss (doesn't even need to be differentiable), put that in a vector of examples, and you can regress it. (SymbolicRegression.jl is basically evolving code to fit a target)

Okay. I can't imagine common use cases for this kind of thing, but I'm happy to take your word for it. What about a misclassification rate, in the case the data is not Infinite scitype?

MilesCranmer · 2024-12-21T02:52:31Z

Ok, I think the best is probably just this:

target_scitype(::Type{<:SRRegressor}) = AbstractVector{Continuous}
target_scitype(::Type{<:SRRegressor{<:Any,<:Any,TemplateExpression}) = AbstractVector{<:Any}

So it appears in the registry while also not flagging people using TemplateExpression at runtime. I'll push a fix.

I think people defining their own expression types are probably comfortable enough getting their hands dirty that they wouldn't be scared off by mismatched scitype warnings anyways, so I'll avoid trying to include other expression types.

Is there something other than Any I can use to indicate that the TemplateExpression version should work for all scitypes? Or is that the right option?

MilesCranmer · 2024-12-21T06:15:31Z

Let me know if you're happy with #392

ablaom · 2024-12-21T08:26:18Z

Looks like a good plan. I can look at #392 tomorrow or the next day.

Is there something other than Any I can use to indicate that the TemplateExpression version should work for all scitypes? Or is that the right option?

If you declare target_scitype to be Unknown the scitype checks are skipped for the default scitype_check_level. Sorry I should have mentioned that earlier.

Here is the detailed logic:

help?> default_scitype_check_level
search: default_scitype_check_level

  default_scitype_check_level()

  Return the current global default value for scientific type checking when
  constructing machines.

  default_scitype_check_level(i::Integer)

  Set the global default value for scientific type checking to i.

  The effect of the scitype_check_level option in calls of the form
  machine(model, data, scitype_check_level=...) is summarized below:

  scitype_check_level  Inspect scitypes? If Unknown in scitypes If other scitype mismatch
  –––––––––––––––––––– ––––––––––––––––– –––––––––––––––––––––– –––––––––––––––––––––––––
  0                            ×                                                         
  1 (value at startup)         ✓                                         warning         
  2                            ✓                warning                  warning         
  3                            ✓                warning                   error          
  4                            ✓                 error                    error          

  See also machine

MilesCranmer · 2024-12-21T21:26:57Z

Cool! Ok I will do that for the TemplateExpression case, and move back to AbstractVector{<:Continuous} as the general case.

ablaom added the bug Something isn't working label Dec 20, 2024

ablaom assigned MilesCranmer Dec 20, 2024

MilesCranmer mentioned this issue Dec 21, 2024

Broaden MLJ target_scitype only when using TemplateExpression #392

Merged

MilesCranmer closed this as completed in #392 Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: [MLJ Interface] `SRRegressor` likely has too broad a `target_scitype` #390

[BUG]: [MLJ Interface] `SRRegressor` likely has too broad a `target_scitype` #390

ablaom commented Dec 20, 2024

MilesCranmer commented Dec 20, 2024 •

edited

Loading

MilesCranmer commented Dec 20, 2024

ablaom commented Dec 20, 2024

MilesCranmer commented Dec 21, 2024 •

edited

Loading

ablaom commented Dec 21, 2024

MilesCranmer commented Dec 21, 2024

MilesCranmer commented Dec 21, 2024

ablaom commented Dec 21, 2024

MilesCranmer commented Dec 21, 2024

[BUG]: [MLJ Interface] SRRegressor likely has too broad a target_scitype #390

[BUG]: [MLJ Interface] SRRegressor likely has too broad a target_scitype #390

Comments

ablaom commented Dec 20, 2024

What happened?

Version

Operating System

Interface

Relevant log output

Extra Info

MilesCranmer commented Dec 20, 2024 • edited Loading

MilesCranmer commented Dec 20, 2024

ablaom commented Dec 20, 2024

MilesCranmer commented Dec 21, 2024 • edited Loading

ablaom commented Dec 21, 2024

MilesCranmer commented Dec 21, 2024

MilesCranmer commented Dec 21, 2024

ablaom commented Dec 21, 2024

MilesCranmer commented Dec 21, 2024

[BUG]: [MLJ Interface] `SRRegressor` likely has too broad a `target_scitype` #390

[BUG]: [MLJ Interface] `SRRegressor` likely has too broad a `target_scitype` #390

MilesCranmer commented Dec 20, 2024 •

edited

Loading

MilesCranmer commented Dec 21, 2024 •

edited

Loading