Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: [MLJ Interface] SRRegressor likely has too broad a target_scitype #390

Closed
ablaom opened this issue Dec 20, 2024 · 9 comments · Fixed by #392
Closed

[BUG]: [MLJ Interface] SRRegressor likely has too broad a target_scitype #390

ablaom opened this issue Dec 20, 2024 · 9 comments · Fixed by #392
Assignees
Labels
bug Something isn't working

Comments

@ablaom
Copy link

ablaom commented Dec 20, 2024

What happened?

The target_scitype is probably meant to return AbstractVector{Continuous} or maybe AbstractVector{<:Infinite} but the current setting is very permissive:

using SymbolicRegression
using MLJBase

MLJBase.target_scitype(SRRegressor)
# AbstractVector (alias for AbstractArray{<:Any, 1})

For consider this example:

X = (x1 = rand(5), x2 = rand(5))
y = MLJBase.coerce([1, 2, 1, 2, 1], OrderedFactor)

scitype(y)
# AbstractVector{OrderedFactor{2}}

M = SRRegressor
@assert scitype(y) <: target_scitype(M)

verbosity = 0
MLJBase.fit(M(), verbosity, X, y)
# ERROR: MethodError: no method matching _loss(::Vector{…}, ::CategoricalArrays.CategoricalVector{…}, ::L2DistLoss)                  
# The function `_loss` exists, but no method is defined for this combination of argument types.

# Closest candidates are:
#   _loss(::AbstractArray{T}, ::AbstractArray{T}, ::LT) where {T, LT<:Union{Function, SupervisedLoss}}                                      
#    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/Znzix/src/LossFunctions.jl:15

Version

1.5.0

Operating System

macOS

Interface

Julia REPL

Relevant log output

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 12 × Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 12 default, 0 interactive, 6 GC (on 12 virtual cores)
Environment:
  JULIA_LTS_PATH = /Applications/Julia-1.10.app/Contents/Resources/julia/bin/julia
  JULIA_PATH = /Applications/Julia-1.11.app/Contents/Resources/julia/bin/julia
  JULIA_EGLOT_PATH = /Applications/Julia-1.7.app/Contents/Resources/julia/bin/julia
  JULIA_NUM_THREADS = 12
  JULIA_NIGHTLY_PATH = /Applications/Julia-1.11.app/Contents/Resources/julia/bin/julia

Extra Info

No response

@ablaom ablaom added the bug Something isn't working label Dec 20, 2024
@MilesCranmer
Copy link
Owner

MilesCranmer commented Dec 20, 2024

This is technically correct because TemplateExpression as the expression_type lets you regress on an arbitrary struct: https://ai.damtp.cam.ac.uk/symbolicregression/stable/examples/template_expression/ according to a user-defined output function. This requires passing a custom elementwise_loss though. In this example, the target has element type

struct Force{T}
    x::T
    y::T
    z::T
end

Maybe I should try to define the target_scitype based on the parameters of SRRegressor - and only permit Any for the target_scitype when both elementwise_loss and expression_type are set to something custom?

@MilesCranmer
Copy link
Owner

What about this?

MMI.metadata_model(
    SRRegressor;
    input_scitype,
    target_scitype=AbstractVector{<:Any}
    supports_weights=true,
    reports_feature_importances=false,
    load_path="SymbolicRegression.MLJInterfaceModule.SRRegressor",
    human_name="Symbolic Regression via Evolutionary Search",
)
# ^ Existing code

# For standard `Expression` type, we require continuous input
MMI.target_scitype(::Type{<:SRRegressor{<:Any,<:Any,Expression}}}) = AbstractVector{<:Continuous}

@ablaom
Copy link
Author

ablaom commented Dec 20, 2024

The searchable MLJ model registry basically requires that traits parameter-independent, so your suggestion doesn't work. This isn't really something we can change, because encoding parameter-dependent traits in the registry would be a big complication. The only thing that gets written in the registry is the trait applied to the union type. (See also here, from another project.)

The main suggestions I can think of are:

  1. Implemement fallback loss functions that will aways work. If these are sometimes "dummy" loss functions, issue a warning suggesting user may want to provide a custom one.

  2. Make a more restrictive (parameter-independent) target_scitype declaration, ruling out non-Continuous targets. Users could still provide other targets, but would need to override MLJ warnings with default_scitype_check_level. This isn't a well-tested solution.3.

Can you make one of these work?

You could alternatively split your model into two different models - one for regular targets, and one for unorthodox targets, but that still leaves you with the problem of specifying a loss fallback for the unorthodox model.

@MilesCranmer
Copy link
Owner

MilesCranmer commented Dec 21, 2024

Thanks, makes sense to me. Questions:

  • If I make the target_scitype parameter-dependent, could that still be used inside Julia? It's just that the model registry would be unaware of it?
  • Is there a way I can automatically set default_scitype_check_level when the user provides a custom loss_function and/or expression_type?
    • Then I can restrict to Continuous for the searchable database (where default regressor initialization will work out-of-the-box) while still avoiding warnings for users who know they want to use these branches of the code.
  • Could one option be to provide detailed error messages with how a user can update the parameters to make it work? Like I can just say "To regress elements of type $T, please use the TemplateExpression interface."
    • Or does this hurt automation, like users who want to simply query+train over all models with a matching scitype – I guess you'd prefer things to work out-of-the-box?
    • On the other hand, a user could technically get a SRRegressor to work for any type, with the right expression_type. So would hiding it from the index prevent them from using a model that might actually fit their use-case?

I don't think there's a way to have a dummy loss function here. When I say you can regress anything, I really do mean anything – floats, bools, tuples, named tuples, structs, hierarchical structs, images, symbols, strings, pointers to complicated C++ objects, etc., etc. You just need to pass whatever expression_type and elementwise_loss (doesn't even need to be differentiable), put that in a vector of examples, and you can regress it. (SymbolicRegression.jl is basically evolving code to fit a target)

It's also the case that a user can define a custom type of <: AbstractExpression and also pass that for expression_type. So only detecting TemplateExpression wouldn't cover everything, though I suppose it would help.

@ablaom
Copy link
Author

ablaom commented Dec 21, 2024

Thanks, makes sense to me. Questions:

  • If I make the target_scitype parameter-dependent, could that still be used inside Julia? It's just that the model registry would be unaware of it?

Yes, you can make parameter-dependent target_scitype declarations that I think would be operative most (all) MLJ workflows, excluding model-search based on the registry. However, to get a valid entry into the registry, you would need (without catch-all loss fallbacks) to have the union-type declaration suitably restrictive: target_scitype(::Type{<:SRRegressor}) = AbstractVector{Continuous}, not Any as you have above. You can add specialisations, for certain types, but that isn't going to help you, because you want to specialise on what the parameters are not, rather than what they are, right?

  • Is there a way I can automatically set default_scitype_check_level when the user provides a custom loss_function and/or expression_type?

You could. You would have to issue a warning telling the user you're changing a global setting. This sounds like a drastic hack to me.

  • Then I can restrict to Continuous for the searchable database (where default regressor initialization will work out-of-the-box) while still avoiding warnings for users who know they want to use these branches of the code.
  • Could one option be to provide detailed error messages with how a user can update the parameters to make it work? Like I can just say "To regress elements of type $T, please use the TemplateExpression interface."
    • Or does this hurt automation, like users who want to simply query+train over all models with a matching scitype – I guess you'd prefer things to work out-of-the-box?

Yeah, it's good if things work out of the box. (And out-of-the-box functionality is needed if your model is to remain part of MLJ integration tests.)

  • On the other hand, a user could technically get a SRRegressor to work for any type, with the right expression_type. So would hiding it from the index prevent them from using a model that might actually fit their use-case?

Yes, using the more restrictive target scitype in the registry shouldn't prevent the user from getting it to work (by changing the scitype_warning level or somehow adding appropriate parameter-specializations of target_scitype). But yes, the cost is that the extra functionality is not discoverable in model-registry based search.

I don't think there's a way to have a dummy loss function here. When I say you can regress anything, I really do mean anything – floats, bools, tuples, named tuples, structs, hierarchical structs, images, symbols, strings, pointers to complicated C++ objects, etc., etc. You just need to pass whatever expression_type and elementwise_loss (doesn't even need to be differentiable), put that in a vector of examples, and you can regress it. (SymbolicRegression.jl is basically evolving code to fit a target)

Okay. I can't imagine common use cases for this kind of thing, but I'm happy to take your word for it. What about a misclassification rate, in the case the data is not Infinite scitype?

@MilesCranmer
Copy link
Owner

Ok, I think the best is probably just this:

target_scitype(::Type{<:SRRegressor}) = AbstractVector{Continuous}
target_scitype(::Type{<:SRRegressor{<:Any,<:Any,TemplateExpression}) = AbstractVector{<:Any}

So it appears in the registry while also not flagging people using TemplateExpression at runtime. I'll push a fix.

I think people defining their own expression types are probably comfortable enough getting their hands dirty that they wouldn't be scared off by mismatched scitype warnings anyways, so I'll avoid trying to include other expression types.

Is there something other than Any I can use to indicate that the TemplateExpression version should work for all scitypes? Or is that the right option?

@MilesCranmer
Copy link
Owner

Let me know if you're happy with #392

@ablaom
Copy link
Author

ablaom commented Dec 21, 2024

Looks like a good plan. I can look at #392 tomorrow or the next day.

Is there something other than Any I can use to indicate that the TemplateExpression version should work for all scitypes? Or is that the right option?

If you declare target_scitype to be Unknown the scitype checks are skipped for the default scitype_check_level. Sorry I should have mentioned that earlier.

Here is the detailed logic:

help?> default_scitype_check_level
search: default_scitype_check_level

  default_scitype_check_level()

  Return the current global default value for scientific type checking when
  constructing machines.

  default_scitype_check_level(i::Integer)

  Set the global default value for scientific type checking to i.

  The effect of the scitype_check_level option in calls of the form
  machine(model, data, scitype_check_level=...) is summarized below:

  scitype_check_level  Inspect scitypes? If Unknown in scitypes If other scitype mismatch
  –––––––––––––––––––– ––––––––––––––––– –––––––––––––––––––––– –––––––––––––––––––––––––
  0                            ×                                                         
  1 (value at startup)         ✓                                         warning         
  2                            ✓                warning                  warning         
  3                            ✓                warning                   error          
  4                            ✓                 error                    error          

  See also machine

@MilesCranmer
Copy link
Owner

Cool! Ok I will do that for the TemplateExpression case, and move back to AbstractVector{<:Continuous} as the general case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants