Should we add procedures to handle non-finite-values or let the user take care of valid inputs? #280

bvenn · 2023-07-11T06:41:04Z

bvenn
Jul 11, 2023
Maintainer

As discussed in #272 (comment) and subsequently #272 (comment) @smoothdeveloper suggested to add procedures to handle nan and infinity input values. In data science these inputs often lead to the algorithm to fail or result in useless nan-results.
While I see the benefit of such assistance, I fear the user may loose the feeling for the data.
For experienced data scientists it would bring a huge benefit to not take care of missing values by themselves and just select the appropriate behaviour. But first-time-users may not feel the need to fully understand the procedures and rely on defaults, that may be not applicable. Often the presence of non finite values indicate an error that was made in previous steps and should be taken care of.

For reference, the construct may look like the following:

type HandleNonFiniteValues =
    | PropagatesNonFiniteValues
    | DiscardsNonFiniteValues 
    | FailsOnNonFiniteValues
    | ImputeKNN
    | ImputeRandom
    | TransformPlus1Log
    | ...

let analysisPipeLineA param1 param2 data1 data2 (handleNaN: HandleNonFiniteValues option) =

    let _handleNan= defaultArg handleNaN HandleNonFiniteValues.PropagatesNonFiniteValues

    match _handleNaN with
          | FailsOnNonFiniteValues -> // if data1 or data2 contains nan|infinity then failwithf
          | PropagatesNonFiniteValues -> proceed with data1 data2
          | DiscardsNonFiniteValues ->  
              Seq.zip data1 data2
              |> Seq.filter (fun (a,b) -> // filter if a and b contains finite data
              |> // proceed with filtered data
          | _ -> ...

    // proceed with transformed/imputed/filtered inputs

Despite my concerns, the proposed change would leverage methods that e.g. may not be used by first-time data scientists. The PCA for example fails if the initial matrix contains rows or columns of zeros. After centering these rows become nan, with the PCA ultimately failing. Or for Correlation.Matrix.pearsonRowWise an automated nan handling would be great!

In summary I think there certainly are applications, where non-finite-value handling options are helpful and can boost productivity. The handling options should be carefully selected for each individual analysis method and not be defined globally to avoid confusion. Instead of adding option 'selector parameters it would be beneficial to either add overloads

type Analysis() =
    static member analyse(data1,data2,?HandleNaN: HandleNonFiniteValues)
        let _handleNan= defaultArg handleNaN HandleNonFiniteValues.PropagatesNonFiniteValues
        ...

or add specialized nan-handling functions next to the default ones:

let myAnalysisNaN param1 param2 (handleNaN: HandleNonFiniteValues) = 
    ...

let myAnalysis param1 param2 = 
    myAnalysisNaN param1 param2 HandleNonFiniteValues.PropagatesNonFiniteValues

smoothdeveloper · 2023-07-11T08:21:13Z

smoothdeveloper
Jul 11, 2023

@bvenn thanks for framing the discussion and giving a fuller context from domain perspective.

Since this is a very large problem to solve, I think a first pass of tackling this would be to provide infrastructure for the implementation of the library to rely on the same idioms and components to cover some common use cases, rather than looking at the problem from the most complex part (how workflow is composing, and if semantics could be a runtime thing, etc. which was how I was initially looking at things).

What I mean is that in many places, the filtering, erroring out or inputing has to be put in place in the implementation details, it would be good to have generic functions with stronger semantics than adhoc code, or at least, consolidate the logic and idioms.

In order to achieve this, I'm considering the design approach used in F#+ library (https://fsprojects.github.io/FSharpPlus/) could be a good one.

In FSharp.Stats, each algorithm takes data in a particular form, say the simplest is float array, but it could be more complex ones (('id * float array) array), etc.

We should perform a census of all the current "input data" shapes, and have static method with overloads for each type, answering the questions such as "has non finite values", or performing the operations.

When someone implements a new module, or say change the shape of "input data" for an existing one, it could occur there aren't overloads for the shape that is to be used, in this case, rather than having adhoc implemention in the body, an overload would be added as static method in those entities.

Then, to make it convenient, there can be one inline function that uses SRTP technique, that makes the code look generic (despite plumbing needed with overloaded static method).

Here is an example of how this technique is used in the implementation of F#+:

https://github.com/fsprojects/FSharpPlus/blob/54611620475751bd7accb4d37155ce9dc3f6aa8f/src/FSharpPlus/Control/Foldable.fs#L100-L117

and how it looks in the client code https://fsprojects.github.io/FSharpPlus/abstraction-foldable.html (looking for foldMap, foldBack fold functions, and how those can take various type of inputs, even for types defined outside the library).

We don't necessarily have to go full length in using the SRTP techniques or making generic functions though, the first concern is to identify all the concepts that are necessary to cover what is done in the implementation detail, and see if it helps the design of the library, if we'd consolidate the logic in similar fashion than F#+ is doing.

I think it doesn't bring any runtime cost compared to adhoc implementation that is inevitably sprouting in each module, but this brings some overhead in the implementation of the module, and also compile time of the library.

If the concepts that are defined are sound to use outside the library, they could be exposed and allow client code to implement the logic for their own data structures.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we add procedures to handle non-finite-values or let the user take care of valid inputs? #280

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Should we add procedures to handle non-finite-values or let the user take care of valid inputs? #280

bvenn Jul 11, 2023 Maintainer

Replies: 1 comment

smoothdeveloper Jul 11, 2023

bvenn
Jul 11, 2023
Maintainer

smoothdeveloper
Jul 11, 2023