Replies: 1 comment
-
@bvenn thanks for framing the discussion and giving a fuller context from domain perspective. Since this is a very large problem to solve, I think a first pass of tackling this would be to provide infrastructure for the implementation of the library to rely on the same idioms and components to cover some common use cases, rather than looking at the problem from the most complex part (how workflow is composing, and if semantics could be a runtime thing, etc. which was how I was initially looking at things). What I mean is that in many places, the filtering, erroring out or inputing has to be put in place in the implementation details, it would be good to have generic functions with stronger semantics than adhoc code, or at least, consolidate the logic and idioms. In order to achieve this, I'm considering the design approach used in F#+ library (https://fsprojects.github.io/FSharpPlus/) could be a good one. In FSharp.Stats, each algorithm takes data in a particular form, say the simplest is We should perform a census of all the current "input data" shapes, and have static method with overloads for each type, answering the questions such as "has non finite values", or performing the operations. When someone implements a new module, or say change the shape of "input data" for an existing one, it could occur there aren't overloads for the shape that is to be used, in this case, rather than having adhoc implemention in the body, an overload would be added as static method in those entities. Then, to make it convenient, there can be one inline function that uses SRTP technique, that makes the code look generic (despite plumbing needed with overloaded static method). Here is an example of how this technique is used in the implementation of F#+: and how it looks in the client code https://fsprojects.github.io/FSharpPlus/abstraction-foldable.html (looking for We don't necessarily have to go full length in using the SRTP techniques or making generic functions though, the first concern is to identify all the concepts that are necessary to cover what is done in the implementation detail, and see if it helps the design of the library, if we'd consolidate the logic in similar fashion than F#+ is doing. I think it doesn't bring any runtime cost compared to adhoc implementation that is inevitably sprouting in each module, but this brings some overhead in the implementation of the module, and also compile time of the library. If the concepts that are defined are sound to use outside the library, they could be exposed and allow client code to implement the logic for their own data structures. |
Beta Was this translation helpful? Give feedback.
-
As discussed in #272 (comment) and subsequently #272 (comment) @smoothdeveloper suggested to add procedures to handle nan and infinity input values. In data science these inputs often lead to the algorithm to fail or result in useless nan-results.
While I see the benefit of such assistance, I fear the user may loose the feeling for the data.
For experienced data scientists it would bring a huge benefit to not take care of missing values by themselves and just select the appropriate behaviour. But first-time-users may not feel the need to fully understand the procedures and rely on defaults, that may be not applicable. Often the presence of non finite values indicate an error that was made in previous steps and should be taken care of.
For reference, the construct may look like the following:
Despite my concerns, the proposed change would leverage methods that e.g. may not be used by first-time data scientists. The PCA for example fails if the initial matrix contains rows or columns of zeros. After centering these rows become nan, with the PCA ultimately failing. Or for
Correlation.Matrix.pearsonRowWise
an automated nan handling would be great!In summary I think there certainly are applications, where non-finite-value handling options are helpful and can boost productivity. The handling options should be carefully selected for each individual analysis method and not be defined globally to avoid confusion. Instead of adding
option 'selector
parameters it would be beneficial to either add overloadsor add specialized nan-handling functions next to the default ones:
Beta Was this translation helpful? Give feedback.
All reactions