Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR on hold for months #2311

Open
Deathn0t opened this issue Nov 27, 2024 · 6 comments
Open

PR on hold for months #2311

Deathn0t opened this issue Nov 27, 2024 · 6 comments
Labels
question General question about the software under discussion Issue is currently being discussed

Comments

@Deathn0t
Copy link
Contributor

Hello, I submitted a PR in July for a small code change related to the automated detection of datetime (that happened when it should not).

#2150

The PR has not received any feedback/comment. Will this be ignored?

Thank you very much for your help.

@Deathn0t Deathn0t added new Automatic label applied to new issues question General question about the software labels Nov 27, 2024
@sdv-team
Copy link
Contributor

Hi @Deathn0t, it’s great to see your interest in the SDV ecosystem. The SDV software (and its related libraries) is owned and maintained by DataCebo. It is available under the Business Source License for you to browse.

We are a small team supporting a large set of users with enterprise-specific intricacies and reliability needs. This has required us to be deliberate about setting the roadmap for SDV libraries. As a result, we are unable to prioritize reviewing and accepting external pull requests.

Is your PR is meant to address a bug or feature request? We would greatly appreciate it if you could file an issue instead with the overall description of your problem. We can determine whether it’s aligned with our framework. Once discussed, our team typically resolves smaller issues within a few release cycles. We appreciate your understanding.

@sdv-team sdv-team added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Dec 10, 2024
@Deathn0t
Copy link
Contributor Author

I suggest updating the condition that checks for datetime types in ScalarRange.

The current condition resulted in an exception when the range values could be interpreted as dates (e.g., 1900, 2000) even though the column type is int.

So, I suggested that you raise the exception only when the column is a datetime and not both low/high values of the range are interpreted as datetime.

@npatki
Copy link
Contributor

npatki commented Dec 17, 2024

Thanks for describing this @Deathn0t. I have filed a bug for it in #2324 so that we can track the fix. Please check it out and let us know if the error you are describing is different than what's in the bug.

In this issue, I have also provided a simple workaround that can unblock your project for now (casting the column to a string). At the bottom of this issue, I have also provided some additional context that may be useful (copied below)

  • One important thing to consider is whether the ScalarRange or ScalarInequality constraint is absolutely needed. By default, all SDV synthesizers will enforce the min/max values that are observed in the data. If this default is kept, then you do not need to add any constraints.
  • Constraints are currently set up so that they do not have access to the metadata. In Q1, we plan to streamline constraints as part of a larger project, which means updating this part of the SDV workflow. SDV synthesizers will have access to metadata at the time of validating a constraint. So the "correct" approach here will be to validate against the metadata -- i.e. just check that the sdtype is numerical or datetime in the metadata, regardless of how the data is actually stored (int, string, etc.)

Regarding the second point: coincidentally the team is planning to streamline constraints in Q1, so we might actually resolve this bug as part of this project.

@Deathn0t
Copy link
Contributor Author

If I understand #2324 correctly. Then what I describe is different.

I have a column, for example named "x" which contains integers that are not dates. But, the values are interpreted by SDV as dates by the ScalarRange because they look like dates I guess (year more exactly), and a ValueError exception is thrown.

I tracked the line that does this already: https://github.com/sdv-dev/SDV/pull/2150/files#diff-4dc23c81a73c7caca3948f0875ffef02ec33cad02f24834c282eac20b1d13586R1139

@npatki
Copy link
Contributor

npatki commented Dec 20, 2024

Hi @Deathn0t, does this other issue #2328 capture what you are observing? It's helpful for me to understand how to replicate this problem. If this is something that is actively blocking your projects, I have left a workaround in #2328 that you can use in the meantime.

Re code: I am not as familiar with the internals of this part of our code. However, considering both issues together, it seems like we are generally making some assumptions about data types that shouldn't be needed. Refactoring our entire logic for data validation here may make sense. I think the only validation we really need is:

  • The column's data types should match the data types of the parameters (eg. they should all be ints, or all be strings)
  • If the column is listed as sdtype 'datetime', then the parameters should also be interpretable as datetimes (regardless of whether they are ints or strings)

As I mentioned before, the team will soon do a refactor of all constraints in Q1, so this may be something we save for then (should be soon).

BTW: On a separate note, I am curious about why you'd like to apply the ScalarRange constraint to your data. If it is to strictly enforce min/max values, this is something SDV synthesizers already do by default. This is only tangentially related to your issue, but if you are able to share more about why you need the ScalarRange constraint, we'd better be able to help other users who want to use the same. Thanks.

@Deathn0t
Copy link
Contributor Author

Hi @npatki , yes #2328 is exactly corresponding to my issue! That's "great" you managed to reproduce it as a few updates have come up since I opened the PR.

I am using SDV within Hyperparameter optimization (HPO). To transfer the learning from the previous HPO to the new HPO. It is really just fitting an SDV sampler on previous filtered best outcomes from HPO and then using this sampler within a Bayesian optimization code. For this software http://github.com/deephyper/deephyper, described in this paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

3 participants