Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM When Reading "Short But Wide" (i.e., >100k columns, <1000 rows) partitioned data #171

Open
CalvinLeather opened this issue Dec 23, 2022 · 0 comments

Comments

@CalvinLeather
Copy link

CalvinLeather commented Dec 23, 2022

Largely leaving this for others who may run into similar weird use cases and the resulting problems....

In genomics, we often have "short but wide" datasets due to the width of genomic information (e.g., 100 people's data for 1mm+ locations in their DNA). This library (like many other parquet libraries) may be having some issues with representing the metadata for data of this size (this is an odd abuse of parquet files, but is somewhat common in genomics, e.g. https://medium.com/23andme-engineering/genetic-datastore-4b213256db31)

Anyway, to recreate, take 2 or 3 parquet partitions with >100k float or int-typed columns, 100 rows and try calling read_parquet(path). This OOM'd for us on OSX Big Sur (will follow up with some more details). pyarrow loaded these partitions without (albeit really slowly). The partitions in our example were all <100Mb (and only a few partitions, on a machine with >32Gb of RAM). Parquet.File(filename) works just fine on a single partition (as stated below, this is an odd usecases so not going to dig much more yet)

So, I'm pretty sure something about loading in the header metadata for this width of data is causing some challenges. I'm not expecting that this weird edge case "short but wide" data would be accomodated, so probably not going to investigate in the source code yet, largely leaving this issue as a sign post for other genomics folks who were attempting to use, e.g., JWAS with this library and loaded in floating-point encoded genomic data.

Credit to @RyanGannon-Embark who found this mostly without me, I'm mostly the messenger here since I'm working on documenting our path forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant