Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: Invalid flatbuffers message. #7346

Open
antecede opened this issue Dec 25, 2024 · 1 comment · May be fixed by #7348
Open

OSError: Invalid flatbuffers message. #7346

antecede opened this issue Dec 25, 2024 · 1 comment · May be fixed by #7348

Comments

@antecede
Copy link

antecede commented Dec 25, 2024

Describe the bug

When loading a large 2D data (1000 × 1152) with a large number of (2,000 data in this case) in load_dataset, the error message OSError: Invalid flatbuffers message is reported.

When only 300 pieces of data of this size (1000 × 1152) are stored, they can be loaded correctly.

When 2,000 2D arrays are stored in each file, about 100 files are generated, each with a file size of about 5-6GB. But when 300 2D arrays are stored in each file, about 600 files are generated, which is too many files.

Steps to reproduce the bug

error:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[2], line 4
      1 from datasets import Dataset
      2 from datasets import load_dataset
----> 4 real_dataset = load_dataset("arrow", data_files='tensorData/real_ResidueTensor/*', split="train")#.with_format("torch") # , split="train"
      5 # sim_dataset = load_dataset("arrow", data_files='tensorData/sim_ResidueTensor/*', split="train").with_format("torch")
      6 real_dataset

File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/load.py:2151](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/load.py#line=2150), in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2148     return builder_instance.as_streaming_dataset(split=split)
   2150 # Download and prepare data
-> 2151 builder_instance.download_and_prepare(
   2152     download_config=download_config,
   2153     download_mode=download_mode,
   2154     verification_mode=verification_mode,
   2155     num_proc=num_proc,
   2156     storage_options=storage_options,
   2157 )
   2159 # Build dataset for splits
   2160 keep_in_memory = (
   2161     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   2162 )

File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/builder.py:924](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/builder.py#line=923), in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, dl_manager, base_path, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    922 if num_proc is not None:
    923     prepare_split_kwargs["num_proc"] = num_proc
--> 924 self._download_and_prepare(
    925     dl_manager=dl_manager,
    926     verification_mode=verification_mode,
    927     **prepare_split_kwargs,
    928     **download_and_prepare_kwargs,
    929 )
    930 # Sync info
    931 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/builder.py:978](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/builder.py#line=977), in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
    976 split_dict = SplitDict(dataset_name=self.dataset_name)
    977 split_generators_kwargs = self._make_split_generators_kwargs(prepare_split_kwargs)
--> 978 split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
    980 # Checksums verification
    981 if verification_mode == VerificationMode.ALL_CHECKS and dl_manager.record_checksums:

File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/packaged_modules/arrow/arrow.py:47](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/datasets/packaged_modules/arrow/arrow.py#line=46), in Arrow._split_generators(self, dl_manager)
     45 with open(file, "rb") as f:
     46     try:
---> 47         reader = pa.ipc.open_stream(f)
     48     except pa.lib.ArrowInvalid:
     49         reader = pa.ipc.open_file(f)

File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.py:190](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.py#line=189), in open_stream(source, options, memory_pool)
    171 def open_stream(source, *, options=None, memory_pool=None):
    172     """
    173     Create reader for Arrow streaming format.
    174 
   (...)
    188         A reader for the given source
    189     """
--> 190     return RecordBatchStreamReader(source, options=options,
    191                                    memory_pool=memory_pool)

File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.py:52](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.py#line=51), in RecordBatchStreamReader.__init__(self, source, options, memory_pool)
     50 def __init__(self, source, *, options=None, memory_pool=None):
     51     options = _ensure_default_ipc_read_options(options)
---> 52     self._open(source, options=options, memory_pool=memory_pool)

File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.pxi:1006](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/ipc.pxi#line=1005), in pyarrow.lib._RecordBatchStreamReader._open()

File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/error.pxi:155](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/error.pxi#line=154), in pyarrow.lib.pyarrow_internal_check_status()

File [~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/error.pxi:92](http://localhost:8899/lab/tree/RTC%3Anew_world/esm3/~/miniforge3/envs/esmIne3/lib/python3.12/site-packages/pyarrow/error.pxi#line=91), in pyarrow.lib.check_status()

OSError: Invalid flatbuffers message.

reproduce:Here is just an example result, the real 2D matrix is the output of the ESM large model, and the matrix size is approximate

import numpy as np
import pyarrow as pa

random_arrays_list = [np.random.rand(1000, 1152) for _ in range(2000)]
table = pa.Table.from_pydict({
    'tensor': [tensor.tolist() for tensor in random_arrays_list]
})

import pyarrow.feather as feather
feather.write_feather(table, 'test.arrow')

from datasets import load_dataset
dataset = load_dataset("arrow", data_files='test.arrow', split="train")

Expected behavior

load_dataset load the dataset as normal as feather.read_feather

import pyarrow.feather as feather
feather.read_feather('tensorData/real_ResidueTensor/real_tensor_1.arrow')

Plus load_dataset("parquet", data_files='test.arrow', split="train") works fine

Environment info

  • datasets version: 3.2.0
  • Platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • huggingface_hub version: 0.26.5
  • PyArrow version: 18.1.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.9.0
@lhoestq
Copy link
Member

lhoestq commented Jan 2, 2025

Thanks for reporting, it looks like an issue with pyarrow.ipc.open_stream

Can you try installing datasets from this pull request and see if it helps ? #7348

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants