Skip to content

Commit

Permalink
one more comment
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Apr 20, 2022
1 parent f8a8553 commit 7b73fee
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions src/datasets/features/image.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ def encode_example(self, value: Union[str, dict, np.ndarray, "PIL.Image.Image"])
# we set "bytes": None to not duplicate the data if they're already available locally
return {"bytes": None, "path": value.get("path")}
elif value.get("bytes") is not None or value.get("path") is not None:
# store the image bytes, and path is used to infer the image format using the file extension
return {"bytes": value.get("bytes"), "path": value.get("path")}
else:
raise ValueError(
Expand Down

1 comment on commit 7b73fee

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==5.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009928 / 0.011353 (-0.001425) 0.003894 / 0.011008 (-0.007114) 0.031206 / 0.038508 (-0.007302) 0.035579 / 0.023109 (0.012469) 0.315099 / 0.275898 (0.039201) 0.350021 / 0.323480 (0.026541) 0.007877 / 0.007986 (-0.000108) 0.004823 / 0.004328 (0.000495) 0.008966 / 0.004250 (0.004716) 0.040210 / 0.037052 (0.003157) 0.295379 / 0.258489 (0.036890) 0.348010 / 0.293841 (0.054169) 0.032042 / 0.128546 (-0.096504) 0.009630 / 0.075646 (-0.066016) 0.254056 / 0.419271 (-0.165216) 0.051583 / 0.043533 (0.008050) 0.310839 / 0.255139 (0.055700) 0.339914 / 0.283200 (0.056714) 0.103136 / 0.141683 (-0.038547) 1.819530 / 1.452155 (0.367375) 1.811373 / 1.492716 (0.318656)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.226653 / 0.018006 (0.208646) 0.442383 / 0.000490 (0.441893) 0.007262 / 0.000200 (0.007062) 0.000283 / 0.000054 (0.000228)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027534 / 0.037411 (-0.009877) 0.103444 / 0.014526 (0.088918) 0.113285 / 0.176557 (-0.063271) 0.154323 / 0.737135 (-0.582812) 0.115569 / 0.296338 (-0.180770)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.411300 / 0.215209 (0.196091) 4.134909 / 2.077655 (2.057254) 1.798722 / 1.504120 (0.294602) 1.644948 / 1.541195 (0.103753) 1.730865 / 1.468490 (0.262375) 0.438792 / 4.584777 (-4.145985) 4.551462 / 3.745712 (0.805750) 2.124802 / 5.269862 (-3.145059) 0.923567 / 4.565676 (-3.642109) 0.052974 / 0.424275 (-0.371301) 0.012070 / 0.007607 (0.004463) 0.520505 / 0.226044 (0.294460) 5.219699 / 2.268929 (2.950770) 2.207932 / 55.444624 (-53.236693) 1.864608 / 6.876477 (-5.011868) 2.002697 / 2.142072 (-0.139376) 0.565802 / 4.805227 (-4.239425) 0.123386 / 6.500664 (-6.377278) 0.062124 / 0.075469 (-0.013346)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.617066 / 1.841788 (-0.224721) 13.883624 / 8.074308 (5.809316) 26.498578 / 10.191392 (16.307185) 0.843939 / 0.680424 (0.163515) 0.518052 / 0.534201 (-0.016148) 0.487371 / 0.579283 (-0.091912) 0.506103 / 0.434364 (0.071739) 0.323585 / 0.540337 (-0.216753) 0.328575 / 1.386936 (-1.058362)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008146 / 0.011353 (-0.003207) 0.003837 / 0.011008 (-0.007172) 0.029409 / 0.038508 (-0.009100) 0.033857 / 0.023109 (0.010747) 0.305911 / 0.275898 (0.030013) 0.324950 / 0.323480 (0.001470) 0.006180 / 0.007986 (-0.001806) 0.004701 / 0.004328 (0.000373) 0.007229 / 0.004250 (0.002978) 0.037555 / 0.037052 (0.000502) 0.293456 / 0.258489 (0.034967) 0.324952 / 0.293841 (0.031111) 0.031840 / 0.128546 (-0.096707) 0.009655 / 0.075646 (-0.065992) 0.252677 / 0.419271 (-0.166595) 0.051064 / 0.043533 (0.007532) 0.296493 / 0.255139 (0.041354) 0.322286 / 0.283200 (0.039086) 0.089016 / 0.141683 (-0.052667) 1.794496 / 1.452155 (0.342341) 1.831389 / 1.492716 (0.338673)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.272478 / 0.018006 (0.254472) 0.443112 / 0.000490 (0.442622) 0.021004 / 0.000200 (0.020804) 0.000333 / 0.000054 (0.000279)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026112 / 0.037411 (-0.011299) 0.104849 / 0.014526 (0.090324) 0.112446 / 0.176557 (-0.064110) 0.152904 / 0.737135 (-0.584232) 0.113997 / 0.296338 (-0.182342)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.414926 / 0.215209 (0.199717) 4.164904 / 2.077655 (2.087249) 1.780953 / 1.504120 (0.276833) 1.570543 / 1.541195 (0.029348) 1.636671 / 1.468490 (0.168181) 0.438233 / 4.584777 (-4.146544) 4.654436 / 3.745712 (0.908724) 2.116749 / 5.269862 (-3.153113) 0.946619 / 4.565676 (-3.619057) 0.052670 / 0.424275 (-0.371605) 0.011981 / 0.007607 (0.004374) 0.521587 / 0.226044 (0.295542) 5.203906 / 2.268929 (2.934977) 2.211223 / 55.444624 (-53.233402) 1.847416 / 6.876477 (-5.029061) 1.936589 / 2.142072 (-0.205483) 0.556389 / 4.805227 (-4.248838) 0.122217 / 6.500664 (-6.378447) 0.060145 / 0.075469 (-0.015325)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.624353 / 1.841788 (-0.217435) 13.793956 / 8.074308 (5.719648) 26.665784 / 10.191392 (16.474392) 0.858613 / 0.680424 (0.178190) 0.545623 / 0.534201 (0.011422) 0.508019 / 0.579283 (-0.071264) 0.518543 / 0.434364 (0.084179) 0.334013 / 0.540337 (-0.206324) 0.344348 / 1.386936 (-1.042588)

CML watermark

Please sign in to comment.