-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify num-nulls handling in Statistics and ColumnIndex #449
Conversation
src/main/thrift/parquet.thrift
Outdated
/** | ||
* count of null value in the column | ||
* | ||
* Writers should write this field even if it is zero or in non-null columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Writers should write this field even if it is zero or in non-null columns. | |
* Writers SHOULD always write this field even if it is zero (a.k.a. no null value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also add the expectation to the reader implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also add the expectation to the reader implementations?
I'd like to wait other replying the context in ML firstly
Co-authored-by: Gang Wu <[email protected]>
@pitrou @gszadovszky @emkornfield @alamb @etseidl @shangxinli Would you mind take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mapleFU, I think this is an important clarification. Will you also add this language to ColumnIndex
?
Sure, wait 1h |
Co-authored-by: Ed Seidl <[email protected]>
I'll be asleep 🤣 |
Aha, I may need sometime to check the parquet-java, parquet-rs ColumnIndex impl, so maybe would take some time, it would be finished before you wake up tomorrow. Nice dream! |
@mapleFU, it looks good to me and is a proper change for the actual case. However, the statement |
For column index:
In parquet-rs seems it always write? And in C++/Java If page statistics has null counts, it's bring up to ColumnIndex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just two wording suggestions
Co-authored-by: Antoine Pitrou <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mapleFU -- this makes sense to me
* Writers SHOULD always write this field even if it is zero (i.e. no null value) | ||
* or the column is not nullable. | ||
* Readers MUST distinguish between null_count not being present and null_count == 0. | ||
* If null_count is not present, readers MUST NOT assume null_count == 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this statement (even though there is an issue with older parquet files written with parquet-rs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did parquet-rs do? Not write the null count in case it was null? If so, then readers will just not optimize out null handling in this case, which seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JFinis apache/arrow-rs#6257 would support write 0 here
Before this nonthing is writen.
Co-authored-by: Gang Wu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last nit 😄. Thanks again!
Co-authored-by: Ed Seidl <[email protected]>
@gszadovszky Would you mind take a look? Besides, I didn't get any response in the maillist. What should I do before checking in this pr? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapleFU, I think this PR does not change the way an implementation should handle these values. It only articulates the correct handling.
Also, the PR has been approved by a quite broad audience already. You may safely push it.
Merge now, thanks all! |
* Writers SHOULD always write this field even if it is zero (i.e. no null value) | ||
* or the column is not nullable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should writers write this field if the column isn't nullable? That feels strictly redundant and therefore a waste of memory. If a reader assumes that their could be nulls even though the type is required in the schema, then this reader has a bug. There wouldn't even be a way to write nulls into a required column, as they cannot be represented in this case.
(Also, if you disregard this comment, at least change the wording from not nullable
to not required
, as that's the correct term w.rt. to the schema (there is no nullable
field in the schema, just a repetition type).)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should writers write this field if the column isn't nullable?
I agree it's not costless in this section, however most impl uses this. It's not a standard but it's widely used.
(Also, if you disregard this comment, at least change the wording from not nullable to not required, as that's the correct term w.rt. to the schema (there is no nullable field in the schema, just a repetition type).)
You're right, I'll rechecking this
* Readers MUST distinguish between null_counts not being present | ||
* and null_count being 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null_count => null_counts
@JFinis I totally agree the non-nullable ( including def-level in parquet) type can implicit deducing that not-null, however I found most implemention consumes that and produce that, so I sound it here. |
Rationale for this change
Clarify num-nulls handling. I've mentioned this in maillist: https://lists.apache.org/thread/oqd9lt7p0jtf217hg0667xhk0zvwbgvt
What changes are included in this PR?
Suggest writer to have num_nulls/null_count
Do these changes have PoC implementations?
no