-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-47995: [C++][Parquet] Fix empty string min/max statistics being lost during merge #48717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ing lost during merge Prior to this change, CleanStatistic() for ByteArray rejected statistics where ptr == nullptr. However, empty strings can have ptr == nullptr with len == 0, causing valid statistics to be discarded when the minimum value is an empty string. The fix introduces a sentinel pointer (kNoValueSentinel) distinct from nullptr to mark "no value" in ByteArray statistics. This allows CleanStatistic to distinguish between "no min/max computed" (sentinel) and "min/max is empty string" (nullptr with len=0). FLBA is unchanged since it has fixed length and no "empty" concept.
|
|
| static T DefaultMax() { return T{0, kNoValueSentinel}; } | ||
|
|
||
| static T Min(int type_length, const T& a, const T& b) { | ||
| if (a.ptr == kNoValueSentinel) return b; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR. Since it allows ptr to be nullptr for empty string, following Copy may be a UB if min is an empty string with nullptr.
template <>
inline void TypedStatisticsImpl<ByteArrayType>::Copy(const ByteArray& src, ByteArray* dst,
ResizableBuffer* buffer) {
if (dst->ptr == src.ptr) return;
PARQUET_THROW_NOT_OK(buffer->Resize(src.len, false));
std::memcpy(buffer->mutable_data(), src.ptr, src.len);
*dst = ByteArray(src.len, buffer->data());
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I will make a separate PR to fix that
Rationale for this change
Fixes #47995
When merging ByteArray statistics, empty string min/max values were incorrectly discarded. This happened because
CleanStatistic()rejected statistics whereptr == nullptr, but empty strings can legitimately haveptr == nullptrwithlen == 0.What changes are included in this PR?
Introduces a sentinel pointer (
kNoValueSentinel) distinct fromnullptrto mark "no value" in ByteArray statistics. This allowsCleanStatisticto distinguish between:FLBA is unchanged since it has fixed length and no "empty" concept.
Are these changes tested?
Yes. Added comprehensive tests covering all combinations of:
Are there any user-facing changes?
No API changes. This is a bug fix that preserves empty string statistics correctly during merge operations.