Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrow::compute::concat should merge dictionary type when concatenating list of dictionaries #6888

Open
rluvaton opened this issue Dec 16, 2024 · 4 comments · May be fixed by #6893
Open

arrow::compute::concat should merge dictionary type when concatenating list of dictionaries #6888

rluvaton opened this issue Dec 16, 2024 · 4 comments · May be fixed by #6893
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@rluvaton
Copy link
Contributor

Describe the bug
When concatenating lists of dictionary, the new dictionary type contains duplicate

To Reproduce

#[cfg(test)]
mod tests {
    use arrow_array::builder::{GenericListBuilder, StringDictionaryBuilder};
    use arrow_array::cast::AsArray;
    use arrow_array::types::Int32Type;
    use arrow_array::{ArrayRef, StringArray};
    use std::hash::Hash;
    use std::sync::Arc;

    #[test]
    fn it_works() {
        let scalars = vec![
            create_list_of_dict(vec![Some("a")]),
            create_list_of_dict(vec![Some("a")]),
            create_list_of_dict(vec![Some("b")]),
        ];
        let arrays = scalars.iter().map(|a| a.as_ref()).collect::<Vec<_>>();
        let concat_res = arrow::compute::concat(arrays.as_slice()).unwrap();

        let list = concat_res.as_list::<i32>();

        let dict = list.values().as_dictionary::<Int32Type>().downcast_dict::<StringArray>().unwrap();
        println!("{:?}", dict);

        let values = dict.values().iter().collect::<Vec<_>>();
        let mut unique_values = values.clone();
        unique_values.dedup();
        assert_eq!(values, unique_values, "There are duplicates in the value list");
    }

    fn create_list_of_dict(items: Vec<Option<&'static str>>) -> ArrayRef {
        let mut builder = GenericListBuilder::<i32, _>::new(StringDictionaryBuilder::<Int32Type>::new());

        for item in items {
            builder.values().append_option(item);
        }

        builder.append(true);

        Arc::new(builder.finish())
    }

}

Expected behavior
when concatenating lists of dictionary, it should merge the dictionary rather than blindly concat them as well

Additional context
this happened to me when creating aggregate expression in DataFusion which calls ScalarValue::iter_to_array(results); which use concat underneath

According to the spec duplicate values are valid:

Note that a dictionary is permitted to contain duplicate values or nulls:

From Arrow spec - Dictionary-encoded Layout

@rluvaton rluvaton added the bug label Dec 16, 2024
@rluvaton
Copy link
Contributor Author

It's not really a bug

@tustvold
Copy link
Contributor

Related #3558

@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Dec 17, 2024
@rluvaton
Copy link
Contributor Author

rluvaton commented Dec 17, 2024

While fixing this I saw this line:

let values_exceed_length = total_values >= len;

which in the concat the len is the number of keys (nullable and not nullable if I understand correctly)

so this line basically says if number_of_values >= number_of_keys we should merge (in case not a single dictionary)

Having number of values more than number of keys is really unlikely when you merge dictionary that have the same value multiple times (which is the whole point of the dictionary)

Can you please explain it to me what is the reason behind this check?

@tustvold
Copy link
Contributor

Having number of values more than number of keys is really unlikely when you merge dictionary that have the same value multiple times (which is the whole point of the dictionary)

See #3558, it can occur as a result of the way a single dictionary may be shared across multiple arrays, or as a result of previous processing. The intent is to optimise the dictionary when we know the performance cost will pay off

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
2 participants