-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-unique keys as outputs of mapreduce when using a combiner #167
Comments
This looks like a bug. I am not sure why you are trying the two different reducers. The first argument of the reducer is always lenght one unless one is setting vectorized.reduce to TRUE. For a vector k[1] when k is of length one is not doing anything. Back to this problem. The only scenario I can think of that hadoop and R are ordering keys differently ... wait a second we have fixed that already. Maybe there is some version of that that involves a combiner and is not as fixed as I would like it to be. I would like you to perform the following experiment (unless you can share the data, in the meantime I will try to repro with synthetic data, no luck so far). The idea is to normalize the length of the keys to the maximum of possible lengths
Then in the map function:
Does that make sense? Thanks |
Makes sense. Unfortunately I won't be able to share the data (not even the On Mon, Apr 20, 2015, 12:50 Antonio Piccolboni [email protected]
|
The keys are now all of a similar length (56 characters). But the counts are off...
Without a combiner everything works fine... |
Since you can't share your data, this is what I propose. Let's duplicate the problem with variants of this program
Please let me know if I appear to have captured the essence of the problem you reported.
I have had some trouble with my dev environment but I am now running this and I don't see any problems so far. |
I'm trying to write a mapreduce function that counts the occurrence of different values in a data column.
My code is:
As a result, I'm getting non-unique keys coming out of the reduce phase...
Sample output from the above print commands:
(I know the right answer is 284).
Different ways I tried to make this work:
An interesting thing about the wrong answer: the counts still sum up to what they should, they just don't really get "combined" enough. Meaning that I don't get duplicate entries coming back, I get partially aggregated answers back.
Not sure if I'm doing anything wrong...
Versions:
And hadoop version:
The text was updated successfully, but these errors were encountered: