-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc feedback #8
Comments
I have just started to build some map/reduce jobs with Parkour. I like everything I see about the library, but I have a good knowledge of the Java API's, and I agree that this really helps. It is also helpful to be pretty versed in Clojure: seqs, reducers, threading macros, etc. What's is missing in my mind, for either the Java API or Parkour is a sort of cookbook. It would be great to get something like this started, and in the next couple of months, I would certainly have some possible recipes worked out. Not really sure how to get a cookbook project similar to https://github.com/clojure-cookbook/clojure-cookbook going however. Publishing is not really my area of expertise. Thanks, |
@mjwillson -- It's not the best example, but there is still one test namespace which tests the Hadoop Job API with the lower-level Parkour-Hadoop integration API: https://github.com/damballa/parkour/blob/master/test/parkour/word_count_test.clj Documentation for people new to Hadoop is something I struggle with. It's difficult for me to see the gaps because I've worked with Hadoop for so long (and frequently at such a low level). I've thus tried to punt and largely avoid explaining Hadoop fundamentals, but that has certainly made Parkour less accessible to Clojure programmers without existing Hadoop experience. I'm not sure a full-on Cookbook is the answer, but I'd be happy to collaborate on some "if you're new to Hadoop" documentation. |
Thanks for the pointer, yeah that helps. Agreed that I wouldn't expect a project like this to teach hadoop from scratch. Perhaps where the gaps lie is in spelling out in explicit for-dummies terms how this API translates to and from more canonical boilerplatey usage of the hadoop java API. A few carefully-selected examples might be enough, a whole cookbook would be nice but perhaps not essential. The other thing which could help is: where there are particular awkward quirks to the way the hadoop fundamentals work which are relevant to the design decisions here, perhaps giving a bit of background for newbies on the hadoop side of this, not just what parkour does on top to make it more pleasant. I'll try and revisit and make some more constructive suggestions about where the gaps lie when I'm a bit further along anyway. |
So I have one particular use-case that I'm having difficulty translating into Parkour. I've been pouring over the docs and experimenting, but I am as of yet unable to see the exact path forward. I have done this in Java M/R so will describe it from that perspective. The job is a map-only job where I override run() to control the flow of records from the split all at once.
This is a very common use case in my domain (bioinformatics), as many of the algorithms we want to process data with take the form of command line tools. Re-implementing those tools as functions that could be used in mapper is not feasible, too many of them, and the algorithms used are research topics in their own right. I'm having difficulty mapping this into parkour concepts. For example, how is how to return a reducible after performing the above steps. Seems like I need to somehow connect a lazy-seq to a dsink. If you have the time to point me in the right direction, it would be greatly appreciated. Thanks, |
@chriscnc -- Your Parkour task functions are nearly-literally identical to the Mapper/Reducer class If you want more functional code by using the return-value to yield results, but want to lazily read from your external tool output, you have two basic options:
|
Hello
Just a small (and selfish, feel free to ignore!) doc request / bit of feedback.
At the moment the library seems very much geared towards people who've already done hadoop the hard way, understand the pain points and want a higher-level DSL which abstracts over them a bit more.
Which is a perfectly valid aim, but I feel like it could also be a bit more accessible to those starting out with hadoop too, with a little more motivating "big picture" documentation along the lines of: here's how you do things directly with hadoop, here's why that's painful, here's how the higher-level constructs in this library help and how they translate to and from the lower-level stuff which you can read about elsewhere.
The docs already do a good job of outlining this in places, although I'm thinking about the parkour.graph stuff in particular -- here there's (what looks to a newbie like) a fair amount of magic introduced, and it's not quite clear how the chains of parkour.graph calls in the examples translate into mapreduce jobs.
Is there a less magical direct way to set up a single mapreduce job with a given mapper and/or reducer using this library, if I want to walk before I run and do things very explicitly for the sake of understanding the lower level to motivate understanding of some of the higher-level APIs and the pain points they're addressing?
Realise that I could do this directly with the hadoop java API, and maybe this is the only way to true enlightenment. But I very much like some of the features of this library like the REPL, idiomatic-clojureness and ease of testing, and it'd be nice to benefit from these while gradually easing into using more abstractions on top of the native hadoop concepts. Which I'm sure I can, just can't see the forest for the trees at the moment.
I'll see how I get on anyway -- keep up the good work!
Cheers
-Matt
The text was updated successfully, but these errors were encountered: