Implement parquet simple record as a source for the druid spark indexer #111

Igosuki · 2017-10-03T14:07:44Z

Probably can't be merged right away.

Works fine with the parquet file I included (mapped the csv from test resources into a parquet file). The strategy is to directly map to json without going through avro as loading generic data made it more complicated than not for me.

I'm going to run production tests soon (vs a few hundreds of gigs of data).

drcrallen · 2017-10-03T15:33:27Z

For future maintainability, there is already a parquet input row parser in druid: https://github.com/druid-io/druid/blob/master/extensions-contrib/parquet-extensions/src/main/java/io/druid/data/input/parquet/ParquetHadoopInputRowParser.java is there any reason that cannot be used?

It would be nice if the row parser (or factory) could be passed in and eliminate the giant switch statement which is not really sustainable.

Also regarding #10 it might make sense to have an RDD passed in the index method in the future, though I don't think such a refactoring would be in scope for a specific new format.

Getting rid of the giant switch statement should be investigated more though.

drcrallen · 2017-10-03T15:40:06Z

test failure looks like a guava version problem:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 58, localhost, executor driver): java.lang.NoSuchMethodError: com.google.common.io.Files.asByteSource(Ljava/io/File;)Lcom/google/common/io/ByteSource;

Guava version reconciliation is always a huge issue.

Igosuki · 2017-10-04T11:07:41Z

Looks like checks have passed.
A few things :

This uses the simplest read api from parquet which I copied over because parquet-tools which this api belongs to imports a very old version of guava
I didn't add push down filters as an option, which can be very effective while doing parquet
I fixed the druid version to 0.10.1 but we can revert that to snapshot
I am unsure how hadoopFileAPI works with Spark/Parquet over a directory, meaning if it splits the ParquetFileReader workload properly or if it's a bottleneck.

Gauravshah · 2017-10-04T11:56:18Z

how would we specify the nested columns ?
also would it respect partition column ?

Igosuki · 2017-10-04T13:41:02Z

The point is that it transforms parquet records to json, so that you can use the json spec (including flatten). In 0.10.1 the avro flattening, nesting access isn't available yet afaik which is why I went with a custom parser that can easily transform to json without too much overhead.

…

On Wed, Oct 4, 2017, 1:56 PM Gaurav M Shah ***@***.***> wrote: how would we specify the nested columns ? also would it respect partition column ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#111 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAYyhiLZpG5OzSorXW-Yzt0-oDQVkEnIks5so3JkgaJpZM4PsLLC> .

Gauravshah · 2017-10-04T17:19:20Z

I feel going down dataframe route might be better.
Current way there are two transforms needed, Parquet to Json to Map<string,Object>
With dataframe router we could do Parquet to Map<String,Object>
Would allow us to do other transforms also would make use of parquet pruning features from spark

Igosuki · 2017-10-04T20:38:30Z

Sure, why not, let me have a look

xanec

Maybe we should not commit binary file (i.e., lineitems.parquet)? Could we generate this on-the-fly instead?

xanec · 2017-10-05T14:25:12Z

build.sbt

@@ -143,7 +150,8 @@ resolvers += "JitPack.IO" at "https://jitpack.io"
 publishMavenStyle := true

 //TODO: remove this before moving to druid.io
-publishTo := Some("central-local" at "https://metamx.artifactoryonline.com/metamx/libs-releases-local")
+//publishTo := Some("central-local" at "https://metamx.artifactoryonline.com/metamx/libs-releases-local")
+publishTo := Some(Resolver.file("file",  new File(Path.userHome.absolutePath+"/.m2/repository")))


Just putting a note here as a reminder to correct this before merge.

xanec · 2017-10-05T15:29:44Z

src/main/scala/io/druid/indexer/spark/SparkDruidIndexer.scala

-                s"[${x.getClass.getCanonicalName}]. " +
-                "Hoping it can handle string input"
+    val baseData =
+      dataSchema.getDelegate.getParser match {


There are a lot of repeated code (i.e., for parquet case and text case). I suggest that we can have some refactoring to combine them.

xanec · 2017-10-05T16:16:14Z

src/main/java/org/apache/parquet/tools/read/SimpleRecord.java

+    return values.toString();
+  }
+
+  public void prettyPrint() {


May I ask if prettyPrint is used (i.e., in production)? Or if it is only used for testing/development? If it is the latter, then perhaps we should not include it in the class.

And also, if we are to remove it, then a lot of the methods/classes can also be removed.

xanec · 2017-10-05T17:01:18Z

Please correct me if I am wrong but I think that the parser need not be "simple" in that the parsing can produce a POJO for our manipulation. Hence, it is up to us to decide what is the record that will be produced and how are we going to map it into druid.

Igosuki · 2017-10-27T07:37:23Z

@xanec You're right, I've had other things to patch on our end on Druid, I'm getting back to this as soon as I can to refactor it.

Copy over generic parquet read api to prevent importing the whole parquet-tools dependency

Revert "Copy over generic parquet read api to prevent importing the whole parquet-tools dependency" This reverts commit e845195369a59a5aa0121e80b2bae7b3edbab395.

Igosuki · 2017-12-12T10:15:45Z

I rebased on 0.11 to use the transform and flatten APIs, but I had to copy maps in the parser because AbstractMaps in the ObjectFlatteners API aren't directly serializable by Kryo.

xanec suggested changes Oct 5, 2017

View reviewed changes

Igosuki added 6 commits December 6, 2017 18:08

Implement parquet simple record as a source for the druid spark indexer

4a20864

Exclude guava deps

4c9b4d6

Copy over generic parquet read api to prevent importing the whole parquet-tools dependency

DC/OS git submit script

d37e617

Druid 0.11.0

558f3fe

Revert "Copy over generic parquet read api to prevent importing the whole parquet-tools dependency" This reverts commit e845195369a59a5aa0121e80b2bae7b3edbab395.

Use druid 0.11.1-SNAPSHOT

ab98e37

Use latest druid and the avro flattener api with parquet

3624573

Igosuki force-pushed the master branch from f77d4e4 to 3624573 Compare December 11, 2017 16:06

Igosuki closed this Dec 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement parquet simple record as a source for the druid spark indexer #111

Implement parquet simple record as a source for the druid spark indexer #111

Igosuki commented Oct 3, 2017

drcrallen commented Oct 3, 2017

drcrallen commented Oct 3, 2017

Igosuki commented Oct 4, 2017

Gauravshah commented Oct 4, 2017

Igosuki commented Oct 4, 2017 via email

Gauravshah commented Oct 4, 2017

Igosuki commented Oct 4, 2017

xanec left a comment

xanec Oct 5, 2017

xanec Oct 5, 2017

xanec Oct 5, 2017

xanec commented Oct 5, 2017

Igosuki commented Oct 27, 2017

Igosuki commented Dec 12, 2017

Implement parquet simple record as a source for the druid spark indexer #111

Implement parquet simple record as a source for the druid spark indexer #111

Conversation

Igosuki commented Oct 3, 2017

drcrallen commented Oct 3, 2017

drcrallen commented Oct 3, 2017

Igosuki commented Oct 4, 2017

Gauravshah commented Oct 4, 2017

Igosuki commented Oct 4, 2017 via email

Gauravshah commented Oct 4, 2017

Igosuki commented Oct 4, 2017

xanec left a comment

Choose a reason for hiding this comment

xanec Oct 5, 2017

Choose a reason for hiding this comment

xanec Oct 5, 2017

Choose a reason for hiding this comment

xanec Oct 5, 2017

Choose a reason for hiding this comment

xanec commented Oct 5, 2017

Igosuki commented Oct 27, 2017

Igosuki commented Dec 12, 2017