Reading a Parquet file outside of Spark
So, Spark is becoming, if not has become, the de facto standard for large batch processes. Its big selling point is easy integration with the Hadoop file system and Hadoop's data types — however, I find it to be a bit opaque at times, especially when something goes wrong. Recently I was troubleshooting a parquet file and I wanted to rule out Spark itself as a culprit. It turns out to be non-trivial to do so, especially since most of the documentation I can find on reading Parquet files assumes that you want to do it from a Spark job. After a bit of Google/StackOverflow-ing, I found this project in Scala. All of its documentation, though, is again focused on how it integrates with other projects like Pig and Hive.
I started with this brief Scala example, but it didn't include the imports or the dependencies.
object ParquetSample {
def main(args: Array[String]) {
val path = new Path("hdfs://hadoop-cluster/path-to-parquet-file")
val reader = AvroParquetReader.builder[GenericRecord]().build(path)
.asInstanceOf[ParquetReader[GenericRecord]]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
}
}
This, obviously, doesn't compile, since it needs some imports. The project page includes a sample in Maven, but nothing in SBT. I tried translating directly from Maven to SBT:
libraryDependencies ++= Seq(
"org.apache.parquet" % "parqet-common" % "1.10.0",
"org.apache.parquet" % "parqet-encoding" % "1.10.0",
"org.apache.parquet" % "parqet-column" % "1.10.0",
"org.apache.parquet" % "parqet-hadoop" % "1.10.0"
)
SBT didn't like this at all, though:
sbt.librarymanagement.ResolveException: Error downloading org.apache.parquet:parqet-hadoop:1.10.0
Update: Tomasz Michniewski points out in the comments that there's actually a typo in my .sbt file! Maybe I shouldn't have given up so quickly... however, even after correcting the typo I still run into the same dependency management problems I discuss below.
I'm sure I could figure this out eventually, but I also know that I can go super old-school command-line fanatic, abandon the IDE completely, roll up my sleeves and do some manual dependency management.
The first thing I did was download the aforementioned parquet-mr project from mvnrepository and add it to my scalac classpath:
$ scalac -classpath lib/parquet-scala_2.10-1.10.1.jar ParquetReader.scala
Of course, this fails to compile since it also can't find AvroParquetReader
, GenericRecord
, or Path
.
After a brief recursive search for dependencies, I ended up with the following minimal (compile time) dependency list:
- hadoop-hdfs-2.0.jar
- hadoop-common-2.0.jar
- parquet-hadoop-1.8.1
- parquet-avro-1.8.1
- avro-1.8.2.jar
Side note: this sort of recursive dependency resolution can be dangerous, since you might end up grabbing a version that's later than the correct backward-compatible version. Of course, you could always read through the POM file to find the documented dependencies... but I usually save that as a last resort if nothing is working.
Of course, that's just the compile-time dependencies. Thus begins the longer exercise of discovering the runtime dependencies: run the
sample, wait for an error, and find the JAR file that contains the missing dependency. For the most part, this consists of running the
bundle, Google-ing the resultant ClassNotFoundException
, finding the associated project, downloading it and try again. Of course,
I immediately ran across the usual suspects: Apache Commons and Log4j, as well as some Parquet format dependencies. There were a couple
that were a little tricky to track down, both related to code from Google itself: com.google.common.base.Preconditions
which comes from Guava
, and com.google.protobuf.ServiceException
which comes from Protobuf
.
After adding ten more dependencies to the runtime list:
- commons-cli-1.2.jar
- commons-configuration-1.9.jar
- commons-lang-2.6.jar
- commons-logging-1.2.jar
- guava-11.0.2.jar
- hadoop-auth-2.6.0.jar
- parquet-column-1.8.1.jar
- parquet-common-1.10.1.jar
- protobuf-java-3.5.1.jar
- slf4j-api-1.7.28.jar
I had resolved all of the ClassNotFoundException
s, but ran into a harder to diagnose error:
Sep 30, 2019 10:49:04 AM org.apache.hadoop.ipc.Client call
WARNING: interrupted waiting to send rpc request to server
java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:970)
at org.apache.hadoop.ipc.Client.call(Client.java:1320)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy7.getBlockLocations(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy7.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(
ClientNamenodeProtocolTranslatorPB.java:188)
This one took a little bit more research to resolve, but I finally tracked it back to another missing dependency: parquet-format
.
To finally get it working, I had to add 8 additional dependencies:
- jackson-annotations-2.9.9.jar
- jackson-core-2.9.9.jar
- jackson-core-asl-1.9.13.jar
- jackson-databind-2.9.9.3.jar
- jackson-mapper-asl-1.9.13.jar
- parquet-encoding-1.8.1.jar
- parquet-format-2.6.0.jar
- snappy-java-1.1.7.3.jar
I finally ended up with a simple shell script to compile the code:
#!/bin/sh
CLASSPATH=${CLASSPATH}:lib/hadoop-hdfs-2.2.0.jar
CLASSPATH=${CLASSPATH}:lib/hadoop-common-2.2.0.jar
CLASSPATH=${CLASSPATH}:lib/parquet-hadoop-1.8.1.jar
CLASSPATH=${CLASSPATH}:lib/parquet-avro-1.8.1.jar
CLASSPATH=${CLASSPATH}:lib/avro-1.8.2.jar
scalac -classpath $CLASSPATH ParquetSample.scala
And a companion script to run it:
#!/bin/sh
CLASSPATH=${CLASSPATH}:lib/hadoop-hdfs-2.2.0.jar
CLASSPATH=${CLASSPATH}:lib/hadoop-common-2.2.0.jar
CLASSPATH=${CLASSPATH}:lib/parquet-hadoop-1.8.1.jar
CLASSPATH=${CLASSPATH}:lib/parquet-avro-1.8.1.jar
CLASSPATH=${CLASSPATH}:lib/avro-1.8.2.jar
CLASSPATH=${CLASSPATH}:lib/commons-cli-1.2.jar
CLASSPATH=${CLASSPATH}:lib/commons-configuration-1.9.jar
CLASSPATH=${CLASSPATH}:lib/commons-lang-2.6.jar
CLASSPATH=${CLASSPATH}:lib/commons-logging-1.2.jar
CLASSPATH=${CLASSPATH}:lib/guava-11.0.2.jar
CLASSPATH=${CLASSPATH}:lib/hadoop-auth-2.6.0.jar
CLASSPATH=${CLASSPATH}:lib/jackson-annotations-2.9.9.jar
CLASSPATH=${CLASSPATH}:lib/jackson-core-2.9.9.jar
CLASSPATH=${CLASSPATH}:lib/jackson-core-asl-1.9.13.jar
CLASSPATH=${CLASSPATH}:lib/jackson-databind-2.9.9.3.jar
CLASSPATH=${CLASSPATH}:lib/jackson-mapper-asl-1.9.13.jar
CLASSPATH=${CLASSPATH}:lib/parquet-column-1.8.1.jar
CLASSPATH=${CLASSPATH}:lib/parquet-common-1.10.1.jar
CLASSPATH=${CLASSPATH}:lib/parquet-encoding-1.8.1.jar
CLASSPATH=${CLASSPATH}:lib/parquet-format-2.6.0.jar
CLASSPATH=${CLASSPATH}:lib/protobuf-java-3.5.1.jar
CLASSPATH=${CLASSPATH}:lib/slf4j-api-1.7.28.jar
CLASSPATH=${CLASSPATH}:lib/snappy-java-1.1.7.3.jar
scala -classpath .:$CLASSPATH ParquetSample
This highlights why, when code reuse is popular, automated dependency management systems like Maven and SBT are so handy. Still, when
they have issues — especially in projects that don't get a lot of exercise like parquet-mr
— it's helpful to be
know how to go back to basics and resolve down to the bare minimum of what's necessary.