Reading a Parquet file outside of Spark

So, Spark is becoming, if not has become, the de facto standard for large batch processes. Its big selling point is easy integration with the Hadoop file system and Hadoop's data types — however, I find it to be a bit opaque at times, especially when something goes wrong. Recently I was troubleshooting a parquet file and I wanted to rule out Spark itself as a culprit. It turns out to be non-trivial to do so, especially since most of the documentation I can find on reading Parquet files assumes that you want to do it from a Spark job. After a bit of Google/StackOverflow-ing, I found this project in Scala. All of its documentation, though, is again focused on how it integrates with other projects like Pig and Hive.

I started with this brief Scala example, but it didn't include the imports or the dependencies.


object ParquetSample  {
  def main(args: Array[String]) {
    val path = new Path("hdfs://hadoop-cluster/path-to-parquet-file")
    val reader = AvroParquetReader.builder[GenericRecord]().build(path)
      .asInstanceOf[ParquetReader[GenericRecord]]
    val iter = Iterator.continually(reader.read).takeWhile(_ != null)
  }
}

Listing 1: Sample Parquet reader

This, obviously, doesn't compile, since it needs some imports. The project page includes a sample in Maven, but nothing in SBT. I tried translating directly from Maven to SBT:


libraryDependencies ++= Seq(
  "org.apache.parquet" % "parqet-common" % "1.10.0",
  "org.apache.parquet" % "parqet-encoding" % "1.10.0",
  "org.apache.parquet" % "parqet-column" % "1.10.0",
  "org.apache.parquet" % "parqet-hadoop" % "1.10.0"
)

SBT didn't like this at all, though:

sbt.librarymanagement.ResolveException: Error downloading org.apache.parquet:parqet-hadoop:1.10.0

Update: Tomasz Michniewski points out in the comments that there's actually a typo in my .sbt file! Maybe I shouldn't have given up so quickly... however, even after correcting the typo I still run into the same dependency management problems I discuss below.

I'm sure I could figure this out eventually, but I also know that I can go super old-school command-line fanatic, abandon the IDE completely, roll up my sleeves and do some manual dependency management.

The first thing I did was download the aforementioned parquet-mr project from mvnrepository and add it to my scalac classpath:


$ scalac -classpath lib/parquet-scala_2.10-1.10.1.jar ParquetReader.scala

Of course, this fails to compile since it also can't find AvroParquetReader, GenericRecord, or Path. After a brief recursive search for dependencies, I ended up with the following minimal (compile time) dependency list:

hadoop-hdfs-2.0.jar
hadoop-common-2.0.jar
parquet-hadoop-1.8.1
parquet-avro-1.8.1
avro-1.8.2.jar

Side note: this sort of recursive dependency resolution can be dangerous, since you might end up grabbing a version that's later than the correct backward-compatible version. Of course, you could always read through the POM file to find the documented dependencies... but I usually save that as a last resort if nothing is working.

Of course, that's just the compile-time dependencies. Thus begins the longer exercise of discovering the runtime dependencies: run the sample, wait for an error, and find the JAR file that contains the missing dependency. For the most part, this consists of running the bundle, Google-ing the resultant ClassNotFoundException, finding the associated project, downloading it and try again. Of course, I immediately ran across the usual suspects: Apache Commons and Log4j, as well as some Parquet format dependencies. There were a couple that were a little tricky to track down, both related to code from Google itself: com.google.common.base.Preconditions which comes from Guava, and com.google.protobuf.ServiceException which comes from Protobuf.

After adding ten more dependencies to the runtime list:

commons-cli-1.2.jar
commons-configuration-1.9.jar
commons-lang-2.6.jar
commons-logging-1.2.jar
guava-11.0.2.jar
hadoop-auth-2.6.0.jar
parquet-column-1.8.1.jar
parquet-common-1.10.1.jar
protobuf-java-3.5.1.jar
slf4j-api-1.7.28.jar

I had resolved all of the ClassNotFoundExceptions, but ran into a harder to diagnose error:


Sep 30, 2019 10:49:04 AM org.apache.hadoop.ipc.Client call
WARNING: interrupted waiting to send rpc request to server
java.lang.InterruptedException
	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
	at java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:970)
	at org.apache.hadoop.ipc.Client.call(Client.java:1320)
	at org.apache.hadoop.ipc.Client.call(Client.java:1300)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy7.getBlockLocations(Unknown Source)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy7.getBlockLocations(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(
    ClientNamenodeProtocolTranslatorPB.java:188)

This one took a little bit more research to resolve, but I finally tracked it back to another missing dependency: parquet-format. To finally get it working, I had to add 8 additional dependencies:

jackson-annotations-2.9.9.jar
jackson-core-2.9.9.jar
jackson-core-asl-1.9.13.jar
jackson-databind-2.9.9.3.jar
jackson-mapper-asl-1.9.13.jar
parquet-encoding-1.8.1.jar
parquet-format-2.6.0.jar
snappy-java-1.1.7.3.jar

I finally ended up with a simple shell script to compile the code:


#!/bin/sh

CLASSPATH=${CLASSPATH}:lib/hadoop-hdfs-2.2.0.jar
CLASSPATH=${CLASSPATH}:lib/hadoop-common-2.2.0.jar
CLASSPATH=${CLASSPATH}:lib/parquet-hadoop-1.8.1.jar
CLASSPATH=${CLASSPATH}:lib/parquet-avro-1.8.1.jar
CLASSPATH=${CLASSPATH}:lib/avro-1.8.2.jar

scalac -classpath $CLASSPATH ParquetSample.scala

And a companion script to run it:


#!/bin/sh

CLASSPATH=${CLASSPATH}:lib/hadoop-hdfs-2.2.0.jar
CLASSPATH=${CLASSPATH}:lib/hadoop-common-2.2.0.jar
CLASSPATH=${CLASSPATH}:lib/parquet-hadoop-1.8.1.jar
CLASSPATH=${CLASSPATH}:lib/parquet-avro-1.8.1.jar
CLASSPATH=${CLASSPATH}:lib/avro-1.8.2.jar

CLASSPATH=${CLASSPATH}:lib/commons-cli-1.2.jar
CLASSPATH=${CLASSPATH}:lib/commons-configuration-1.9.jar
CLASSPATH=${CLASSPATH}:lib/commons-lang-2.6.jar
CLASSPATH=${CLASSPATH}:lib/commons-logging-1.2.jar
CLASSPATH=${CLASSPATH}:lib/guava-11.0.2.jar
CLASSPATH=${CLASSPATH}:lib/hadoop-auth-2.6.0.jar
CLASSPATH=${CLASSPATH}:lib/jackson-annotations-2.9.9.jar
CLASSPATH=${CLASSPATH}:lib/jackson-core-2.9.9.jar
CLASSPATH=${CLASSPATH}:lib/jackson-core-asl-1.9.13.jar
CLASSPATH=${CLASSPATH}:lib/jackson-databind-2.9.9.3.jar
CLASSPATH=${CLASSPATH}:lib/jackson-mapper-asl-1.9.13.jar
CLASSPATH=${CLASSPATH}:lib/parquet-column-1.8.1.jar
CLASSPATH=${CLASSPATH}:lib/parquet-common-1.10.1.jar
CLASSPATH=${CLASSPATH}:lib/parquet-encoding-1.8.1.jar
CLASSPATH=${CLASSPATH}:lib/parquet-format-2.6.0.jar
CLASSPATH=${CLASSPATH}:lib/protobuf-java-3.5.1.jar
CLASSPATH=${CLASSPATH}:lib/slf4j-api-1.7.28.jar
CLASSPATH=${CLASSPATH}:lib/snappy-java-1.1.7.3.jar

scala -classpath .:$CLASSPATH ParquetSample

This highlights why, when code reuse is popular, automated dependency management systems like Maven and SBT are so handy. Still, when they have issues — especially in projects that don't get a lot of exercise like parquet-mr — it's helpful to be know how to go back to basics and resolve down to the bare minimum of what's necessary.

Add a comment:

Completely off-topic or spam comments will be removed at the discretion of the moderator.

You may preserve formatting (e.g. a code sample) by indenting with four spaces preceding the formatted line(s)

Raju, 2019-10-28

In Python, pyarrow can be used to read parquest files as described in https :// arrow.apache.org/docs/python/parquet.html . It does not assume that you are using Spark. Things are much simpler in the Python world! :)

Tomasz Michniewski, 2021-02-04

You have spelling errors in the sbt libraryDependencies. Instead of "parqet" it should be "parquet".

Joshua Davies, 2021-02-05

Good catch, thanks! I've added an update in the text, but I'll leave the typo as is.

Command Line Fanatic

Reading a Parquet file outside of Spark

Add a comment:

Past Posts