• Spark Parallel Job Execution

    A pretty common use case for Spark is to run many jobs in parallel. Spark is excellent at running stages in parallel after constructing the job dag, but this doesn’t help us to run two entirely independent jobs in the same Spark applciation at the same time. Some of the use cases I can think of for parallel job execution include steps in an etl pipeline in which we are pulling data from several remote sources and landing them into our an hdfs cluster.
  • A Spark Gradle Project

    An Empty Spark Project with Gradle Goodies When I started working with Spark, I was new to many technologies, and one of the most time consuming aspects for me was putting together a set of build tools. Objectives Over the course of this project, I found myself wanting a handful of features from the build. Build the code Run the tests Run Spark integration tests Bundle my project into an uberjar for easy use with Spark Build a tar file that contains my jar and other files I might want (such as configuration templates) Add the git commit hash to the jar manifest Using the code Grab the code from Github.