• Spark Parallel Job Execution

    A pretty common use case for Spark is to run many jobs in parallel. Spark is excellent at running stages in parallel after constructing the job dag, but this doesn’t help us to run two entirely independent jobs in the same Spark applciation at the same time. Some of the use cases I can think of for parallel job execution include steps in an etl pipeline in which we are pulling data from several remote sources and landing them into our an hdfs cluster.
  • Ignoring Scalatest Tests By Tag When Using JUnit Runner and Gradle

    I use Drone for CI for some projects, a fully awesome Docker based build pipeline tool. On one of my projects that uses Gradle and scalatest I was experience a hang at the end of the build & test phase, prevent further Drone tasks from running. After a lengthly chunk of work and diagnosis I found this issue to Docker v1.12, which matched our Docker version on Drone. However, I learnt some things along the way which I’ll blog about piece by piece, and first up is ignoring scalatest tests by tag when using JUnit Runner.
  • A Spark Gradle Project

    An Empty Spark Project with Gradle Goodies When I started working with Spark, I was new to many technologies, and one of the most time consuming aspects for me was putting together a set of build tools. Objectives Over the course of this project, I found myself wanting a handful of features from the build. Build the code Run the tests Run Spark integration tests Bundle my project into an uberjar for easy use with Spark Build a tar file that contains my jar and other files I might want (such as configuration templates) Add the git commit hash to the jar manifest Using the code Grab the code from Github.