While adding AWS Glue to our data processing pipelines, we needed a way to develop new scripts and incorporate them into our testing and CI/CD systems.

What is AWS Glue?

Amazon describes AWS Glue as "AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics." To be more specific, it is a managed service that executes Apache Spark jobs using Hadoop Yarn to perform MapReduce operations over large data sets in AWS Simple Storage Service (S3) and other data stores.

Our Problem

As our data pipeline has expanded, we found some processes that benefit from using the MapReduce model. Having the ability to perform large-scale batch processing in parallel is a powerful tool. In our investigation, we found our tooling was inadequate to rapidly integrate Glue into our existing applications.

AWS offers a great UI for editing and executing scripts in their console, but there is significant delay - ten to twelve minutes before a job even begins execution. Executing a job and waiting 10 minutes only to get back a syntax error is not a viable development plan. We needed a way to test scripts before deploying them.

AWS provides a development endpoints as a solution to this problem, each of which costs $0.44 per hour. These endpoints enable developing with a notebook (Jupyter, Zeppelin, or Sagemaker) or direct REPL access over SSH. This allows you to have an on-demand cluster and see results rapidly. While powerful, this solution comes with additional costs that did not make sense for our initial exploration. We needed to be able to test our scripts locally, and integrate them into our CI/CD systems.

In August 2019, AWS released their official Glue libraries and some documentation on implementing local development. Unfortunately, we were unable to find a readily available local test suite.

Our Solution

In response to the problems above, we developed a small application for testing Glue scripts written in Scala locally. The tool allows us to rapidly iterate during development and provides additional validation by incorporating a testing framework. Our solution utilizes the released libraries, SBT, and ScalaTest to allow rapid iteration and testing of AWS Glue scripts locally, while incorporating them into our existing deployment processes.

Check out our open source tool here: https://github.com/Gamesight/aws-glue-local-scala

You can test Glue job scripts in aws-glue-local-scala by defining test classes and running sbt test.

import org.scalatest._

class ExampleSpec extends FunSpec {
  describe("Example") {
    it("should run the job") {

      println(s"Starting ExampleJob at ${new java.util.Date()}")

      // Trigger the execution by directly calling the main class and supplying
      // arguments. AWS Glue job arguments always begin with "--" so that the
      // resolver can correctly convert it to a Map
      io.gamesight.AWSGlue.ExampleJob.main(Array(
        "--JOB_NAME", "job",
        "--stage", "dev",
        "--inputBucket", "<YOUR BUCKET NAME>",
        "--outputBucket", "<YOUR OUTPUT BUCKET NAME>",
        "--inputPrefix", "<YOUR INPUT PREFIX>",
        "--outputPrefix", "<YOUR OUTPUT PREFIX>"
      ))

      println(s"ExampleJob Finished at ${new java.util.Date()}")

    }
  }
}

By executing using a ScalaTest in this way, we are able to quickly ensure that our code compiles correctly, add validate the results of our jobs with assertions. Along with a test data set, we are able to use Github Actions to automatically verify that our jobs work properly before deploying.

About us

At Gamesight, we use data to help games find commercial success. If you are looking for help getting your game's performance or influencer running and want to talk about this article, identifying the right influencers, or measuring your marketing efforts, please reach out on our website!