Stopping Spark Broadcast Variables From Leaking Throughout Your Application

When working with Spark it’s tempting to use Broadcast variables everywhere you need them without thinking too much about it. However, if you are not careful, this can lead to strong coupling between all of your application components and the implementation of Spark itself.

How Broadcast variables work

Broadcast variables are intended to solve a very specific problem in Spark: namely, preparing some static value on the driver side to be passed efficiently to all nodes that need it during some processing step. Spark takes care of implementing the distribution of this variable in the most efficient manner, eg. it can elect to use BitTorrent for larger variables (if enabled), and promises that only the nodes that need the data will be sent it at the point they need it.

However, in order to work, broadcast variables effectively provide a wrapper type for the variable they close over:

// This is a vastly simplified version of the class!
abstract class Broadcast[T: ClassTag](val id: Long) {
  // ...
  def value: T = // ...
  // ...

Effectively, Spark will store our data and reference it via an ID (a Long) which is all we need to send to the nodes until they actually need the data. When they need the data, they send the driver the id, and they get what they need back!

Working with Broadcast variables

In order for us to setup and get the value out of a broadcast variable, we need to do something like the following:

val data: Broadcast[Array[Int]] = sparkContext.broadcast(Array(1, 2, 3))

// Later on, to use it somewhere...
data.value // => Array[Int] = Array(1, 2, 3))

Leaking the types

When working with Spark, it can be easy to end up with one “god job” where the implementation of all the steps of a job are inlined and heavily dependent on Spark features.

With some effort, we can factor out our pure functions into smaller classes which we can call into:

class PostcodeToLatLong(lookup: Map[String, (Double, Double)]) {
  def lookupPostcode(postcode: String): (Double, Double) = ???

Here we have a simple class which needs a lookup table to function: in this case, it will convert some string based postcode to a latitude/longitude pair which is represented as a tuple of Doubles.

If this class was going to be used in a Spark job, we may consider making the postcodeTable a broadcast variable so that only the nodes that need it will request the data for it. However, we hit a snag: we can’t do this without leaking the Brodcast wrapper type into the implementation details of the class like so:

class PostcodeToLatLong(lookup: Broadcast[Map[String, (Double, Double)]]) {
  // ...

This is a nightmare for testing, as now we’ve gone from having a simple, pure class which can be unit tested easily to having a class that depends entirely on Spark implementation details and needs a SparkContext just to setup!

Abstracting away the broadcast functionality

Thankfully, we can solve this using a trait that abstracts what we actually care about: that the thing we have passed is lazy and will only be grabbed when we call .value on it!

trait Lazy[T] extends Serializable {
  def value: T

And now we can change our class to take this trait instead:

class PostcodeToLatLong(lookup: Lazy[Map[String, (Double, Double)]]) {
  // ...

With some simple implicit classes we can make it easy to call this class with either a Spark broadcast variable or any old primitive object:

object Lazy {
  object implicits {
    implicit class LazySparkBroadcast[T: ClassTag](bc: Broadcast[T]) extends Lazy[T] {
      override def value: T = bc.value

    implicit class LazyPrimitive[T: ClassTag](inner: T) extends Lazy[T] {
      override def value: T = inner

And now we can put the pieces together:

import Lazy.implicits._

val lookup = Map("foo" -> (123.45, 123.45))
val bcLookup = sparkContext.broadcast(lookup)

// And later on...

val postcodeMapper = new PostcodeToLatLong(bcLookup)
// This also works!
val postcodeMapper = new PostcodeToLatLong(lookup)

The Parable Of The Getting Started Experience

This is the story of a boy named Stanley. Stanley had never written any code before, but he had always wanted to. His Dad worked for a company called “Google” where he programmed and it sounded like an amazing place to work! Stanley had always been interested in computers since a young age and had dreamed of being able to learn programming some day — plus, he sure liked the sound of free food, sleeping pods and dinosaurs battling flamingos.

Stanley decided to try to learn to code in his spare time. A quick search on the internet yielded plenty of results for courses he could take to get off on the right foot. He quickly determined that he wanted to learn how to make a website. After doing some reading, he discovered he would need to learn a language called “HTML”. He enrolled in a course and set to work right away, his enthusiasm palpable. The possibilities were endless once he could make things like he’d seen on the internet!

After many days of working away trying to understand HTML, Stanley had made his first web page. It was very basic, but it was something! He was very proud of what he’d achieved, but he couldn’t shake the want of making his website look more like the other websites he’d seen on the internet.

To make things pretty like those other websites, he did some research and discovered he would have to learn “CSS”. Somewhat disappointed that he’d have to learn a whole new thing from scratch, he begun to delve into that next. He found the whole ordeal rather too much for a while; between having to learn what each property did and how to combine them to make his website look nice, he was starting to think he’d never be able to make a real website.

After many weeks of learning CSS, Stanley finally made his web pages looks pretty. It wasn’t the most amazing creation, but it was something! He was very proud of what he’d achieved, but he couldn’t dispose of the feeling that the other websites still had something his didn’t. His website felt very static.

When he did some research, he found lots of tutorials throwing around words like “React” and “Angular” that made him want to give up for fear of being overwhelmed. He eventually found that underneath these two things lay a language called “JavaScript” which promised great things: being able to change the page after it loaded, do complex animations and script complex interactions with the user. He persisted and started looking into JavaScript. He questioned the need for a whole new language and set of tools to do this, but following instructions he set off to learn in the hope that he could make his website look and feel like the others.

After many months of absorbing book after book and video after video, he finally was able to embellish his website with JavaScript the way he’d seen others do. The website looked amazing, and page after page Stanley had poured his newfound knowledge into making the best website he could. There was, however, still something missing.

One evening Stanley sat down with his Dad after dinner. He showed him the website he had created and explained how he felt something was missing. He wondered how these other websites were able to allow people to write comments on their pages, or to have thousands of pages without having to change every single page one by one. Stanley’s Dad explained that they were using “Server-Side Programming”, a technique to build up the HTML from data by writing code.

Stanley, dreading having to read more books, asked his Dad if he could do this with JavaScript, HTML and CSS, the languages he’d already learned. Stanley’s Dad walked over to the computer and typed into the search box “Node.js”. Up popped a website full of new technology Stanley would have to get to grips with.

Stanley sighed, thanked his Dad, and got to work.

10 Things, 1 Year

As a personal challenge to myself, I am going to set myself 10 goals to achieve this year. As we all know, goals are nothing if you don’t set them in stone, so without further ado:

  1. Blog more. Start blogging regularly, and writing for personal enjoyment. Being a better writer makes you a better coder, so embrace it.
  2. Read more. You picked up some books in the later half of last year that have collected dust for some time. Keep up the reading and get back to the old days of reading quietly before bed instead of staring at a screen!
  3. Keep learning more about functional programming. You’ve come a long way and are doing functional programming every day at work and at home. Don’t stop learning: keep looking into new things, and discovering more about the theory behind it.
  4. Do more with Rust. Rust changed everything by challenging the notion that a programming language had to be slow to be safe. Keep writing open source projects in and learning about Rust as often as possible.
  5. Double down on Haskell. Haskell is an amazing language, try to write more with it.
  6. Attend some meetups. Try to attend some programming meetups, perhaps about Rust, Haskell or functional programming!
  7. Start writing creatively again. Once upon a time you used to write for fun, but the pen
  8. Geocache more! Try to find more geocaches this year than last and in the process discover places you would never have found otherwise!
  9. Spend more time with your family. Make sure they know how much you love and care for them, and how important they are to you. Don’t get carried away working or studying and forget to find time for them.
  10. Marry the most amazing girl in the world, Kitty. In July, you’ll say your vows — make the most of it, and make sure she knows how happy she’ll make you for the rest of your life.

Our CI and Builds at Intent HQ

I gave another presentation, this time on how we tackle CI and builds at Intent HQ! The slides are available on Speaker Deck as always!

Distributed ID Generation

I presented a talk on distributed ID generation using Redis Lua scripting as we do at Intent HQ! The slides are available on Speaker Deck and we’re planning on open-sourcing our implementation from Intent HQ soon!

In the meantime, you might find this useful as a guide on the theory and ideas behind generating unique IDs in a distributed fashion.