Nathan Kleyn

Scala by day — Rust, Ruby and Haskell by night. I hail from London.

Hi! I currently work for Intent HQ as Director of Engineering, writing functional Scala by day and Rust by night. I write for SitePoint about Ruby, and I’m on Twitter and on GitHub. You can get in touch with me at mail/at/

Using Transient Lazy Val's To Avoid Spark Serialisation Issues

29 Dec 2017

Occasionally within a Spark application, you may encounter issues because some member of a (case) class / object is not serialisable. This manifests most often as an exception when the first task containing that item is attempted to be sent to an executor. The vast majority of the time, the fix you probably will reach for is to make the object implement Serializable. Sometimes however, this may not be easy or even possible (for example, when the type in question is out of your control).

It turns out that there is another way! If the object in question can be constructed again inexpensively, the @transient lazy val pattern may be for you:

case class Foo(foo: SomeType) {
  @transient lazy val bar: SomeOtherType = SomeOtherType(foo)

The @transient annotation has the effect of excluding the annotated item from the object it is contained within when that object is serialised. In conjunction with the lazy val, this means the field will be constructed again when first accessed on each of the executors, rather than being sent to each executor as a series of serialised bytes to deserialise as part of the task.

Sometimes this trick can actually result in modest performance improvements — for example, if the object is question is large when serialised but is cheap to construct again (like large collections computed from some smaller seed). However, carefully note that it is constructed again once per executor making it only useful for stateless items.

The next time you hit a serialisation issue in Spark, give @transient lazy val a try and see if it’s a fit for you!

Stopping Spark Broadcast Variables From Leaking Throughout Your Application

20 Aug 2016

When working with Spark it’s tempting to use Broadcast variables everywhere you need them without thinking too much about it. However, if you are not careful, this can lead to strong coupling between all of your application components and the implementation of Spark itself.

How Broadcast variables work

Broadcast variables are intended to solve a very specific problem in Spark: namely, preparing some static value on the driver side to be passed efficiently to all nodes that need it during some processing step. Spark takes care of implementing the distribution of this variable in the most efficient manner, eg. it can elect to use BitTorrent for larger variables (if enabled), and promises that only the nodes that need the data will be sent it at the point they need it.

However, in order to work, broadcast variables effectively provide a wrapper type for the variable they close over:

// This is a vastly simplified version of the class!
abstract class Broadcast[T: ClassTag](val id: Long) {
  // ...
  def value: T = // ...
  // ...

Effectively, Spark will store our data and reference it via an ID (a Long) which is all we need to send to the nodes until they actually need the data. When they need the data, they send the driver the id, and they get what they need back!

Working with Broadcast variables

In order for us to setup and get the value out of a broadcast variable, we need to do something like the following:

val data: Broadcast[Array[Int]] = sparkContext.broadcast(Array(1, 2, 3))

// Later on, to use it somewhere...
data.value // => Array[Int] = Array(1, 2, 3))

Leaking the types

When working with Spark, it can be easy to end up with one “god job” where the implementation of all the steps of a job are inlined and heavily dependent on Spark features.

With some effort, we can factor out our pure functions into smaller classes which we can call into:

class PostcodeToLatLong(lookup: Map[String, (Double, Double)]) {
  def lookupPostcode(postcode: String): (Double, Double) = ???

Here we have a simple class which needs a lookup table to function: in this case, it will convert some string based postcode to a latitude/longitude pair which is represented as a tuple of Doubles.

If this class was going to be used in a Spark job, we may consider making the postcodeTable a broadcast variable so that only the nodes that need it will request the data for it. However, we hit a snag: we can’t do this without leaking the Brodcast wrapper type into the implementation details of the class like so:

class PostcodeToLatLong(lookup: Broadcast[Map[String, (Double, Double)]]) {
  // ...

This is a nightmare for testing, as now we’ve gone from having a simple, pure class which can be unit tested easily to having a class that depends entirely on Spark implementation details and needs a SparkContext just to setup!

Abstracting away the broadcast functionality

Thankfully, we can solve this using a trait that abstracts what we actually care about: that the thing we have passed is lazy and will only be grabbed when we call .value on it!

trait Lazy[T] extends Serializable {
  def value: T

And now we can change our class to take this trait instead:

class PostcodeToLatLong(lookup: Lazy[Map[String, (Double, Double)]]) {
  // ...

With some simple implicit classes we can make it easy to call this class with either a Spark broadcast variable or any old primitive object:

object Lazy {
  object implicits {
    implicit class LazySparkBroadcast[T: ClassTag](bc: Broadcast[T]) extends Lazy[T] {
      override def value: T = bc.value

    implicit class LazyPrimitive[T: ClassTag](inner: T) extends Lazy[T] {
      override def value: T = inner

And now we can put the pieces together:

import Lazy.implicits._

val lookup = Map("foo" -> (123.45, 123.45))
val bcLookup = sparkContext.broadcast(lookup)

// And later on...

val postcodeMapper = new PostcodeToLatLong(bcLookup)
// This also works!
val postcodeMapper = new PostcodeToLatLong(lookup)

10 Things, 1 Year

01 Jan 2016

As a personal challenge to myself, I am going to set myself 10 goals to achieve this year. As we all know, goals are nothing if you don’t set them in stone, so without further ado:

  1. Blog more. Start blogging regularly, and writing for personal enjoyment. Being a better writer makes you a better coder, so embrace it.
  2. Read more. You picked up some books in the later half of last year that have collected dust for some time. Keep up the reading and get back to the old days of reading quietly before bed instead of staring at a screen!
  3. Keep learning more about functional programming. You’ve come a long way and are doing functional programming every day at work and at home. Don’t stop learning: keep looking into new things, and discovering more about the theory behind it.
  4. Do more with Rust. Rust changed everything by challenging the notion that a programming language had to be slow to be safe. Keep writing open source projects in and learning about Rust as often as possible.
  5. Double down on Haskell. Haskell is an amazing language, try to write more with it.
  6. Attend some meetups. Try to attend some programming meetups, perhaps about Rust, Haskell or functional programming!
  7. Start writing creatively again. Once upon a time you used to write for fun, but the pen
  8. Geocache more! Try to find more geocaches this year than last and in the process discover places you would never have found otherwise!
  9. Spend more time with your family. Make sure they know how much you love and care for them, and how important they are to you. Don’t get carried away working or studying and forget to find time for them.
  10. Marry the most amazing girl in the world, Kitty. In July, you’ll say your vows — make the most of it, and make sure she knows how happy she’ll make you for the rest of your life.

Our CI and Builds at Intent HQ

03 Mar 2015

I gave another presentation, this time on how we tackle CI and builds at Intent HQ! The slides are available on Speaker Deck as always!

Distributed ID Generation

25 Feb 2015

I presented a talk on distributed ID generation using Redis Lua scripting as we do at Intent HQ! The slides are available on Speaker Deck and we’re planning on open-sourcing our implementation from Intent HQ soon!

In the meantime, you might find this useful as a guide on the theory and ideas behind generating unique IDs in a distributed fashion.