Nathan Kleyn

Scala by day — Rust, Ruby and Haskell by night. I hail from London.

Hi! I currently work for Intent HQ as Head of Engineering, writing functional Scala by day and Rust by night. I write for SitePoint about Ruby, and I’m on Twitter and on GitHub. You can get in touch with me at mail/at/nathankleyn.com.


Top and Bottom

06 Sep 2018

As a follow-up to my previous post on Unit vs null, I thought it might be useful to talk about some of the other special types we have in statically typed languages.

There are actually two other special types in the type-system that you may be interested in: namely, the bottom and the top type.

The Bottom Type

The bottom type (written as Nothing in Scala) is a type that is impossible to create. It is commonly used as a placeholder when a type isn’t known to the compiler. See for example what happens when we don’t tell the compiler the types of the keys and values in an empty Map:

scala> Map.empty
res0: scala.collection.immutable.Map[Nothing,Nothing] = Map()

As it has no other information to go on, the types for both end up as Nothing.

It’s also useful for saying that a function never returns — for example, a function that exits the program before returning could be written as:

def doSomething(): Nothing = sys.exit(1)

It should be clear from the type that it is impossible for this function to produce a return value. In fact, it’s not even possible to write a function that has a real return value if you have Nothing as the return types:

scala> def doSomething(): Nothing = 123
<console>:11: error: type mismatch;
 found   : Int(123)
 required: Nothing
       def doSomething(): Nothing = 123

You therefore know that upon seeing a function returning Nothing, it will definitely never return if you call it — all guranteed because it’s impossible to make a value of type Nothing!

The Top Type

The opposite of the bottom type is the top type. Also called the universal type (and written in Scala as Any), this is simply a way to say “this could be any possible type”. For example:

def doSomething(): Any = 123

This function actually returns an Int but typing it Any works — what’s going on here? This is because Any is actually the ancestor of all types in Scala, even Object:

scala> val x: Object = new Object {}
x: Object = $anon$1@b428830

scala> x.isInstanceOf[Any]
res1: Boolean = true

You’ll often see Any used by the compiler when it is trying to fill in a missing type (so called “type inference”) but couldn’t find anything more specific to choose.

Why Do We Need These Types?

The reason both bottom and top types have to exist comes down to how static-typing works — compilers of statically typed languages are, fundamentally, trying to solve two problems:

The bottom type is a natural way to represent an impossible scenario during the first of these tasks. The bottom type is the default type if a type cannot be inferred — that is, when you don’t annotate a type, Nothing will be the end result if the compiler cannot find another more specific type after searching through everything else.

The top type is the default for constraint reduction. It gives a compiler a final ultimate result if nothing can be proven when doing constraint reduction. Any is what you end up with when two types constrain each other so much that they have no other ancestor in common, and works because every single type has Any at least in common.

Unit vs Null

05 Sep 2018

The other day, I was asked by somebody why we had Unit in Scala, and what was the difference over just using null. I thought this was a really interesting question — so here’s my response in a slightly longer form!

The Boolean Type

Let’s start from something we know well — the Boolean type! When writing code, we use these all the time. As you know, they can only have two possible values — true or false.

The boolean type is the smallest possible type with which we can represent information. It can show something was true, or false, on or off, yes or no, 1 or 0 — just a single bit is all it has to work with, but it’s enough 1.

Imagine the following example function:

def doSomething(bool: Boolean): Boolean = ???

We have a function that takes a Boolean and returns a Boolean. We know immediately some things about this function:

We know all of these things because the number of possible values for Boolean is limited. This limitation is what allows us to reason about what a function can do just from it’s type — and this is an incredibly useful tool that static-typing gives you.

Parameters and Return Values As Tuples

Remember that function we wrote before?

def doSomething(bool: Boolean): Boolean = ???

We can think about functions in a slightly different way: we can think about their parameters being passed as tuples and them returning tuples:

// Is the same thing as:
val doSomething: (Boolean) => (Boolean) = ???
//               ^ tuple      ^ tuple

This is how a compiler actually sees the functions you wrote — names given to parameters and syntax around showing the return type is just sugar that makes it read better to our human eyes. Really, behind the scenes, it’s more like passing tuples of arguments and receiving tuples of results back.

Keep this tuple thing in mind!

The Unit Type

Sometimes we do things where we don’t care about or can’t get a result — things that may have side-effects, for example. Often we need a way to say “this function returns nothing useful” — how can we do that without wasting space on a boolean if a boolean is the smallest piece of information we have to work with?

Well, it turns out there is a type for saying “I have nothing useful to return” — and that type is called Unit.

Unit is a type that only has a single value (hence the name). This single value is also called “unit” but to differentiate the value from the type, in some languages we write it as ().

Notice something? That looks like a tuple! And indeed, that’s not a mistake — Unit is just a way of saying “an empty tuple”!

Unit is useful because of an important property about the empty tuple — there is only one possible value of it, (). Where, for example, (Boolean) could become (true) or (false), Unit will always be () only. As a result of only having a single value, this means it is entirely useless for actually encoding any information — after all, what information can you impart if you’re only allowed to respond with one thing?

Being able to say “I have no information” is still a useful thing, and so we use Unit and () to do just that:

def doSomething(file: File): Unit = ???

Just from looking at the definition of this function, and as you were able to with the boolean example, you can tell this function must have some side-effect 2. We could even use this information to guess what this function does (perhaps it deletes the file, or maybe calls touch?)

Null: The Odd One Out

So now to the question at hand: what makes Unit useful over null?

The issue with null is that it is, basically, a hack! null is a special value that can take the place of any type, but it is not a type itself. This completely breaks our ability to reason using just types. Take for example this function:

def doSomethingpath: Path): File = ???

We might look at this and think “oh cool, maybe it’s opening a File for us?”. Well, if we didn’t have null you might possibly be right3 — but unfortunately it’s completely valid for this function to be implemented thusly:

def doSomethingpath: Path): File = null

What’s wrong about null is that it is a magic value not represented in the type-system, and therefore it limits a lot of our ability to use the types to their full power. This is why most idiomatic Scala eschews the use of null over things like Unit, Option[A], or Either[A, B] — these types tell the full story of what is going on, without hacks, and allow us the full power of deductive reasoning. null in Scala exists purely because of Java compatibility, and you would be wise to avoid it at all costs!

  1. Space is an important limitation in computers, and it is natural for us to seek out the smallest way to represent things — booleans are as small as it gets. It also fundamentally encodes the entirety of boolean logic, which enables us to write code that can make decisions based on conditional statements. 

  2. Assuming you aren’t making empty functions, of course. 

  3. Technically, exceptions are also another thing that isn’t represented in the type-system. Languages like Rust are taking this thinking to the mainstream and using an Either like type (Result) and Option completely instead of exception! 

Data-Only Images In Docker By Abusing COPY --from

02 Jan 2018

I recently employed a little trick to make Docker data-only images which can be joined into other Docker images as needed.

Imagine you have a set of configuration that you want to store somewhere separately to the code. Well, you can make a Docker image of just this configuration very easily:

FROM scratch
ADD ./ /fake-config

After building it with docker build -t test/fake-config ., the only thing inside this image will be the directory you added. This is the most barebones image it is possible to make outside of scratch itself 1.

Now, you may wonder what the use of an image like this is — after all, it can’t be executed or run because it has no binaries inside. Enter the COPY --from command:

FROM <some base image>
# ...
COPY --from=test/fake-config /fake-config /fake-config
# ...

With one command, the contents of your data-only image has been added to another image. This is super useful for adding configuration that you want shared between applications to the Docker images for them 2.

  1. There is actually some low-level plumbing within an image based on scratch

  2. This is especially true since you can’t use symlinks within a Docker context, making it difficult to otherwise do this without some script to wrap the call to docker build that copies the common stuff around — messy. 

Using Transient Lazy Val's To Avoid Spark Serialisation Issues

29 Dec 2017

Occasionally within a Spark application, you may encounter issues because some member of a (case) class / object is not serialisable. This manifests most often as an exception when the first task containing that item is attempted to be sent to an executor. The vast majority of the time, the fix you probably will reach for is to make the object implement Serializable. Sometimes however, this may not be easy or even possible (for example, when the type in question is out of your control).

It turns out that there is another way! If the object in question can be constructed again inexpensively, the @transient lazy val pattern may be for you:

case class Foo(foo: SomeType) {
  @transient lazy val bar: SomeOtherType = SomeOtherType(foo)
}

The @transient annotation has the effect of excluding the annotated item from the object it is contained within when that object is serialised. In conjunction with the lazy val, this means the field will be constructed again when first accessed on each of the executors, rather than being sent to each executor as a series of serialised bytes to deserialise as part of the task.

Sometimes this trick can actually result in modest performance improvements — for example, if the object is question is large when serialised but is cheap to construct again (like large collections computed from some smaller seed). However, carefully note that it is constructed again once per executor making it only useful for stateless items.

The next time you hit a serialisation issue in Spark, give @transient lazy val a try and see if it’s a fit for you!

Stopping Spark Broadcast Variables From Leaking Throughout Your Application

20 Aug 2016

When working with Spark it’s tempting to use Broadcast variables everywhere you need them without thinking too much about it. However, if you are not careful, this can lead to strong coupling between all of your application components and the implementation of Spark itself.

How Broadcast variables work

Broadcast variables are intended to solve a very specific problem in Spark: namely, preparing some static value on the driver side to be passed efficiently to all nodes that need it during some processing step. Spark takes care of implementing the distribution of this variable in the most efficient manner, eg. it can elect to use BitTorrent for larger variables (if enabled), and promises that only the nodes that need the data will be sent it at the point they need it.

However, in order to work, broadcast variables effectively provide a wrapper type for the variable they close over:

// This is a vastly simplified version of the class!
abstract class Broadcast[T: ClassTag](val id: Long) {
  // ...
  def value: T = // ...
  // ...
}

Effectively, Spark will store our data and reference it via an ID (a Long) which is all we need to send to the nodes until they actually need the data. When they need the data, they send the driver the id, and they get what they need back!

Working with Broadcast variables

In order for us to setup and get the value out of a broadcast variable, we need to do something like the following:

val data: Broadcast[Array[Int]] = sparkContext.broadcast(Array(1, 2, 3))

// Later on, to use it somewhere...
data.value // => Array[Int] = Array(1, 2, 3))

Leaking the types

When working with Spark, it can be easy to end up with one “god job” where the implementation of all the steps of a job are inlined and heavily dependent on Spark features.

With some effort, we can factor out our pure functions into smaller classes which we can call into:

class PostcodeToLatLong(lookup: Map[String, (Double, Double)]) {
  def lookupPostcode(postcode: String): (Double, Double) = ???
}

Here we have a simple class which needs a lookup table to function: in this case, it will convert some string based postcode to a latitude/longitude pair which is represented as a tuple of Doubles.

If this class was going to be used in a Spark job, we may consider making the postcodeTable a broadcast variable so that only the nodes that need it will request the data for it. However, we hit a snag: we can’t do this without leaking the Brodcast wrapper type into the implementation details of the class like so:

class PostcodeToLatLong(lookup: Broadcast[Map[String, (Double, Double)]]) {
  // ...
}

This is a nightmare for testing, as now we’ve gone from having a simple, pure class which can be unit tested easily to having a class that depends entirely on Spark implementation details and needs a SparkContext just to setup!

Abstracting away the broadcast functionality

Thankfully, we can solve this using a trait that abstracts what we actually care about: that the thing we have passed is lazy and will only be grabbed when we call .value on it!

trait Lazy[T] extends Serializable {
  def value: T
}

And now we can change our class to take this trait instead:

class PostcodeToLatLong(lookup: Lazy[Map[String, (Double, Double)]]) {
  // ...
}

With some simple implicit classes we can make it easy to call this class with either a Spark broadcast variable or any old primitive object:

object Lazy {
  object implicits {
    implicit class LazySparkBroadcast[T: ClassTag](bc: Broadcast[T]) extends Lazy[T] {
      override def value: T = bc.value
    }

    implicit class LazyPrimitive[T: ClassTag](inner: T) extends Lazy[T] {
      override def value: T = inner
    }
  }
}

And now we can put the pieces together:

import Lazy.implicits._

val lookup = Map("foo" -> (123.45, 123.45))
val bcLookup = sparkContext.broadcast(lookup)

// And later on...

val postcodeMapper = new PostcodeToLatLong(bcLookup)
// This also works!
val postcodeMapper = new PostcodeToLatLong(lookup)