TypedDict vs dataclasses in Python — Epic typing BATTLE!

MS
Mike Solomon

CEO

27th Mar 2020

We recently migrated our Meeshkan product from Python TypedDict to dataclasses. This article explains why. We'll start with a general overview of types in Python. Then, we'll walk through the difference between the two typing strategies with examples. By the end, you should have the information you need to choose the one that's the best fit for your Python project.

Table of Contents#

Types in Python#

PEP 484, co-authored by Python's creator Guido van Rossum, gives a rationale for types in Python. He proposes:

A standard syntax for type annotations, opening up Python code to easier static analysis and refactoring, potential runtime type checking, and (perhaps, in some contexts) code generation utilizing type information.

For me, static analysis is the strongest benefit of types in Python.

It takes code like this:

Which raises this error at runtime:

And allows you to do this:

Which raises this error at compile time:

Types help us catch bugs earlier and reduces the number of unit tests to maintain.

Classes and dataclasses#

Python typing works for classes as well. Let's see how static typing with classes can move two errors from runtime to compile time.

Setting up our example#

The following area.py file contains a function that calculates the area of a shape using the data provided by two classes:

The first runtime error this produces is:

Yikes! Bitten by a spelling mistake in the area function. Let's fix that by changing lefft to left.

We run again, and:

Oh no! In the definition of area, we have used right and left for y instead of up and down. This is a common copy-and-paste error.

Let's change the area function again so that the final function reads:

After running our code again, we get the result of 27. This is what we would expect the area of a 9x3 rectangle to be.

Adding type definitions#

Now let's see now how Python would have caught both of these errors using types at compile time.

We first add type definitions to the area function:

Then we can run our area.py file using mypy, a static type checker for Python:

It spots the same three errors before we even run our code.

Working with dataclasses#

In our previous example, you'll notice that the assignment of attributes like x.left and x.right is clunky. Instead, what we'd like to do is RangeX(left = 1, right = 4). The dataclass decorator makes this possible. It takes a class and turbocharges it with a constructor and several other useful methods.

Let's take our area.py file and use the dataclass decorator.

According to mypy, our file is now error-free:

And it gives us the expected result of 27:

class and dataclass are nice ways to represent objects as types. They suffer from several limitations, though, that TypedDict solves.

TypedDict#

But first...

A brief introduction to duck typing#

In the world of types, there is a notion called duck typing. Here's the idea: If an object looks like a duck and quacks like a duck, it's a duck.

For example, take the following JSON:

In a language with duck typing, we would define a type with the attributes name and age. Then, any object with these attributes would correspond to the type.

In Python, classes aren't duck typed, which leads to the following situation:

This example should return False. But without duck typing, JSON or dict versions of Comet and Person would be the same.

We can see this when we check our example with asdict:

Duck typing helps us encode classes to another format without losing information. That is, we can create a field called type that represents a "person" or a "comet".

Working with TypedDict#

TypedDict brings duck typing to Python by allowing dicts to act as types.

An extra advantage of this approach is that it treats None values as optional.

Let's imagine, for example, that we extended Person like so:

If we print a Person, we'll see that the None values are still present:

This feels a bit off - it has lots of explicit None fields and gets verbose as you add more optional fields. Duck typing avoids this by only adding existing fields to an object.

So let's rewrite our person.py file to use TypedDict:

Now when we print our Person, we only see the fields that exist:

Migrating from TypedDict to dataclasses#

You may have guessed by now, but generally, we prefer duck typing over classes. For this reason, we're very enthusiastic about TypedDict. That said, in Meeshkan, we migrated from TypedDict to dataclasses for several reasons. Throughout the rest of this article, we'll explain why we made the move.

The two reasons we migrated from TypedDict to dataclasses are matching and validation:

  • Matching means determining an object's class when there's a union of several classes.
  • Validation means making sure that unknown data structures, like JSON, will map to a class.

Matching#

Let's use the person_vs_comet.py example from earlier to see why class is better at matching in Python.

In Python, isinstance can discriminate between union types. This is critical for most real-world programs that support several types.

In Meeshkan, we work with union types all the time in OpenAPI. For example, most object specifications can be a Schema or a Reference to a schema. All over our codebase, you'll see isinstance(r, Reference) to make this distinction.

TypedDict doesn't work with isinstance - and for good reason. Under the hood, isinistance looks up the class name of the Python object. That's a very fast operation. With duck typing, you'd have to inspect the whole object to see if "it's a duck." While this is fast for small objects, it is too slow for large objects like OpenAPI specifications. The isinstance pattern has sped up our code a lot.

Validation#

Most code receives input from an external source, like a file or an API. In these cases, it's important to verify that the input is usable by the program. This often requires mapping the input to an internal class. With duck typing, after the validation step, this requires a call to cast.

The problem with cast is that it allows incorrect validation code to slip through. In the following person.py example, there is an intentional mistake. It asks if isinstance(d['age'], str) even though age is an int. cast, because it's so permissive, won't catch this error:

However, a class will only ever work with a constructor. So this will catch the error at the moment of construction:

The above to_person will raise an error, whereas the TypedDict version won't. This means that, when an error arises, it will happen later down the line. These types of errors are much harder to debug.

When we changed from TypedDict to dataclasses in Meeshkan, some tests started to fail. Looking them over, we realized that they never should have succeeded. Their success was due to the use of cast, whereas the class approach surfaced several bugs.

Conclusion#

While we love the idea of TypedDict and duck typing, it has practical limitations in Python. This makes it a poor choice for most large-scale applications. We would recommend using TypedDict in situations where you're already using dicts. In these cases, TypedDict can add a degree of type safety without having to rewrite your code. For a new project, though, I'd recommend using dataclasses. It works better with Python's type system and will lead to more resilient code.

Disagree with us? Are there any strengths or weaknesses of either approach that we're missing? Leave us a comment!

Newer postOlder post

Company

ContactPricingAbout usT&CDocs