TypedDict vs dataclasses in Python — Epic typing BATTLE!
27th Mar 2020
We recently migrated our Meeshkan product from Python
dataclasses. This article explains why. We'll start with a general overview of types in Python. Then, we'll walk through the difference between the two typing strategies with examples. By the end, you should have the information you need to choose the one that's the best fit for your Python project.
- Types in Python
- Classes and
- Migrating from
PEP 484, co-authored by Python's creator Guido van Rossum, gives a rationale for types in Python. He proposes:
A standard syntax for type annotations, opening up Python code to easier static analysis and refactoring, potential runtime type checking, and (perhaps, in some contexts) code generation utilizing type information.
For me, static analysis is the strongest benefit of types in Python.
It takes code like this:
Which raises this error at runtime:
And allows you to do this:
Which raises this error at compile time:
Types help us catch bugs earlier and reduces the number of unit tests to maintain.
Python typing works for classes as well. Let's see how static typing with classes can move two errors from runtime to compile time.
area.py file contains a function that calculates the area of a shape using the data provided by two classes:
The first runtime error this produces is:
Yikes! Bitten by a spelling mistake in the
area function. Let's fix that by changing
We run again, and:
Oh no! In the definition of
area, we have used
left for y instead of
down. This is a common copy-and-paste error.
Let's change the
area function again so that the final function reads:
After running our code again, we get the result of
27. This is what we would expect the area of a 9x3 rectangle to be.
Now let's see now how Python would have caught both of these errors using types at compile time.
We first add type definitions to the
Then we can run our
area.py file using
mypy, a static type checker for Python:
It spots the same three errors before we even run our code.
In our previous example, you'll notice that the assignment of attributes like
x.right is clunky. Instead, what we'd like to do is
RangeX(left = 1, right = 4). The
dataclass decorator makes this possible. It takes a
class and turbocharges it with a constructor and several other useful methods.
Let's take our
area.py file and use the
mypy, our file is now error-free:
And it gives us the expected result of
dataclass are nice ways to represent objects as types. They suffer from several limitations, though, that
In the world of types, there is a notion called duck typing. Here's the idea: If an object looks like a duck and quacks like a duck, it's a duck.
For example, take the following JSON:
In a language with duck typing, we would define a type with the attributes
age. Then, any object with these attributes would correspond to the type.
In Python, classes aren't duck typed, which leads to the following situation:
This example should return
False. But without duck typing, JSON or
dict versions of
Person would be the same.
We can see this when we check our example with
Duck typing helps us encode classes to another format without losing information. That is, we can create a field called
type that represents a
"person" or a
TypedDict brings duck typing to Python by allowing
dicts to act as types.
An extra advantage of this approach is that it treats
None values as optional.
Let's imagine, for example, that we extended
Person like so:
If we print a
Person, we'll see that the
None values are still present:
This feels a bit off - it has lots of explicit
None fields and gets verbose as you add more optional fields. Duck typing avoids this by only adding existing fields to an object.
So let's rewrite our
person.py file to use
Now when we print our
Person, we only see the fields that exist:
You may have guessed by now, but generally, we prefer duck typing over classes. For this reason, we're very enthusiastic about
TypedDict. That said, in Meeshkan, we migrated from
dataclasses for several reasons. Throughout the rest of this article, we'll explain why we made the move.
The two reasons we migrated from
dataclasses are matching and validation:
- Matching means determining an object's class when there's a union of several classes.
- Validation means making sure that unknown data structures, like JSON, will map to a class.
Let's use the
person_vs_comet.py example from earlier to see why
class is better at matching in Python.
isinstance can discriminate between union types. This is critical for most real-world programs that support several types.
In Meeshkan, we work with union types all the time in OpenAPI. For example, most object specifications can be a
Schema or a
Reference to a schema. All over our codebase, you'll see
isinstance(r, Reference) to make this distinction.
TypedDict doesn't work with
isinstance - and for good reason. Under the hood,
isinistance looks up the class name of the Python object. That's a very fast operation. With duck typing, you'd have to inspect the whole object to see if "it's a duck." While this is fast for small objects, it is too slow for large objects like OpenAPI specifications. The
isinstance pattern has sped up our code a lot.
Most code receives input from an external source, like a file or an API. In these cases, it's important to verify that the input is usable by the program. This often requires mapping the input to an internal class. With duck typing, after the validation step, this requires a call to
The problem with
cast is that it allows incorrect validation code to slip through. In the following
person.py example, there is an intentional mistake. It asks if
isinstance(d['age'], str) even though
age is an
cast, because it's so permissive, won't catch this error:
class will only ever work with a constructor. So this will catch the error at the moment of construction:
to_person will raise an error, whereas the
TypedDict version won't. This means that, when an error arises, it will happen later down the line. These types of errors are much harder to debug.
When we changed from
dataclasses in Meeshkan, some tests started to fail. Looking them over, we realized that they never should have succeeded. Their success was due to the use of
cast, whereas the
class approach surfaced several bugs.
While we love the idea of
TypedDict and duck typing, it has practical limitations in Python. This makes it a poor choice for most large-scale applications. We would recommend using
TypedDict in situations where you're already using
dicts. In these cases,
TypedDict can add a degree of type safety without having to rewrite your code. For a new project, though, I'd recommend using
dataclasses. It works better with Python's type system and will lead to more resilient code.
Disagree with us? Are there any strengths or weaknesses of either approach that we're missing? Leave us a comment!