In this chapter, we'll revisit our design, leveraging Python's @dataclass
definitions.
This holds some potential for simplification. There are some limitations that create
difficult engineering tradeoffs.
We'll also look at immutable NamedTuple class definitions. These are stateless objects, leading to the possibility of some software simplifications. This will also change our design to make less use of inheritance and more use of composition.
Let's review the design we have so far for our model.py
module. This shows the hierarchy
of Sample
class definitions, used to reflect the various ways samples are used.
The various Sample
classes are a very good
fit with the dataclass definition. These objects have a number of
attributes, and the methods built automatically are seem to fit well
the behaviors we want.
The dataclasses
module defines a decorator @dataclass
that transforms
a class summary into a concrete class definition. The summary can omit
a number of common features, like __init__()
, __repr__()
, __str__()
,
__eq__()
, because they are constructed for us by the decorator.
If we wanted to be very picky, the UML for this process might be drawn this way:
The transformation process happens as part of class definition, and is entirely
transparent. It seems simpler to consider the @dataclass
decoration as an implementation
choice. It doesn't change the design in a material way, it serves to reduce the volume
of code we need to write.
Here's the revised Sample
class, implemented as a @dataclass
instead of being
built entirely by hand.
from dataclasses import dataclass, asdict
from typing import Optional
@dataclass
class Sample:
sepal_length: float
sepal_width: float
petal_length: float
petal_width: float
We've used the @dataclass
decorator to create a class from the supplied
attribute type hints.
The decorator builds a commonly-used suite of methods from the supplied
names and their associated type hints.
If we write methods here, they are preseved in the resulting class,
giving us complete flexibility to add as many methods as we need.
We can use the resulting Sample
class like this:
>>> from model import Sample
>>> x = Sample(1, 2, 3, 4)
>>> x
Sample(sepal_length=1, sepal_width=2, petal_length=3, petal_width=4)
This example shows how we create instances of a class defined
with the @dataclass
decorator. Note the representation function, __repr__()
, was created automatically,
and displays a useful level of detail.
This is very pleasant. It almost feels like cheating.
Here are the definitions for some more of the Sample
class hierarchy.
@dataclass
class KnownSample(Sample):
species: str
@dataclass
class TestingKnownSample(KnownSample):
classification: Optional[str] = None
@dataclass
class TrainingKnownSample(KnownSample):
"""Note: no classification instance variable available."""
pass
This seems to cover the user stories nicely. We didn't have to write very much code and we get a lot of useful features.
We do have a potential problem, however, we can set a classifier attribute on a TrainingKnownSample
instance. Here's an example, where we create a sample to be used for training, and then
also set a classification attribute.
>>> from model import TrainingKnownSample
>>> s1 = TrainingKnownSample(
... sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species="Iris-setosa")
>>> s1
TrainingKnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa')
# This is undesirable...
>>> s1.classification = "wrong"
>>> s1
TrainingKnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa')
>>> s1.classification
'wrong'
Generally, Python doesn't stop us from creating a attribute, likeclassification
, in an object.
This behavior could be the source of hidden bugs. (A good unit test will often expose these bugs.)
Note the additional attribute is not reflected in __repr__()
processing
or __eq__()
tests for this class. It's not a serious problem. In later sections,
we'll address it using frozen data classes as well as the typing.NamedTuple
class.
The remaining classes in our model don't enjoy the same huge benefit from being
implemented as dataclasses as the Sample
classes did.
When a class has a lot of attributes, and few methods, then the @dataclass
definition is a big help.
Another class to benefit the most from the @dataclass
treatment is the
Hyperparameter
class. Here's the first part of the definition, with the method body omitted.
@dataclass
class Hyperparameter:
"""A specific tuning parameter set with k and a distance algorithm"""
k: int
algorithm: Distance
if TYPE_CHECKING:
# Mypy wants to know this
data: weakref.ReferenceType["TrainingData"]
else:
# @dataclass only needs to know this.
data: weakref.ReferenceType
def classify(self, sample: Sample) -> str:
"""The k-NN algorithm"""
pass
This reveals an interesting problem with some of Python's generic types.
The weakref.ReferenceType
, in this case, has two distinct behaviors.
When examined with mypy
to check type references, we must provide a qualifier,
weakref.ReferenceType["TrainingData"]
.
This uses a string as a forward reference to the yet-undefined TrainingData
class.
When evaluated at run-time by the @dataclass
decorator to build a class definition,
the additional type qualifier cannot be used.
In the rare cases when these conflicting needs surface, we need to import the TYPE_CHECKING
global variable
from the typing
module. This is generally False
. When mypy
is examining the type hints,
however, it's True
.
We omitted the details of the classify()
method. This doesn't change from the implementation
shown in the Chapter Three case study.
We haven't seen all the features of data classes. In the next section, we'll freeze them to help spot the kind of bug where a piece of training data is used for testing purposes.
The general case for dataclasses is to create mutable objects. The state can be changed by assigning new values to the attributes. This isn't always a desirable feature, and we can make a dataclass stateless.
We can describe the design by adding a UML stereotype of «Frozen»
. This notation can help to remind
us of the implementation choice of making the object immutable. We must also respect
an important rule of frozen dataclasses: an extension via inheritance must also be frozen.
The definition of frozen Sample
must be kept separate from the mutable objects that are part of processing
an unknown or testing sample. This splits our design into two families of classes:
A small hierarchy of immutable classes, specifically Sample
, and KnownSample
.
Some associated classes that leverage these frozen classes.
The related classes for testing, training, and unknown samples form a loose collection of classes with nearly identical methods and attributes. We can call this a "paddling" of related classes. This comes from the "Duck Typing" rule: "When I see a bird that walks like a duck and quacks like a duck, I call that bird a duck." Objects created from classes with the same attributes and methods are interchangeable, even though they lack a common abstract superclass.
We can descibe this revised design with a diagram like this:
Here's the change to the Sample
class hierarchy. It's relatively minor,
and easy to overlook the frozen=True
in a few places.
@dataclass(frozen=True)
class Sample:
sepal_length: float
sepal_width: float
petal_length: float
petal_width: float
@dataclass(frozen=True)
class KnownSample(Sample):
species: str
@dataclass
class TestingKnownSample:
sample: KnownSample
classification: Optional[str] = None
@dataclass(frozen=True)
class TrainingKnownSample:
"""Cannot be classified."""
sample: KnownSample
When we create an instance of a TrainingKnownSample
or TestingKnownSample
, we
have to respect the composition of these objects:
There's a frozen KnownSample
object inside each of these classes.
The following example shows one way to create a composite object.
>>> from model_f import TrainingKnownSample, KnownSample
>>> s1 = TrainingKnownSample(
... sample=KnownSample(
... sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species="Iris-setosa"
... )
... )
>>> s1
TrainingKnownSample(sample=KnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa'))
This nested construction of a TrainingKnownSample
instance containing a KnownSample
object is explicit.
It exposes the immutable KnownSample
object.
The fozen design has a very pleasant consequence for detecting subtle bugs. The following example shows
the exception raised by improper use of a TrainingKnownSample
.
>>> s1.classification = "wrong"
Traceback (most recent call last):
... details omitted
dataclasses.FrozenInstanceError: cannot assign to field 'classification'
We can't accidentally introduce a bug that changes a training instance.
We get one more bonus feature that makes it easier to spot duplicates when allocating
instances to the training set.
The frozen versions of the Sample
(and KnownSample
) classes produce a consistent hash()
value.
This makes it easier to locate duplicate values by examining the subset of
items with a common hash value.
Appropriate use of @dataclass
and @dataclass(frozen=True)
can be a big
help in implementing object-oriented Python. These definitions provide
a rich set of features with minimal code.
One other technique available to us is similar to the frozen dataclass,
the typing.NamedTuple
. We'll look at this, next.
Using typing.NamedTuple
is somewhat similar to using @dataclass(frozen=True)
.
There are some signficant differences in the implementation details, however.
In particular, the typing.NamedTuple
class does not really support inheritance.
This leads us to a composition-focused design among the classes in the Sample
hierarchy.
Here's the definition of Sample
as NamedTuple
. It looks similar to the @dataclass
definition. The definition of KnownSample
, however, must change dramatically.
class Sample(NamedTuple):
sepal_length: float
sepal_width: float
petal_length: float
petal_width: float
class KnownSample(NamedTuple):
sample: Sample
species: str
The KnownSample
class is a composite, built from a Sample
instance, plus
the species assigned when the data was loaded initially. Since these are both
subclasses of typing.NamedTuple
, the values are immutable.
We've shifted from inheritance to composition in our design. Here are the two concepts, side-by-side.
The difference is easy to overlook in the diagram.
Using a composition-focused design, a KnownSample
instance is composed of a Sample
instance and a species classification.
It has two attributes.
Using an inheritance-focused design, a KnownSample
instance is a Sample
instance.
It has five attributes: all four attributes inherited from the Sample
class plus one attribute unique
to the KnownSample
subclass.
As we've seen, both designs will work. The choice is difficult and often revolves around the number
and the complexity of the methods that are inherited from the superclass. In this example there are
no methods of importance to the application defined in the Sample
class.
The Testing and Training samples follow the Duck Typing rule. They have similar attributes and can be used interchangeably in many cases.
class TestingKnownSample:
def __init__(
self, sample: KnownSample, classification: Optional[str] = None
) -> None:
self.sample = sample
self.classification = classification
def __repr__(self) -> str:
return f"{self.__class__.__name__}(sample={self.sample!r}, classification={self.classification!r})"
class TrainingKnownSample(NamedTuple):
sample: KnownSample
In this case both TestingKnownSample
and TrainingKnownSample
are composite objects
which contain a KnownSample
object. The primary difference is the presence (or absence)
of an additional attribute, the classification
value.
Here's an example of creating a TrainingKnownSample
and trying (erroneously)
to set the classification.
>>> from model_t import TrainingKnownSample, KnownSample, Sample
>>> s1 = TrainingKnownSample(
... sample=KnownSample(
... sample=Sample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2),
... species="Iris-setosa"
... ),
... )
>>> s1
TrainingKnownSample(sample=KnownSample(sample=Sample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2), species='Iris-setosa'))
>>> s1.classification = "wrong"
Traceback (most recent call last):
...
AttributeError: 'TrainingKnownSample' object has no attribute 'classification'
The code reflects the composite-of-composite design. A TrainingKnownSample
instance
contains a KnownSample
object, which contains a Sample
object. The example shows that we
cannot add a new attribute to a TrainingKnownSample` instance.
We've seen a total of four ways to address object-oriented design and implementation.
In previous chapters, we've looked at creating objects "from scratch", writing all
the method definitions ourselves.
We've emphasized inheritance among the classes in the Sample
class hierarchy.
In this chapter, we've seen a stateful class definition using @dataclass
.
This supports inheritance among the classes in the Sample
class hierarchy.
We've also seen an stateless (or immutable) definition using @dataclass(frozen=True)
.
This tends to discourage some aspects of inheritance and favor composition.
Finally, we've looked at stateless (or immutable) definitions using NamedTuple
.
This must be designed using composition. This preliminary overview of these
classes makes the design seem quite simple. We'll return to this in
Chapter Eight.
We have a lot of flexibility in Python. It's important to look at the choices from the viewpoint of our future self trying to add or alter features. It helps to follow the SOLID design principles and focus on Single Resposibility and Interface Segregation to isolate and encapsulate our class definitions.