Case Study for Chapter Seven, Python Data Structures

In this chapter, we'll revisit our design, leveraging Python's @dataclass definitions. This holds some potential for simplification. There are some limitations that create difficult engineering tradeoffs.

We'll also look at immutable NamedTuple class definitions. These are stateless objects, leading to the possibility of some software simplifications. This will also change our design to make less use of inheritance and more use of composition.

Logical Model

Let's review the design we have so far for our model.py module. This shows the hierarchy of Sample class definitions, used to reflect the various ways samples are used.

uml diagram

The various Sample classes are a very good fit with the dataclass definition. These objects have a number of attributes, and the methods built automatically are seem to fit well the behaviors we want.

The dataclasses module defines a decorator @dataclass that transforms a class summary into a concrete class definition. The summary can omit a number of common features, like __init__(), __repr__(), __str__(), __eq__(), because they are constructed for us by the decorator.

If we wanted to be very picky, the UML for this process might be drawn this way:

uml diagram

The transformation process happens as part of class definition, and is entirely transparent. It seems simpler to consider the @dataclass decoration as an implementation choice. It doesn't change the design in a material way, it serves to reduce the volume of code we need to write.

Here's the revised Sample class, implemented as a @dataclass instead of being built entirely by hand.


from dataclasses import dataclass, asdict
from typing import Optional


@dataclass
class Sample:
    sepal_length: float
    sepal_width: float
    petal_length: float
    petal_width: float

We've used the @dataclass decorator to create a class from the supplied attribute type hints. The decorator builds a commonly-used suite of methods from the supplied names and their associated type hints. If we write methods here, they are preseved in the resulting class, giving us complete flexibility to add as many methods as we need.

We can use the resulting Sample class like this:


>>> from model import Sample
>>> x = Sample(1, 2, 3, 4)
>>> x
Sample(sepal_length=1, sepal_width=2, petal_length=3, petal_width=4)

This example shows how we create instances of a class defined with the @dataclass decorator. Note the representation function, __repr__(), was created automatically, and displays a useful level of detail. This is very pleasant. It almost feels like cheating.

Here are the definitions for some more of the Sample class hierarchy.

@dataclass
class KnownSample(Sample):
    species: str


@dataclass
class TestingKnownSample(KnownSample):
    classification: Optional[str] = None


@dataclass
class TrainingKnownSample(KnownSample):
    """Note: no classification instance variable available."""
    pass

This seems to cover the user stories nicely. We didn't have to write very much code and we get a lot of useful features.

We do have a potential problem, however, we can set a classifier attribute on a TrainingKnownSample instance. Here's an example, where we create a sample to be used for training, and then also set a classification attribute.


>>> from model import TrainingKnownSample
>>> s1 = TrainingKnownSample(
...     sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species="Iris-setosa")
>>> s1
TrainingKnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa')

# This is undesirable...
>>> s1.classification = "wrong"
>>> s1
TrainingKnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa')
>>> s1.classification
'wrong'

Generally, Python doesn't stop us from creating a attribute, likeclassification, in an object. This behavior could be the source of hidden bugs. (A good unit test will often expose these bugs.) Note the additional attribute is not reflected in __repr__() processing or __eq__() tests for this class. It's not a serious problem. In later sections, we'll address it using frozen data classes as well as the typing.NamedTuple class.

The remaining classes in our model don't enjoy the same huge benefit from being implemented as dataclasses as the Sample classes did. When a class has a lot of attributes, and few methods, then the @dataclass definition is a big help.

Another class to benefit the most from the @dataclass treatment is the Hyperparameter class. Here's the first part of the definition, with the method body omitted.

@dataclass
class Hyperparameter:
    """A specific tuning parameter set with k and a distance algorithm"""

    k: int
    algorithm: Distance

    if TYPE_CHECKING:
        # Mypy wants to know this
        data: weakref.ReferenceType["TrainingData"]
    else:
        # @dataclass only needs to know this.
        data: weakref.ReferenceType

    def classify(self, sample: Sample) -> str:
        """The k-NN algorithm"""
        pass 

This reveals an interesting problem with some of Python's generic types. The weakref.ReferenceType, in this case, has two distinct behaviors.

In the rare cases when these conflicting needs surface, we need to import the TYPE_CHECKING global variable from the typing module. This is generally False. When mypy is examining the type hints, however, it's True.

We omitted the details of the classify() method. This doesn't change from the implementation shown in the Chapter Three case study.

We haven't seen all the features of data classes. In the next section, we'll freeze them to help spot the kind of bug where a piece of training data is used for testing purposes.

Frozen Dataclasses

The general case for dataclasses is to create mutable objects. The state can be changed by assigning new values to the attributes. This isn't always a desirable feature, and we can make a dataclass stateless.

We can describe the design by adding a UML stereotype of «Frozen». This notation can help to remind us of the implementation choice of making the object immutable. We must also respect an important rule of frozen dataclasses: an extension via inheritance must also be frozen.

The definition of frozen Sample must be kept separate from the mutable objects that are part of processing an unknown or testing sample. This splits our design into two families of classes:

The related classes for testing, training, and unknown samples form a loose collection of classes with nearly identical methods and attributes. We can call this a "paddling" of related classes. This comes from the "Duck Typing" rule: "When I see a bird that walks like a duck and quacks like a duck, I call that bird a duck." Objects created from classes with the same attributes and methods are interchangeable, even though they lack a common abstract superclass.

We can descibe this revised design with a diagram like this:

uml diagram

Here's the change to the Sample class hierarchy. It's relatively minor, and easy to overlook the frozen=True in a few places.

@dataclass(frozen=True)
class Sample:
    sepal_length: float
    sepal_width: float
    petal_length: float
    petal_width: float


@dataclass(frozen=True)
class KnownSample(Sample):
    species: str


@dataclass
class TestingKnownSample:
    sample: KnownSample
    classification: Optional[str] = None


@dataclass(frozen=True)
class TrainingKnownSample:
    """Cannot be classified."""
    sample: KnownSample

When we create an instance of a TrainingKnownSample or TestingKnownSample, we have to respect the composition of these objects: There's a frozen KnownSample object inside each of these classes. The following example shows one way to create a composite object.

>>> from model_f import TrainingKnownSample, KnownSample
>>> s1 = TrainingKnownSample(
...     sample=KnownSample(
...         sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species="Iris-setosa"
...     )
... )
>>> s1
TrainingKnownSample(sample=KnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa'))

This nested construction of a TrainingKnownSample instance containing a KnownSample object is explicit. It exposes the immutable KnownSample object.

The fozen design has a very pleasant consequence for detecting subtle bugs. The following example shows the exception raised by improper use of a TrainingKnownSample.

>>> s1.classification = "wrong"
Traceback (most recent call last):
... details omitted
dataclasses.FrozenInstanceError: cannot assign to field 'classification'

We can't accidentally introduce a bug that changes a training instance.

We get one more bonus feature that makes it easier to spot duplicates when allocating instances to the training set. The frozen versions of the Sample (and KnownSample) classes produce a consistent hash() value. This makes it easier to locate duplicate values by examining the subset of items with a common hash value.

Appropriate use of @dataclass and @dataclass(frozen=True) can be a big help in implementing object-oriented Python. These definitions provide a rich set of features with minimal code.

One other technique available to us is similar to the frozen dataclass, the typing.NamedTuple. We'll look at this, next.

NamedTuple classes

Using typing.NamedTuple is somewhat similar to using @dataclass(frozen=True). There are some signficant differences in the implementation details, however. In particular, the typing.NamedTuple class does not really support inheritance. This leads us to a composition-focused design among the classes in the Sample hierarchy.

Here's the definition of Sample as NamedTuple. It looks similar to the @dataclass definition. The definition of KnownSample, however, must change dramatically.

class Sample(NamedTuple):
    sepal_length: float
    sepal_width: float
    petal_length: float
    petal_width: float


class KnownSample(NamedTuple):
    sample: Sample
    species: str

The KnownSample class is a composite, built from a Sample instance, plus the species assigned when the data was loaded initially. Since these are both subclasses of typing.NamedTuple, the values are immutable.

We've shifted from inheritance to composition in our design. Here are the two concepts, side-by-side.

uml diagram

The difference is easy to overlook in the diagram.

As we've seen, both designs will work. The choice is difficult and often revolves around the number and the complexity of the methods that are inherited from the superclass. In this example there are no methods of importance to the application defined in the Sample class.

The Testing and Training samples follow the Duck Typing rule. They have similar attributes and can be used interchangeably in many cases.

class TestingKnownSample:
    def __init__(
        self, sample: KnownSample, classification: Optional[str] = None
    ) -> None:
        self.sample = sample
        self.classification = classification

    def __repr__(self) -> str:
        return f"{self.__class__.__name__}(sample={self.sample!r}, classification={self.classification!r})"


class TrainingKnownSample(NamedTuple):
    sample: KnownSample

In this case both TestingKnownSample and TrainingKnownSample are composite objects which contain a KnownSample object. The primary difference is the presence (or absence) of an additional attribute, the classification value.

Here's an example of creating a TrainingKnownSample and trying (erroneously) to set the classification.

>>> from model_t import TrainingKnownSample, KnownSample, Sample
>>> s1 = TrainingKnownSample(
...     sample=KnownSample(
...         sample=Sample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2),
...         species="Iris-setosa"
...     ),
... )
>>> s1
TrainingKnownSample(sample=KnownSample(sample=Sample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2), species='Iris-setosa'))

>>> s1.classification = "wrong"
Traceback (most recent call last):
...
AttributeError: 'TrainingKnownSample' object has no attribute 'classification'

The code reflects the composite-of-composite design. A TrainingKnownSample instance contains a KnownSample object, which contains a Sample object. The example shows that we cannot add a new attribute to a TrainingKnownSample` instance.

Conclusion

We've seen a total of four ways to address object-oriented design and implementation.

We have a lot of flexibility in Python. It's important to look at the choices from the viewpoint of our future self trying to add or alter features. It helps to follow the SOLID design principles and focus on Single Resposibility and Interface Segregation to isolate and encapsulate our class definitions.