Case Study for Chapter 2, Objects in Python

This section expand on the object-oriented design of a realistic example. We'll start with the diagrams creating using the Unified Modeling Language (UML) to help depict and summarize the software we're going to build.

We'll describe the various considerations that are part of the Python implementation of the class definitions. We'll start with a review of the diagrams that describe the classes to be defined.

Logical View

Here's the overview of the classes we need to build. This is similar to the previous chapter's model.

uml diagram

There are four classes that define our core data model. These classes will rely on other Python class definitions, particularly instances of the list class.

We'll start with the Sample and KnownSample classes. Python 3.8 offers three essential paths for defining a new class.

Our first design decision is to use Python's class statement to write a class definition for Sample and it's subclass KnownSample. This may be replaced in the future with alternatives build using dataclasses as well as NamedTuple.

Samples and Their States

The diagram shows Sample class and an extension, the KnownSample class. This doesn't seem to be a complete decomposition of the various kinds of samples. When we review the user stories and the process views, there seem to be some gaps.

We can make a case for two distinct subclasses of Sample:

Generally, we consider class definitions as a way to encapsulate state and behavior. An UnknownSample instance provided by a user starts out with no species. Then, after the classifier algorithm computes a species, the Sample changes state.

A question we must always ask about class definitions is this:

Is there any change in behavior that goes with the change in state?

In this case, it doesn't seem like there's anything new or different that can happen. Perhaps this is a single class with some optional attributes.

We have another possible state chage concern. Currently, there's no class that owns the responsibility of partitioning Sample objects into the training or testing subsets. This, too, is a kind of state change,

This leads to a second important question:

What class has responsibility for making this state change?

In this case, it seems like the TrainingData class should own the discrimination between testing and training data.

One way to help look closely our class design is to enumerate all of the various states of individual samples. This technique helps uncover a need for attributes in the classes. It also helps identify the methods to make state changes to objects of a class.

Sample State Transitions

Let's look at the life-cycles of Sample objects. We can consider their creation, state changes, and (in some cases) the end of their processing life when there are no more references to them. We have three scenarios:

  1. Initial Load. We'll need a load() method to populate a TrainingData object from some source of raw data. We'll preview some of Chapter Nine's material by saying that reading a CSV file often produces an iterable sequence of dictionaries. We can imagine a load method using a CSV reader to create Sample objects with a species value, making them KnownSample objects. The load() method populates two lists, which is an important state change for a TrainingData object.

  2. Hyperparameter Testing. We'll need a test() method in the Hyperparameter class. The body of the test() method works with the test samples in the associated TrainingData object, For each sample, it applies the classifier, and counts the matches. This points up the need for a classify() method for a single sample that's used by the test() method for a batch of samples. The test() method will change the state of the Hyperparameter object by computing a quality score.

  3. User-Initiated Classification. A RESTful web application is often decomposed into separate view function to handle requests. When handling a request to classify an unknown sample, the view function will have a Hyperparameter object used for classification; this will be chosen by the botanist to produce the best results. The user’s input will be an an UnknownSample instance. The view function applies Hyperparameter.classify() method to create a response to the user. Does this state change really matter? Here are two views:

There's a key concept underlying this detailed decomposition of these alternatives:

TIP: There's No "Right" Answer.

Some design decisions are based on non-functional and non-technical considerations. These might include the longevity of the application, future use cases, additional users who might be enticed, current schedules and budgets, pedagogical value, technical risk, the creation of intellectual property, and how cool the demo will look in conference call.

In Chapter One, we dropped a hint that this application is the precursor to a consumer product recommender. Because of that, we'll consider a change in state from UnknownSample to ClassifiedSample to be very important. The Sample objects will live in a database for additional marketing campaigns or possibly reclassification when new products are available and the training data changes.

We'll decide to keep the classification and the species data in the UnknownSample class.

This analysis suggests we can -- perhaps -- coalesce all the various Sample details into the following design.

uml diagram

This view uses the open arrowhead to show a number of subclasses of Sample. We won't directly implement these as subclasses. (We will address this in Chapter Three.) We've included the arrows to show that we have some distinct use cases for these objects. Specifically the box for KnownSample has a condition "species is not None" to summarize what's unique about these Sample objects. Similarly, the UnknownSample has a condition, "species is None" to clarify our intent around Sample objects with the species attribute value of None.

In these UML diageams, we have generally avoided showing Python's "special" methods. Generally, it seems helpful to minimize visual clutter. In some cases, a special method may be absolutely essential, and worthy of showing in a diagram. An implementation almost always needs to have an __init__() method. It will benefit from having a __repr__() method, also. In a separate part of a design document, these common aspects might be noted.

Here's the start of a class, Sample, which seems to capture all the features of a single sample.

class Sample:

    def __init__(
        self,
        sepal_length: float,
        sepal_width: float,
        petal_length: float,
        petal_width: float,
        species: Optional[str] = None,
    ) -> None:
        self.sepal_length = sepal_length
        self.sepal_width = sepal_width
        self.petal_length = petal_length
        self.petal_width = petal_width
        self.species = species
        self.classification: Optional[str] = None

    def __repr__(self) -> str:
        if self.species is None:
            known_unknown = "UnknownSample"
        else:
            known_unknown = "KnownSample"
        if self.classification is None:
            classification = ""
        else:
            classification = f", {self.classification}"
        return (
            f"{known_unknown}("
            f"sepal_length={self.sepal_length}, "
            f"sepal_width={self.sepal_width}, "
            f"petal_length={self.petal_length}, "
            f"petal_width={self.petal_width}, "
            f"species={self.species!r}"
            f"{classification}"
            f")"
        )

The __repr__() method reflects the fairly complex internal state of this Sample object. The states implied by the presence (or absence) of a species and the presence (or absence) of a classification lead to small behavior changes. So far, the behavior is limited to the __repr__() method used to display the current state of the object.

What's important is that the state changes do lead to a (tiny) behavioral change.
In Chapter Three, we'll look at some alternative designs for this.

We have two application-specific methods for the Sample class. These are shown in the next code snippet:

    def classify(self, classification: str) -> None:
        self.classification = classification

    def matches(self) -> bool:
        return self.species == self.classification

The classify() method defines the state change from unclassified to classified. The matches() method compares the results of classification with a Botanist-assigned species. This is used for testing.

Here's an example of how these state changes can look:

>>> from model import Sample
>>> s2 = Sample(
...     sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species="Iris-setosa")
>>> s2
KnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa')
>>> s2.classification = "wrong"
>>> s2
KnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa', classification='wrong')

We'll revisit this design in Chapters Three and Five. Until then, we'll look at other parts of our application, starting with the Hyperparameter class definition.

The Hyperparameter Class

The Hyperparameter class has more methods than the Sample class, but it has more complex methods. There doesn't seem to be a significant state change for a Hyperparameter, other than computing a quality score. The botanist provides the value of k, a part of the classification algorithm.

At this time, we're not ready to dive into all of the features of the Hyperparameter class. We're going to revisit this design in Chapter Three and complete this class.

For now, we'll provide a useful skeleton that includes the weak reference back to the TrainingData isntance, and a number of methods.

A "weak" reference is a Python object which has a reference to another object, but because the reference is indirect, it isn't tracked by Python's ordinary reference counting. It works like this:

First, we'll create a tiny class, TrainingData, with no useful attributes or methods. It's a kind of stub for a class. We've created an instance of the class, and assigned it to the variable td_1.


>>> class TrainingData:
...     pass
>>> td_1 = TrainingData()

It's essential to distinguish between two things:

The object with an id of 140527584537136 has exactly one reference. The reference is the variable td_1. If we remove the variable, then there are no references to the object, and it will vanish from memory.

Trouble arises when we have two objects with mutual references. In the following diagram, the training data instance, shown with "id = 140527584537136" has three references. These are as follows: - The variable, td_1. - a reference from within a Hyperparameter object, shown with "id = 98765432". - a reference from within another Hyperparameter object, shown with "id = 87654321".

If we remove the variable, there's still at least one remaining reference. Even when we're done using these objects, Python can't easily prove we're done using them, and keeps them in memory.

uml diagram

We can then create a weak reference to the object like this.


>>> from weakref import ref
>>> b = ref(td_1)

We've used the weakref.ref() function to create a weak reference to the object with an id of 140527584537136.

The variable b is not a direct reference to the object of class TrainingData with an id of 140527584537136. The variable b is a reference to a weakref.ref object. The ref object can return the original object.


>>> type(b)
<class 'weakref'>

The b object is -- actually -- a kind of callable object. It's a function that will return a reference to the original TrainingData object.


>>> b() == td_1
True

The value of b() is the original object, also referred to by the variable td_1. Because b isn't a strong reference, we have a little more flexibility to remove the original TrainingData from memory.

If we delete the variable, td_1, the reference count to the underlying object with the id 140527584537136 will decrease to zero; the object will be removed from memory. Once th eobject is gone, the value of b() will be broken, as a consequence.


>>> del td_1
>>> b() is None
True

The weak reference did not prohibit garbage collection. This lets us have a TrainingData with a reference to a Hyperparameter while the Hyperparameter has a weak reference to the TrainingData. When we are done with a TrainingData object, it can be removed from memory without getting tangled up by having too many references.

Responsibilities

We have a second decision to make on the Hyperparameter class responsible for testing. It seems clear that the Hyperparameter class needs to run the test using its values.

It also seems clear the TrainingData class is an acceptable place to record the various Hyperparameter trials. This means the TrainingData can identify which of Hyperparameter instances is best.

There are multiple, related state changes here. In this case, both the Hyperparameter and TrainingData classes will do part of the work. The system -- as a whole -- will change state as individual elements change state. This is sometimes described as "emergent behavior". Rather than write a monster class that does many things, we've written smaller classes the collaborate to achieve the expected goals.

This test() method of TrainingData is something that we didn't shown in the UML image because not all ideas are drawn on the whiteboard during design conversations.

Also note that the references to the TrainingData class are provided as strings, not the simple class name. This is how mypy deals with forward references: the class name is provided as a string.
When mypy is analyzing the code, it resolves the strings into proper class names.

Here's the start of the class definition.

class Hyperparameter:
    """A hyperparameter value and the overall quality of the classification."""

    def __init__(self, k: int, training: "TrainingData") -> None:
        self.k = k
        self.data: weakref.ReferenceType["TrainingData"] = weakref.ref(training)
        self.quality: float

The testing is defined by the following method.

    def test(self) -> None:
        """Run the entire test suite."""
        training_data: Optional["TrainingData"] = self.data()
        if not training_data:
            raise RuntimeError("Broken Weak Reference")
        pass_count, fail_count = 0, 0
        for sample in training_data.testing:
            sample.classification = self.classify(sample)
            if sample.matches():
                pass_count += 1
            else:
                fail_count += 1
        self.quality = pass_count / (pass_count + fail_count)

We start by resolving the weak reference to the training data. This will raise an exception if there's a problem.

For each testing sample, we classify the sample, setting the sample's classification attribute. The matches method tells us if the classification matches the known species.

Finally, the overall quality is fraction of tests that passed. We can use the integer count, or a floating point ratio of tests passed out of the total number of tests.

We won't look at the classification method in this chapter. We'll save that for Chapter Three.

The Training Data Class

The TrainingData class has lists with two subclasses of Sample objects as well as a list with Hyperparameter instances. This class can have simple, direct references to previously-defined classes.

This class has the two methods which initiate the processing:

Because there are three stories, it seems helpful to add a method to perform a classification using a given Hyperparameter instance.

The load() method is designed to process data given by another object. We could have designed the load() method to open and read a file, but then we'd bind the TrainingData to a specific file format and logical layout. It seems better to isolate the details of file format from the details of managing training data. In Chapter Five we'll look closely and reading and validating input. In Chapter Nine, we'll revisit the file format considerations.

For now, we'll use the following outline for processing the training data.

    def load(self, raw_data_iter: Iterator[Dict[str, str]]) -> None:
        """Extract TestingKnownSample and TrainingKnownSample from raw data"""
        for n, row in enumerate(raw_data_iter):
            ... filter and extract subsets (See Chapter 6)
            ... Create self.training and self.testing subsets 
        self.uploaded = datetime.datetime.utcnow()

We'll depend on some kind of data_iter method, defined in another class. We've described the properties of this method with a typ hint, Iterable[Dict[str, str]] The Iterable states that the method's results can be used by a for statement or the list function. This is true of collections like lists. It's also true of generator functions.

The results of this iterator need to be dictionaries that map strings to strings. This is a very general structure, and it allows us to require a dictionary that looks like this:

{
    "sepal_length": 5.1, 
    "sepal_width": 3.5, 
    "petal_length": 1.4, 
    "petal_width": 0.2, 
    "species": "Iris-setosa"
}

This required structure seems flexible enough that we can build some object that will produce this. We'll look at details in Chapter Nine.