This section expand on the object-oriented design of a realistic example. We'll start with the diagrams creating using the Unified Modeling Language (UML) to help depict and summarize the software we're going to build.
We'll describe the various considerations that are part of the Python implementation of the class definitions. We'll start with a review of the diagrams that describe the classes to be defined.
Here's the overview of the classes we need to build. This is similar to the previous chapter's model.
There are four classes that define our core data model.
These classes will rely on other Python class definitions, particularly instances of the list
class.
We'll start with the Sample
and KnownSample
classes.
Python 3.8 offers three essential paths for defining a new class.
A class
definition; we'll focus on this to start.
A @dataclass
definition. This provides a number of built-in features.
While it's handy, it's not ideal for programmers who are new to Python, because it
can obscure some implementation details. We'll set this aside for later.
An extension to the typing.NamedTuple
class. The most notable feature
of this definition will be the state of the object is immutable.
This turns out to be a useful feature for this kind of application.
We'll set it aside for the moment so we can stick with the basics to start.
Our first design decision is to use Python's class
statement to write a class definition
for Sample
and it's subclass KnownSample
. This may be replaced in the future
with alternatives build using dataclasses as well as NamedTuple
.
The diagram shows Sample
class and an extension, the KnownSample
class.
This doesn't seem to be a complete decomposition of the various kinds of samples.
When we review the user stories and the process views, there seem to be some gaps.
We can make a case for two distinct subclasses of Sample
:
UnknownSample
. This class contains the initial four Sample
attributes.
A User provides these to get them classified.
KnownSample
. This class has the Sample
attributes plus the classification result, a species name.
We use these for training and testing the model.
Generally, we consider class definitions as a way to encapsulate state and behavior.
An UnknownSample
instance provided by a user starts out with no species.
Then, after the classifier algorithm computes a species, the Sample
changes state.
A question we must always ask about class definitions is this:
Is there any change in behavior that goes with the change in state?
In this case, it doesn't seem like there's anything new or different that can happen. Perhaps this is a single class with some optional attributes.
We have another possible state chage concern.
Currently, there's no class that owns the responsibility of partitioning Sample
objects into the training or testing subsets. This, too, is a kind of state change,
This leads to a second important question:
What class has responsibility for making this state change?
In this case, it seems like the TrainingData
class should own the discrimination between testing and training data.
One way to help look closely our class design is to enumerate all of the various states of individual samples. This technique helps uncover a need for attributes in the classes. It also helps identify the methods to make state changes to objects of a class.
Let's look at the life-cycles of Sample
objects.
We can consider their creation, state changes, and (in some cases) the end of their processing life when there are no more references to them.
We have three scenarios:
Initial Load.
We'll need a load()
method to populate a TrainingData
object from some source
of raw data. We'll preview some of Chapter Nine's material by saying that reading a CSV
file often produces an iterable sequence of dictionaries. We can imagine a load method
using a CSV reader to create Sample
objects with a species value, making them KnownSample
objects.
The load()
method populates two lists, which is an important state change for a TrainingData
object.
Hyperparameter Testing.
We'll need a test()
method in the Hyperparameter
class.
The body of the test()
method works with the test samples in the associated TrainingData
object,
For each sample, it applies the classifier, and counts the matches. This points up the need for
a classify()
method for a single sample that's used by the test()
method for a batch of samples.
The test()
method will change the state of the Hyperparameter
object by computing a quality score.
User-Initiated Classification.
A RESTful web application is often decomposed into separate view function to handle requests.
When handling a request to classify an unknown sample, the view function will have a Hyperparameter
object used for classification; this will be chosen by the botanist to produce the best results.
The user’s input will be an an UnknownSample
instance. The view function applies Hyperparameter.classify()
method to create a response to the user. Does this state change really matter? Here are two views:
Each UnknownSample
can have a classified
attribute.
Setting this is a change in the state of the Sample
.
It's not clear that there's any behavior change associated with this state change.
The classification result is not part of the Sample
at all. It's a local variable in the view function.
This state change in the function is used
to respond to the user, but has no life within the Sample
object.
There's a key concept underlying this detailed decomposition of these alternatives:
TIP: There's No "Right" Answer.
Some design decisions are based on non-functional and non-technical considerations. These might include the longevity of the application, future use cases, additional users who might be enticed, current schedules and budgets, pedagogical value, technical risk, the creation of intellectual property, and how cool the demo will look in conference call.
In Chapter One, we dropped a hint that this application is the precursor to a consumer product recommender.
Because of that, we'll consider a change in state from UnknownSample
to ClassifiedSample
to be very
important. The Sample
objects will live in a database for additional marketing campaigns or possibly
reclassification when new products are available and the training data changes.
We'll decide to keep the classification and the species data in the UnknownSample
class.
This analysis suggests we can -- perhaps -- coalesce all the various Sample
details
into the following design.
This view uses the open arrowhead to show a number of subclasses of Sample
. We won't directly
implement these as subclasses. (We will address this in Chapter Three.)
We've included the arrows to show that we have some distinct use cases for these objects.
Specifically the box for KnownSample
has a condition "species is not None" to summarize
what's unique about these Sample
objects. Similarly, the UnknownSample
has a condition,
"species is None" to clarify our intent around Sample
objects with the species attribute
value of None
.
In these UML diageams, we have generally avoided showing Python's "special" methods.
Generally, it seems helpful to minimize visual clutter. In some cases, a special method
may be absolutely essential, and worthy of showing in a diagram.
An implementation almost always needs to have an __init__()
method.
It will benefit from having a __repr__()
method, also. In a separate part of a design
document, these common aspects might be noted.
Here's the start of a class, Sample
, which seems to capture all the features of a single sample.
class Sample:
def __init__(
self,
sepal_length: float,
sepal_width: float,
petal_length: float,
petal_width: float,
species: Optional[str] = None,
) -> None:
self.sepal_length = sepal_length
self.sepal_width = sepal_width
self.petal_length = petal_length
self.petal_width = petal_width
self.species = species
self.classification: Optional[str] = None
def __repr__(self) -> str:
if self.species is None:
known_unknown = "UnknownSample"
else:
known_unknown = "KnownSample"
if self.classification is None:
classification = ""
else:
classification = f", {self.classification}"
return (
f"{known_unknown}("
f"sepal_length={self.sepal_length}, "
f"sepal_width={self.sepal_width}, "
f"petal_length={self.petal_length}, "
f"petal_width={self.petal_width}, "
f"species={self.species!r}"
f"{classification}"
f")"
)
The __repr__()
method reflects the fairly complex internal state
of this Sample
object. The states implied by the presence (or absence) of a species and the
presence (or absence) of a classification lead to small behavior changes.
So far, the behavior is limited to the __repr__()
method used to display
the current state of the object.
What's important is that the state changes do lead to a (tiny) behavioral change.
In Chapter Three, we'll look at some alternative designs for this.
We have two application-specific methods for the Sample
class. These are shown
in the next code snippet:
def classify(self, classification: str) -> None:
self.classification = classification
def matches(self) -> bool:
return self.species == self.classification
The classify()
method defines the state change from unclassified
to classified. The matches()
method compares the results of classification
with a Botanist-assigned species. This is used for testing.
Here's an example of how these state changes can look:
>>> from model import Sample
>>> s2 = Sample(
... sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species="Iris-setosa")
>>> s2
KnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa')
>>> s2.classification = "wrong"
>>> s2
KnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa', classification='wrong')
We'll revisit this design in Chapters Three and Five.
Until then, we'll look at other parts of our application, starting
with the Hyperparameter
class definition.
The Hyperparameter
class has more methods than the Sample
class,
but it has more complex methods. There doesn't seem to be a significant
state change for a Hyperparameter, other than computing a quality score.
The botanist provides the value of k, a part of the classification algorithm.
At this time, we're not ready to dive into all of the features of
the Hyperparameter
class. We're going to revisit this design in Chapter Three
and complete this class.
For now, we'll provide a useful skeleton that includes the weak reference
back to the TrainingData
isntance, and a number of methods.
A "weak" reference is a Python object which has a reference to another object, but because the reference is indirect, it isn't tracked by Python's ordinary reference counting. It works like this:
First, we'll create a tiny class, TrainingData
, with no useful
attributes or methods. It's a kind of stub for a class.
We've created an instance of the class, and assigned
it to the variable td_1
.
>>> class TrainingData:
... pass
>>> td_1 = TrainingData()
It's essential to distinguish between two things:
An object, with an id of a large number, for example 140527584537136.
A variable, td_1
with a reference to the object, with an id of 140527584537136.
The object with an id of 140527584537136 has exactly one reference.
The reference is the variable td_1
. If we remove the variable,
then there are no references to the object, and it will vanish from
memory.
Trouble arises when we have two objects with mutual references.
In the following diagram, the training data instance, shown
with "id = 140527584537136" has three references.
These are as follows:
- The variable, td_1
.
- a reference from within a Hyperparameter object,
shown with "id = 98765432".
- a reference from within another Hyperparameter object,
shown with "id = 87654321".
If we remove the variable, there's still at least one remaining reference. Even when we're done using these objects, Python can't easily prove we're done using them, and keeps them in memory.
We can then create a weak reference to the object like this.
>>> from weakref import ref
>>> b = ref(td_1)
We've used the weakref.ref()
function to create a weak reference
to the object with an id of 140527584537136.
The variable b
is not a direct reference to the object of class TrainingData
with an id of 140527584537136. The variable b
is a reference to
a weakref.ref
object. The ref
object can return the original object.
>>> type(b)
<class 'weakref'>
The b
object is -- actually -- a kind of callable object.
It's a function that will return a reference to the original
TrainingData
object.
>>> b() == td_1
True
The value of b()
is the original object, also referred to
by the variable td_1
. Because b
isn't a strong reference,
we have a little more flexibility to remove the original
TrainingData
from memory.
If we delete the variable, td_1
, the reference count to the underlying
object with the id 140527584537136 will decrease to zero;
the object will be removed from memory. Once th eobject is gone,
the value of b()
will be broken, as a consequence.
>>> del td_1
>>> b() is None
True
The weak reference did not prohibit garbage collection.
This lets us have a TrainingData
with a reference to a Hyperparameter
while the Hyperparameter
has a weak reference to the TrainingData
.
When we are done with a TrainingData
object, it can be removed
from memory without getting tangled up by having too many references.
We have a second decision to make on the Hyperparameter
class responsible for testing.
It seems clear that the Hyperparameter
class needs to run the test using its values.
It also seems clear the TrainingData
class is an acceptable place
to record the various Hyperparameter
trials. This means the TrainingData
can identify which of Hyperparameter
instances is best.
There are multiple, related state changes here. In this case,
both the Hyperparameter
and TrainingData
classes will do part of the work.
The system -- as a whole --
will change state as individual elements change state.
This is sometimes described as "emergent behavior".
Rather than write a monster class that does many things,
we've written smaller classes the collaborate to achieve the
expected goals.
This test()
method of TrainingData
is something that
we didn't shown in the UML image because not all ideas
are drawn on the whiteboard during design conversations.
Also note that the references to the TrainingData
class are provided as strings, not the simple class name.
This is how mypy deals with forward references:
the class name is provided as a string.
When mypy is analyzing the code, it resolves the
strings into proper class names.
Here's the start of the class definition.
class Hyperparameter:
"""A hyperparameter value and the overall quality of the classification."""
def __init__(self, k: int, training: "TrainingData") -> None:
self.k = k
self.data: weakref.ReferenceType["TrainingData"] = weakref.ref(training)
self.quality: float
The testing is defined by the following method.
def test(self) -> None:
"""Run the entire test suite."""
training_data: Optional["TrainingData"] = self.data()
if not training_data:
raise RuntimeError("Broken Weak Reference")
pass_count, fail_count = 0, 0
for sample in training_data.testing:
sample.classification = self.classify(sample)
if sample.matches():
pass_count += 1
else:
fail_count += 1
self.quality = pass_count / (pass_count + fail_count)
We start by resolving the weak reference to the training data. This will raise an exception if there's a problem.
For each testing sample, we classify the sample, setting
the sample's classification
attribute. The matches
method
tells us if the classification matches the known species.
Finally, the overall quality is fraction of tests that passed. We can use the integer count, or a floating point ratio of tests passed out of the total number of tests.
We won't look at the classification method in this chapter. We'll save that for Chapter Three.
The TrainingData
class has lists with two subclasses of Sample
objects
as well as a list with Hyperparameter
instances.
This class can have simple, direct references to previously-defined
classes.
This class has the two methods which initiate the processing:
The load()
method reads raw data and partitions it into
training data and test data. Both of these are subclasses
of KnownSamople
.
The test()
method uses a Hyperparameter
object,
performs the test, and saves the result.
Because there are three stories, it seems helpful to add a method to perform a classification
using a given Hyperparameter
instance.
The load()
method is designed to process data given
by another object. We could have designed the load()
method to open and read a file, but then we'd bind the TrainingData
to a specific file format and logical layout. It seems
better to isolate the details of file format from the details
of managing training data. In Chapter Five we'll look closely
and reading and validating input. In Chapter Nine, we'll revisit
the file format considerations.
For now, we'll use the following outline for processing the training data.
def load(self, raw_data_iter: Iterator[Dict[str, str]]) -> None:
"""Extract TestingKnownSample and TrainingKnownSample from raw data"""
for n, row in enumerate(raw_data_iter):
... filter and extract subsets (See Chapter 6)
... Create self.training and self.testing subsets
self.uploaded = datetime.datetime.utcnow()
We'll depend on some kind of data_iter
method, defined in another class.
We've described the properties of this method with a typ hint, Iterable[Dict[str, str]]
The Iterable
states that the method's results can
be used by a for
statement or the list
function.
This is true of collections like lists. It's also true
of generator functions.
The results of this iterator need to be dictionaries that map strings to strings. This is a very general structure, and it allows us to require a dictionary that looks like this:
{
"sepal_length": 5.1,
"sepal_width": 3.5,
"petal_length": 1.4,
"petal_width": 0.2,
"species": "Iris-setosa"
}
This required structure seems flexible enough that we can build some object that will produce this. We'll look at details in Chapter Nine.