We want to explore some additional features of object-oriented design in Python. The first is what is sometimes called "syntactic sugar": a handy way to write something that offers a simpler way to express something fairly complex. The second is the concept of a manager for providing a context for resource management.
In Chapter Four, we built the @authenticate
decorator. This decoration was
used on Flask view functions to check the Authorization
HTTP Header to
be sure the request included proper credentials. (We also added a self-signed
certificate so the web service would use HTTPS protocol with a secured
connection to prevent exposing credentials.)
The TrainingData
object is loaded from a file.
Currently, we don't make a large effort to validate the contents of the file.
There are two common ways a file with training data can be processed in our application:
An application can use the load()
method of TrainingData
.
Currently, this method requires an iterable sequence
of dictionaries. The type hint is Iterable[Dict[str, str]]
.
Based on the use cases shown in the
context diagram in Chapter One, the classifier
web server
handles training data
uploads and testing. These view functions will make use of
the load()
method of TrainingData
.
We'll turn to this in chapter TBD.
This case study will focus in the load()
method of training data.
Currently, the file must be CSV-formatted data.
In Chapter Nine, we'll look at alternative formats.
Thinking about alternative formats suggests the TrainingData
class should
not depend on the Dict[str, str]
row definition is part of
CSV file processing. While this approach is simple,
it pushes some details into the TrainingData
that don't really belong there.
Specifically, any detail of the source document's representation
has nothing to do with managing a collection of training and
test samples.
In order to support multiple sources of data, we will need some variant rules for validating the input values.
We'll need a class like this:
class SampleReader:
"""See iris.names for attribute ordering in bezdekIris.data file"""
header = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]
def __init__(self, source: Path) -> None:
self.source = source
def sample_iter(self) -> Iterator[Sample]:
with self.source.open() as source_file:
reader = csv.DictReader(source_file, self.header)
for row in reader:
yield Sample(
sepal_length=float(row["sepal_length"]),
sepal_width=float(row["sepal_width"]),
petal_length=float(row["petal_length"]),
petal_width=float(row["petal_width"]),
)
This builds an instance of the Sample
superclass from the input fields
read by a CSV DictReader
instance.
This sample_iter()
method uses a series of conversion expressions
to translated input data into useful Python objects. In this example,
the conversions are simple, and the implementation is a bunch of float()
functions
to convert CSV string data to Python objects.
This isn't ideal because any failure to process the input can look like a bug
in the software. A ValueError
exception is used widely in Python.
A unique exception is a better way to help disentagle our application's errors from ordinary bugs in our Python code.
class BadSampleRow(ValueError):
pass
The float()
functions -- when confronted with bad data -- will raise a ValueError
.
This is good. A bug in a distance formula may also raise a ValueError
leading to possible confusion.
We want our application to produce unique exceptions to make it easy to identify a root cause
for the exception.
Second, we want to map the various float()
problems to our application's exception.
This is a change to the sample_iter()
method.
def sample_iter(self) -> Iterator[Sample]:
with self.source.open() as source_file:
reader = csv.DictReader(source_file, self.header)
for row in reader:
try:
sample = Sample(
sepal_length=float(row["sepal_length"]),
sepal_width=float(row["sepal_width"]),
petal_length=float(row["petal_length"]),
petal_width=float(row["petal_width"]),
)
except ValueError as ex:
raise BadSampleRow(f"Invalid {row!r}") from ex
yield sample
The creation of the target class is wrapped in a try:
statement. Any ValueError
that's raised will become a BadSampleRow
exception. This will include the source
data so we can pinpoint the problem. We've used a raise...from...
so that the original
exception is preserved to help with the debugging.
Once we have valid input, we have to decide wether the object should be used for training or testing. We'll turn to that problem next.
The SampleReader
class -- while simple -- has a problem.
It isn't easily extended to cover
the two KnownSample
subclasses, TrainingKnownSample
, and TestingKnownSample
.
The class name used to create objects is part of the sample_iter()
function,
and difficult to change.
We could create subclasss of SampleReader
to create the two subclasses of Sample
The bulk of the subclass implementations of the sample_iter()
function would be a
copy of the the superclass method.
It's not a good idea to copy and paste the
body of the sample_iter()
method with one
small change of the class reference. Two nearly identical
methods suffer a change to one subclass that isn't properly
applied to the copy in the other subclass. Inconsistent
changes are an example of "code rot."
A better approach is to extract the class name and make it a class level variable, like this:
class SampleReader:
"""See iris.names for attribute ordering in bezdekIris.data file"""
target_class = KnownSample
header = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
def __init__(self, source: Path) -> None:
self.source = source
def sample_iter(self) -> Iterator[Sample]:
target_class = self.target_class
with self.source.open() as source_file:
reader = csv.DictReader(source_file, self.header)
for row in reader:
try:
sample = target_class(
sepal_length=float(row["sepal_length"]),
sepal_width=float(row["sepal_width"]),
petal_length=float(row["petal_length"]),
petal_width=float(row["petal_width"]),
)
except ValueError as ex:
raise BadSampleRow(f"Invalid {row!r}") from ex
yield sample
This is pleasantly Pythonic. We can create subclasses of this reader to create different kinds of samples from the raw data.
This class exposes another problem. A single source of raw sample data needs to be partitioned
into two separate subclasses of KnownSample
. Ideally, a single reader
would emit a mixture of the two classes.
We have two paths forward to provided the needed functionality:
A more sophisticated algorithm for deciding what class to create.
The algorithm would likely be an if
statement to creates instance of one object
or another.
A simplified definition of KnownSample
that handles
immutable training samples separately from testing samples
that can be classified (and reclassified) any number of times.
Simplification seems to be a good idea. The second alternative suggests we can separate three distinct aspects of a sample.
The "raw" data. This is the core collection of measurements. They are immutable. (We'll address this design variation in Chapter Seven.)
The botanist-assigned species. This is available for training or testing data, but not part of an unknown sample. The assigned species, like the mesurements, is immutable.
A classifier-assigned classification. This is applied to the testing and unknown samples. This value can be seen as mutable; each time we classify a sample (or reclassify a test sample,) the value changes.
This a profound change to the design created so far. Early in a project, this kind of change can be necessary. Way back in Chapters One and Two, we decided to create a fairly sophisticated class hierarchy for various kinds of samples. It's time to revisit that design.
We can rethink our earlier designs from several points of view. One alternative is to separate
the essential Sample
class from the additional features. There seem to be four additional
features.
Known | Unknown | |
---|---|---|
Unclassified | Training Data | Sample waiting to be classified |
Classified | Testing Data | Classified Sample |
We've omitted a detail from the Classified row. Each time we do a classification, a specific Hyperparameter is associated with the classification. It's not simply a "Classified Sample"; it would be more accurate to say it's a sample classified by a specific hyperparameter.
The distinction between the two cells in the Unkown column is minute.
The distinction is so minor as to be essentially irrelevant to most processing.
An Unknown Sample
will be waiting to be classified for -- at most -- a few lines
of code.
If we rethink this, we may be able to created fewer classes and still reflect the object state and behavior changes correctly.
There can be two subclasses of Sample
with a separate Classification
object.
Here's a diagram.
We've refined the class hierarchy to reflect two essentially different kinds of samples:
A KnownSample
instance can be used for testing or training. The difference
is confined the method that does classification. We can make this depend on
purpose
attribute,
shown with a small square or a "-" as a prefix. Python doesn't have private
variables, but this marker can be helpful as a design note.
The public attributes can be shown with a small circle or a "+" as a prefix.
When the purpose has a value of Training
, the classify()
method will raise an exception.
The sample cannot be classified.
When the purpose has a value of Testing
, the classify()
method will work normally,
applying a given Hyperparameter to compute a species.
The UnknownSample
instance can be used for user classification. The classification
method here does not depend on the value of the purpose
attribute, and always performs
classification.
Let's look at implementing these behaviors with the @property
decorator.
We can use the property to fetch computed values as if they were simple attributes.
We can also use the @property
to define attributes which cannot be set.
We'll start by enumerating a domain of purpose values.
class Purpose(int, enum.Enum):
Classification = 0
Testing = 1
Training = 2
This definition creates a namespace with three objects we can use in our code:
Purpose.Classification
, Purpose.Testing
and Purpose.Training
. For example, we can
use if sample.purpose == Purpose.Testing:
to identify a testing sample.
We can convert to Purpose
objects from input values using Purpose(x)
where x
is an integer value,
zero, one, or two. Any other value will raise a ValueError
exception.
We can convert back to numeric values, also. For example, Purpose.Training.value
is 1
.
This use of numeric codes can fit well with external software that doesn't deal well with an enumeration of Python objects. This will be handy in Chapter 9 when we look closely at serialization techniques.
We'll decompose the KnownSample
subclass of Sample
into two parts. Here's the first
part. We initialize a sample with the data required by Sample.__init__()
plus two
additional values.
class KnownSample(Sample):
"""Represents a sample of testing or training data, the species is set once
The purpose determines if it can or cannot be classified.
"""
def __init__(
self,
sepal_length: float,
sepal_width: float,
petal_length: float,
petal_width: float,
purpose: int,
species: str,
) -> None:
purpose_enum = Purpose(purpose)
if purpose_enum not in {Purpose.Training, Purpose.Testing}:
raise ValueError(f"Invalid purpose: {purpose!r}: {purpose_enum}")
super().__init__(
sepal_length=sepal_length,
sepal_width=sepal_width,
petal_length=petal_length,
petal_width=petal_width,
)
self.purpose = purpose_enum
self.species = species
self._classification: Optional[str] = None
def matches(self) -> bool:
return self.species == self.classification
We validate the purpose
parameter's value,to be sure it decodes to either Purpose.Training
or Purpose.Testing
.
If the purpose
value isn't one of the two allowed values, we'll get a ValueError
exception.
This is not a user-supplied value; if there's a problem with the value provided, it's a bug in the
application.
We've created an instance variable, self._classification
with a leading _
name. This
is a convention that suggests the name is not for general use by clients of this class.
It's not "private", since there's no notion of privacy in Python. We could call in "concealed"
or perhaps "none-of-your-business."
Instead a large, opaque wall, it's a low, decorative floral border that sets this variable
apart from the others. You can march right through the floral _
character to look at it
closely, but you probably shouldn't.
Here's the first @property
method.
@property
def classification(self) -> Optional[str]:
if self.purpose == Purpose.Testing:
return self._classification
else:
raise AttributeError(f"Training samples have no classification")
This defines a method that will be visible as an attribute name. Here's an example of creating a sample for testing purposes:
>>> from model import KnownSample, Purpose
>>> s2 = KnownSample(
... sepal_length=5.1,
... sepal_width=3.5,
... petal_length=1.4,
... petal_width=0.2,
... species="Iris-setosa",
... purpose=Purpose.Testing.value)
>>> s2
KnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, purpose=1, species='Iris-setosa')
>>> s2.classification is None
True
When we evaluate s2.classification
, this will call the method definition function.
This function makes sure this is a sample to be used for testing, and returns
the value of the "concealed" instance variable self._classification
.
If this is a Purpose.Training
sample, it raises an AttributeError
exception because any
application that checks the value of the classification for a training sample has a bug in
it that needs to be fixed.
How do we set the classification? Do we really use self._classification = h.classify(self)
?
The answer is no, we can create a property that updates the "concealed" instance variable.
This is a bit more complex than the example above.
@classification.setter
def classification(self, value: str) -> None:
if self.purpose == Purpose.Testing:
self._classification = value
else:
raise AttributeError(f"Training samples cannot be classified")
The initial @property
definition for classification
is called a "getter". It gets
the value of an attribute. (The implementation uses the __get__()
method of a descriptor object
that was created for us.) The @property
definition for classification
also created two
additional decorators, @classification.setter
and @classification.deleter
. The
method decorated by the setter is used by assignment statements. The method decorated by
deleter is used by the del
statement.
Note that the method names are both classification
. This is the attribute name to be used.
Now a statement like s2.classification = h.classify(self)
can save the classification from
a particular Hyperparameter
object. This assignment statement will use the method to
examine the purpose of this sample. If the purpose is testing, the value will be saved.
If the purpose is not Purpose.Testing
, then attempting to set a classification raises
an AttributeError
exception, and identifies a place where something's wrong in our
application.
We have a number of if-statements checking for specific Purpose
values.
This is a suggestion that this design is not optimal. The variant behavior
is not encapsulated in a single class, instead, multiple behaviors are
combined into a class.
The presence of a Purpose
enumeration and if-statements to check for
the enumerated values is a suggestion that we have multiple classes.
The "simplification" here isn't desirable.
In the Input Partitioning section of this case study, we suggested
there were two paths forward. One was to try and simplify the classes
by setting the purpose
attribute to separate testing from training data.
This seems to have added if-statements, and didn't simplify the design.
This means we'll have to search for a better partitioning algorithm.
There's one additional note in this case study. We've show the with
statement,
we need to look at that.
Whenever we open a file (or a network connection) we create an entanglement with operating system resources outside the direct control of our Python application. For a small application that runs quickly, these entanglements are fleeting, and cleaned up nicely when our application exits.
For long-running web servers, however, these entanglements become very important. An application that does not clean up properly, will "leak" resources. If each OS entanglement with an open file is left active, then -- eventually -- the server will run out of OS file handles. The bucket will be empty because it was slowly drained and never refilled. The next time a file-related operation is attempted, the application crashes.
But when it's restarted, it runs perfectly.
Depending on the nature of the leak, the application could run for weeks before running out of resources. Or, it could run for twenty minutes.
One way we make sure we release all of our external entanglements is by opening all files using a context manager.
with some_source_path.open() as source_file:
process the file
The with
statement guarantees that resources are released even when an exception
is raised inside the with
statement. While not required for small applications,
they should be considered as an essential technique, and always used. It's better
to have a context manager and not need it, than to have to debug a web application
that crashes "randomly" because of a resource leak.
We can define our own context managers. These classes need to have an __enter__()
and an __exit__()
method.
A class defined with these methods can be used to create a context that
can be cleaned up to remove entanglements with external resources.