Case Study for Chapter 5, When to Use Object-Oriented Programming

We want to explore some additional features of object-oriented design in Python. The first is what is sometimes called "syntactic sugar": a handy way to write something that offers a simpler way to express something fairly complex. The second is the concept of a manager for providing a context for resource management.

In Chapter Four, we built the @authenticate decorator. This decoration was used on Flask view functions to check the Authorization HTTP Header to be sure the request included proper credentials. (We also added a self-signed certificate so the web service would use HTTPS protocol with a secured connection to prevent exposing credentials.)

Input Validation

The TrainingData object is loaded from a file. Currently, we don't make a large effort to validate the contents of the file.

There are two common ways a file with training data can be processed in our application:

This case study will focus in the load() method of training data. Currently, the file must be CSV-formatted data. In Chapter Nine, we'll look at alternative formats.

Thinking about alternative formats suggests the TrainingData class should not depend on the Dict[str, str] row definition is part of CSV file processing. While this approach is simple, it pushes some details into the TrainingData that don't really belong there. Specifically, any detail of the source document's representation has nothing to do with managing a collection of training and test samples.

In order to support multiple sources of data, we will need some variant rules for validating the input values.

We'll need a class like this:

class SampleReader:
    """See iris.names for attribute ordering in bezdekIris.data file"""

    header = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]

    def __init__(self, source: Path) -> None:
        self.source = source

    def sample_iter(self) -> Iterator[Sample]:
        with self.source.open() as source_file:
            reader = csv.DictReader(source_file, self.header)
            for row in reader:
                yield Sample(
                    sepal_length=float(row["sepal_length"]),
                    sepal_width=float(row["sepal_width"]),
                    petal_length=float(row["petal_length"]),
                    petal_width=float(row["petal_width"]),
                )

This builds an instance of the Sample superclass from the input fields read by a CSV DictReader instance.

This sample_iter() method uses a series of conversion expressions to translated input data into useful Python objects. In this example, the conversions are simple, and the implementation is a bunch of float() functions to convert CSV string data to Python objects.

This isn't ideal because any failure to process the input can look like a bug in the software. A ValueError exception is used widely in Python.

A unique exception is a better way to help disentagle our application's errors from ordinary bugs in our Python code.

class BadSampleRow(ValueError):
    pass

The float() functions -- when confronted with bad data -- will raise a ValueError. This is good. A bug in a distance formula may also raise a ValueError leading to possible confusion. We want our application to produce unique exceptions to make it easy to identify a root cause for the exception.

Second, we want to map the various float() problems to our application's exception. This is a change to the sample_iter() method.

    def sample_iter(self) -> Iterator[Sample]:
        with self.source.open() as source_file:
            reader = csv.DictReader(source_file, self.header)
            for row in reader:
                try:
                    sample = Sample(
                        sepal_length=float(row["sepal_length"]),
                        sepal_width=float(row["sepal_width"]),
                        petal_length=float(row["petal_length"]),
                        petal_width=float(row["petal_width"]),
                    )
                except ValueError as ex:
                    raise BadSampleRow(f"Invalid {row!r}") from ex
                yield sample

The creation of the target class is wrapped in a try: statement. Any ValueError that's raised will become a BadSampleRow exception. This will include the source data so we can pinpoint the problem. We've used a raise...from... so that the original exception is preserved to help with the debugging.

Once we have valid input, we have to decide wether the object should be used for training or testing. We'll turn to that problem next.

Input Partitioning

The SampleReader class -- while simple -- has a problem. It isn't easily extended to cover the two KnownSample subclasses, TrainingKnownSample, and TestingKnownSample. The class name used to create objects is part of the sample_iter() function, and difficult to change.

We could create subclasss of SampleReader to create the two subclasses of Sample The bulk of the subclass implementations of the sample_iter() function would be a copy of the the superclass method.

It's not a good idea to copy and paste the body of the sample_iter() method with one small change of the class reference. Two nearly identical methods suffer a change to one subclass that isn't properly applied to the copy in the other subclass. Inconsistent changes are an example of "code rot."

A better approach is to extract the class name and make it a class level variable, like this:

class SampleReader:
    """See iris.names for attribute ordering in bezdekIris.data file"""

    target_class = KnownSample
    header = ["sepal_length", "sepal_width", "petal_length", "petal_width"]

    def __init__(self, source: Path) -> None:
        self.source = source

    def sample_iter(self) -> Iterator[Sample]:
        target_class = self.target_class
        with self.source.open() as source_file:
            reader = csv.DictReader(source_file, self.header)
            for row in reader:
                try:
                    sample = target_class(
                        sepal_length=float(row["sepal_length"]),
                        sepal_width=float(row["sepal_width"]),
                        petal_length=float(row["petal_length"]),
                        petal_width=float(row["petal_width"]),
                    )
                except ValueError as ex:
                    raise BadSampleRow(f"Invalid {row!r}") from ex
                yield sample

This is pleasantly Pythonic. We can create subclasses of this reader to create different kinds of samples from the raw data.

This class exposes another problem. A single source of raw sample data needs to be partitioned into two separate subclasses of KnownSample. Ideally, a single reader would emit a mixture of the two classes.

We have two paths forward to provided the needed functionality:

Simplification seems to be a good idea. The second alternative suggests we can separate three distinct aspects of a sample.

This a profound change to the design created so far. Early in a project, this kind of change can be necessary. Way back in Chapters One and Two, we decided to create a fairly sophisticated class hierarchy for various kinds of samples. It's time to revisit that design.

The Sample Class Hierarchy

We can rethink our earlier designs from several points of view. One alternative is to separate the essential Sample class from the additional features. There seem to be four additional features.

Known Unknown
Unclassified Training Data Sample waiting to be classified
Classified Testing Data Classified Sample

We've omitted a detail from the Classified row. Each time we do a classification, a specific Hyperparameter is associated with the classification. It's not simply a "Classified Sample"; it would be more accurate to say it's a sample classified by a specific hyperparameter.

The distinction between the two cells in the Unkown column is minute. The distinction is so minor as to be essentially irrelevant to most processing. An Unknown Sample will be waiting to be classified for -- at most -- a few lines of code.

If we rethink this, we may be able to created fewer classes and still reflect the object state and behavior changes correctly.

There can be two subclasses of Sample with a separate Classification object. Here's a diagram.

uml diagram

We've refined the class hierarchy to reflect two essentially different kinds of samples:

Let's look at implementing these behaviors with the @property decorator. We can use the property to fetch computed values as if they were simple attributes. We can also use the @property to define attributes which cannot be set.

The Purpose Enumeration

We'll start by enumerating a domain of purpose values.

class Purpose(int, enum.Enum):
    Classification = 0
    Testing = 1
    Training = 2

This definition creates a namespace with three objects we can use in our code: Purpose.Classification, Purpose.Testing and Purpose.Training. For example, we can use if sample.purpose == Purpose.Testing: to identify a testing sample.

We can convert to Purpose objects from input values using Purpose(x) where x is an integer value, zero, one, or two. Any other value will raise a ValueError exception. We can convert back to numeric values, also. For example, Purpose.Training.value is 1.

This use of numeric codes can fit well with external software that doesn't deal well with an enumeration of Python objects. This will be handy in Chapter 9 when we look closely at serialization techniques.

We'll decompose the KnownSample subclass of Sample into two parts. Here's the first part. We initialize a sample with the data required by Sample.__init__() plus two additional values.

class KnownSample(Sample):
    """Represents a sample of testing or training data, the species is set once
    The purpose determines if it can or cannot be classified.
    """

    def __init__(
        self,
        sepal_length: float,
        sepal_width: float,
        petal_length: float,
        petal_width: float,
        purpose: int,
        species: str,
    ) -> None:
        purpose_enum = Purpose(purpose)
        if purpose_enum not in {Purpose.Training, Purpose.Testing}:
            raise ValueError(f"Invalid purpose: {purpose!r}: {purpose_enum}")
        super().__init__(
            sepal_length=sepal_length,
            sepal_width=sepal_width,
            petal_length=petal_length,
            petal_width=petal_width,
        )
        self.purpose = purpose_enum
        self.species = species
        self._classification: Optional[str] = None

    def matches(self) -> bool:
        return self.species == self.classification

We validate the purpose parameter's value,to be sure it decodes to either Purpose.Training or Purpose.Testing. If the purpose value isn't one of the two allowed values, we'll get a ValueError exception. This is not a user-supplied value; if there's a problem with the value provided, it's a bug in the application.

We've created an instance variable, self._classification with a leading _ name. This is a convention that suggests the name is not for general use by clients of this class. It's not "private", since there's no notion of privacy in Python. We could call in "concealed" or perhaps "none-of-your-business."

Instead a large, opaque wall, it's a low, decorative floral border that sets this variable apart from the others. You can march right through the floral _ character to look at it closely, but you probably shouldn't.

Here's the first @property method.

    @property
    def classification(self) -> Optional[str]:
        if self.purpose == Purpose.Testing:
            return self._classification
        else:
            raise AttributeError(f"Training samples have no classification")

This defines a method that will be visible as an attribute name. Here's an example of creating a sample for testing purposes:

>>> from model import KnownSample, Purpose
>>> s2 = KnownSample(
...     sepal_length=5.1, 
...     sepal_width=3.5, 
...     petal_length=1.4, 
...     petal_width=0.2, 
...     species="Iris-setosa", 
...     purpose=Purpose.Testing.value)
>>> s2
KnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, purpose=1, species='Iris-setosa')
>>> s2.classification is None
True

When we evaluate s2.classification, this will call the method definition function. This function makes sure this is a sample to be used for testing, and returns the value of the "concealed" instance variable self._classification.

If this is a Purpose.Training sample, it raises an AttributeError exception because any application that checks the value of the classification for a training sample has a bug in it that needs to be fixed.

Property Setters

How do we set the classification? Do we really use self._classification = h.classify(self)? The answer is no, we can create a property that updates the "concealed" instance variable. This is a bit more complex than the example above.


    @classification.setter
    def classification(self, value: str) -> None:
        if self.purpose == Purpose.Testing:
            self._classification = value
        else:
            raise AttributeError(f"Training samples cannot be classified")

The initial @property definition for classification is called a "getter". It gets the value of an attribute. (The implementation uses the __get__() method of a descriptor object that was created for us.) The @property definition for classification also created two additional decorators, @classification.setter and @classification.deleter. The method decorated by the setter is used by assignment statements. The method decorated by deleter is used by the del statement.

Note that the method names are both classification. This is the attribute name to be used.

Now a statement like s2.classification = h.classify(self) can save the classification from a particular Hyperparameter object. This assignment statement will use the method to examine the purpose of this sample. If the purpose is testing, the value will be saved. If the purpose is not Purpose.Testing, then attempting to set a classification raises an AttributeError exception, and identifies a place where something's wrong in our application.

Repeated If Statements

We have a number of if-statements checking for specific Purpose values. This is a suggestion that this design is not optimal. The variant behavior is not encapsulated in a single class, instead, multiple behaviors are combined into a class.

The presence of a Purpose enumeration and if-statements to check for the enumerated values is a suggestion that we have multiple classes. The "simplification" here isn't desirable.

In the Input Partitioning section of this case study, we suggested there were two paths forward. One was to try and simplify the classes by setting the purpose attribute to separate testing from training data. This seems to have added if-statements, and didn't simplify the design.

This means we'll have to search for a better partitioning algorithm. There's one additional note in this case study. We've show the with statement, we need to look at that.

Context Managers

Whenever we open a file (or a network connection) we create an entanglement with operating system resources outside the direct control of our Python application. For a small application that runs quickly, these entanglements are fleeting, and cleaned up nicely when our application exits.

For long-running web servers, however, these entanglements become very important. An application that does not clean up properly, will "leak" resources. If each OS entanglement with an open file is left active, then -- eventually -- the server will run out of OS file handles. The bucket will be empty because it was slowly drained and never refilled. The next time a file-related operation is attempted, the application crashes.

But when it's restarted, it runs perfectly.

Depending on the nature of the leak, the application could run for weeks before running out of resources. Or, it could run for twenty minutes.

One way we make sure we release all of our external entanglements is by opening all files using a context manager.


    with some_source_path.open() as source_file:
        process the file

The with statement guarantees that resources are released even when an exception is raised inside the with statement. While not required for small applications, they should be considered as an essential technique, and always used. It's better to have a context manager and not need it, than to have to debug a web application that crashes "randomly" because of a resource leak.

We can define our own context managers. These classes need to have an __enter__() and an __exit__() method. A class defined with these methods can be used to create a context that can be cleaned up to remove entanglements with external resources.