Case Study for Chapter 3, When Objects Are Alike

This section expand on the object-oriented design of a realistic example. We'll start with the diagrams creating using the Unified Modeling Language (UML) to help depict and summarize the software we're going to build.

We'll describe the various considerations that are part of the Python implementation of the class definitions. We'll start with a review of the diagrams that describe the classes to be defined.

There are a number of design principles that we'll be exploring as this design becomes more and more complete. One popular set of principles are the SOLID principles, which are:

We won't look at all of these in this chapter. We'll look, primarily, at Liskov Substitution, and talk a bit about Dependency Inversion.

Logical View

Here's the overview of some of the classes we need to build.

uml diagram

In the previous chapter, we avoided delving into the classification algorithm. Indeed, since Chapter One, we've been avoiding details of the k-NN algorithm.

This reflects a common design strategy, sometimes called "Hard Part, Do Later", also called "Do The Easy Part First." This strategy encourages following common design patterns where possible to isolate the hard part. In effect, the easy parts define a number of fences that enclose and constrain the novel and unknwown parts.

The k-NN classifier locates k nearest neighbors. This forces us to answer the question becomes "What is nearest?"

In a converntional, two-dimensional geometric sense, we can use the "Euclidean" distance between sanples. Given an Unknown sample located at \((u_x, u_y)\) and a Known sample at \((k_x, k_y)\), the Euclidean distance between these samples, \(ED(k, u)\), is

$$ ED(k, u) = \sqrt{ (k_x-u_x)^2 + (k_y-u_y)^2 } $$

We can visualize it like this.

Euclidean

In our case, we have four dimensions: sepal length, sepal width, petal length, and petal width. We could write this out, fully, as

$$ ED(k, u) = \sqrt{ (k_{sl}-u_{sl})^2 + (k_{sw}-u_{sw})^2 + (k_{pl}-u_{pl})^2 + (k_{pw}-u_{pw})^2} $$

This distance is one of many alternative definitions of distance between a known and unknown sample.

There are three other ways to compute a distance that are similar, and often produces consistently good results without the complexity of a square root.

Manhattan distance is the total number of steps along the x-axis, plus the total number of steps along the y-axis.

$$ MD(k, u) = \lvert k_x - u_x \rvert + \lvert k_y - u_y \rvert $$

This isn't too far off of the direct, Euclidean distance. This can be as much as 41% larger. But will still parallel the direct distance in a way that can yield a good k-NN with a faster computation.

Here's a view of the Manhattan distance.

Manhattan

The Chebyshev distance is the largest of the x or y distances.

$$ CD(k, u) = \textbf{max}(\lvert k_x - u_x \rvert, \lvert k_y - u_y \rvert) $$

Here's a view of the Chebyshev distance.

Chebyshev

Minkowski distance is a weighted sum of differences. When then weight is 1, this is the Manhattan distance. When the weight is 2, this is Euclidean distance.

$$ MD(k, u) = \Bigl( {\lvert k_x - u_x \rvert}^m + {\lvert k_y - u_y \rvert}^m \Bigr)^{1/m} $$

To an extent, we can think of the choice between Manhattan and Euclidean distances as a simple parameter, m. If m is 1, this will be Manhattan distance, if m is 2, this is Euclidean distance. For Manhattan and Euclidean distances, the differences are reduced using a sum() operation. For the Chebyshev distance, the differences are reduced by a max() operation.

These three can be lumped together as a single kind of distance metric. There are at least seven other kinds of metrics.

See Effects of Distance Measure Choice on KNN Classifier Performance - A Review. This paper has 54 distinct metrics computations. The three we're looking at are collectively identified as "Minkowski" measures because they're similar and measure each axis equally.

How can we have all of these different distance computations available? Does this mean we have as many as 54 subclasses of the Hyperparameter class?

The Strategy Design Pattern

We have to decompose the Hyperparameter class into two parts:

When the Hyperparameter class requires another object to provide one (or more) methods, this is called the Strategy design pattern. For a given Hyperparameter object, h, the h.distance object has a distance() method that does the work of computing a distance. We can plug in any of the subclasses of Distance to do this work.

This means the Hyperparameter class classify() method will use the strategy's self.distance.distance() to compute the distances. We can use this to provide alternative distance objects as well as alternative k values to find a combination that provides the best quality classification of unknown samples.

We can summarize the relationships like this.

uml diagram

We've focused on a few of the classes:

We'll need to test these definitions. Chapter Thirteen covers testing in detail. We'll provide some test cases using the doctest tool. These can be incorporated into the class docstrings.

Here's an example, using the Chebyshev computation.


class Distance:
    """Abstact definition of a distance computation"""

    def distance(self, s1: Sample, s2: Sample) -> float:
        raise NotImplementedError

class Chebyshev(Distance):
    """
    Computes the Chebyshev distance between two samples.

    ::

        >>> from math import isclose
        >>> from model import TrainingKnownSample, UnknownSample, Chebyshev

        >>> s1 = TrainingKnownSample(
        ...     sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species="Iris-setosa")
        >>> u = UnknownSample(**{"sepal_length": 7.9, "sepal_width": 3.2, "petal_length": 4.7, "petal_width": 1.4})

        >>> algorithm = Chebyshev()
        >>> isclose(3.3, algorithm.distance(s1, u))
        True

    """

    def distance(self, s1: Sample, s2: Sample) -> float:
        return max(
            [
                abs(s1.sepal_length - s2.sepal_length),
                abs(s1.sepal_width - s2.sepal_width),
                abs(s1.petal_length - s2.petal_length),
                abs(s1.petal_width - s2.petal_width),
            ]
        )

The docstring comment is both test data and computation. We strongly encourage its use.

We'll turn to the slightly more complex hierachy of three classes to do the Minkowski distance computations.

The Minkowski Solution

The two Minkowski algorithms are nearly identical but for a single parameter. In this case, the parameter is a feature of the class, and is used in the distance() method. Here's how we can define the Minknowsi superclass, with a place-holder for the m attribute.


class Minkowski(Distance):
"""An abstraction to provide a way to implement Manhattan and Euclidean."""

m: int

def distance(self, s1: Sample, s2: Sample) -> float:
    return (
        sum(
            [
                abs(s1.sepal_length - s2.sepal_length) ** self.m,
                abs(s1.sepal_width - s2.sepal_width) ** self.m,
                abs(s1.petal_length - s2.petal_length) ** self.m,
                abs(s1.petal_width - s2.petal_width) ** self.m,
            ]
        )
        ** (1 / self.m)
    )

We've provided a class-level type hint, m: int, and nothing more. We can't create a usable instance of this class because if we try to evaluate the distance() method, there won't be a value for self.m.

The type hint does not create an instance variable. It doesn't provide a default value. It's only a hint to mypy that will be fulfilled by subclass definitions.

Here are two subclasses that provide specific distance computations.


class Euclidean(Minkowski):
    m = 2

class Manhattan(Minkowski):
    m = 1

The distance() method is inherited from the superclass. The m value is provided by the subclass. No more code is needed.

It would, however, be smart to include a docstring with an explanation and a test case. The core test cases look like this:


>>> from math import isclose
>>> from model import TrainingKnownSample, UnknownSample, Euclidean

>>> s1 = TrainingKnownSample(
...     sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species="Iris-setosa")
>>> u = UnknownSample(**{"sepal_length": 7.9, "sepal_width": 3.2, "petal_length": 4.7, "petal_width": 1.4})

>>> algorithm = Euclidean()
>>> isclose(4.50111097, algorithm.distance(s1, u))
True

This requires the isclose() function from the math module. This function avoids checking too many digits of floating-point results. Python runs on a number of platforms. On some, the last digit of a floating-point approximation may differ, leading to a test failure for an error that's 1 part in \(10^{-15}\); fifteen digits past the decimal point.

The Manhattan distance for the same two samples is close to 7.6. We'll leave it to the reader to work out the doctest example for this class.

One More Example

Just to make it clear how easy it is to add subclasses, we'll define a somewhat more complex distance metric. This is the Sorensen distance, also known as Bray-Curtis.

$$ SD(k, u) = \frac{\lvert k_x - u_x \rvert + \lvert k_y - u_y \rvert }{(k_x + u_x) + (k_y + u_y)} $$

We've effectively standardized each component of the manhattan distance by dividing by the possible range of values.

Here's a diagram:

Sorensen

The simple Manhattan distance applies no matter how far from the origin we are. The Sorensen distance reduces the importance of measures that are further from the origin so they don't dominate the k-NN by virtue of being large-valued outliers.

We can introduce this into our design by adding a new subclass of Distance. While this is similar, in some ways, to the Manhattan distance, it's often classified separately.


class Sorensen(Distance):
    def distance(self, s1: Sample, s2: Sample) -> float:
        return sum(
            [
                abs(s1.sepal_length - s2.sepal_length),
                abs(s1.sepal_width - s2.sepal_width),
                abs(s1.petal_length - s2.petal_length),
                abs(s1.petal_width - s2.petal_width),
            ]
        ) / sum(
            [
                s1.sepal_length + s2.sepal_length,
                s1.sepal_width + s2.sepal_width,
                s1.petal_length + s2.petal_length,
                s1.petal_width + s2.petal_width,
            ]
        )

We don't try to leverage the Minkowski class. There may be some way to manage the other class, but it seems simpler to provide a clear definition that shows the absolute differences weighted by the sum of the attribute values.

Another Strategy

We've kept the Chebyshev computation separate from the other Minkowki computations. They're not really very different, though. The essential difference between Chebyshev and the Minkowski algorithm is the use of max() vs. sum() to reduce a vector of differences to a single value.

We can still exploit the strategy design pattern in Python. A Python function is an object; we can make the function an attribute value of a subclass of Minkowski.

We want to be able to write code like this:


>>> from model import TrainingKnownSample, UnknownSample, Minkowski_2

>>> class CD(Minkowski_2):
...     m = 1
...     reduction = max

>>> class MD(Minkowski_2):
...     m = 1
...     reduction = sum

>>> class ED(Minkowski_2):
...     m = 2
...     reduction = sum

We've extended a Minkowski_2 class with two class-level attributes. The m attribute provides the power to use and the root to take. The reduction provides a function which is used to perform the reduction: either the sum() function or the max() function.

The Mikowski_2 class is very similar to the Minkowski class, shown earlier in this case study. The class will extend the abstract Distance class. In introduces a reduction class-level attribute.


class Minkowski_2(Distance):
    """
    A generic way to implement Manhattan, Euclidean, and Chebyshev.
    """

    m: int
    reduction: Reduce_Function

    def distance(self, s1: Sample, s2: Sample) -> float:
        # Required to prevent Python from passing `self` as the first argument.
        summarize = self.reduction
        return (
            summarize(
                [
                    abs(s1.sepal_length - s2.sepal_length) ** self.m,
                    abs(s1.sepal_width - s2.sepal_width) ** self.m,
                    abs(s1.petal_length - s2.petal_length) ** self.m,
                    abs(s1.petal_width - s2.petal_width) ** self.m,
                ]
            )
            ** (1 / self.m)
        )

We have to use an assignment statement to "fool" Python into using this function correctly. When we write self.reduction(...), the rule is that the first argument value provided to the method must be the instance, self. When we write reduction(...), we're calling a function outside the class, and no special argument will be automatically included.

There's nothing magical about the variable self. The magic applies to the first variable for methods inside a class. It's generally spelled self so we can all recognize it as being special. When we assign reduction = self.reduction we can then use reduction() without introducing the self variable into the arguments.

Type Hints for Minkowski_2

While the typing module offers a Callable type hint for functions, this doesn't work out well for functions that are attributes of a class. An attribute that's a Callable looks a lot like an ordinary method of the class. It's difficult to distinguish between the two cases without providing some details to mypy.

To make it clear, we've introduced a Reduce_Function type definition. This is an example of a new kind of Protocol. In this case, we use the formal (and wordy) protocol definition instead of a terse (and ambiguous) Callable definition.


from typing import Protocol

class Reduce_Function(Protocol):
    """Define a callable object with specific parameters."""

    def __call__(self, values: List[float]) -> float:
        pass

An instance of the Reduction_Function class is a "callable object". This class definition declares our intent behind the class-level reduction attribute in the Minkowski_2 definition.

The __call__ special method is how we build callable objects. Here's a non-Protocol callable object definition. This class can be instantiated to create callables.


class SumPow:
    def __call__(self, sequence: List[float], p: float=1.0) -> float:
        return sum(x**p for x in sequence)

sum_pow = SumPow()

The sum_pow object is a callable object that can be used like a function, sum_pow([2.5, 2.5], 2) == 12.5. The SumPow class doesn't have an __init__() method because it doesn't have any internal state; it's a very thin wrapper around the __call__() method.

Callable objects can have other methods. The method named __call__() is used implicitly when we evaluate a callable object with a list of argument values in (). We can think of sum_pow(s, p) as if it was SumPow.__call__(sumpow, s, p): we've provided the instance variable, self, and the two argument values to the function for evaluation.

Using typing.Protocol lets us provide a class definition as a type hint. This provides some flexibility for the edge case of defining an attribute that is almost like a method. We'll return to these concepts in Chapter Six, where we'll look at additional special methods we can use.