This section expand on the object-oriented design of a realistic example. We'll start with the diagrams creating using the Unified Modeling Language (UML) to help depict and summarize the software we're going to build.
We'll describe the various considerations that are part of the Python implementation of the class definitions. We'll start with a review of the diagrams that describe the classes to be defined.
There are a number of design principles that we'll be exploring as this design becomes more and more complete. One popular set of principles are the SOLID principles, which are:
S. Single Responsibility Principle. A class should have one responsibility. This can mean one reason to change when the application's requirements change.
O. Open/Closed. A class should be open to extension but closed to modification.
L. Liskov Substutition. (Named after Barbara Liskov, who created one of the first object-oriented programming languages, CLU.) Any subclass can be substituted for it's superclass. This tends to focus a class hierarchy on classes which have very similar interfaces, leading to polymorphism among the objects.
I. Interface Segregation. A class should have the smallest interface possible. This is, perhaps, the most important of these principles. Classes should be relatively small and isolated.
D. Dependency Inversion. This has a peculiar name. We need to know what a bad dependency relationship is so we know how to invert it to have a good relationship. Pragmatically, we'd like classes to be independent, so a Liskov Substitution doesn't involve a lot of code changes. In Python, this often means referring to superclasses in type hints to be sure we have the flexibility to make changes. In some cases, it also means providing parameters to allow making global class changes without revising any of the code.
We won't look at all of these in this chapter. We'll look, primarily, at Liskov Substitution, and talk a bit about Dependency Inversion.
Here's the overview of some of the classes we need to build.
In the previous chapter, we avoided delving into the classification algorithm. Indeed, since Chapter One, we've been avoiding details of the k-NN algorithm.
This reflects a common design strategy, sometimes called "Hard Part, Do Later", also called "Do The Easy Part First." This strategy encourages following common design patterns where possible to isolate the hard part. In effect, the easy parts define a number of fences that enclose and constrain the novel and unknwown parts.
The k-NN classifier locates k nearest neighbors. This forces us to answer the question becomes "What is nearest?"
In a converntional, two-dimensional geometric sense, we can use the "Euclidean" distance between sanples. Given an Unknown sample located at \((u_x, u_y)\) and a Known sample at \((k_x, k_y)\), the Euclidean distance between these samples, \(ED(k, u)\), is
$$ ED(k, u) = \sqrt{ (k_x-u_x)^2 + (k_y-u_y)^2 } $$
We can visualize it like this.
In our case, we have four dimensions: sepal length, sepal width, petal length, and petal width. We could write this out, fully, as
$$ ED(k, u) = \sqrt{ (k_{sl}-u_{sl})^2 + (k_{sw}-u_{sw})^2 + (k_{pl}-u_{pl})^2 + (k_{pw}-u_{pw})^2} $$
This distance is one of many alternative definitions of distance between a known and unknown sample.
There are three other ways to compute a distance that are similar, and often produces consistently good results without the complexity of a square root.
Manhattan Distance. This is the distance one would walk in a city with square blocks (somewhat like parts of the city of Manhattan.)
Chebyshev Distance. This counts a diagonal step as 1. A Manhattan computation would rank this as 2. The Euclidean distance would be \(\sqrt{2} = 1.41\).
Minkowski Distance. This is a generalization of Manhattan, Chebyshev, and Euclidean distances to provide a single formula.
Manhattan distance is the total number of steps along the x-axis, plus the total number of steps along the y-axis.
$$ MD(k, u) = \lvert k_x - u_x \rvert + \lvert k_y - u_y \rvert $$
This isn't too far off of the direct, Euclidean distance. This can be as much as 41% larger. But will still parallel the direct distance in a way that can yield a good k-NN with a faster computation.
Here's a view of the Manhattan distance.
The Chebyshev distance is the largest of the x or y distances.
$$ CD(k, u) = \textbf{max}(\lvert k_x - u_x \rvert, \lvert k_y - u_y \rvert) $$
Here's a view of the Chebyshev distance.
Minkowski distance is a weighted sum of differences. When then weight is 1, this is the Manhattan distance. When the weight is 2, this is Euclidean distance.
$$ MD(k, u) = \Bigl( {\lvert k_x - u_x \rvert}^m + {\lvert k_y - u_y \rvert}^m \Bigr)^{1/m} $$
To an extent, we can think of the choice between Manhattan and Euclidean distances as a simple
parameter, m. If m is 1, this will be Manhattan distance, if m is 2, this is Euclidean distance.
For Manhattan and Euclidean distances, the differences are reduced using a sum()
operation.
For the Chebyshev distance, the differences are reduced by a max()
operation.
These three can be lumped together as a single kind of distance metric. There are at least seven other kinds of metrics.
See Effects of Distance Measure Choice on KNN Classifier Performance - A Review. This paper has 54 distinct metrics computations. The three we're looking at are collectively identified as "Minkowski" measures because they're similar and measure each axis equally.
How can we have all of these different distance computations available?
Does this mean we have as many as 54 subclasses of the Hyperparameter
class?
We have to decompose the Hyperparameter
class into two parts:
The parameter
value, and the reference to the TrainingData
.
A distance algorithm. We'd like to be able to plug in any algorithm here. We'd rather not be constrained to use a trivial \(m=1\) or \(m=2\) to choose between Manhattan and Euclidean distances.
When the Hyperparameter
class requires another object to provide one (or more) methods, this
is called the Strategy design pattern. For a given Hyperparameter
object, h
, the h.distance
object
has a distance()
method that does the work of computing a distance. We can plug in any of
the subclasses of Distance
to do this work.
This means the Hyperparameter
class classify()
method will use the strategy's self.distance.distance()
to compute the distances.
We can use this to provide alternative distance
objects as well as alternative k values to find
a combination that provides the best quality classification of unknown samples.
We can summarize the relationships like this.
We've focused on a few of the classes:
An instance of the Hyperparameter
class will have a reference to a Distance
class.
This lets us create any number of subclasses of Distance
with any of the algorithms found in the literature.
An instance of the Distance
class will compute a distance between two samples.
Researches have designed 54 implementations. We'll stick with four simple
ones to start:
Chebyshev uses max()
to reduce four distances to a single value.
Minkowski uses sum()
and some exponential operations to reduce four distances to a single value.
There are two variations, with different powers to which the values are raised.
Euclidean is the Minkowski algorithm using the second power: the square root of a sum of squares.
Manahattan is the Minkowski algorithm using the first power: the sum.
Two subclasses of Sample
will be involved. It turns out, however, that the distance computation
won't make use of the additional features of any of the subclasses of Sample
.
A TrainingData
object contains the original Sample
objects. This object doesn't really play
much of a role. We included it on the diagram to provide some context, but we won't be writing
any software for this class.
We'll need to test these definitions. Chapter Thirteen covers testing in detail. We'll provide
some test cases using the doctest
tool. These can be incorporated into the class docstrings.
Here's an example, using the Chebyshev computation.
class Distance:
"""Abstact definition of a distance computation"""
def distance(self, s1: Sample, s2: Sample) -> float:
raise NotImplementedError
class Chebyshev(Distance):
"""
Computes the Chebyshev distance between two samples.
::
>>> from math import isclose
>>> from model import TrainingKnownSample, UnknownSample, Chebyshev
>>> s1 = TrainingKnownSample(
... sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species="Iris-setosa")
>>> u = UnknownSample(**{"sepal_length": 7.9, "sepal_width": 3.2, "petal_length": 4.7, "petal_width": 1.4})
>>> algorithm = Chebyshev()
>>> isclose(3.3, algorithm.distance(s1, u))
True
"""
def distance(self, s1: Sample, s2: Sample) -> float:
return max(
[
abs(s1.sepal_length - s2.sepal_length),
abs(s1.sepal_width - s2.sepal_width),
abs(s1.petal_length - s2.petal_length),
abs(s1.petal_width - s2.petal_width),
]
)
The docstring comment is both test data and computation. We strongly encourage its use.
We'll turn to the slightly more complex hierachy of three classes to do the Minkowski distance computations.
The two Minkowski algorithms are nearly identical but for a single parameter.
In this case, the parameter is a feature of the class, and is used in the distance()
method.
Here's how we can define the Minknowsi
superclass, with a place-holder for the m
attribute.
class Minkowski(Distance):
"""An abstraction to provide a way to implement Manhattan and Euclidean."""
m: int
def distance(self, s1: Sample, s2: Sample) -> float:
return (
sum(
[
abs(s1.sepal_length - s2.sepal_length) ** self.m,
abs(s1.sepal_width - s2.sepal_width) ** self.m,
abs(s1.petal_length - s2.petal_length) ** self.m,
abs(s1.petal_width - s2.petal_width) ** self.m,
]
)
** (1 / self.m)
)
We've provided a class-level type hint, m: int
, and nothing more. We can't create a usable
instance of this class because if we try to evaluate the distance()
method, there won't be
a value for self.m
.
The type hint does not create an instance variable. It doesn't provide a default value. It's only a hint to mypy that will be fulfilled by subclass definitions.
Here are two subclasses that provide specific distance computations.
class Euclidean(Minkowski):
m = 2
class Manhattan(Minkowski):
m = 1
The distance()
method is inherited from the superclass. The m
value is provided by the subclass.
No more code is needed.
It would, however, be smart to include a docstring with an explanation and a test case. The core test cases look like this:
>>> from math import isclose
>>> from model import TrainingKnownSample, UnknownSample, Euclidean
>>> s1 = TrainingKnownSample(
... sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species="Iris-setosa")
>>> u = UnknownSample(**{"sepal_length": 7.9, "sepal_width": 3.2, "petal_length": 4.7, "petal_width": 1.4})
>>> algorithm = Euclidean()
>>> isclose(4.50111097, algorithm.distance(s1, u))
True
This requires the isclose()
function from the math
module. This function avoids checking
too many digits of floating-point results. Python runs on a number of platforms. On some, the
last digit of a floating-point approximation may differ, leading to a test failure for an error
that's 1 part in \(10^{-15}\); fifteen digits past the decimal point.
The Manhattan distance for the same two samples is close to 7.6. We'll leave it to the reader to work out the doctest example for this class.
Just to make it clear how easy it is to add subclasses, we'll define a somewhat more complex distance metric. This is the Sorensen distance, also known as Bray-Curtis.
$$ SD(k, u) = \frac{\lvert k_x - u_x \rvert + \lvert k_y - u_y \rvert }{(k_x + u_x) + (k_y + u_y)} $$
We've effectively standardized each component of the manhattan distance by dividing by the possible range of values.
Here's a diagram:
The simple Manhattan distance applies no matter how far from the origin we are. The Sorensen distance reduces the importance of measures that are further from the origin so they don't dominate the k-NN by virtue of being large-valued outliers.
We can introduce this into our design by adding a new subclass of Distance
.
While this is similar, in some ways, to the Manhattan distance, it's often
classified separately.
class Sorensen(Distance):
def distance(self, s1: Sample, s2: Sample) -> float:
return sum(
[
abs(s1.sepal_length - s2.sepal_length),
abs(s1.sepal_width - s2.sepal_width),
abs(s1.petal_length - s2.petal_length),
abs(s1.petal_width - s2.petal_width),
]
) / sum(
[
s1.sepal_length + s2.sepal_length,
s1.sepal_width + s2.sepal_width,
s1.petal_length + s2.petal_length,
s1.petal_width + s2.petal_width,
]
)
We don't try to leverage the Minkowski
class. There may be some way
to manage the other class, but it seems simpler to provide a clear definition
that shows the absolute differences weighted by the sum of the attribute values.
We've kept the Chebyshev computation separate from the other Minkowki computations.
They're not really very different, though.
The essential difference between Chebyshev
and the Minkowski
algorithm is the use of
max()
vs. sum()
to reduce a vector of differences to a single value.
We can still exploit the strategy design pattern in Python. A Python function
is an object; we can make the function an attribute value of a subclass of Minkowski
.
We want to be able to write code like this:
>>> from model import TrainingKnownSample, UnknownSample, Minkowski_2
>>> class CD(Minkowski_2):
... m = 1
... reduction = max
>>> class MD(Minkowski_2):
... m = 1
... reduction = sum
>>> class ED(Minkowski_2):
... m = 2
... reduction = sum
We've extended a Minkowski_2
class with two class-level attributes. The m
attribute provides the power to use and the root to take. The reduction
provides
a function which is used to perform the reduction: either the sum()
function
or the max()
function.
The Mikowski_2
class is very similar to the Minkowski
class, shown earlier in this case study.
The class will extend the abstract Distance
class. In introduces a reduction
class-level
attribute.
class Minkowski_2(Distance):
"""
A generic way to implement Manhattan, Euclidean, and Chebyshev.
"""
m: int
reduction: Reduce_Function
def distance(self, s1: Sample, s2: Sample) -> float:
# Required to prevent Python from passing `self` as the first argument.
summarize = self.reduction
return (
summarize(
[
abs(s1.sepal_length - s2.sepal_length) ** self.m,
abs(s1.sepal_width - s2.sepal_width) ** self.m,
abs(s1.petal_length - s2.petal_length) ** self.m,
abs(s1.petal_width - s2.petal_width) ** self.m,
]
)
** (1 / self.m)
)
We have to use an assignment statement to "fool" Python into using this
function correctly. When we write self.reduction(...)
,
the rule is that the first argument value provided to the method must be
the instance, self
. When we write reduction(...)
, we're calling
a function outside the class, and no special argument will be
automatically included.
There's nothing magical about the variable self
. The magic applies to the first variable for methods inside
a class. It's generally spelled self
so we can all recognize it as being special.
When we assign reduction = self.reduction
we can then use reduction()
without introducing the self
variable into the arguments.
While the typing
module offers a Callable
type hint for functions, this doesn't work out
well for functions that are attributes of a class.
An attribute that's a Callable
looks a lot like an ordinary method of the class.
It's difficult to distinguish between the two cases without providing some details to mypy.
To make it clear, we've introduced a Reduce_Function
type definition.
This is an example of a new kind of Protocol
. In this case, we use the formal (and wordy)
protocol definition instead of a terse (and ambiguous) Callable
definition.
from typing import Protocol
class Reduce_Function(Protocol):
"""Define a callable object with specific parameters."""
def __call__(self, values: List[float]) -> float:
pass
An instance of the Reduction_Function
class is a "callable object". This class definition
declares our intent behind the class-level reduction
attribute in the Minkowski_2
definition.
The __call__
special method is how we build callable objects.
Here's a non-Protocol callable object definition. This class can be instantiated to create
callables.
class SumPow:
def __call__(self, sequence: List[float], p: float=1.0) -> float:
return sum(x**p for x in sequence)
sum_pow = SumPow()
The sum_pow
object is a callable object that can be used like a function, sum_pow([2.5, 2.5], 2) == 12.5
.
The SumPow
class doesn't have an __init__()
method because it doesn't have any internal state;
it's a very thin wrapper around the __call__()
method.
Callable objects can have other methods. The method named __call__()
is used implicitly
when we evaluate a callable object with a list of argument values in ()
. We can think of sum_pow(s, p)
as
if it was SumPow.__call__(sumpow, s, p)
: we've provided the instance variable, self
, and the two argument values
to the function for evaluation.
Using typing.Protocol
lets us provide a class definition as a type hint. This provides some flexibility
for the edge case of defining an attribute that is almost like a method. We'll return to these concepts in
Chapter Six, where we'll look at additional special methods we can use.