Case Study for Chapter Six, Abstract Base Classes (abc's) and Operator Overloading

In Chapter Two, we talked in a vague way about loading the training data and splitting into two clumps -- the training set and the testing set. In Chapter Nine, we looked at ways to deserialize the raw into into Sample instances.

In this chapter we want to look further at this operation of using the raw data to create a number of TrainingKnownSample instances separate from a number of TestingKnownSample instances.

We'll use a variety of approaches in this chapter, including a number of overloaded comparison operations. We'll look at processing we can do on the entire collection of data, including shuffling, sorting, and random choices. We'll also look at processing that can done "incrementally", one sample at a time.

Design Patterns

In Chapter Three, we looked at a number of ways to compute distances among samples. This was an example of applying the *Strategy design pattern. We identified a number of alternatives, defined a common interface for each altenative, and then packaged them as methods of distinct classes.

A Hyperparameter object would be given a Distance object. The Distance object provided a strategy to compute sample distances using one of the defined algorithms.

In previous chapters, we made use of a number of Python's special methods. These represent a variety of class design aspects.

A common thread to Python's special methods is that many of them have a default behavior which we then replace with something we need. This isn't universally true. Some special methods, like __call__(), don't have a default behavior.

Our training data sorting can be approached from two distinct directions:

The net effect is the same. The processes are similar. Working with an entire collection can be relatively simple, but uses a great deal of memory. Processing items individually can be a bit more complex, but doesn't require as much memory.

We'll start by building some sophisticated collections.

Smart List

We can extend the built-in list class to add some features. It's important to note that extending built-in types can be tricky because the type hints for these types are sometimes suprisingly complex.

Python's built-in structures like list have a variety of initialization alternatives.

To make this clear to mypy, we need to use the @overlay decorator for the variant definitions.

    class SamplePartition(List["SampleDict"]):
        @overload
        def __init__(self) -> None:
            ...

        @overload
        def __init__(self, iterable: Iterable["SampleDict"]) -> None:
            ...

        def __init__(self, iterable: Optional[Iterable["SampleDict"]] = None) -> None:
            if iterable:
                super().__init__(iterable)
            else:
                super().__init__()

        @property
        def training(self) -> List[TrainingKnownSample]:
            raise NotImplementedError

        @property
        def testing(self) -> List[TestingKnownSample]:
            raise NotImplementedError

We've defined two overloads for the __init__() method. The first overload is __init__() with no arguments. This should create an empty list of SampleDict objects. The second overload is __init__() with an iterable source of SampleDict objects.

The implementation is the definition of the __init__() method without the @overload decorator. This definition uses the superclass' __init__() method to build a List[SampleDict] object.

Since this is a superclass, we haven't provided a definition for the training or testing properties. Each algorithm can have different implementations of the methods that provide values for these attributes.

A Shuffling Strategy

One alternative is to shuffle and cut a list -- precisely the way a deck of cards is shuffled and cut before a game. We can use random.shuffle() to handle the randomized shuffling. The cut is -- in a way -- a hyperparameter. How large should the training set be compared with the testing set? Suggestions include 80% to 20%, 67% to 33% and an even 50% to 50% split.

We'll make the split a feature of the class. We can create separate subclasses to implement alternative splits.

Here's a shuffling implementation.

    class Shuffling_SamplePartition_80(SamplePartition):
        training_subset = 0.8

        def __init__(self, iterable: Optional[Iterable["SampleDict"]] = None) -> None:
            super().__init__(iterable)
            self.split: Optional[int] = None

        def shuffle(self) -> None:
            if not self.split:
                random.shuffle(self)
                self.split = int(len(self) * self.training_subset)

        @property
        def training(self) -> List[TrainingKnownSample]:
            self.shuffle()
            return [TrainingKnownSample(**sd) for sd in self[: self.split]]

        @property
        def testing(self) -> List[TestingKnownSample]:
            self.shuffle()
            return [TestingKnownSample(**sd) for sd in self[self.split :]]

Since we're extending the SamplePartition superclass, we can leverage the overloaded __init__() method definitions. For this subclass we need to provide a concrete implementation's compatible with the superclass.

The two properties, trainging and testing, both make use of an internal shuffle() method. This method will shuffle the samples exactly one time, and then set a value, self.split to show where to split the samples into training and test subsets.

The two properties use Python list slicing to subdivide the raw SampleDict objects, and build useful TrainingKnownSample, and TestingKnownSample objects from the raw data.

Because this depends on the random module we can create repdoducible results. By setting random.seed() to a fixed value, we can create a random, but predictable, collection of samples. This works as follows:

>>> import random
>>> from model import Shuffling_SamplePartition_80
>>> from pprint import pprint
>>> data = [
...     {
...         "sepal_length": i + 0.1,
...         "sepal_width": i + 0.2,
...         "petal_length": i + 0.3,
...         "petal_width": i + 0.4,
...         "species": f"sample {i}",
...     }
...     for i in range(10)
... ]

>>> random.seed(42)
>>> ssp = Shuffling_SamplePartition_80(data)
>>> pprint(ssp.testing)
[TestingKnownSample(sepal_length=0.1, sepal_width=0.2, petal_length=0.3, petal_width=0.4, species='sample 0', classification=None, ),
 TestingKnownSample(sepal_length=1.1, sepal_width=1.2, petal_length=1.3, petal_width=1.4, species='sample 1', classification=None, )]

With a random seed of 42, we always get the same two samples in the testing set.

This allows us to build the initial list in a variety of ways. We can, for example, append data items to an empty list like this:

ssp = Shuffling_SamplePartition_80()
for row in data:
    ssp.append(row)

The SamplePartition subclass of list will inherit all the methods the parent class. This allows us to make changes to the internal state of the list prior to extracting the training and testing subsets.

Once we ask for testing or training data, however, any further changes to the list an appear as peculiar. The shuffle is not redone, and the split is not recomputed. This means that appending another SampleDict object, will lead to creation of a new testing sample.
While this behavior is very simple, it may not be desirable.

We could override the built-in append() method to set self.split back to None when a new item is appended to the list. We'd have to do the same thing for methods like extend(), remove(), delete(), pop(), and insert() as well. Among the special methods, __delitem__() will remove an item from the list, changing the number of items, so this, too would need to reset self.split.

While this is a large number of methods to override, the overrides are not terribly complex.

    def extend(self, items: Iterable["SampleDict"]) -> None:
        self.split = None
        super().extend(items)

    def pop(self, index: int) -> SampleDict:
        self.split = None
        return super().pop(index)

There will be a long list of methods with definitions like this. Setting the self.split value back to None with each state change assures that the next time there's a request for the testing or training property, the list is shuffled and split again.

We could define a decorator that embodies this aspect. Extracting a common aspect from a number of methods can make the class definition slightly simpler and much more consistent. Here's the decorator with a sketchy set of type hints.

>>> from typing import Callable, Any, Iterable
>>> from functools import wraps
>>> def unsplit(method: Callable[..., Any]) -> Callable[..., Any]:
...     @wraps(method)
...     def concrete_unsplit_method(self, *arg, **kwarg):
...         self.split = None
...         return method(self, *arg, **kwarg)
...     return concrete_unsplit_method
...

Here's the class using the @unsplit decorator.

>>> from model import Shuffling_SamplePartition_80, SampleDict
>>> class Extendable_SamplePartition(Shuffling_SamplePartition_80):
...     @unsplit
...     def append(self, item: SampleDict) -> None:
...         super().append(item)
...     @unsplit
...     def extend(self, item_iterable: Iterable[SampleDict]) -> None:
...         super.extend(item_iterable)
...

We've only decorated a few methods. Each method that changes the state of the list would need similar decoration.

Here's an example of using this class.

>>> samples = [
...        {
...            "sepal_length": i + 0.1,
...            "sepal_width": i + 0.2,
...            "petal_length": i + 0.3,
...            "petal_width": i + 0.4,
...            "species": f"sample {i}",
...        }
...        for i in range(10)
... ]
...
>>> x = Extendable_SamplePartition()
>>> x.append(samples[0])
>>> len(x.training)
0
>>> len(x.testing)
1
>>> x.append(samples[0])
>>> len(x.training)
1
>>> len(x.testing)
1

In this application, however, where the only client is the TrainingData class, do we really need to tweak all the methods of List[SampleDict] so they also set the split attribute to None? Perhaps not. An alternative is a describe in the docstring of the consequences of changing the state of a Shuffling_SamplePartition_80 object after asking for testing or training values.

An Incremental Strategy

We have an alternative to splitting a single list after it's been built. Instead of extending the list class to provide two sub-lists, we can reframe the problem slightly. Let's define a subclass of SamplePartition that makes a random choice between testing and training on each SampleDict object that is presented via initialization, append(), or extend().

Here's an abstraction that summarizes our thinking on this. We'll have three methods for building a list. We'll have two properties that will provide the training and testing sets. We don't inherit from List because we're not providing any other list-like features. Not even __len__().

class DealingPartition:
    def __init__(self, Optional[Iterable["SampleDict"]]) -> None:
        pass

    def append(self, item: "SampleDict") -> None:
        pass

    def extend(self, items: Iterable["SampleDict"]) -> None:
        pass

    @property
    def training(self) -> List[TrainingKnownSample]:
        raise NotImplementedError

    @property
    def testing(self) -> List[TestingKnownSample]:
        raise NotImplementedError

Here's how we can extend this to wrap two internal collections. We'll break this into two parts. First, building the collections, then the properties to expose the values of the collections.

class Counting_DealingPartition_80(DealingPartition):
    training_subset = (8, 10)  # 8/10 == 80%
    # 2/3 and 1/2 are common choices.

    def __init__(self, items: Optional[Iterable["SampleDict"]]) -> None:
        self.counter = 0
        self._training: List[TrainingKnownSample] = []
        self._testing: List[TestingKnownSample] = []
        if items:
            self.extend(items)

    def extend(self, items: Iterable["SampleDict"]) -> None:
        for item in items:
            self.append(item)

    def append(self, item: "SampleDict") -> None:
        n, d = self.training_subset
        if self.counter % d < n:
            self._training.append(TrainingKnownSample(**item))
        else:
            self._testing.append(TestingKnownSample(**item))
        self.counter += 1

We've defined an intitializer that sets the initial state of two empty collections. Then it uses the extend() method to build the collections from a source iterable.

The extend() method relies on the append() method to allocate an SampleDict instance to either testing or training. The append() method actually does all the work. It counts the items, and makes a decision based on some modulus arithmetic.

The training subset is defined as a fraction, we've shown it defined as a tuple, (8, 10) with a comment suggesting this means 8/10 or 80% training, the balance testing.

For a given counter value, c, if \(c < 8 \mod 10\) we'll call it training, if \(c \geq 8 \mod 10\) we'll call it testing.

Here are the remaining two methods. these expose the values of the two internal list objects. To an extent, these are useless in this context. It's common in Object-Oriented Python to simply name the two internal collections self.training and self.testing and drop these property methods.

    @property
    def training(self) -> List[TrainingKnownSample]:
        return self._training

    @property
    def testing(self) -> List[TestingKnownSample]:
        return self._testing

This class also partitions the data. It doesn't rely on a random number generator. It can incrementally provide results at any stage of the operation.

Polymorphism

Because we were careful in our design, the various partitioning algorithms all have a very similar interface. They are not all subclasses of a common superclass, but they are functionally identical and we can use any of them.

The use cases are all essentially identical. Here's how we'd implement the load() method of the TrainingData class.

    def load(self, raw_data_iter: Iterator[Dict[str, str]]) -> None:
        """Extract TestingKnownSample and TrainingKnownSample from raw data"""
        partitioner = Counting_DealingPartition(raw_data_iter)
        self.training = partitioner.training
        self.testing = partitioner.testing
        self.uploaded = datetime.datetime.utcnow()

We can then replace the Counting_DealingPartition class with any of the alternative definitions to experiment with different ways to split the data.

Duplicate Rejection

A feature that's sometimes needed is to manage the remote possibility of duplicates between the testing and training sets. We don't want to test with data values that are exact matches of training values.

This is a potentially severe complication to our simple shuffling and dealing algorithms.

We do not want to simply compare each input sample against all other input samples, if we can avoid it. This is an \(\textbf{O}(n^2)\) operation. For a few thousand rows it becomes millions of comparisons.

We don't really have to locate all duplicates. We only need to be sure that no testing item is a simple duplicate of a training item. This is a slight simplification that lets us compare items in the testing set to items in the training set. Since the testing set is smaller, the work is somewhat reduced.

We can allow the training items to be duplicates of each other. This tends to reflect real-world clustering of data.

Before allowing an item to be used for testing, we need to compare it to all the items in the training set. If it's a duplicate, it belongs in the training set.

We can avoid a huge search cost by partitioning the training data into reasonably-sized buckets. A common technique for doing this is to compute a hash of the values, and partition the hashes into buckets. We can then examine the items in a hash bucket to check for duplicates.

This is relatively easy to do in Python because all objects provide an internal hash value. The built-in hash() function exposes this. We can leverage this via the modulus operator, %, to partition objects into a small number of buckets. The of code looks like this:

bucket.setdefault(hash(obj) % n, []).append(obj)

Ths appends an object, obj, into one of the n distinct buckets in a mapping named bucket.

For mutable objects, there is no hash. This means none of the objects we've defined have hash values. The individual float values for each attribute have hashes, but the composite object can't be hashed because it's mutable.

For immutable objects, like a tuple or a float, the hash reflects the actual value. Python requires the hash to be equal before two items are compared in detail to be equal.

When we evaluate "hello" == "world", the hash values must be equal before the individual characters are compared. If the hash values aren't equal, there's no need to look at the details within the str objects.

In our case study, we don't want to define the special __hash__() method for our SampleDict items. They're mutable dictionary objects, and offering a __hash__() method is misleading.

When we can compute a temporary hash, we need to be clear on how it changes. We also need to use a method name other than the special method name of __hash__().

We'll add a hash() method to the Sample class so that all of the samples can produce a hash. We can use this to put test samples into buckets to check for duplicates. The algorithm we'll use is to compute linear combination of the "interesting" values -- the values we'd compare to test for equality -- modulus a system-wide hash parameter.

Generally, sys.hash_info.modulus has the modulus value we need to use. The method looks like this:

    def hash(self) -> int:
        return sum(
            [
                hash(self.sepal_length),
                hash(self.sepal_width),
                hash(self.petal_length),
                hash(self.petal_width),
            ]
        ) % sys.hash_info.modulus

For our purposes a simple sum is good enough to distinguish similar sample values. Consider this pair of samples.

x = Sample(1, 2, 3, 4)
y = Sample(2, 1, 4, 3)

The two samples have the same hash value. The samples are not duplicates. This is acceptable because the hash value is not the only test for equality.

We'll need to modify our DealingPartition classes to exclude duplicates from the testing set. There are a few places we can inject this additional feature.

  1. Rewrite the append() method to check for duplicates. This means also revising the class to use a mapping of buckets instead of a simple List[TrainingKnownSample]. This seems like it would make the DealingPartition extremely complex by including two nearly unrelated aspects into a single class.

  2. Provide a class to implement Collection[TrainingKnownSample].
    This would use a mapping of buckets and implement a fast __contains__() check based on hashing. We can then rewrite DealingPartition class to use this special fast-search container implementation.

Creating Our Own Collection

The typing module provides some generic type descriptions. The collections.abc module also provides generic type descriptions. Prior to Python 3.9, this created an unpleasant conundrum. The collections.abc definitions provided a number of built-in methods, making them ideal as base classes. The typing module definitions were used by mypy.

During the transition from Python 3.x to 3.9, the following kinds of things are done when creating new kinds of collections.

import collections.abc
import typing
import sys

if sys.version_info >= (3, 9):
    BucketCollection = collections.abc.Collection[Sample]
else:
    BucketCollection = typing.Collection[Sample]

For new versions of Python, the collections.abc definitions can be used. For older versions of Python, a fallback to the typing.Collection happens to be required.

The Collection type is a mixture of three separate protocols:

These are defined as separate features so we can create collections that have some of these, but not all. We might, for example, have a complex graph of arcs and nodes where iteration doesn't make sense.

Also, the Collection type is generic, and requires a type definition to specific what it is a collection of. In this case, we've made it a collection of Sample objects.

Our collection is a kind of composition. Inside the class, we'll use the collections.defaultdict to create a mapping from buckets to lists of samples.

Here's the class definition

class BucketedCollection(BucketCollection):
    def __init__(self, samples: Optional[Iterable[Sample]] = None) -> None:
        super().__init__()
        self.buckets: DefaultDict[int, List[Sample]] = collections.defaultdict(list)
        if samples:
            self.extend(samples)

    def extend(self, samples: Iterable[Sample]) -> None:
        for sample in samples:
            self.append(sample)

    def append(self, sample: Sample) -> None:
        b = sample.hash() % 128
        self.buckets[b].append(sample)

    def __contains__(self, target: Any) -> bool:
        b = cast(Sample, target).hash() % 128
        return any(existing == target for existing in self.buckets[b])

    def __len__(self) -> int:
        return sum(len(b) for b in self.buckets.values())

    def __iter__(self) -> Iterator[Sample]:
        return itertools.chain(*self.buckets.values())

The construction of a BucketedCollection can be done any of three different ways:

All three use cases eventually rely on the append() method. This method will compute a bucket using the hash() method of a Sample instance, and appending the item into the bucket's list of Sample instances.

The __contains__() method implements the in operator.
When we write Sample(1, 2, 3, 4) in b, this will evaluate __contains__(). The implementation locates a bucket and checks the items in the bucket. Since there are 128 buckets, this is less than 1% of the data.

The __len__() method returns the current size by counting items in each bucket.

The __iter__() method iterates through the items. We use itertools.chain() to return all of the values from all of the buckets. The order depends on the hash values, and is very difficult to predict, making unit testing potentially annoyingly complex.

Changes to the DealingPartition classes

The DealingPartition subclasses will need a few changes to use this new type of container.

    def __init__(self, items: Optional[Iterable["SampleDict"]]) -> None:
        self.counter = 0
        self._training: BucketedCollection = []
        self._testing: List[TestingKnownSample] = []
        if items:
            self.extend(items)

The self._training collection needs the fast __contains__() check, so we'll replace the original List[TrainingKnownSample] with a BucketedCollection. The BucketedCollection is not generic with respect to the items in the collection; it's bound to Sample. That's why it doesn't need a parameter the way List requires a specific class of items.

Additionally, the append() method needs to reject candidate testing objects that would duplicate existing training objects.

    def append(self, item: "SampleDict") -> None:
        n, d = self.training_subset
        if self.counter % d < n:
            self._training.append(TrainingKnownSample(**item))
        else:
            candidate = TestingKnownSample(**item)
            if candidate in self._training:
                # Duplicate, create a Training sample from it
                self._training.append(TrainingKnownSample(**item))
            else:
                self._testing.append(candidate)
        self.counter += 1

We'll create a TestingKnownSample from the raw data. If this is already in the training set, it's not a good testing candidate, and needs to be put into the training set. If the candidate is not in the training set, it's a perfectly good test case.

The final change is to the property that emits the training data.

    @property
    def training(self) -> List[TrainingKnownSample]:
        return list(self._training)

We'll make use of the __iter__() method of the BucketedCollection to create a List[TrainingKnownSample] object that can be returned.

Each time dealer.training is used, it will create a new List object. This is, generally, less than ideal. For large sets of data, it can consume a lot of memory with nearly identical objects.

It seems like a better approach is to change -- fundamentally -- the objects involved. Here are the changes we need to make to both SamplePartition and DealingPartition families of classes.

This means that these partitioning strategies need to be the final resting places for our data. The partitioning classes can produce iterables for the two essential processing steps.

  1. When testing a hyperparameter, the test() method of the Hyperparameter class will examine all of the items in the iterable collection of TestingKnownSample objects. The test() method does not need a list object, it only needs a source of test data.

  2. When classifying a sample, the classify() method of the Hyperparameter class will examine all the items in the iterable collection of TrainingKnownSample objects. This method doesn't really need a list object, it only needs a source of samples.

This final round of changes is left an as exercise for the reader. It requires some careful testing, which is the subject of Chapter 13.