In Chapter Two, we talked in a vague way about loading the training data
and splitting into two clumps -- the training set and the testing set.
In Chapter Nine, we looked at ways to deserialize the raw into into Sample
instances.
In this chapter we want to look further at this operation of using
the raw data to create a number of TrainingKnownSample
instances
separate from a number of TestingKnownSample
instances.
We'll use a variety of approaches in this chapter, including a number of overloaded comparison operations. We'll look at processing we can do on the entire collection of data, including shuffling, sorting, and random choices. We'll also look at processing that can done "incrementally", one sample at a time.
In Chapter Three, we looked at a number of ways to compute distances among samples. This was an example of applying the *Strategy design pattern. We identified a number of alternatives, defined a common interface for each altenative, and then packaged them as methods of distinct classes.
A Hyperparameter
object would be given a Distance
object.
The Distance
object provided a strategy to
compute sample distances using one of the defined algorithms.
In previous chapters, we made use of a number of Python's special methods. These represent a variety of class design aspects.
__call__()
was used to make an instance of a Distance
class into a callable object.
__init__()
is generally used to initialize an object. All of the Sample
classes
made use of this. None of the Distance
classes
needed an initialization. There are some good reasons for implementing this even for
a class that wraps a stateless computation.
__repr__()
creates a representation of the object so we can understand its internal state.
__str__()
can create a useful, friendly representation of an object. For a number of built-in
types, like the various kinds of numbers, the __str__()
and the __repr__()
are identical.
__eq__()
was shown in Chapter Four as a way to compare two objects.
We implement this to define which attributes are compared to be able to declare two objects as being the same.
__len__()
made a brief appearance in Chapter Four. We used this so we could use len(users)
to
find out how many objects were inside the users.users
mapping. Similarly, we defined
a values()
method that delegated work to self.users.values()
. This made the container have
some methods in common with a dictionary, without having to do the work of defining a proper
extension to the built-in dictionary class.
A common thread to Python's special methods is that many of them have a default behavior which
we then replace with something we need. This isn't universally true. Some special methods, like __call__()
,
don't have a default behavior.
Our training data sorting can be approached from two distinct directions:
We can ingest all the raw data, then distribute into two collections for later use.
During the process of ingestion, we can make selections among the collections.
The net effect is the same. The processes are similar. Working with an entire collection can be relatively simple, but uses a great deal of memory. Processing items individually can be a bit more complex, but doesn't require as much memory.
We'll start by building some sophisticated collections.
We can extend the built-in list
class to add some features. It's important to note
that extending built-in types can be tricky because the type hints for these types are
sometimes suprisingly complex.
Python's built-in structures like list
have a variety of initialization alternatives.
We can use list()
to create an empty list.
We can use list(x)
to create a list from an iterable source of data.
To make this clear to mypy, we need to use the @overlay
decorator for the variant definitions.
class SamplePartition(List["SampleDict"]):
@overload
def __init__(self) -> None:
...
@overload
def __init__(self, iterable: Iterable["SampleDict"]) -> None:
...
def __init__(self, iterable: Optional[Iterable["SampleDict"]] = None) -> None:
if iterable:
super().__init__(iterable)
else:
super().__init__()
@property
def training(self) -> List[TrainingKnownSample]:
raise NotImplementedError
@property
def testing(self) -> List[TestingKnownSample]:
raise NotImplementedError
We've defined two overloads for the __init__()
method. The first overload is __init__()
with no
arguments. This should create an empty list of SampleDict
objects. The second overload is __init__()
with
an iterable source of SampleDict
objects.
The implementation is the definition of the __init__()
method without the @overload
decorator.
This definition uses the superclass' __init__()
method to build a List[SampleDict]
object.
Since this is a superclass, we haven't provided a definition for the training
or testing
properties.
Each algorithm can have different implementations of the methods that provide values for these attributes.
One alternative is to shuffle and cut a list -- precisely the way a deck of cards is shuffled
and cut before a game. We can use random.shuffle()
to handle the randomized shuffling.
The cut is -- in a way -- a hyperparameter. How large should the training set be compared with the
testing set? Suggestions include 80% to 20%, 67% to 33% and an even 50% to 50% split.
We'll make the split a feature of the class. We can create separate subclasses to implement alternative splits.
Here's a shuffling implementation.
class Shuffling_SamplePartition_80(SamplePartition):
training_subset = 0.8
def __init__(self, iterable: Optional[Iterable["SampleDict"]] = None) -> None:
super().__init__(iterable)
self.split: Optional[int] = None
def shuffle(self) -> None:
if not self.split:
random.shuffle(self)
self.split = int(len(self) * self.training_subset)
@property
def training(self) -> List[TrainingKnownSample]:
self.shuffle()
return [TrainingKnownSample(**sd) for sd in self[: self.split]]
@property
def testing(self) -> List[TestingKnownSample]:
self.shuffle()
return [TestingKnownSample(**sd) for sd in self[self.split :]]
Since we're extending the SamplePartition
superclass, we can leverage the overloaded __init__()
method
definitions. For this subclass we need to provide a concrete implementation's compatible with the
superclass.
The two properties, trainging
and testing
, both make use of an internal shuffle()
method. This method
will shuffle the samples exactly one time, and then set a value, self.split
to show where to split
the samples into training and test subsets.
The two properties use Python list slicing to subdivide the raw SampleDict
objects, and build
useful TrainingKnownSample
, and TestingKnownSample
objects from the raw data.
Because this depends on the random
module we can create repdoducible results. By setting random.seed()
to a fixed value, we can create a random, but predictable, collection of samples.
This works as follows:
>>> import random
>>> from model import Shuffling_SamplePartition_80
>>> from pprint import pprint
>>> data = [
... {
... "sepal_length": i + 0.1,
... "sepal_width": i + 0.2,
... "petal_length": i + 0.3,
... "petal_width": i + 0.4,
... "species": f"sample {i}",
... }
... for i in range(10)
... ]
>>> random.seed(42)
>>> ssp = Shuffling_SamplePartition_80(data)
>>> pprint(ssp.testing)
[TestingKnownSample(sepal_length=0.1, sepal_width=0.2, petal_length=0.3, petal_width=0.4, species='sample 0', classification=None, ),
TestingKnownSample(sepal_length=1.1, sepal_width=1.2, petal_length=1.3, petal_width=1.4, species='sample 1', classification=None, )]
With a random seed of 42, we always get the same two samples in the testing set.
This allows us to build the initial list in a variety of ways. We can, for example, append data items to an empty list like this:
ssp = Shuffling_SamplePartition_80()
for row in data:
ssp.append(row)
The SamplePartition
subclass of list
will inherit all the methods the parent class.
This allows us to make changes to the internal state of the list prior to extracting
the training
and testing
subsets.
Once we ask for testing or training data, however, any further changes to the list an appear
as peculiar. The shuffle is not redone, and the split is not recomputed. This means that
appending another SampleDict
object, will lead to creation of a new testing sample.
While this behavior is very simple, it may not be desirable.
We could override the built-in append()
method to set self.split
back to None
when
a new item is appended to the list. We'd have to do the same thing for methods like extend()
,
remove()
, delete()
, pop()
, and insert()
as well. Among the special methods, __delitem__()
will remove an item from the list, changing the number of items, so this, too would need to reset self.split
.
While this is a large number of methods to override, the overrides are not terribly complex.
def extend(self, items: Iterable["SampleDict"]) -> None:
self.split = None
super().extend(items)
def pop(self, index: int) -> SampleDict:
self.split = None
return super().pop(index)
There will be a long list of methods with definitions like this. Setting the self.split
value back to
None
with each state change assures that the next time there's a request for the testing
or training
property, the list is shuffled and split again.
We could define a decorator that embodies this aspect. Extracting a common aspect from a number of methods can make the class definition slightly simpler and much more consistent. Here's the decorator with a sketchy set of type hints.
>>> from typing import Callable, Any, Iterable
>>> from functools import wraps
>>> def unsplit(method: Callable[..., Any]) -> Callable[..., Any]:
... @wraps(method)
... def concrete_unsplit_method(self, *arg, **kwarg):
... self.split = None
... return method(self, *arg, **kwarg)
... return concrete_unsplit_method
...
Here's the class using the @unsplit
decorator.
>>> from model import Shuffling_SamplePartition_80, SampleDict
>>> class Extendable_SamplePartition(Shuffling_SamplePartition_80):
... @unsplit
... def append(self, item: SampleDict) -> None:
... super().append(item)
... @unsplit
... def extend(self, item_iterable: Iterable[SampleDict]) -> None:
... super.extend(item_iterable)
...
We've only decorated a few methods. Each method that changes the state of the list would need similar decoration.
Here's an example of using this class.
>>> samples = [
... {
... "sepal_length": i + 0.1,
... "sepal_width": i + 0.2,
... "petal_length": i + 0.3,
... "petal_width": i + 0.4,
... "species": f"sample {i}",
... }
... for i in range(10)
... ]
...
>>> x = Extendable_SamplePartition()
>>> x.append(samples[0])
>>> len(x.training)
0
>>> len(x.testing)
1
>>> x.append(samples[0])
>>> len(x.training)
1
>>> len(x.testing)
1
In this application, however, where the only client is the TrainingData
class, do we really need to tweak all the methods of List[SampleDict]
so they also set the split
attribute to None? Perhaps not. An alternative
is a describe in the docstring of the consequences of changing the state
of a Shuffling_SamplePartition_80
object after asking for testing
or training
values.
We have an alternative to splitting a single list after it's been built.
Instead of extending the list
class to provide two sub-lists, we can
reframe the problem slightly. Let's define a subclass of SamplePartition
that makes a random choice between testing and training on each SampleDict
object that is presented via initialization, append()
, or extend()
.
Here's an abstraction that summarizes our thinking on this.
We'll have three methods for building a list. We'll have
two properties that will provide the training and testing
sets. We don't inherit from List
because we're not providing
any other list-like features. Not even __len__()
.
class DealingPartition:
def __init__(self, Optional[Iterable["SampleDict"]]) -> None:
pass
def append(self, item: "SampleDict") -> None:
pass
def extend(self, items: Iterable["SampleDict"]) -> None:
pass
@property
def training(self) -> List[TrainingKnownSample]:
raise NotImplementedError
@property
def testing(self) -> List[TestingKnownSample]:
raise NotImplementedError
Here's how we can extend this to wrap two internal collections. We'll break this into two parts. First, building the collections, then the properties to expose the values of the collections.
class Counting_DealingPartition_80(DealingPartition):
training_subset = (8, 10) # 8/10 == 80%
# 2/3 and 1/2 are common choices.
def __init__(self, items: Optional[Iterable["SampleDict"]]) -> None:
self.counter = 0
self._training: List[TrainingKnownSample] = []
self._testing: List[TestingKnownSample] = []
if items:
self.extend(items)
def extend(self, items: Iterable["SampleDict"]) -> None:
for item in items:
self.append(item)
def append(self, item: "SampleDict") -> None:
n, d = self.training_subset
if self.counter % d < n:
self._training.append(TrainingKnownSample(**item))
else:
self._testing.append(TestingKnownSample(**item))
self.counter += 1
We've defined an intitializer that sets the initial state
of two empty collections. Then it uses the extend()
method
to build the collections from a source iterable.
The extend()
method relies on the append()
method to
allocate an SampleDict
instance to either testing or training.
The append()
method actually does all the work. It counts
the items, and makes a decision based on some modulus arithmetic.
The training subset is defined as a fraction, we've shown it defined as a tuple, (8, 10) with a comment suggesting this means 8/10 or 80% training, the balance testing.
For a given counter value, c, if \(c < 8 \mod 10\) we'll call it training, if \(c \geq 8 \mod 10\) we'll call it testing.
Here are the remaining two methods. these expose the values
of the two internal list objects. To an extent, these are
useless in this context. It's common in Object-Oriented
Python to simply name the two internal collections
self.training
and self.testing
and drop these property methods.
@property
def training(self) -> List[TrainingKnownSample]:
return self._training
@property
def testing(self) -> List[TestingKnownSample]:
return self._testing
This class also partitions the data. It doesn't rely on a random number generator. It can incrementally provide results at any stage of the operation.
Because we were careful in our design, the various partitioning algorithms all have a very similar interface. They are not all subclasses of a common superclass, but they are functionally identical and we can use any of them.
The use cases are all essentially identical. Here's how we'd
implement the load()
method of the TrainingData
class.
def load(self, raw_data_iter: Iterator[Dict[str, str]]) -> None:
"""Extract TestingKnownSample and TrainingKnownSample from raw data"""
partitioner = Counting_DealingPartition(raw_data_iter)
self.training = partitioner.training
self.testing = partitioner.testing
self.uploaded = datetime.datetime.utcnow()
We can then replace the Counting_DealingPartition
class
with any of the alternative definitions to experiment with different
ways to split the data.
A feature that's sometimes needed is to manage the remote possibility of duplicates between the testing and training sets. We don't want to test with data values that are exact matches of training values.
This is a potentially severe complication to our simple shuffling and dealing algorithms.
We do not want to simply compare each input sample against all other input samples, if we can avoid it. This is an \(\textbf{O}(n^2)\) operation. For a few thousand rows it becomes millions of comparisons.
We don't really have to locate all duplicates. We only need to be sure that no testing item is a simple duplicate of a training item. This is a slight simplification that lets us compare items in the testing set to items in the training set. Since the testing set is smaller, the work is somewhat reduced.
We can allow the training items to be duplicates of each other. This tends to reflect real-world clustering of data.
Before allowing an item to be used for testing, we need to compare it to all the items in the training set. If it's a duplicate, it belongs in the training set.
We can avoid a huge search cost by partitioning the training data into reasonably-sized buckets. A common technique for doing this is to compute a hash of the values, and partition the hashes into buckets. We can then examine the items in a hash bucket to check for duplicates.
This is relatively easy to do in Python because all objects provide
an internal hash value. The built-in hash()
function exposes this.
We can leverage this via the modulus operator, %
, to partition
objects into a small number of buckets. The of code looks like this:
bucket.setdefault(hash(obj) % n, []).append(obj)
Ths appends an object, obj
, into one of the n
distinct buckets
in a mapping named bucket
.
For mutable objects, there is no hash. This means none of the objects
we've defined have hash values. The individual float
values for
each attribute have hashes, but the composite object can't be hashed
because it's mutable.
For immutable objects,
like a tuple
or a float
, the hash reflects the actual value.
Python requires the hash
to be equal before two items are compared
in detail to be equal.
When we evaluate "hello" == "world"
, the hash values must be equal
before the individual characters are compared. If the hash values
aren't equal, there's no need to look at the details within the str
objects.
In our case study, we don't want to define the special __hash__()
method
for our SampleDict
items. They're mutable dictionary objects,
and offering a __hash__()
method is misleading.
When we can compute a temporary hash, we need to be clear on
how it changes. We also need to use a method name other than the
special method name of __hash__()
.
We'll add a hash()
method to the Sample
class so that all of the samples can
produce a hash. We can use this to put test samples into buckets to check for duplicates.
The algorithm we'll use is to compute linear combination of the "interesting" values -- the values we'd compare
to test for equality -- modulus a system-wide hash parameter.
Generally, sys.hash_info.modulus
has the modulus value we need
to use. The method looks like this:
def hash(self) -> int:
return sum(
[
hash(self.sepal_length),
hash(self.sepal_width),
hash(self.petal_length),
hash(self.petal_width),
]
) % sys.hash_info.modulus
For our purposes a simple sum is good enough to distinguish similar sample values. Consider this pair of samples.
x = Sample(1, 2, 3, 4)
y = Sample(2, 1, 4, 3)
The two samples have the same hash value. The samples are not duplicates. This is acceptable because the hash value is not the only test for equality.
We'll need to modify our DealingPartition
classes to
exclude duplicates from the testing set. There are a few places
we can inject this additional feature.
Rewrite the append()
method to check for duplicates.
This means also revising the class to use a mapping
of buckets instead of a simple List[TrainingKnownSample]
.
This seems like it would make the DealingPartition
extremely
complex by including two nearly unrelated aspects into a single class.
Provide a class to implement Collection[TrainingKnownSample]
.
This would use a mapping of buckets
and implement a fast __contains__()
check
based on hashing. We can then rewrite DealingPartition
class
to use this special fast-search container implementation.
The typing
module provides some generic type descriptions.
The collections.abc
module also provides generic type descriptions.
Prior to Python 3.9, this created an unpleasant conundrum.
The collections.abc
definitions provided a number of built-in methods,
making them ideal as base classes. The typing
module definitions
were used by mypy.
During the transition from Python 3.x to 3.9, the following kinds of things are done when creating new kinds of collections.
import collections.abc
import typing
import sys
if sys.version_info >= (3, 9):
BucketCollection = collections.abc.Collection[Sample]
else:
BucketCollection = typing.Collection[Sample]
For new versions of Python, the collections.abc
definitions can
be used.
For older versions of Python, a fallback to the typing.Collection
happens
to be required.
The Collection
type is a mixture of three separate protocols:
Container
. This specifies a class with the __contains__()
method
to determine if an item is in the container.
Sized
. One of these classes with the __len__()
method to
report how many items are in the container.
Iterable
. Classes with this mixin aspect must implement
the __iter__()
method to expose all the individual items.
These are defined as separate features so we can create collections that have some of these, but not all. We might, for example, have a complex graph of arcs and nodes where iteration doesn't make sense.
Also, the Collection
type is generic, and requires a type definition
to specific what it is a collection of. In this case, we've made it
a collection of Sample
objects.
Our collection is a kind of composition. Inside the class, we'll use
the collections.defaultdict
to create a mapping from buckets to lists
of samples.
Here's the class definition
class BucketedCollection(BucketCollection):
def __init__(self, samples: Optional[Iterable[Sample]] = None) -> None:
super().__init__()
self.buckets: DefaultDict[int, List[Sample]] = collections.defaultdict(list)
if samples:
self.extend(samples)
def extend(self, samples: Iterable[Sample]) -> None:
for sample in samples:
self.append(sample)
def append(self, sample: Sample) -> None:
b = sample.hash() % 128
self.buckets[b].append(sample)
def __contains__(self, target: Any) -> bool:
b = cast(Sample, target).hash() % 128
return any(existing == target for existing in self.buckets[b])
def __len__(self) -> int:
return sum(len(b) for b in self.buckets.values())
def __iter__(self) -> Iterator[Sample]:
return itertools.chain(*self.buckets.values())
The construction of a BucketedCollection
can be done any of
three different ways:
a = BucketedCollection(some iterable list of Samples)
`b = BucketedCollection(); b.extend(some iterable list of Samples)'
c = BucketedCollection(); for item in some list: c.append(item)
All three use cases eventually rely on the append()
method.
This method will compute a bucket using the hash()
method of a Sample
instance, and appending the item into the bucket's list of Sample
instances.
The __contains__()
method implements the in
operator.
When we write Sample(1, 2, 3, 4) in b
, this will evaluate __contains__()
.
The implementation locates a bucket and checks the items in the bucket.
Since there are 128 buckets, this is less than 1% of the data.
The __len__()
method returns the current size by counting items in each bucket.
The __iter__()
method iterates through the items. We use itertools.chain()
to return all of the values from all of the buckets. The order depends on
the hash values, and is very difficult to predict, making unit testing
potentially annoyingly complex.
DealingPartition
classesThe DealingPartition
subclasses will need a few changes to use
this new type of container.
def __init__(self, items: Optional[Iterable["SampleDict"]]) -> None:
self.counter = 0
self._training: BucketedCollection = []
self._testing: List[TestingKnownSample] = []
if items:
self.extend(items)
The self._training
collection needs the fast __contains__()
check,
so we'll replace the original List[TrainingKnownSample]
with a BucketedCollection
.
The BucketedCollection
is not generic with respect to the items in the collection;
it's bound to Sample
. That's why it doesn't need a parameter
the way List
requires a specific class of items.
Additionally, the append()
method needs to reject candidate
testing objects that would duplicate existing training objects.
def append(self, item: "SampleDict") -> None:
n, d = self.training_subset
if self.counter % d < n:
self._training.append(TrainingKnownSample(**item))
else:
candidate = TestingKnownSample(**item)
if candidate in self._training:
# Duplicate, create a Training sample from it
self._training.append(TrainingKnownSample(**item))
else:
self._testing.append(candidate)
self.counter += 1
We'll create a TestingKnownSample
from the raw data.
If this is already in the training set, it's not a good testing
candidate, and needs to be put into the training set.
If the candidate is not in the training set, it's a perfectly
good test case.
The final change is to the property that emits the training data.
@property
def training(self) -> List[TrainingKnownSample]:
return list(self._training)
We'll make use of the __iter__()
method of the BucketedCollection
to create a List[TrainingKnownSample]
object that can be returned.
Each time dealer.training
is used, it will create a new List
object.
This is, generally, less than ideal. For large sets of data, it
can consume a lot of memory with nearly identical objects.
It seems like a better approach is to change -- fundamentally --
the objects involved. Here are the changes we need to make
to both SamplePartition
and DealingPartition
families of classes.
The training
property should be Iterator[TrainingKnownSample]
.
The testing
property should be Iterator[TestingKnownSample]
.
This means that these partitioning strategies need to be the final resting places for our data. The partitioning classes can produce iterables for the two essential processing steps.
When testing a hyperparameter, the test()
method
of the Hyperparameter
class will examine all of the items
in the iterable collection of TestingKnownSample
objects.
The test()
method does not need a list
object, it
only needs a source of test data.
When classifying a sample, the classify()
method
of the Hyperparameter
class will examine all the items
in the iterable collection of TrainingKnownSample
objects.
This method doesn't really need a list
object, it
only needs a source of samples.
This final round of changes is left an as exercise for the reader. It requires some careful testing, which is the subject of Chapter 13.