While Object-Oriented programming is helpful to encapsulating features, it's not the only way to create flexible, expressive, and succinct application programs. Another design approach, called "Functional Programming" emphasizes functional design and function composition over object-oriented design.
In Python, functional design often involves using a few object-oriented techniques. This is one of the beauties of Python: being able to choose an appropropriate set of design tools to address the problem effectively.
We often depict object-oriented designs with the classes and their various associations. For functional design, we're interested in functions to transform objects. This tends to follow mathematical practices very closely, and a functional design often looks like ordinary math.
This often includes a shift in view point from the data, as defined by Python classes, to processing, defined by Python functions.
The initial analysis from Chapter One identified three distinct processes. The context diagram looked like this:
We can think of these as functions to build some collections of sample data.
A function based on the "Provide Training Data" use case would
transform source data into two collections of samples.
A training set and a testing set. We'd like to avoid placing items in the testing set
that are exact matches for items in the training set, creating some constraints
on this process. We can think of this is a mapping from a KnownSample
to
a TestingKnownSample
or a TrainingKnownSample
.
A function based on the "Set Paramawters and Test Classifier" use case would
transform a Hyperparameter (the k value and the distance algorithm) and the testing set of samples into
a quality score. We can think of this as a mapping from TestingKnownSample
to
a correct or incorrect classification, and a reduction to a single value showing the number
correct out of the number of tests.
A function based on the "Make Classification Request" use case would transform a Hyperparameter (the k value and the distance algorithm) and a single samples into a classification result.
We'll look at each of these functions separately. We can build an alternative model for our application using these processing steps to define a functional approach.
In effect, splitting the data into two subsets can be defined around some filter functions. We have a pair of functions, \(e(s_i)\) and \(r(s_i)\) that decide if a sample, \(s_i\), is for tesing or training. These functions have a result used to partition the samples into two subsets.
It's simpler if these two functions are exclusive, \(e(s_i) = \neg r(s_i)\). This means we only need to define one of the two.
$$ \begin{align} R &= \{s_i \mid s_i \in S \wedge r(s_i)\} \\ E &= \{s_i \mid s_i \in S \wedge \neg r(s_i\} \end{align} $$
The training set is all items, \(s_i\), from the source data, S, where \(r(s_i)\) is true. The testing set is all the items from the source where \(r(s_i)\) is false.
This concept is a kind of "comprehension" or "builder" for a set of samples.
We can translate the mathematical comprehension into a Python list comprehension in a fairly direct way.
We'll rename the function \(r(s_i)\) as training()
.
We'll also expose the index value, i, as a separate parameter to this function.
def training(s: Sample, i: int) -> bool:
pass
training_samples = [
TrainingKnownSample(s)
for i, s in enumerate(samples)
if training(s, i)]
test_samples = [
TestingKnownSample(s)
for i, s in enumerate(samples)
if not training(s, i)]
The Python comprehension is a generator expression wrapped in a list constructor.
The []
on the outside creates a list. The generator expression has three
parts: an expression, a for
clause, and an if
condition. The for
clause
provides the values. The if
condition filters the values. The final expression, s
,
determines what is accumulated into a list.
We've composed TrainingKnownSample
objects from the bare KnownSample
instances.
This leverages the composition-based design from Chapter Seven.
We can use the index value, an integer, to partition the data.
The remainder after division, the modulus, can be used to break data into subsets.
The value of i % 5
, for example, is a value from 0 to 4.
If we use i % 5 == 0
as test data, 20% of the values will be selected.
When i % 5 != 0
, this is the remaining 80% of the data, used for training.
The following is a list comprehension without the []
wrapper. We've used
the list()
function to consume items from the generator and build a list.
They do the same thing; list()
is wordier than []
.
test_samples = list(
TestingKnownSample(s)
for i, s in enumerate(samples)
if not training(s, i))
In Chapter Two, we wrestled with a number of ways of handling the state change
that goes with classification. There are two similar processes, one for KnownSample
objects
that will be used for testing and one for UknownSample
objects being classified by users.
The process diagrams are simple-looking, but conceal an important question.
Here's the user's classification of an unknown sample.
We can borrow this (with a few tiny class changes) and use it for testing. Here's a better approach to handling classification for test purposes.
Keeping these processes nearly identical seems to be advantageous. Ideally, the same code can be used in both cases, reducing the overall complexity of the application.
As we consider the different alternatives to process view, this leads to changes in the logical view. Here's a revised view thinking of these classes as immutable compositions. We've included notes to suggest when these objects are created during application processing. We've highlighted two classes requiring careful consideration.
The TestingKnownSample
and the TrainingKnownSample
classes have very minor differences.
They don't introduce new attributes or methods. Here are the differences:
TrainingKnownSample
instances are never used for classification.
TesstingKnownSample
and UnknownSample
instances are used for classification and testing.
We'll create a ClassifiedKnownSample
object from a TesstingKnownSample
object
by repackaging the KnownSample
instance into a new container. This creates a
more consistent set of definitions.
The idea is the classifier()
method of the Hyperparameter
class should work
with objects of two classes, summarized by the type hint Union[TesstingKnownSample, UnknownSample]
.
This kind of hint can help us spot application code that uses the classes incorrectly.
This diagram seems to capture the ways in which these objects are used. Having these details available can lead to more detailed type hints which can be clarify our intent.
partition()
functionWe can define multiple versions of the training()
function to divide our data
into an 80/20, or 75/25 or 67/33 splits.
def training_80(s: KnownSample, i: int) -> bool:
return i % 5 != 0
def training_75(s: KnownSample, i: int) -> bool:
return i % 4 != 0
def training_67(s: KnownSample, i: int) -> bool:
return i % 3 != 0
Here's a function, partition()
, which takes one of the
training_xx()
functions as an argument. The training_xx()
function
is applied to a sample to decide if it's training data or not.
TrainingList = List[TrainingKnownSample]
TestingList = List[TestingKnownSample]
def partition(
samples: Iterable[KnownSample],
rule: Callable[[KnownSample, int], bool]
) -> Tuple[TrainingList, TestingList]:
training_samples = [
TrainingKnownSample(s) for i, s in enumerate(samples) if rule(s, i)
]
test_samples = [
TestingKnownSample(s) for i, s in enumerate(samples) if not rule(s, i)
]
return training_samples, test_samples
We've built a "higher-order" function that takes another function as an argument value. This is a very cool feature of functional programming that is an integral part of Python.
This training()
function builds two lists from a source of data and a function.
This covers the simple case, where we don't care about introducing
values into the testing
list that are duplicates of values in the training
list.
We'd like to avoid examining the data twice. It's not that this is particularly costly. But we may have a generator expression creating the raw data in the first place. We can only consume a generator once, and we'd like to avoid creating multiple copies of a large set of data.
Also, we'd like to avoid assigning test values that happen to be exact matches for training values. This turns into a more complex problem. We'll defer this until Chapter Ten.
We can create multiple pools of samples in one pass through the data. There are several approaches, we'll show one that has simpler type hints.
The idea is to create two empty list objects, one of training, the other for testing. We can then assign specific type hints to each list, and leverage mypy to be sure we use the lists appropriately.
def partition_1(
samples: Iterable[KnownSample], rule: Callable[[KnownSample, int], bool]
) -> Tuple[TrainingList, TestingList]:
"""One pass through the source data to create testing and training pools."""
training: TrainingList = []
testing: TestingList = []
for i, s in enumerate(samples):
training_use = rule(s, i)
if training_use:
training.append(TrainingKnownSample(s))
else:
testing.append(TestingKnownSample(s))
return training, testing
In this partition_1()
function, we've used the rule
function to determine if the
data will be used for training. We expect one of the training_xx()
functions defined earlier in this case study
to be provided as the argument for the rule
parameter.
Based on this output, we can create an appropriate class, and then assign the sample to an appropriate list.
This example doesn't check for duplicates between testing samples and training samples.
We don't want any test samples that are exact matches for training samples.
We can see, where that can be inserted between when the training_use
variable is assigned
and the final appends are done to either list. If training_use
is False and the item
already exists in the training set, this item, too, must be used for training.
We can refactor this algorithm slightly by performing the type conversions later
in the process. This lets us create a dictionary
of various "pools" of KnownSample
objects based on the intended usage. So far, we
only have two pools, training, where a training_xx()
rule is true, and testing.
def partition_1p(
samples: Iterable[KnownSample], rule: Callable[[KnownSample, int], bool]
) -> Tuple[TrainingList, TestingList]:
"""One pass through the source data to create testing and training pools."""
pools: Dict[bool, List[KnownSample]] = collections.defaultdict(list)
partition = ((rule(s, i), s) for i, s in enumerate(samples))
for usage_pool, sample in partition:
pools[usage_pool].append(sample)
training = [TrainingKnownSample(s) for s in pools[True]]
testing = [TestingKnownSample(s) for s in pools[False]]
return training, testing
The DefaultDict
object will map boolean values to List[KnownSample]
objects.
We provided the list
function to set a default value when a key is accessed
that did not previously exist. We only anticipate two keys, and this could also
have been written pools: Dict[bool, List[KnownSample]] = {True: [], False: []}
.
The partitioning starts by creating a generator function to apply the given rule
function
to each sample. The result is a two-tuple; we could write an explicit type
hint of Tuple[bool, KnownSample]
.
This generator expression is lazy, and doesn't compute anything until the values are consumed.
The for
statement consumes values from the generator, appending the sample to the appropriate pool.
When values are consumed, the generator function is evaluated, producing a stream of two-tuples
with the pool (a boolean value) and the KnownSample
instance.
Once the KnownSample
objects have been partitioned, we can wrap them up as
instances of the TrainingKnownSample
class or the TestingKnownSample
class.
The type hints seem simpler in this example than in the previous version.
This doesn't actually create two copies of the data. References to the KnownSample
objects are collected
into a dictionary. From these the two lists of TrainingKnownSample
and TestingKnownSample
objects
are created. Each of the derived objects contains a reference to the original KnownSample
object.
The structure of the temporary dictionary represents some memory overhead
This example suffers from a complication. It's not perfectly clear how to
prevent creating test samples that are exact matches for training samples.
It seems like an additional if-statement inside the for
statement could check
for an iten with usage_pool
of False (i.e., a testing item) that also existed
in the pools[True]
(i.e., the training items. This is quite a bit of extra
complexity.
Rather than add the additional steps here, we'll wait for Chapter Ten and revise the algorithm to handle duplicate removal that avoids too many special cases or extra if-statements.