Case Study for Chapter Eight, Functional Techniques

While Object-Oriented programming is helpful to encapsulating features, it's not the only way to create flexible, expressive, and succinct application programs. Another design approach, called "Functional Programming" emphasizes functional design and function composition over object-oriented design.

In Python, functional design often involves using a few object-oriented techniques. This is one of the beauties of Python: being able to choose an appropropriate set of design tools to address the problem effectively.

We often depict object-oriented designs with the classes and their various associations. For functional design, we're interested in functions to transform objects. This tends to follow mathematical practices very closely, and a functional design often looks like ordinary math.

This often includes a shift in view point from the data, as defined by Python classes, to processing, defined by Python functions.

Processing Overivew

The initial analysis from Chapter One identified three distinct processes. The context diagram looked like this:

uml diagram

We can think of these as functions to build some collections of sample data.

  1. A function based on the "Provide Training Data" use case would transform source data into two collections of samples. A training set and a testing set. We'd like to avoid placing items in the testing set that are exact matches for items in the training set, creating some constraints on this process. We can think of this is a mapping from a KnownSample to a TestingKnownSample or a TrainingKnownSample.

  2. A function based on the "Set Paramawters and Test Classifier" use case would transform a Hyperparameter (the k value and the distance algorithm) and the testing set of samples into a quality score. We can think of this as a mapping from TestingKnownSample to a correct or incorrect classification, and a reduction to a single value showing the number correct out of the number of tests.

  3. A function based on the "Make Classification Request" use case would transform a Hyperparameter (the k value and the distance algorithm) and a single samples into a classification result.

We'll look at each of these functions separately. We can build an alternative model for our application using these processing steps to define a functional approach.

Splitting the Data

In effect, splitting the data into two subsets can be defined around some filter functions. We have a pair of functions, \(e(s_i)\) and \(r(s_i)\) that decide if a sample, \(s_i\), is for tesing or training. These functions have a result used to partition the samples into two subsets.

It's simpler if these two functions are exclusive, \(e(s_i) = \neg r(s_i)\). This means we only need to define one of the two.

$$ \begin{align} R &= \{s_i \mid s_i \in S \wedge r(s_i)\} \\ E &= \{s_i \mid s_i \in S \wedge \neg r(s_i\} \end{align} $$

The training set is all items, \(s_i\), from the source data, S, where \(r(s_i)\) is true. The testing set is all the items from the source where \(r(s_i)\) is false.

This concept is a kind of "comprehension" or "builder" for a set of samples. We can translate the mathematical comprehension into a Python list comprehension in a fairly direct way. We'll rename the function \(r(s_i)\) as training(). We'll also expose the index value, i, as a separate parameter to this function.

def training(s: Sample, i: int) -> bool:
    pass

training_samples = [
    TrainingKnownSample(s) 
    for i, s in enumerate(samples) 
    if training(s, i)]

test_samples = [
    TestingKnownSample(s) 
    for i, s in enumerate(samples) 
    if not training(s, i)]

The Python comprehension is a generator expression wrapped in a list constructor. The [] on the outside creates a list. The generator expression has three parts: an expression, a for clause, and an if condition. The for clause provides the values. The if condition filters the values. The final expression, s, determines what is accumulated into a list.

We've composed TrainingKnownSample objects from the bare KnownSample instances. This leverages the composition-based design from Chapter Seven.

We can use the index value, an integer, to partition the data. The remainder after division, the modulus, can be used to break data into subsets. The value of i % 5, for example, is a value from 0 to 4. If we use i % 5 == 0 as test data, 20% of the values will be selected. When i % 5 != 0, this is the remaining 80% of the data, used for training.

The following is a list comprehension without the [] wrapper. We've used the list() function to consume items from the generator and build a list. They do the same thing; list() is wordier than [].

test_samples = list(
    TestingKnownSample(s) 
    for i, s in enumerate(samples) 
    if not training(s, i))

Rethinking classification

In Chapter Two, we wrestled with a number of ways of handling the state change that goes with classification. There are two similar processes, one for KnownSample objects that will be used for testing and one for UknownSample objects being classified by users. The process diagrams are simple-looking, but conceal an important question.

Here's the user's classification of an unknown sample.

uml diagram

We can borrow this (with a few tiny class changes) and use it for testing. Here's a better approach to handling classification for test purposes.

uml diagram

Keeping these processes nearly identical seems to be advantageous. Ideally, the same code can be used in both cases, reducing the overall complexity of the application.

As we consider the different alternatives to process view, this leads to changes in the logical view. Here's a revised view thinking of these classes as immutable compositions. We've included notes to suggest when these objects are created during application processing. We've highlighted two classes requiring careful consideration.

uml diagram

The TestingKnownSample and the TrainingKnownSample classes have very minor differences. They don't introduce new attributes or methods. Here are the differences:

The idea is the classifier() method of the Hyperparameter class should work with objects of two classes, summarized by the type hint Union[TesstingKnownSample, UnknownSample]. This kind of hint can help us spot application code that uses the classes incorrectly.

This diagram seems to capture the ways in which these objects are used. Having these details available can lead to more detailed type hints which can be clarify our intent.

The partition() function

We can define multiple versions of the training() function to divide our data into an 80/20, or 75/25 or 67/33 splits.

def training_80(s: KnownSample, i: int) -> bool:
    return i % 5 != 0

def training_75(s: KnownSample, i: int) -> bool:
    return i % 4 != 0

def training_67(s: KnownSample, i: int) -> bool:
    return i % 3 != 0

Here's a function, partition(), which takes one of the training_xx() functions as an argument. The training_xx() function is applied to a sample to decide if it's training data or not.

TrainingList = List[TrainingKnownSample]
TestingList = List[TestingKnownSample]

def partition(
    samples: Iterable[KnownSample], 
    rule: Callable[[KnownSample, int], bool]
) -> Tuple[TrainingList, TestingList]:

    training_samples = [
        TrainingKnownSample(s) for i, s in enumerate(samples) if rule(s, i)
    ]

    test_samples = [
        TestingKnownSample(s) for i, s in enumerate(samples) if not rule(s, i)
    ]

    return training_samples, test_samples

We've built a "higher-order" function that takes another function as an argument value. This is a very cool feature of functional programming that is an integral part of Python.

This training() function builds two lists from a source of data and a function. This covers the simple case, where we don't care about introducing values into the testing list that are duplicates of values in the training list.

We'd like to avoid examining the data twice. It's not that this is particularly costly. But we may have a generator expression creating the raw data in the first place. We can only consume a generator once, and we'd like to avoid creating multiple copies of a large set of data.

Also, we'd like to avoid assigning test values that happen to be exact matches for training values. This turns into a more complex problem. We'll defer this until Chapter Ten.

One-Pass Partitioning

We can create multiple pools of samples in one pass through the data. There are several approaches, we'll show one that has simpler type hints.

The idea is to create two empty list objects, one of training, the other for testing. We can then assign specific type hints to each list, and leverage mypy to be sure we use the lists appropriately.

def partition_1(
    samples: Iterable[KnownSample], rule: Callable[[KnownSample, int], bool]
) -> Tuple[TrainingList, TestingList]:
    """One pass through the source data to create testing and training pools."""

    training: TrainingList = []
    testing: TestingList = []

    for i, s in enumerate(samples):
        training_use = rule(s, i)
        if training_use:
            training.append(TrainingKnownSample(s))
        else:
            testing.append(TestingKnownSample(s))

    return training, testing

In this partition_1() function, we've used the rule function to determine if the data will be used for training. We expect one of the training_xx() functions defined earlier in this case study to be provided as the argument for the rule parameter.

Based on this output, we can create an appropriate class, and then assign the sample to an appropriate list.

This example doesn't check for duplicates between testing samples and training samples. We don't want any test samples that are exact matches for training samples. We can see, where that can be inserted between when the training_use variable is assigned and the final appends are done to either list. If training_use is False and the item already exists in the training set, this item, too, must be used for training.

We can refactor this algorithm slightly by performing the type conversions later in the process. This lets us create a dictionary of various "pools" of KnownSample objects based on the intended usage. So far, we only have two pools, training, where a training_xx() rule is true, and testing.

def partition_1p(
    samples: Iterable[KnownSample], rule: Callable[[KnownSample, int], bool]
) -> Tuple[TrainingList, TestingList]:
    """One pass through the source data to create testing and training pools."""

    pools: Dict[bool, List[KnownSample]] = collections.defaultdict(list)
    partition = ((rule(s, i), s) for i, s in enumerate(samples))
    for usage_pool, sample in partition:
        pools[usage_pool].append(sample)

    training = [TrainingKnownSample(s) for s in pools[True]]
    testing = [TestingKnownSample(s) for s in pools[False]]
    return training, testing

The DefaultDict object will map boolean values to List[KnownSample] objects. We provided the list function to set a default value when a key is accessed that did not previously exist. We only anticipate two keys, and this could also have been written pools: Dict[bool, List[KnownSample]] = {True: [], False: []}.

The partitioning starts by creating a generator function to apply the given rule function to each sample. The result is a two-tuple; we could write an explicit type hint of Tuple[bool, KnownSample]. This generator expression is lazy, and doesn't compute anything until the values are consumed.

The for statement consumes values from the generator, appending the sample to the appropriate pool. When values are consumed, the generator function is evaluated, producing a stream of two-tuples with the pool (a boolean value) and the KnownSample instance.

Once the KnownSample objects have been partitioned, we can wrap them up as instances of the TrainingKnownSample class or the TestingKnownSample class. The type hints seem simpler in this example than in the previous version.

This doesn't actually create two copies of the data. References to the KnownSample objects are collected into a dictionary. From these the two lists of TrainingKnownSample and TestingKnownSample objects are created. Each of the derived objects contains a reference to the original KnownSample object. The structure of the temporary dictionary represents some memory overhead

This example suffers from a complication. It's not perfectly clear how to prevent creating test samples that are exact matches for training samples. It seems like an additional if-statement inside the for statement could check for an iten with usage_pool of False (i.e., a testing item) that also existed in the pools[True] (i.e., the training items. This is quite a bit of extra complexity.

Rather than add the additional steps here, we'll wait for Chapter Ten and revise the algorithm to handle duplicate removal that avoids too many special cases or extra if-statements.