Case Study for Chapter 1, Object-Oriented Design

This section will step through a few iterations of object-oriented design on a realistic example. We'll create a collection of diagrams using the Unified Modeling Language (UML) to help depict and summarize the software we're going to build.

We'll examine the problem using a technique called "4+1 Views". The views are:

It's common to start with the context view so that readers have a sense of what the other views describe. As our understanding of the users and the problem domain evolves, the context will evolve, also.

It's very important to recognize that all of these 4+1 views evolve together. A change to one will generally
be reflected in other views. It's a common mistake to think that one view is in some way foundational, and the other views build on it in a cascade of design steps that always lead to software.

We'll start with a summary of the problem and some background before we start trying to analyze the application or design software.

Introduction and Problem Overview

Our users want to automate a job often called "classification". This is the underpinning idea behind product recommendations: last time a customer bought product X, perhaps they'd be interested in a similar product, Y. We've classified their desires and can locate other items in that class of products.

It helps to start with something small and manageable. The users don't want to tackle complex consumer products. Solving a difficult problem is not a good way to learn how to build this kind of application. It's better to start with something of a manageable level of complexity and then refine and expand it until it does everything they need.

There are a number of classifier approaches; one popular approach is called "k Nearest Neighbors", or "k-NN" for short. A training set of data is required. Each training sample has a number of attributes, reduced to numeric scores, and a final, correct, classification.

Given an unknown sample, we can measure the distance between the unknown sample and any of the known samples. For some small group of nearby neighbors, we can take a vote. The unknown sample can be classified into the sub-population selected by the majority of the nearby neighbors.

One underpinning concept is having tangible, numeric measurements for the various attributes. Converting words, addresses, and other non-ordinal data to an ordinal measurement can can be challenging. The good news is the data we're going to start with already has properly ordinal measurements with explicit units of measure.

Another supporting concept is the number of neighbors involved in the voting. This is the "k" factor in the k nearest neighbors. If k is too big, then the simple size of the populations can influence voting. If there are more things classified as "Aye" than "Nay", then the "Aye"'s win all the votes even when an unknown sample is surrounded by "Nay"'s.

Imagine we have an unknown sample slightly closer to one lonely outlier of class "Nay" than it is to three slightly more distant samples of class "Aye". Any affinity for the three samples of class "Aye" would be lost if k is set to one and we only consider the single nearest neighbor. In this csae, a k of three or five would have properly associated the unknown with the cluster that's slightly further away.

Setting the appropriate k value requires concrete sample data to be used for training, and pre-classified data that can be used as test cases. We can then try various "k" values until we've found a value for which the test data is properly classified.

A popular data set for learning how this works is the Iris Classification data. See https://archive.ics.uci.edu/ml/datasets/iris for some background on this data. This is also available here https://www.kaggle.com/uciml/iris and many other places.

In the long run, the users intend to switch to classifying consumer product purchases. There's money in making good consumer product recommendations based on previous purchases. We need to start with someting simpler to demonstrate that we have the essence of the application working correctly.

More experienced readers may notice some gaps and possible contradictions as we move through the object-oriented analysis and design work. This is intentional. An initial analysis of a problem of any scope will involve learning and rework. This case study will evolve as we learn more. If you've spotted a gap or contradiction, formulate your own design and see if it converges with the lessons learned in subsequent chapters.

Having looked at some aspects of the problem, we can provide a more concrete context with actors and the use cases or scenarios that describe how an actor interacts with the system to be built. We'll start with the context view.

Context View

The context for our application involves two classes of actors.

Here are the two actors and three scenarios we will explore.

uml diagram

The system as a whole is depicted as a rectangle. It encloses ovals to represent user stories. In the UML, specific shapes have meanings, and we reserve rectangles for objects. Ovals (and circles) are for user stories, which are interfaces to the system. Later, we'll see round-cornered rectangles to describe automated processing.

In order to do any useful processing, we need training data, properly classified. There are two parts to each set of data data: a training set and a test set. We'll call the whole assembly "training data" instead of the longer (but more precise) "training and test data."

The tuning parameters are set by the botanist who must examine the test results to be sure the classifier works. There two parameters that can be tuned:

We'll look these parameters in detail, in the processing view section, later in this chapter. We'll also revisit these ideas in subsequent case study chapters. The distance computation is an interesting problem.

We can define a set of experiments by imagining a "grid" of each alternative and methodically filling in the grid with the results of measuring the test set. The combination which provides the best fit will be the recommended parameter set from the botanist.

After the testing, a user can make requests. They provide unknown data to recieve classification results from this trained classifier process. In the long run, this "user" won't be a person, they'll be a connection from some website's sales or catalog engine to our clever classifier-based recommendation engine.

We can summarize each of these scenarios with a "use case" or "user story" statement.

Given the nouns and verbs in the user stories, we can use that information to create a logical view of the data the application will process.

Logical View

Looking at the context diagram, processing starts with training data and testing data. This is a properly classified sample data
Here's one way to look at a class to contain various training and testing data sets.

uml diagram

This shows a Training Data class of objects with the attributes of each instance of this class. We need a Training Data object to have collection a name, and some dates where uploading and testing where completed. Each Training Data object has a single tuning parameter, k, used for the k-NN classifier algorithm. An instance also includes two lists of indivdual samples: a training list and a testing list.

Each class of objects is depicted in a rectangle with a number of individual sections.

Each object of the Sample class has a handful of attributes: four floating-point measurement values, and a string value which is the botanist-assigned classification for the sample. In this case, we used the attribute name class because that's what it's called in the source data. The Botanist explained "Series" and "Species" but some of the botanical nuance isn't part of this problem domain.

The UML arrows show two specific kinds of relationships, highlighted by a filled or empty diamonds. A filled diamond shows "Composition": a Training Data object is composed -- in part -- of two collections. The open diamond shows "Aggregation": a List[Sample] object is an aggregate of Sample items.

A Composition is an existential relationship: we can't have a Training Data without the two List[Sample] objects. And, conversely, a List[Sample] object isn't used in our application without being part of a Training Data object.

An Aggregation, on the other hand, is a relationship where items can exist independently of each other. In this diagram, a number of Sample objects can be part of List[Sample] or can exist independently of the list.

It's not clear that the open diamond to show aggregation of Sample objects into a List object is relevant. It may be an unhelpful design detail. When in doubt, it's better to omit these kinds of the details until they're clearly required to assure an implementation that meets the user's expectations.

We've shown a List[Sample] as a separate class of objects. This is Python's generic List qualified with a specific class of objects that will be in the list. It's common to avoid this level of detail and summarize the relationships in a diagram like the following

uml diagram

This slightly abbreviated is slightly better for analytical work, where the underlying data structures don't matter. It's less helpful for design work, as specific Python class information becomes more important.

Given an initial sketch, we'll compare this logical view with each of the three scenarios, mentioned in the context diagram, shown in the previous section. We want to be sure all of the data and processing in the user stories can be allocated as responsibilities scattered among the classes, attributes, and methods in the diagram.

Walking through the user stories, we uncover these two problems:

The second problem is a question of boundaries. While the web request and response details are missing, it's more important to describe the essential problem domain -- classification and k-NN -- first. The web services for handling a user requests is one (of many) solution technologies.

The first point, however, is a real defect, We'll need to re-read the user stories and try again to create a better logical view.

The classify() method of a Sample object needs access to a Tuning object. It turns out these "tuning" parameters are more commonly known as hyperparameters. We'll need to change terminology.

There are several choices for providing an appropriate Tuning (or Hyperparameter) instance to the classify() function:

  1. In the original Logical View diagrams, the Tuning object was a parameter to Sample.classify(). This is certainly simple, and allows the flexibility to use a non-optimal tuning for testing and comparison purposes.

  2. We can consider refactoring the classify method into a new Hyperparameter class.

We'll need to rething our diagram by looking more closely at the various nouns and verbs in our user stories. This exercise is often repeated several times.

Logical View Revised

This second choice can -- perhaps -- lead to a slight simplification. Here's the diagram.

uml diagram

This diagram pushes the classification away from the TrainingData class of objects. In this revision, the classification process is defined to be part of the Hyperparameter class. The idea here is to implement the classification of an unknown Sample with a process like the following.

  1. Provide the Sample to a specific Hyperparameter object. Usually, this is the Hyperparameter instance with the highest quality after testing.

  2. Each Hyperparameter object is associated with a specific Training Data object. This Hyperparameter object computes the k nearest neighbors. The value of k is an attribute of the Hyperparameter instance.

  3. The result, after voting, is a new KnownSample object with the species attribute filled in. This is a copy of the data from the original Sample. The original Sample can be gracefully ignored at this point, and cleaned up by Python's ordinary garbage collection.

These three steps are the implementation of the Hyperparameter.classify() method.

The Hyperparameter.matches() method evaluates the Hyperparameter.classify() method on a KnownSample. The quality score can be as simple as the count of successful matches (assuming a constant sized test set.)

Now that we seem to have a reasonably complete logical view of the data, we can turn our focus on the processing for the data. This seems to be the most effect order for creating a description of an application. The data has to be described first, it's the most enduring part, and the thing which is always preserved through each refinement of the processing. The processing needs to be secondary to the data, because this changes as the context changes and user experience and preferences change.

Process View

There are three separate user stories. This does not necessarily force is to create three process diagrams. For complex processing, there may be more process diagrams than user stories. In some cases, a user story may be too simple to require a carefully designed diagram.

For our application, it seems as though there are at least three unique processes of interest.

We'll sketch activity diagrams for these use cases. An activity diagram summarizes a number of state changes. The processing begins with a start node and proceeds until an end node is reached. In transaction-based applications, like web services, it's common to omit showing the overall web server engine. Instead, we generally focus on the processing performed by a Flask view function, since that tends to be unique for each kind of transaction.

The activities are shown in round-corner rectangles.

Where specific classes of objects or software components are relevant, they can be linked to relevant activities.

What's more important is making sure that the logical view is updated as ideas arise while working on the processing view. It's difficult to get either view done completely in isolation. It's far more important to make incremental changes in each view as new solution ideas arise. In some cases, additional user input is required, and this, too will lead to evolution of these views.

We can sketch a diagram to shows how the system responds when the Botanist provides the initial data.

uml diagram

The collection of KnownSample values will be partitioned into two subsets: a training subset and a testing subset. There's no rule in our problem summary or user stories for making this distinction; the gap shows we're missing details in the original user story. When details are missing from the user stories, then the logical view may be incomplete, also. For now, we can labor under an assumption that most of the data -- say 75% -- will be used for training, and the balance, 25%, will be used for testing.

It often helps to create similar diagrams for each of the user stories. It also helps to be sure that the activities all have relevant classes to implement the steps and represent state changes caused by each step.

We've included a verb, Partition in this diagram. This suggests a method will be required to implement the verb. This may lead to rethinking the class model to be sure the processing can be implemented.

We'll turn next to considering some of the components to be built. Since this is a preliminary analysis, our ideas will evolve as we do more detailed design and start creating class definitions.

Development View

There's often a delicate balance between the final deployment and the components to be developed. In rare cases, there are few deployment constraints, and the designer can think freely about the components to be developed. A physical view will evolve from the development. In more common cases, there's a specific target architecture that must be used and elements of the physical view are fixed.

In this case, we're planning on a web-services architecture where RESTful requests can be made to the server we're building. We'll detail the architecture in the Deployment View, below. For now, we'll assume that the Flask framework will be used to build a web service.

The following diagram shows some of the components we need to build.

uml diagram

This diagram shows a parent Python package, Classifier that contains a number of modules. The three top-level modules are:

We have included dependency arrows, using dashed lines. These are annotated with the Python-specific "imports" label to help clarify how the various packages and modules are related.

As we move through the design in later chapters, we'll expand on this initial view. Having thought about what needs to be built, we can now consider how it's deployed by drawing a physical view of the application. As noted above, there's a delicate dance between development and deployment. The two views are often built together.

Physical View

The physical view shows how the software will be installed into physical hardware. For web services, we often talk about a continuous integration and continous deployment (CI/CD) pipeline. A change to the software is tested as a unit, integrated with the existing applications, tested as an integrated whole, then deployed for the users.

The following diagram shows a view of a Flask application server.

uml diagram

This diagram shows the client and server nodes as 3-dimensional "boxes" with "components" installed on them. We've identified three components.

Of these components, the Client's "Application" is not part of the work being done to develop the classifier. We've included this to illustrate the context, but we're not really going to be building it.

We've used a dotted dependency arrow to show that our Classifer application is a dependency from the web server. GUnicorn will import our Flask object and use it to respond to requests.

Now that we've sketched out the application, we can consider writing some code. As we write, it helps to keep the diagrams up-to-date. Sometimes, they serve as a handy road-map in a wilderness of code.

Conclusion

There are several key concepts in this overview.

  1. Software applications can be rather complicated. There are five views to depict the users, the data, the processing, the components to be built, and the target physical implementation.

  2. Mistakes will be made. This overview has some gaps in it. It's important to move forward with partial solutions. One of Python's advantages is the ability to build software quickly, meaning we're not deeply invested in bad ideas. We can (and should) remove and replace code quickly.

    Spoiler Alert. We've failed to address the choice of distance calculation used for k-NN.

  3. Extensions will be identified. After we implement this, we'll see that setting the k parameter is a tedious exercise. An important next step is to automate tuning, using a Grid Search tuning algorithm. It's often helpful to set these things aside and get something that works first, then extend working software later to add this helpful feature.

  4. Some OO design techniques are used more than others. In this example, we've focused on a few:

  5. We've tried to assign clear responsibilities to each class. This has been moderately successful, and some responsibilities are vague or omitted entirely. We'll revisit this as we expand this initial analysis into implementation details.