This section will step through a few iterations of object-oriented design on a realistic example. We'll create a collection of diagrams using the Unified Modeling Language (UML) to help depict and summarize the software we're going to build.
We'll examine the problem using a technique called "4+1 Views". The views are:
A logical view of the data entities, their static attributes, and their relationships. This is the heart of Object-Oriented Design.
A process view that describes how the data is processed. This can take a variety of forms, including state models, activity diagrams, and sequence diagrams.
A development view of the code components to be built. This diagram shows relationships among software components. This is used to show how class definitions are gathered into modules and packages.
A physical view of the application to be integrated and deployed. In the cases where an application follows a common design pattern, a sophisticated diagram isn't necessary. In other cases, a diagram is essential to show how a collection of components are integrated and deployed.
A context view that provides a unifying context for the other four views. The context view will often describe the actors that use (or interact) with the system to be built. This can involve human actors as well as automated interfaces: both are outside the system, and the system must respond to these external actors.
It's common to start with the context view so that readers have a sense of what the other views describe. As our understanding of the users and the problem domain evolves, the context will evolve, also.
It's very important to recognize that all of these 4+1 views evolve together. A change to one will generally
be reflected in other views. It's a common mistake to think that one view is in some way foundational,
and the other views build on it in a cascade of design steps that always lead to software.
We'll start with a summary of the problem and some background before we start trying to analyze the application or design software.
Our users want to automate a job often called "classification". This is the underpinning idea behind product recommendations: last time a customer bought product X, perhaps they'd be interested in a similar product, Y. We've classified their desires and can locate other items in that class of products.
It helps to start with something small and manageable. The users don't want to tackle complex consumer products. Solving a difficult problem is not a good way to learn how to build this kind of application. It's better to start with something of a manageable level of complexity and then refine and expand it until it does everything they need.
There are a number of classifier approaches; one popular approach is called "k Nearest Neighbors", or "k-NN" for short. A training set of data is required. Each training sample has a number of attributes, reduced to numeric scores, and a final, correct, classification.
Given an unknown sample, we can measure the distance between the unknown sample and any of the known samples. For some small group of nearby neighbors, we can take a vote. The unknown sample can be classified into the sub-population selected by the majority of the nearby neighbors.
One underpinning concept is having tangible, numeric measurements for the various attributes. Converting words, addresses, and other non-ordinal data to an ordinal measurement can can be challenging. The good news is the data we're going to start with already has properly ordinal measurements with explicit units of measure.
Another supporting concept is the number of neighbors involved in the voting. This is the "k" factor in the k nearest neighbors. If k is too big, then the simple size of the populations can influence voting. If there are more things classified as "Aye" than "Nay", then the "Aye"'s win all the votes even when an unknown sample is surrounded by "Nay"'s.
Imagine we have an unknown sample slightly closer to one lonely outlier of class "Nay" than it is to three slightly more distant samples of class "Aye". Any affinity for the three samples of class "Aye" would be lost if k is set to one and we only consider the single nearest neighbor. In this csae, a k of three or five would have properly associated the unknown with the cluster that's slightly further away.
Setting the appropriate k value requires concrete sample data to be used for training, and pre-classified data that can be used as test cases. We can then try various "k" values until we've found a value for which the test data is properly classified.
A popular data set for learning how this works is the Iris Classification data. See https://archive.ics.uci.edu/ml/datasets/iris for some background on this data. This is also available here https://www.kaggle.com/uciml/iris and many other places.
In the long run, the users intend to switch to classifying consumer product purchases. There's money in making good consumer product recommendations based on previous purchases. We need to start with someting simpler to demonstrate that we have the essence of the application working correctly.
More experienced readers may notice some gaps and possible contradictions as we move through the object-oriented analysis and design work. This is intentional. An initial analysis of a problem of any scope will involve learning and rework. This case study will evolve as we learn more. If you've spotted a gap or contradiction, formulate your own design and see if it converges with the lessons learned in subsequent chapters.
Having looked at some aspects of the problem, we can provide a more concrete context with actors and the use cases or scenarios that describe how an actor interacts with the system to be built. We'll start with the context view.
The context for our application involves two classes of actors.
A Botanist who provides the properly classified training data and a properly classified set of test data. The Botanist also runs the test cases to establish the proper parameters for the classification. In the simple case of k-NN, they can decide which k value should be used.
A "User" who needs to do classification of unknown data. The user has made careful measurements and makes a request with the measurement data to get a classification from this classifier system. The name "User" seems vague, but we're not sure what's better. We'll leave it for now, and put off changing it until we foresee a problem.
Here are the two actors and three scenarios we will explore.
The system as a whole is depicted as a rectangle. It encloses ovals to represent user stories. In the UML, specific shapes have meanings, and we reserve rectangles for objects. Ovals (and circles) are for user stories, which are interfaces to the system. Later, we'll see round-cornered rectangles to describe automated processing.
In order to do any useful processing, we need training data, properly classified. There are two parts to each set of data data: a training set and a test set. We'll call the whole assembly "training data" instead of the longer (but more precise) "training and test data."
The tuning parameters are set by the botanist who must examine the test results to be sure the classifier works. There two parameters that can be tuned:
The distance computation to use,
The number of neighbors to consider for voting.
We'll look these parameters in detail, in the processing view section, later in this chapter. We'll also revisit these ideas in subsequent case study chapters. The distance computation is an interesting problem.
We can define a set of experiments by imagining a "grid" of each alternative and methodically filling in the grid with the results of measuring the test set. The combination which provides the best fit will be the recommended parameter set from the botanist.
After the testing, a user can make requests. They provide unknown data to recieve classification results from this trained classifier process. In the long run, this "user" won't be a person, they'll be a connection from some website's sales or catalog engine to our clever classifier-based recommendation engine.
We can summarize each of these scenarios with a "use case" or "user story" statement.
As a botantist, I want to provide properly classified training and testing data to this system so users can correctly identify plants.
As a botanist, I want to examine the test results from the classifier to be sure that new samples are likely to be correctly classified.
As a user, I want to be able to provide a few key measurements to the classifier and have the iris species correctly classified.
Given the nouns and verbs in the user stories, we can use that information to create a logical view of the data the application will process.
Looking at the context diagram, processing starts with training data and testing data.
This is a properly classified sample data
Here's one way to look at a class to contain various training and testing data sets.
This shows a Training Data
class of objects with the attributes of each instance of this class.
We need a Training Data
object to have collection a name, and some dates where uploading and testing
where completed. Each Training Data
object has a single tuning parameter, k
, used for the k-NN classifier algorithm.
An instance also includes two lists of indivdual samples: a training list and a testing list.
Each class of objects is depicted in a rectangle with a number of individual sections.
The top-most section provides a name for the class of objects. In two cases, we've
used a type hint, List[Sample]
, because the generic class, list
is used in a way
that assures the contents of the list are only Sample
objects.
The next section of a class rectangle shows the attributes of each object; These attributes also called the instance variables of this class.
Later, we'll add to the bottom section "methods" for instances of the class.
Each object of the Sample
class has a handful of attributes: four floating-point measurement values,
and a string value which is the botanist-assigned classification for the sample.
In this case, we used the attribute name class
because that's what it's called in the source data.
The Botanist explained "Series" and "Species" but some of the botanical nuance isn't part
of this problem domain.
The UML arrows show two specific kinds of relationships, highlighted by a filled or empty diamonds.
A filled diamond shows "Composition": a Training Data
object is composed -- in part -- of two collections.
The open diamond shows "Aggregation": a List[Sample]
object is an aggregate of Sample
items.
A Composition is an existential relationship: we can't have a Training Data
without the two List[Sample]
objects. And, conversely, a List[Sample]
object isn't used in our application without being part of a Training Data
object.
An Aggregation, on the other hand, is a relationship where items can exist independently of each other.
In this diagram, a number of Sample
objects can be part of List[Sample]
or can exist independently of the list.
It's not clear that the open diamond to show aggregation of Sample
objects into a List
object is relevant.
It may be an unhelpful design detail. When in doubt, it's better to omit these kinds of
the details until they're clearly required to assure an implementation that meets the user's expectations.
We've shown a List[Sample]
as a separate class of objects.
This is Python's generic List
qualified with a specific
class of objects that will be in the list. It's common to avoid
this level of detail and summarize the relationships in a diagram
like the following
This slightly abbreviated is slightly better for analytical work, where the underlying data structures don't matter. It's less helpful for design work, as specific Python class information becomes more important.
Given an initial sketch, we'll compare this logical view with each of the three scenarios, mentioned in the context diagram, shown in the previous section. We want to be sure all of the data and processing in the user stories can be allocated as responsibilities scattered among the classes, attributes, and methods in the diagram.
Walking through the user stories, we uncover these two problems:
It's not clear how the testing and parameter tuning fit with this diagram. We know there's a k factor that's required, but there's no relevant test results to show alternative k factors and the consequence of those choices.
The user's request is not shown at all. Nor is the response to the user. No classes have these items as part of their responsibilities.
The second problem is a question of boundaries. While the web request and response details are missing, it's more important to describe the essential problem domain -- classification and k-NN -- first. The web services for handling a user requests is one (of many) solution technologies.
The first point, however, is a real defect, We'll need to re-read the user stories and try again to create a better logical view.
The classify()
method of a Sample
object needs access to a Tuning
object.
It turns out these "tuning" parameters are more commonly known as hyperparameters.
We'll need to change terminology.
There are several choices for providing an appropriate Tuning
(or Hyperparameter
) instance to
the classify()
function:
In the original Logical View diagrams, the Tuning
object was a parameter to Sample.classify()
.
This is certainly simple, and allows the flexibility to
use a non-optimal tuning for testing and comparison purposes.
We can consider refactoring the classify
method into a new Hyperparameter
class.
We'll need to rething our diagram by looking more closely at the various nouns and verbs in our user stories. This exercise is often repeated several times.
This second choice can -- perhaps -- lead to a slight simplification. Here's the diagram.
This diagram pushes the classification away from the TrainingData
class of objects.
In this revision, the classification process is defined to be part of the Hyperparameter
class.
The idea here is to implement the classification of an unknown Sample
with a process
like the following.
Provide the Sample
to a specific Hyperparameter
object.
Usually, this is the Hyperparameter
instance with the highest quality after testing.
Each Hyperparameter
object is associated with a specific Training Data
object.
This Hyperparameter
object computes the k nearest neighbors.
The value of k is an attribute of the Hyperparameter
instance.
The result, after voting, is a new KnownSample
object with the species attribute
filled in. This is a copy of the data from the original Sample
. The original Sample
can be gracefully ignored at this point, and cleaned up by Python's ordinary garbage collection.
These three steps are the implementation of the Hyperparameter.classify()
method.
The Hyperparameter.matches()
method evaluates the Hyperparameter.classify()
method on a KnownSample
.
The quality score can be as simple as the count of successful matches (assuming a constant sized test set.)
Now that we seem to have a reasonably complete logical view of the data, we can turn our focus on the processing for the data. This seems to be the most effect order for creating a description of an application. The data has to be described first, it's the most enduring part, and the thing which is always preserved through each refinement of the processing. The processing needs to be secondary to the data, because this changes as the context changes and user experience and preferences change.
There are three separate user stories. This does not necessarily force is to create three process diagrams. For complex processing, there may be more process diagrams than user stories. In some cases, a user story may be too simple to require a carefully designed diagram.
For our application, it seems as though there are at least three unique processes of interest.
Upload the initial set of Samples
that comprise some Training Data
.
Run a test of the classifier with a given k value.
Make a classification request with a new Sample
object.
We'll sketch activity diagrams for these use cases. An activity diagram summarizes a number of state changes. The processing begins with a start node and proceeds until an end node is reached. In transaction-based applications, like web services, it's common to omit showing the overall web server engine. Instead, we generally focus on the processing performed by a Flask view function, since that tends to be unique for each kind of transaction.
The activities are shown in round-corner rectangles.
Where specific classes of objects or software components are relevant, they can be linked to relevant activities.
What's more important is making sure that the logical view is updated as ideas arise while working on the processing view. It's difficult to get either view done completely in isolation. It's far more important to make incremental changes in each view as new solution ideas arise. In some cases, additional user input is required, and this, too will lead to evolution of these views.
We can sketch a diagram to shows how the system responds when the Botanist provides the initial data.
The collection of KnownSample
values will be partitioned into
two subsets: a training subset and a testing subset.
There's no rule in our problem summary or user stories for making this distinction;
the gap shows we're missing details in the original user story.
When details are missing from the user stories, then the logical view may be incomplete, also.
For now, we can labor under an assumption that most of the data -- say 75% --
will be used for training, and the balance, 25%, will be used for testing.
It often helps to create similar diagrams for each of the user stories. It also helps to be sure that the activities all have relevant classes to implement the steps and represent state changes caused by each step.
We've included a verb, Partition
in this diagram. This suggests a method will
be required to implement the verb. This may lead to rethinking the class model
to be sure the processing can be implemented.
We'll turn next to considering some of the components to be built. Since this is a preliminary analysis, our ideas will evolve as we do more detailed design and start creating class definitions.
There's often a delicate balance between the final deployment and the components to be developed. In rare cases, there are few deployment constraints, and the designer can think freely about the components to be developed. A physical view will evolve from the development. In more common cases, there's a specific target architecture that must be used and elements of the physical view are fixed.
In this case, we're planning on a web-services architecture where RESTful requests can be made to the server we're building. We'll detail the architecture in the Deployment View, below. For now, we'll assume that the Flask framework will be used to build a web service.
The following diagram shows some of the components we need to build.
This diagram shows a parent Python package, Classifier
that contains a number of modules.
The three top-level modules are:
Data Model. (This is not a properly Pythonic name; we'll change it later.) It's often helpful to separate the classes that define the problem domain into modules to make it possible to test them in isolation from any particular application that uses those classes.
View Functions. (Not a Pythonic name, either.) This module will create an instance of the Flask
class, our application. It will bind a number of routes (URL paths) and the functions that handle
requests to those routes.
Tests. This will have unit tests for the model and view functions. While it is essential for being sure the software is usable, it's the subject of Chapter 13.
We have included dependency arrows, using dashed lines. These are annotated with the Python-specific "imports" label to help clarify how the various packages and modules are related.
As we move through the design in later chapters, we'll expand on this initial view. Having thought about what needs to be built, we can now consider how it's deployed by drawing a physical view of the application. As noted above, there's a delicate dance between development and deployment. The two views are often built together.
The physical view shows how the software will be installed into physical hardware. For web services, we often talk about a continuous integration and continous deployment (CI/CD) pipeline. A change to the software is tested as a unit, integrated with the existing applications, tested as an integrated whole, then deployed for the users.
The following diagram shows a view of a Flask application server.
This diagram shows the client and server nodes as 3-dimensional "boxes" with "components" installed on them. We've identified three components.
A Client "client app" application. This is the application that connects to the classifier web service and makes RESTful requests. It might be a web site, written in Javascript. It might be a mobile application, written in Kotlin or Swift. All of these front-ends have a common HTTPS connection to our web server. This secure connection requires some configuration of certificates and encryption key pairs.
The "GUnicorn" web server. This server can handle a number of details of web service requests, including the important HTTPS protocol. See https://docs.gunicorn.org/en/stable/index.html for details.
Our "Classifier" application. From this view, the complexities have been elided, and the entire Clasifier is reduced to a small component in a larger web services framework.
Of these components, the Client's "Application" is not part of the work being done to develop the classifier. We've included this to illustrate the context, but we're not really going to be building it.
We've used a dotted dependency arrow to show that our Classifer application is a dependency from the web server. GUnicorn will import our Flask object and use it to respond to requests.
Now that we've sketched out the application, we can consider writing some code. As we write, it helps to keep the diagrams up-to-date. Sometimes, they serve as a handy road-map in a wilderness of code.
There are several key concepts in this overview.
Software applications can be rather complicated. There are five views to depict the users, the data, the processing, the components to be built, and the target physical implementation.
Mistakes will be made. This overview has some gaps in it. It's important to move forward with partial solutions. One of Python's advantages is the ability to build software quickly, meaning we're not deeply invested in bad ideas. We can (and should) remove and replace code quickly.
Spoiler Alert. We've failed to address the choice of distance calculation used for k-NN.
Extensions will be identified. After we implement this, we'll see that setting the k parameter is a tedious exercise. An important next step is to automate tuning, using a Grid Search tuning algorithm. It's often helpful to set these things aside and get something that works first, then extend working software later to add this helpful feature.
Some OO design techniques are used more than others. In this example, we've focused on a few:
Encapsulating features into classes.
Inheritance to extend a class with new features.
Composition to build a class from component objects.
We've tried to assign clear responsibilities to each class. This has been moderately successful, and some responsibilities are vague or omitted entirely. We'll revisit this as we expand this initial analysis into implementation details.