We've been skirting an issue that arises frequently when working with complex data. Files have both a logical layout and a physical format. We've been laboring under a tacit assumption that are files are in CSV format, with a layout defined by the first line of the file.
Trusting everything to be in a CSV file isn't a great assumption. We need to look at the alternatives, and elevate our assumption to a design choice. We also need to build in the flexibility to make changes as the context for using our application evolves.
We'll start by looking at the diea of serialization (and deserialization.)
The term "serialization" can used to describe any process of turning a complex data structure into a series of bytes.
In our various model classes, we provided __repr__()
methods
to create a string representation of the underlying object.
This is a kind of serialization.
Ideally, a good serialization scheme is reversible. The long-standing
tradition is for the Python __repr__()
method to create a sequence
of characters that will -- when interpreted as Python code -- rebuild
the underlying object.
We can see it like this.
>>> from model import TrainingKnownSample
>>> s1 = TrainingKnownSample(
... sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species="Iris-setosa")
>>> serialized = repr(s1)
>>> serialized
"TrainingKnownSample(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa', )"
>>> s2 = eval(serialized)
>>> s1 == s2
True
>>> id(s1) == id(s2)
False
We've created an object, s1
, an an instance
of the TrainingKnownSample
class. The built-in
repr()
function makes use of the __repr__()
method
to create a representation of the object. In this
case, we've been careful to build a string that's
also a valid Python expression.
When we evaluate the expression, eval(serialized)
,
we'll interpret the text, which creates a new object,
that's assigned to the variable s2
.
We've added an __eq__()
method to the KnownSample
superclass to confirm the two objects have the same
attribute values. We can check the id()
of each
object and confirm that they are, indeed, separate
objects with the same attribute values.
Python has several kinds of built-in serialization techniques. These are useful for many things, but aren't appropriate for sharing data. We'll look at some of the serialization modules we have available in the standard library.
Python has a very clever built-in serialization
module, named pickle
. The pickle
module
can serialize any Python object. The down-side
of Pickle is the representation is more-or-less
opaque and also unique to Python. It's not
great for data sharing, but it's very easy to
use.
The shelve
module makes use of pickle
format
to create a kind of database with the general
behavior of a dictionary. We can use this to save
and recover objects from files. This isn't a proper
database; it lacks any provision for concurrent
writes from multiple clients.
The json
module can serialize a narrow subset
of Python built-in types. There are a number
of limitations imposed by the json
definition.
See json.org
for technical details on what
JSON can represent.
We can use XML for serializing data. This isn't very easy. Python's standard library has modules to prase XML data, but it doesn't provide any handy XML serializers. Pragmatically we often have to use a templating engine like Jinja to create XML files.
Similar considerations alloy to using HTML to serialize data. It's awkward to parse and awkward to create. Because of the complexity of HTML web pages, a simple HTML parser isn't often sufficient, and a more powerful too like Beautiful Soup is required to make sense of the HTML.
It's common to map complex objects to dictionaries, which have a tidy JSON representation. For this reason, the Classifier web application makes use of dictionaries. We can also parse CSV data into dictionaries, providing us a kind of grand unification of CSV, Python and JSON. We'll start by looking at the CSV format.
We've made use of the csv
module to read and write files.
CSV is "Comma-Separated Values", it's designed to export and import
the data from a spreadsheet.
The CSV format describes a sequence of rows. Each row is a sequence of strings.
The "comma" in CSV is a role, not a specific character.
The purpose is to separate the columns of data.
For the most part, the role of the comma is played
by the literal ",". But other actors can fill this role.
It's common to see the tab character, written "\t"
or "\x09"
,
can fill the role of comma.
The end-of-line is often the CRLF sequence,
writen "\r\n"
or \x0d\x0a
. On Mac OS X and Linux,
it's also possible to use single newline character, \n
,
at the end of each row.
In order to contain the comma character within a column's
data, the data can be quoted. This is often done
with the '"'
character, but it's possible to specify
a different quote character when describing a CSV dialect.
Beacuse CSV data is simply a sequence of strings,
any other interpretation of the data requires
processing by our application. For example,
within the TrainingSample
class, the load()
method
includes processing like the following:
test = TestingKnownSample(
species=row["species"],
sepal_length=float(row["sepal_length"]),
sepal_width=float(row["sepal_width"]),
petal_length=float(row["petal_length"]),
petal_width=float(row["petal_width"]),
)
This load()
method extracts specific row values, applies a conversion
function to build a Python object from the text,
and uses all of the attribute values to build
a resulting object.
The design for the load()
method assumes that the data
will be provided as a dictionary.
There are two ways to consume (and produce) CSV-formatted data.
We can read CSV files as a sequence of strings, or as a dictionary. When we read the file as a sequence of strings, there are not special provisions for column headers. We're forced to manage the details of which column has a particular attribute. This is unpleasantly complex, but sometimes necessary.
We can also read a CSV file so each row becomes a dictionary. We can provide a sequence of keys, or the first line of the file can provide the keys. This is relatively common, and it saves a little bit of confusion when the column headers are part of the data.
In the case of the Iris data, the source file, bezdekIris.data
does not have column titles. The column titles are provided separately
in a file named iris.names
.
The iris.names
file has a great deal of information in it, including this:
7. Attribute Information:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
In section 7 of the names document, the five columns of data are defined. This separation between the metadata and the dats isn't ideal, but we can copy and paste it into code to make something useful from it.
We'll use it to define an Iris reader class as follows:
class CSV_IrisReader:
"""
Attribute Information:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
"""
header = [
"sepal_length", # in cm
"sepal_width", # in cm
"petal_length", # in cm
"petal_width", # in cm
"species", # Iris-setosa, Iris-versicolour, Iris-virginica
]
def __init__(self, source: Path) -> None:
self.source = source
def data_iter(self) -> Iterator[Dict[str, str]]:
with self.source.open() as source_file:
reader = csv.DictReader(source_file, self.header)
yield from reader
We transformed the documentation into a sequence
of column names. The transformation isn't arbitrary.
We matched the resulting KnownSample
class attribute
names.
In relatively simple applications, there's a single source of data, the attribute names for classes and column names for CSV files are easy to keep aligned. This isn't always the case. In some problem domains, the data may have several variant names and formats. We may chose attribute names that seem good, may not simply match any of the input files.
The data_iter()
method has a name suggesting
it is an iterator over multiple data items.
The type hint (Iterator[Dict[str, str]]
) confirms
this. The function uses yield from
to provide
rows from the CSV DictReader
object as they're demanded
by a client process.
This is a "lazy" way to read lines from the CSV as they're required by another object. This doesn't slurp in the entire file, creating a giagantic list of dictionaries. Instead, the iterator produces one dictionary at a time, as they're requested.
One way to request data from an iterator is to use the built-in list()
function. We can use this class as follows:
>>> from model import CSV_IrisReader
>>> from pathlib import Path
>>> test_data = Path.cwd().parent/"bezdekIris.data"
>>> rdr = CSV_IrisReader(test_data)
>>> samples = list(rdr.data_iter())
>>> len(samples)
150
>>> samples[0]
{'sepal_length': '5.1', 'sepal_width': '3.5', 'petal_length': '1.4', 'petal_width': '0.2', 'species': 'Iris-setosa'}
The csv DictReader
produces a dictionary.
The keys can come from the first row. In this case,
the file doesn't have column headers in the first row, so we
provided column headers in our IrisReader
class definition.
The data_iter()
method produces rows for a consumer. In this
example, the list()
function consumes the available rows.
As expected, the dataset has 150 rows. We'e shown the first row.
Note that the attribute values are strings. This is always true
when reading CSV files -- all of the input values are strings.
Our application must convert the strings to float
values to
be able to create KnownSample
objects.
Another way to consume values is with a for
statement.
This how the load()
method of the TrainingData
class works.
It uses code that looks like this.
def load(self, raw_data_iter: Iterator[Dict[str, str]]) -> None:
"""Extract TestingKnownSample and TrainingKnownSample from raw data"""
for n, row in enumerate(raw_data_iter):
... more processing here
We combine an IrisReader
object with this object to load the samples.
It looks like this.
teaining_data = TrainingData("besdekIris")
rdr = CSV_IrisReader(test_data)
trainning_data.load(rdr.data_iter())
The load()
method will consume values produced by the data_iter()
method. The loading of the data is a cooperative process from the two objects.
The non-dictionary CSV reader produces a list of strings from
each row. This is not what the load()
method expects.
We have two choices to meet the interface requirement for the load()
method.
Convert the list of column values to a dictionary.
Change load()
to use a list of values in a fixed order.
This would have the unfortunate consequence of forcing
the TrainingData
class' load()
method to match a specific file layout.
Alternatively, we'd have to re-order input values to match the
requirements of load()
; doing this is about as complex as
building a dictionary.
Building a dictionary relatively easy and allows the load()
method
to work with data where the column layout varies from our initial
expectation.
Here's a CSV_IrisReader_2
class that uses csv.reader()
to read
a file, and builds dictionaries based on the attribute
information published in the iris.names
file.
class CSV_IrisReader_2:
"""
Attribute Information:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
"""
def __init__(self, source: Path) -> None:
self.source = source
def data_iter(self) -> Iterator[Dict[str, str]]:
with self.source.open() as source_file:
reader = csv.reader(source_file)
for row in reader:
yield dict(
sepal_length=row[0], # in cm
sepal_width=row[1], # in cm
petal_length=row[2], # in cm
petal_width=row[3], # in cm
species=row[4] # class string
)
The data_iter()
method yields individual dictionary objects.
This for
-with-yield
summarizes what a yield from
does.
When we write yield from X
, that becomes
for item in X:
yield item
Our source file happens to be in the most common CSV dialect. There are other possible configurations of CSV readers.
The comma character could be space, tab (\t
), or "|".
The end-of-line character doesn't have to be \r\n
or '\n'.
It could be an obscure, unprintable character like \x01
.
The quote character doesn't have to be "
, it could be
$
. This will wrap column values that contain the comma character.
It will be doubled where necessary. Think of "Say ""$5"""``
as the CSV serialization of the Python string
'Say "$5"'.
If the quote character is
$, the CSV serialization
would be
$Say "$$5"$`.
There are a variety of quoting rules available, including
QUOTE_ALL
, QUOTE_MINIMAL
, QUOTE_NONNUMERIC
, and QUOTE_NONE
.
The default is minimal quoting -- quotes are only used on
values that have the comma, quote, or end-of-line character in them.
Requiring quotes on all values is a popular alternative for reducing ambiguity.
Using no quotes is a potential problem.
Having looked closely at CSV alternatives, we'll turn to JSON-formatted data. This is very popular for RESTful web frameworks.
The JSON format can serialize a number of commonly-used Python object classes, including:
What's essential is a JSON document can contain list of JSON documents. And a JSON document can contain a mapping with string keys, and values that are JSON documents. This recursion can allow us to represent very complex things.
We might consider a type hint like the following:
JSON = Union[
None, bool, int, float, str, List['JSON'], Dict[str, 'JSON']
]
This isn't directly supported by mypy, but it can be helpful conceptual framework for understanding what we can represent in JSON notation.
In JSON notation, our data will look like this:
[
{
"sepal_length": 5.1,
"sepal_width": 3.5,
"petal_length": 1.4,
"petal_width": 0.2,
"species": "Iris-setosa"
},
{
"sepal_length": 4.9,
"sepal_width": 3.0,
"petal_length": 1.4,
"petal_width": 0.2,
"species": "Iris-setosa"
},
Note that the numeric values don't have quotation
marks, they will be converted to float
values
if they have a .
character.
Each {}-delimited sample has a syntax that's very much like Python's native syntax for dictionaries. The JSON syntax is a little less flexible than Python's, but is otherwise similar.
The entire collection of samples is a single list
object. JSON uses []
in almost exactly the same
syntax Python uses for lists. Again, Python is a
bit more flexible than the JSON rules, but the two
are very similar.
For other data types, like bool
and None
, the JSON
notataion is considerably different from Python.
In general, the json
module is required to successfully
read and write JSON representations of Python objects.
The json.org standards require a single JSON object in a file. This forces us to create a "list-of-dict" structure. Pragmatically, this can be summarized by this type hint:
JSON_Samples = List[Dict[str, Union[float, str]]]
The document -- as a whole -- is a list. It contains
a number of dictionaries that map string keys to
either float or string values. We can be a bit more
precise, by using the mypy TypedDict
hint.
SampleDict = TypedDict(
"SampleDict",
{
"sepal_length": float,
"sepal_width": float,
"petal_length": float,
"petal_width": float,
"species": str,
},
)
This can be helpful to mypy (and other people reading our code) by showing what the expected structure should be.
This doesn't really confirm the JSON document, however. Remember, mypy is only a static check on the code, and has no run-time impact. To check the JSON document's structure, we'll need something more sophisticated that a Python type hint.
Here's our JSON reader class definition
class JSON_IrisReader:
def __init__(self, source: Path) -> None:
self.source = source
def data_iter(self) -> Iterator[SampleDict]:
with self.source.open() as source_file:
sample_list = json.load(source_file)
yield from iter(sample_list)
We've opened the source file, loaded the list-of-dict objects. We can the yield the individual sample dictionaries by iterating over the list.
This has a hidden cost. We'll look at how newline-delimited JSON -- a modification to the standard -- can help reduce the memory used.
For large collections of objects, reading a single, massive list
into memory first isn't ideal. The "newline delimited" JSON
format, see ndjson.org
, provides a way to put a large number
of separate JSON documents into a single file.
The file would look like this.
{"sepal_length": 5.0, "sepal_width": 3.3, "petal_length": 1.4, "petal_width": 0.2, "species": "Iris-setosa"}
{"sepal_length": 7.0, "sepal_width": 3.2, "petal_length": 4.7, "petal_width": 1.4, "species": "Iris-versicolor"}
There's no overall []
to create a list. Each individual sample
must be complete on one physical line of the file.
This leads to a slight difference in the way we process the sequence of documents.
class NDJSON_IrisReader:
def __init__(self, source: Path) -> None:
self.source = source
def data_iter(self) -> Iterator[SampleDict]:
with self.source.open() as source_file:
for line in source_file:
sample = json.loads(line)
yield sample
We've read each line of the file, and used json.loads()
to parse
the single string into a sample dictionary. The interface
is the same: an Iterator[SampleDict]
. The technique for producing
that iterator is unique to newline-delimited JSON.
We noted that our mypy type hint doesn't really guarantee the JSON document is -- in any way -- what we expected. We do have a library that can be used for this. The JSONSchema library lets us provide a specification for a JSON document, and then confirm whether or not the document meets the specification.
This is a run-time check, unlike the mypy type hint. This means it makes our program slower. It can also help to diagnose subtly incorrect JSON documents.
For details, see https://json-schema.org.
We'll focus on newline-delimited JSON. This means we need a schema for each sample document within the larger collection of documents.
We'll need to install an additional library to do the validation.
conda install jsonschema
Or
python -m pip install jsonschema
A JSON Schema document is also written in JSON notation. It includes some metadata to help clarify the purpose and meaning of the document. It's often a little easier to create a Python dictionary with the JSON schema definition.
Here's a candidate definition for the Iris schema for an individual sample.
Iris_Schema = {
"$schema": "https://json-schema.org/draft/2019-09/hyper-schema",
"title": "Iris Data Schema",
"description": "Schema of Bezdek Iris data",
"type": "object",
"properties": {
"sepal_length": {"type": "number", "description": "Sepal Length in cm"},
"sepal_width": {"type": "number", "description": "Sepal Length in cm"},
"petal_length": {"type": "number", "description": "Sepal Length in cm"},
"petal_width": {"type": "number", "description": "Sepal Length in cm"},
"species": {
"type": "string",
"description": "class",
"enum": ["Iris-setosa", "Iris-versicolor", "Iris-virginica"],
},
},
"required": ["sepal_length", "sepal_width", "petal_length", "petal_width"],
}
Each sample is an "object", the JSONSchema term for a dictionary with keys and values. The "properties" of an object are the dictionary keys. Each one of these is described with a type of data, "number" in this case. We can provide additional details, like ranges of values. We provided a description, taken from the iris.names file.
In the case of the species, we've provided an enumeration of the valid string values. This can be handy for confirming that the data meets our overall expectations.
We use this schema information by creating a JSONSchema validator and applying the validator to check each sample we read. An extended class might look like this:
class Validating_NDJSON_IrisReader:
def __init__(self, source: Path, schema: Dict[str, Any]) -> None:
self.source = source
self.validator = jsonschema.Draft7Validator(schema)
def data_iter(self) -> Iterator[SampleDict]:
with self.source.open() as source_file:
for line in source_file:
sample = json.loads(line)
if self.validator.is_valid(sample):
yield sample
else:
print(f"Invalid: {sample}")
We've accepted an additional parameter in the __init__()
method
with the schema definition. We use this to create the Validator
instance that will be applied to each document.
The data_iter()
method use validator's is_valid()
method
to process only samples that pass the JSONSchema validation.
Note that we now have two separate, but similar definitions for the raw
data that builds a Sample
instance.
A Type Hint, SampleDict
, describing the expected
Python intermediate data structure. This can be applied
to CSV as well as JSON data, and helps summarize the relationship
between the load()
method of the TrainingData
class,
and the various readers.
A JSON Schema that also describes an expected external data structure. This doesn't describe a Python object, it describes the JSON serialization of a Python object.
For simple case studies, these two seem redundant. In more complex situations, these two will diverge, and fairly complex conversions between external schema, intermediate results, and the final class definition is a common feature of Python applications.
This occurs becase there are a variety of ways to serialize Python objects. We need to be flexible enough to work with a useful variety of representations.
The YAML format is built on JSON as as foundation. This requires an additional library to process YAML notation.
conda install pyyaml
Or
python -m pip install pyyaml
Because it's easier to read, YAML has a tiny advantage over JSON.
If we want files that people can read without checking the presence or
absence of {}
and []
characters, YAML can be handy.
Additionally, YAML permits storing a large number of distinct documents in a single file. We don't need to break out of the standard and use an extendion like newline-delimited JSON.
The YAML version of our data looks like this:
petal_length: 1.4
petal_width: 0.2
sepal_length: 5.0
sepal_width: 3.3
species: Iris-setosa
---
petal_length: 4.7
petal_width: 1.4
sepal_length: 7.0
sepal_width: 3.2
species: Iris-versicolor
---
The "---" separator ends on YAML document.
This allows multiple documents in a single file.
The yaml.load_all()
function will iterate
through the documents, creating dictionaries
from each document.
Here's how we can use this to read YAML-formatted data.
class YAML_IrisReader:
def __init__(self, source: Path) -> None:
self.source = source
def data_iter(self) -> Iterator[SampleDict]:
with self.source.open() as source_file:
yield from yaml.load_all(source_file, Loader=yaml.SafeLoader)
This is pleasantly simple. It works well when the data is in YAML format. Because the intermediate result is a Python dictionary, we can use JSONSchema to validate the document even though, technically, it started in YAML.