Background on Web Services

There are three topics here:

We'll start by looking at an activity diagram to show the general process of making a RESTful request.

Web Services Process View

The REST concept is "Representational State Transfer". We're exchanging a representation of object state between two independent processes. Often the state of an object is summarized into JSON notation, allowing the receiver to build a similar object, given the serialized state.

The protocols stack on top of the existing HTTPS protocols, so we can use all of our favorite libraries to build RESTful API's. There's going to be one import difference, however.

The HTTPS protocols are "stateless": each request must be viewed independently of all other requests. When interacting with a web site that uses a login, the notion of state is handled via "cookies". Each response includes packets of data for the browser to save, each request includes packets of data that are sent back. Once a user authenticates themself by logging in, the web application use the cookies to track session state. Each response records the state of the session in cookies; each request uses the cookies to rebuild the session information. This makes the user feel like they're engaging in a long, personal dialog. In fact it's a series of independent actions with history tracked via cookies.

RESTful API's are stateless. They don't use cookies. Instead, each request will include user authentication credentials. There are a variety of ways to handle the credential processing, including:

We'll focus on using HTTPS. This requires providing an appropriate certificate to secure the web server.

This leads to RESTful API processing that proceeds as shown in this sequence diagram.

uml diagram

The boxes are objects that are part of our application. The lines below are "lifelines" showing activities the object engages in. The requests are solid lines and the responses are dashed lines. Some of these are network requests using HTTPS protocols. Some of these are method of one object being invoked and returning an object.

A request originates from a "Client". This could be a mobile app, or a browser, or another web service making a RESTful API request. The request uses HTTPS protocols, and the server will provide a public key as part of the negotiation for a secure connection. This is (almost) completely transparent. We do need to provide the certificate and key pair as part of configuring the server. Good practices suggest the keys be rotated periodically.

The client's request will be encrypted using the public key. Only the server's private key can decrypt the request.

When we write our Flask application, we'll provide a Python decorator on each view function. This decorator will validate the Authorization header to be sure that it has a known username and the user's password. If the Authorization header is missing, or a hash of the user's credentials are invalid, the response is an HTTP 401 status code to indicate that the user could not be authenticated.

(And yes, the Authorization header seems to have the wrong name.)

If the Authorization header is valid, the decorator will invoke the view function. The view function will use the model classes process the request. This might be an upload of training data, a request for a test, or a request for classification. We have two different classes of users, so we must assure that the user is authorized to perform the requested action. If they're not authorized, we can return a 403 status code. This is a nuanced distinction between a bad Authorization header and a request to access inappropriate resources.

If the authentication and authorization checks have passed, then the document that's part of the request must also be validated. It should be a JSON-format document. It should have certain fields, and the fields should have specific types of values. This is often handled by the JSONSchema package. If the data is invalid, this often leads to a 400 status code: the request can't be processed.

There may be additional checks to avoid duplicate collections of training data. There are several ways to respond to attempts to create duplicate data. Some servers will respond with a 400 status code to prevent problems. A more helpful approach is to examine the request and the existing data to see if the request is a simple duplication, and tolerate a second upload that matches a previous upload. It's common for responses to be lost, or the the response to be unclear, and the user clicked the upload button again.

Once all of the validation is complete, the work can be performed. This should always be delegated to objects and methods of the model. The view functions in Flask should not be part of the problem domain. They are focused on deserializing requests, validation, and serializing responses.

User Password Hashes

See https://cloud.google.com/blog/products/gcp/12-best-practices-for-user-account

It's not apparent in the diagram, but the passwords are hashes. The idea is save a string that has an algorithm descrtiption, a "salt" value, and a digest of the salt and the user's password.

When the user attempts to make a request, they provide a candidate password. This will be in the Authorization header, and the whole is encrypted using SSL. The hash algorithm and the salt from the original password are used with the candidate password in a request. If this candidate's digest matches the original password's digest there's a very good possibility the passwords were identical.

(Engineering Sidebar. There's a tiny possibility of two distinct password strings colliding on a common hash summary value. For a hash with a 64-bit summary, the probability of a collision is surprisingly high. For a hash with a 256-bit summary, the probability of a collision is small. It's non-zero, but it's very small: less likely than an asteroid ending life on earth. This is why hash algorithms like MD5 and SHA1 are a had idea. We show some MD5 examples in the text because they're shorter than the PBKDF2 with SHA256 that is used in practice.)

The "salt" value is a unique, random string that's used to further randomize the hash summaries. The MD5 hash of "Hunter2" is computed as follows:

>>> import hashlib
>>> hashlib.md5("Hunter2".encode("ascii")).hexdigest()
'5648f87c4bfdbe1edab312f2148261bc'

If all users has the same password, Hunter2, then all the hashes would also be identical.

If we use a unique salt for each user, we lose the ability to guess if two passwords are the same.

>>> hashlib.md5("salt1:Hunter2".encode("ascii")).hexdigest()
'16d54184d080a480fa0832f7260c280c'
>>> hashlib.md5("salt2:Hunter2".encode("ascii")).hexdigest()
'4b56aa025b617313b6c63c739d036b8c'

When we save the password, we save the algorithm, salt, and digest in a string punctuated by "$".

'md5$DFb5LLuA$e8bf3591e8f45483af00c16d85689159'

The algorithm and salt are used when testing a candidate password.

We use hashes of passwords because a hash can't be reversed. This parallels the way a product, like 42, reflects a product of two numbers: we can't determine the original pair of numbers. The original value could be 67 or 212 or 14*3. (This example also shows how hash collisions arise with a too-simple hashing algorithm.)

This leads to some important security rules, that can't be emphasized enough.

A web site should never be able to recover a password. It must be impossible. Not difficult. Not "requires special privileges." It needs to be "no amount of computing can guess the password."

A web site should never coach you on a password that's "too similar" to your previous password. This is a red flag that passwords are stored (and can be compromised); you should log off, delete cookies, and run a virus scan.

Adding HTTPS to Flask

We'll make a small change to our server to create a secure connection. This is a minimal change, and doesn't work well outside the desktop testing environment. It does, however, let us get statrted down the path of creating a secure service.

We'll add these lines to our classifier.py module.

if __name__ == "__main__":
    app.run(ssl_context='adhoc')

This will make use of a werkzeug.serving feature. We'll ask werkzeug to create a temporary certificate to force HTTPS negotiation.

Because the certificate is local to the server, and not visible to the client, this isn't useful outside a quick test or two.

This changes the way we start our server.

(CaseStudy) slott@MacBookPro-SLott ch_04 % export PYTHONPATH=src
(CaseStudy) slott@MacBookPro-SLott ch_04 % python -m classifier 
 * Serving Flask app "classifier" (lazy loading)
 * Environment: development
 * Debug mode: on
 * Running on https://127.0.0.1:5000/ (Press CTRL+C to quit)
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 200-216-223
127.0.0.1 - - [11/Oct/2020 09:53:12] "GET /health HTTP/1.1" 200 -

We've used python -m classifier to use the app.run() method for starting our application.

Note the change in the base URL:https://127.0.0.1:5000/ We've switch from HTTP to HTTPS.

We'll have to make a slight change to the way we use curl, also.

curl -k -w 'status: %{response_code}'  https://127.0.0.1:5000/health

The -k option is required so that curl will tolerate the self-signed, "ad-hoc" certificate that werkzeug created for us. Self-signed certificates are potetially untrustworthy, so the default is to treat them as symptom of a "Man-In-The-Middle" hack where requests are being redirected to unknown servers.

Using curl -kv will show the following interaction between the client and the server.

*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 5000 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: O=Dummy Certificate; CN=*
*  start date: Oct 11 13:43:56 2020 GMT
*  expire date: Oct 11 13:43:56 2021 GMT
*  issuer: O=Dummy Certificate; CN=*
*  SSL certificate verify result: self signed certificate (18), continuing anyway.

This line is very important

* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384

The client and server agreed on TLS 1.2, which is a trusted, secure protocol.

The applications (our Flask server and curl) did not agree to to an application layer protocol. The final ALPN message reflects the way we're using Flask directly. If we use GUnicorn or NGINX as a container for our Flask application, this additional layer will participate in the application layer protocol negotiation.

This line is a consequence of using the -k option.

*  SSL certificate verify result: self signed certificate (18), continuing anyway.

The curl client could not validate the signature, but the -k option let the transaction proceed.

Creating and sharing a certificate between client and server is a better idea, but beyond the scope of this chapter. For more information, the werkzeug.serviing.make_ssl_devcert() function will create a certificate and a public key file that can be used by the client as well as the server.