There are three topics here:
An overview of web services in general.
How User Password hashes work.
How to enable HTTPS for local testing.
We'll start by looking at an activity diagram to show the general process of making a RESTful request.
The REST concept is "Representational State Transfer". We're exchanging a representation of object state between two independent processes. Often the state of an object is summarized into JSON notation, allowing the receiver to build a similar object, given the serialized state.
The protocols stack on top of the existing HTTPS protocols, so we can use all of our favorite libraries to build RESTful API's. There's going to be one import difference, however.
The HTTPS protocols are "stateless": each request must be viewed independently of all other requests. When interacting with a web site that uses a login, the notion of state is handled via "cookies". Each response includes packets of data for the browser to save, each request includes packets of data that are sent back. Once a user authenticates themself by logging in, the web application use the cookies to track session state. Each response records the state of the session in cookies; each request uses the cookies to rebuild the session information. This makes the user feel like they're engaging in a long, personal dialog. In fact it's a series of independent actions with history tracked via cookies.
RESTful API's are stateless. They don't use cookies. Instead, each request will include user authentication credentials. There are a variety of ways to handle the credential processing, including:
Use HTTPS to assure the exchange is secure, and provide the credentials in the HTTP Authorization
header.
Use a second service to validate credentials and recieve an identity token. Use this identity token with each request. The credentials check has to involve a certificate with public key information so the identity information is encrypted in a way that only the server can decrypt it. The identity token is a hash that must be confirmed as part of processing each request. Typically, the tokens expire, and the user's credentials must be supplied again. A mobile app or Javascript-based web site can provide the login credentials periodically to refresh the token.
We'll focus on using HTTPS. This requires providing an appropriate certificate to secure the web server.
This leads to RESTful API processing that proceeds as shown in this sequence diagram.
The boxes are objects that are part of our application. The lines below are "lifelines" showing activities the object engages in. The requests are solid lines and the responses are dashed lines. Some of these are network requests using HTTPS protocols. Some of these are method of one object being invoked and returning an object.
A request originates from a "Client". This could be a mobile app, or a browser, or another web service making a RESTful API request. The request uses HTTPS protocols, and the server will provide a public key as part of the negotiation for a secure connection. This is (almost) completely transparent. We do need to provide the certificate and key pair as part of configuring the server. Good practices suggest the keys be rotated periodically.
The client's request will be encrypted using the public key. Only the server's private key can decrypt the request.
When we write our Flask application, we'll provide a Python decorator on each view function.
This decorator will validate the Authorization
header to be sure that it has a known username
and the user's password. If the Authorization
header is missing, or a hash of the user's credentials
are invalid, the response is an HTTP 401 status code to indicate that the user could not be authenticated.
(And yes, the Authorization
header seems to have the wrong name.)
If the Authorization
header is valid, the decorator will invoke the view function. The view
function will use the model classes process the request. This might be an upload of training data,
a request for a test, or a request for classification. We have two different classes of users,
so we must assure that the user is authorized to perform the requested action. If they're not
authorized, we can return a 403 status code. This is a nuanced distinction between a bad Authorization
header and a request to access inappropriate resources.
If the authentication and authorization checks have passed, then the document that's part of the request must also be validated. It should be a JSON-format document. It should have certain fields, and the fields should have specific types of values. This is often handled by the JSONSchema package. If the data is invalid, this often leads to a 400 status code: the request can't be processed.
There may be additional checks to avoid duplicate collections of training data. There are several ways to respond to attempts to create duplicate data. Some servers will respond with a 400 status code to prevent problems. A more helpful approach is to examine the request and the existing data to see if the request is a simple duplication, and tolerate a second upload that matches a previous upload. It's common for responses to be lost, or the the response to be unclear, and the user clicked the upload button again.
Once all of the validation is complete, the work can be performed. This should always be delegated to objects and methods of the model. The view functions in Flask should not be part of the problem domain. They are focused on deserializing requests, validation, and serializing responses.
See https://cloud.google.com/blog/products/gcp/12-best-practices-for-user-account
It's not apparent in the diagram, but the passwords are hashes. The idea is save a string that has an algorithm descrtiption, a "salt" value, and a digest of the salt and the user's password.
When the user attempts to make a request, they provide
a candidate password. This will be in the Authorization
header,
and the whole is encrypted using SSL.
The hash algorithm and the salt from the original password
are used with the candidate password in a request.
If this candidate's digest matches the original password's digest
there's a very good possibility the passwords were identical.
(Engineering Sidebar. There's a tiny possibility of two distinct password strings colliding on a common hash summary value. For a hash with a 64-bit summary, the probability of a collision is surprisingly high. For a hash with a 256-bit summary, the probability of a collision is small. It's non-zero, but it's very small: less likely than an asteroid ending life on earth. This is why hash algorithms like MD5 and SHA1 are a had idea. We show some MD5 examples in the text because they're shorter than the PBKDF2 with SHA256 that is used in practice.)
The "salt" value is a unique, random string that's used to further randomize the hash summaries. The MD5 hash of "Hunter2" is computed as follows:
>>> import hashlib
>>> hashlib.md5("Hunter2".encode("ascii")).hexdigest()
'5648f87c4bfdbe1edab312f2148261bc'
If all users has the same password, Hunter2
, then all the hashes
would also be identical.
If we use a unique salt for each user, we lose the ability to guess if two passwords are the same.
>>> hashlib.md5("salt1:Hunter2".encode("ascii")).hexdigest()
'16d54184d080a480fa0832f7260c280c'
>>> hashlib.md5("salt2:Hunter2".encode("ascii")).hexdigest()
'4b56aa025b617313b6c63c739d036b8c'
When we save the password, we save the algorithm, salt, and digest in a string punctuated by "$".
'md5$DFb5LLuA$e8bf3591e8f45483af00c16d85689159'
The algorithm and salt are used when testing a candidate password.
We use hashes of passwords because a hash can't be reversed. This parallels the way a product, like 42, reflects a product of two numbers: we can't determine the original pair of numbers. The original value could be 67 or 212 or 14*3. (This example also shows how hash collisions arise with a too-simple hashing algorithm.)
This leads to some important security rules, that can't be emphasized enough.
Never store passwords.
Never store reversibly encrypted passwords. If the key is compromized, then all passwords are lost.
Only store hashes of passwords.
A web site should never be able to recover a password. It must be impossible. Not difficult. Not "requires special privileges." It needs to be "no amount of computing can guess the password."
A web site should never coach you on a password that's "too similar" to your previous password. This is a red flag that passwords are stored (and can be compromised); you should log off, delete cookies, and run a virus scan.
We'll make a small change to our server to create a secure connection. This is a minimal change, and doesn't work well outside the desktop testing environment. It does, however, let us get statrted down the path of creating a secure service.
We'll add these lines to our classifier.py module.
if __name__ == "__main__":
app.run(ssl_context='adhoc')
This will make use of a werkzeug.serving
feature.
We'll ask werkzeug to create a temporary certificate
to force HTTPS negotiation.
Because the certificate is local to the server, and not visible to the client, this isn't useful outside a quick test or two.
This changes the way we start our server.
(CaseStudy) slott@MacBookPro-SLott ch_04 % export PYTHONPATH=src
(CaseStudy) slott@MacBookPro-SLott ch_04 % python -m classifier
* Serving Flask app "classifier" (lazy loading)
* Environment: development
* Debug mode: on
* Running on https://127.0.0.1:5000/ (Press CTRL+C to quit)
* Restarting with stat
* Debugger is active!
* Debugger PIN: 200-216-223
127.0.0.1 - - [11/Oct/2020 09:53:12] "GET /health HTTP/1.1" 200 -
We've used python -m classifier
to use the app.run()
method for starting our
application.
Note the change in the base URL:https://127.0.0.1:5000/
We've switch from HTTP
to HTTPS
.
We'll have to make a slight change to the way we use curl, also.
curl -k -w 'status: %{response_code}' https://127.0.0.1:5000/health
The -k
option is required so that curl will tolerate
the self-signed, "ad-hoc" certificate that werkzeug created
for us. Self-signed certificates are potetially untrustworthy,
so the default is to treat them as symptom of a "Man-In-The-Middle"
hack where requests are being redirected to unknown servers.
Using curl -kv
will show the following interaction between
the client and the server.
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 5000 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/cert.pem
CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
* subject: O=Dummy Certificate; CN=*
* start date: Oct 11 13:43:56 2020 GMT
* expire date: Oct 11 13:43:56 2021 GMT
* issuer: O=Dummy Certificate; CN=*
* SSL certificate verify result: self signed certificate (18), continuing anyway.
This line is very important
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
The client and server agreed on TLS 1.2, which is a trusted, secure protocol.
The applications (our Flask server and curl) did not agree to
to an application layer protocol. The final ALPN
message
reflects the way we're using Flask directly. If we use GUnicorn
or NGINX as a container for our Flask application, this
additional layer will participate in the application layer
protocol negotiation.
This line is a consequence of using the -k
option.
* SSL certificate verify result: self signed certificate (18), continuing anyway.
The curl client could not validate the signature,
but the -k
option let the transaction proceed.
Creating and sharing a certificate between client
and server is a better idea, but beyond the scope
of this chapter. For more information, the
werkzeug.serviing.make_ssl_devcert()
function
will create a certificate and a public key file
that can be used by the client as well as the server.