Deployment good practices

Kinto is a python Web application that provides storage as a service.

It relies on 3 vital components:

  • A Web stack;
  • A database;
  • An authentication service.

This document describes the strategy in order to deploy a full stack with the following properties:

  • Fail-safe: respond in a way that causes a minimum of harm in case of failure;
  • Consistency: all nodes see the same data at the same time;
  • Durability: data of successful requests remains stored.

Even though it is related, this document does not cover the properties of the Kinto protocol (client race conditions etc.).

Python stack

High-availability

  • At least two nodes (e.g. Linux boxes)
  • A load balancer, that spreads requests across the nodes (e.g. HAProxy)
  • Each node runs several WSGI process workers (e.g. uWSGI)
  • Each node runs a HTTP reverse proxy that spreads requests across the workers (e.g. Nginx)

Vertical scaling:

  • Increase size of nodes
  • Increase number of WSGI processes

Horizontal scaling:

  • Increase number of nodes

Fail safe

WSGI process crash:

  • 503 error + Retry-After response header
  • Sentry report
  • uWSGI respawns a process (via Systemd for example)

Reverse proxy crash:

  • The load balancer blacklists the node

If the load balancer or all nodes are down, the service is down.

Consistency

Every worker across every node are configured with the same database DSN.

See next section about details for database.

Configuration change

Application:

  • Modify configuration file
  • Reload workers gracefully

Reverse proxy:

  • Disable node in load balancer
  • Restart reverse proxy
  • Enable node in load balancer

Load balancer:

  • See scheduled down time

Change application configuration

  • Modify configuration file
  • Reload workers gracefully

Database

Kinto can be configured to persist data in several kinds of storage.

PostgreSQL is the one that we chose at Mozilla, mainly because:

  • It is a mature and standard solution;
  • It supports sorting and filtering of JSONB fields;
  • It has an excellent reputation for data integrity.

High-availability

Deploy a PostgreSQL cluster:

  • a leader («master»);
  • one or more replication followers («slaves»).
  • A load balancer, that routes queries to take advantage of the cluster (pgPool)

Writes are sent to the master, and reads are sent to the master and slaves that are up-to-date.

Vertical scaling:

  • Increase size of nodes (RAM+#CPU)
  • Increase shared_buffers and work_mem

Horizontal scaling:

  • Increase number of nodes

Performance

  • RAID
  • Volatile data on SSD (indexes)
  • Storage on HDD
  • shared_buffers is like caching tables in memory
  • work_mem is like caching joins (per connection)

Connection pooling:

  • via load balancer
  • via Kinto

Fail safe

If the master fails, one slave can be promoted to be the new master.

Database crash:

  • Restore database from last scheduled backup
  • Restore WAL files since last backup

Consistency

  • master streams WAL to slaves
  • slaves are removed from load balance until their data is up-to-date with master

Durability

  • ACID
  • WAL for transactions
  • pgDump export :)

Pooling

  • automatic refresh of connections (TODO in Kinto)

Sharding

  • Use buckets+collections or userid to shard ?

Via pgPool:

  • Flexible
  • Tedious to configure

Via Kinto code:

  • not implemented yet
  • battery-included (via INI configuration)

Using Amazon RDS

  • Consistency/Availability/Durability are handled by Postgresql RDS
  • Use Elasticcache for Redis
  • Use a EC2 Instance with uWSGI and Nginx deployed
  • Use Route53 for loadbalancing

Authentication service

Each request contains an Authorization header that needs to be verified by the authentication service.

In the case of Mozilla, Kinto is plugged with the Firefox Accounts OAuth service.

Fail safe

With the Firefox Accounts policy, token verifications are cached for an amount of time.

fxa-oauth.cache_ttl_seconds = 300  # 5 minutes

If the remote service is down, the cache will allow the authentication of known token for a while. However new tokens will generate a 401 or 503 error response.

Scheduled down time

  • Change Backoff setting in application configuration