We discussed the problem of reducing SPOF in a container platform.
Any 1 system has at least 1 SPOF (direct like a machine, or abstract like a cluster). A good system has a SPOF that (A) causes a minimal functionality degradation, and (B) is easily replaceable.
Corollary 1: most work should be owned “below” the SPOF.
Corollary 2: subunits should be able to approximate as many of the SPOF’s functions as possible. For example, subunits should be able to directly receive traffic if traffic is directed there, rather than requiring the central load balancer to be able to function.
Corollary 3: state in SPOFs should be avoided. State is anti-portability and therefore anti-replacement.
We validated the federation model (of distinct functional clusters with a central control plane abstraction). This model has been used in many platforms (Borg, Kubernetes, Titus, etc) with mixed usefulness.
We identified a key weakness of federation usability is lack of federation-wide awareness. For example, some workloads we want spread out, and others we want tightly knit or close to something else. An abstract concept of topology could be used for workload affinity or anti-affinity. This could help replace much deployment logic for administrators.
For example:
* We have a database that we wish to spread between clusters (cluster anti-affinity)
* We have an app that we want to serve to global users (geographic anti-affinity)
* We have a job that we want to schedule close to a specific service (workload affinity)
* We have a mapreduce job that we want to keep close together (geographic affinity)
In all of these cases, the typical solution is to have an administrator choose a specific cluster (or set) to deploy to, despise having no strict reason to deploy explicitly (and only) there.
More technical and use case research is needed.
PS – I’d be happy to put contact information out for anyone working on functionality like this.