Operator Lifecycle Manager

At KubeCon EU 2018, Red Hat eagerly announced the open sourcing of the Operator Framework. There’s a lot of buzzword bingo these days, but being someone who helped originally pioneer the concept at CoreOS, I feel I can give a fair definition:

An Operator is a Kubernetes pattern that is extending the Kubernetes control plane with a custom Controller and Custom Resource Definitions that add additional operational knowledge of an application.

Application-specific operational knowledge is often forgotten from this definition and without it Operators are indistinguishable from any other custom Kubernetes Controllers. A great example of this concept (and a bad example of code to read) is the original Operator: the etcd Operator. The etcd Operator extends Kubernetes to handle operations such as backups and resizing etcd clusters. These are the type of tasks that normally have humans following written playbooks to fix deployments when they get paged at 4AM. The CoreOS philosophy was to automate solutions to eliminate operational toil for the future. If you’ve resized the cluster once, you’re going to need to resize it a million times. Why not teach the computer to do it for you?

Now that the concept of an Operator is well defined, what’s the framework? The Operator Framework has three pillars (product managers love having three pillars): Operator SDK, Operator Lifecycle Manager, and Operator Metering. From their names, it’s easy to deduce that the Operator SDK is tooling to help you build new Operators and that Operator Metering improves the visibility of resource usage, but what the hell does an Operator Lifecycle Manager do?

There are two problems in computer science: there’s only one joke, and it isn’t funny.

Operator Lifecycle Manager is poorly named. If by the end of this article, you both understand OLM and can think of a better name, please let Red Hat know. OLM does a lot for Operator authors, but it also solves an important problem that not many people have thought about yet: how do you effectively manage first-class extensions to Kubernetes over time?

Before there were robust primitives for extending Kubernetes, one had to literally fork the controller manager to add functionality; this is the legacy incurred by Red Hat’s Kubernetes distribution, OpenShift, for being early to the game. Around the time that CoreOS started doing more work around extending Kubernetes in their distribution, Tectonic, Kubernetes had already developed naive forms of these primitives. The core of Tectonic was entirely powered by Operators. Operators were used to literally upgrade Kubernetes out from underneath itself. If you think this is getting meta, we’re just getting started.

Remember earlier how I said that the CoreOS philosophy was to automate? When a bunch of logic for your Kubernetes distribution is implemented as Operators, you need to somehow automate managing them. Can you see where I’m going with this? The only logical step forward is to begin writing Operators to manage Operators.

OLM originally shipped in Tectonic before being open sourced as a part of the Operator Framework. OLM was the meta-Operator solution for services that Tectonic exposed to end-users (e.g. Prometheus, etcd, Vault). OLM organizes its functionality into two Operators that run on the Kubernetes cluster.

The first is named the “OLM” Operator and manages a Kubernetes resource, ClusterServiceVersion, that represents an Operator. When a CSV (what an unfortunate acronym) is applied to the cluster, the OLM Operator gates the deployment of an Operator until all of its requirements are present. For example, the etcd Operator shouldn’t be running if the cluster doesn’t have an EtcdCluster CRD registered. The OLM Operator also orchestrates the finer details of upgrading a running Operator to a newer version ensuring safety for all of the objects that the previous Operator was managing. The guiding principal of the OLM Operator design is to be similar to dpkg -i to install a .deb already on your machine.

The second Operator is perhaps better named, but far too easy to conflate with other technologies: the “catalog” Operator. This has nothing to do with Service Catalog, which is a project that implements the Open Service Broker API for Kubernetes. What OLM’s catalog Operator does do is automate the plumbing for satisfying the requirements that the “OLM” Operator is enforcing. Users can create Subscriptions to Operators which enforce upgrade policies such as “keep this namespace up to date with alpha etcd Operator”. Subscriptions are realized by the usage of InstallPlans. InstallPlans aren’t meant solely for usage by Subscriptions, users can also manually create them. InstallPlans resolve dependencies and find all the resources that’d have to be applied to the cluster in order to satisfy the installation of a new Operator. Think of the catalog Operator as apt-get upgrade; it’s going to prompt you with what needs to be applied to the cluster and if you say yes, it’s going to install all of those packages.

Now, I know what you’re thinking: OLM is just the Operator Package Manager. I think that’s a reasonable analogy for describing OLM today. But nothing about OLM precludes it from managing any custom Controller or API server.

If you squint your eyes, everything in Kubernetes is a Resource managed by a Controller and OLM manages installing Resources and their Controllers.

I’m not suggesting you package up every application as an Operator for OLM; not every binary on your laptop is packaged as .deb. But if you’re fundamentally extending Kubernetes, you should lean on OLM to guide you through the logistics of deploying your Operator. Even if you’re not writing Operators, OLM is the best way to manage collections of Resources and Controllers like those required for complex projects like OpenShift, Knative, Istio, and whatever else comes along.

Follow-Ups:

Check out #kubernetes-operators on the Kubernetes Slack
Attend OpenShift SIG Operator Framework
Watch Lachy’s unboxing video of the Operator SDK + OLM
Mull around the docs and code on GitHub