Tuesday, June 27, 2017

Configuring Some Key Production Settings for MongoDB on GKE Kubernetes

[Part 2 in a series of posts about running MongoDB on Kubernetes, on Google’s Container Engine (a.k.a. GKE). See the GitHub project gke-mongodb-demo for an example scripted deployment of MongoDB to GKE, that you can easily try yourself. The gke-mongodb-demo project combines the conclusions from all the posts in this series so far. Also see: http://k8smongodb.net/]


In the first part of my blog series I showed how to deploy a MongoDB Replica Set to GKE's Kubernetes environment, whilst ensuring that the replica set is secure by default and resilient to various types of system failures. As mentioned in that post, there are number of other "production" considerations that need to be made when running MongoDB in Kubernetes and Docker environments. These considerations are primarily driven by the best practices documented in MongoDB’s Production Operations Checklist and Production Notes. In this blog post, I will address how to apply some (but not all) of these best practices, on GKE's Kubernetes platform.

Host VM Modifications for Using XFS & Disabling Hugepages

For optimum performance, the MongoDB Production Notes strongly recommend applying the following configuration settings to the host operating system (OS):
  1. Use an XFS based Linux filesystem for WiredTiger data file persistence.
  2. Disable Transparent Huge Pages.
The challenge here is that neither of these elements can be configured directly within normally deployed pods/containers. Instead, they need to be set in the OS of each machine/VM that is eligible to host one or more pods and their containers. Fortunately, after a little googling I found a solution to incorporating XFS, in the article Mounting XFS on GKE, which also provided the basis for deriving a solution for disabling Huge Pages too. It turns out that in Kubernetes, it is possible to run a pod (and its container) once per node (host machine), using a facility called a DaemonSet. A DaemonSet is used to schedule a "special" container to run on every newly provisioned node, as a one off, before any "normal" containers are scheduled and run on the node. In addition, for Docker based containers (the default on GKE Kubernetes), the container can be allowed to run in a privileged mode, which gives the "privileged" container access to other Linux Namespaces running in the same host environment. With heightened security rights the "privileged" container can then run a utility called nsenter ("NameSpace ENTER") to spawn a shell using the namespace belonging to the host OS ("/proc/1"). The script that the shell runs can then essentially perform any arbitrary root level actions on the underlying host OS.

So with this in mind, the challenge is to build a Docker container image that, when run in privileged mode, uses "nsenter" to spawn a shell to run some shell script commands. As luck would have it, such a container has already been created, in a generic way, as part of the Kubernetes "contributions" project, called startup-script. The generated "startup-script" Docker image has been registered and and made available in the Google Container Registry, ready to be pulled in and used by anyone's Kubernetes projects.

Therefore on GKE, to create a DaemonSet leveraging the "startup-script" image in privileged mode, we first need to define the DaemonSet's configuration:

$ cat hostvm-node-configurer-daemonset.yaml

kind: DaemonSet
apiVersion: extensions/v1beta1
  name: hostvm-configurer
    app: startup-script
        app: startup-script
      hostPID: true
      - name: hostvm-configurer-container
        image: gcr.io/google-containers/startup-script:v1
          privileged: true
        - name: STARTUP_SCRIPT
          value: |
            #! /bin/bash
            set -o errexit
            set -o pipefail
            set -o nounset
            # Disable hugepages
            echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled
            echo 'never' > /sys/kernel/mm/transparent_hugepage/defrag
            # Install tool to enable XFS mounting
            apt-get update || true
            apt-get install -y xfsprogs

Shown in bold at the base of the file, you will notice the commands used to disable Huge Pages and to install the XFS tools for mounting and formatting storage using the XFS filesystem. Further up the file, in bold, is the reference to the 3rd party "startup-script" image from the Google Container Registry and the security context setting to state that the container should be run in privileged mode.

Next we need to deploy the DaemonSet with its "start-script" container to all the hosts (nodes), before we attempt to create any GCE disks, that need to be formatted as XFS:

$ kubectl apply -f hostvm-node-configurer-daemonset.yaml

In the GCE disk definitions, described in the first blog post in this series (i.e. "gce-ssd-persistentvolume?.yaml"), an addition of a new parameter needs to be made (shown in bold below) to indicate that the disk's filesystem type needs to be XFS:

apiVersion: "v1"
kind: "PersistentVolume"
  name: data-volume-1
    storage: 30Gi
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: fast
    fsType: xfs
    pdName: pd-ssd-disk-1

Now in theory, this should be all that is required to get XFS working. Except on GKE, it isn't!

After deploying the DaemonSet and creating the GCE storage disks, the deployment of the "mongod" Service/StatefulSet will fail. The StatefulSet's pods do not to start properly because the disks can't be formatted and mounted as XFS. It turns out that this is because, by default, GKE uses a variant of Chromium OS as the underlying host VM that runs the containers, and this OS flavour doesn't support XFS. However, GKE can also be configured to use a Debian based Host VM OS instead, which does support XFS.

To see the list of host VM OSes that GKE supports, the following command can be run:

$ gcloud container get-server-config
Fetching server config for europe-west1-b
defaultClusterVersion: 1.6.4
defaultImageType: COS

Here, "COS" is the label for the Chromium OS and "CONTAINER_VM" is the label for the Debian OS. The easiest way to start leveraging the Debian OS image is to clear out all the GCE/GKE resources and Kubernetes cluster from the current project and start deployment all over again. This time, when the initial command is run to create the new Kubernetes cluster,  an additional argument (shown in bold) must be provided to define that the Debian OS should be used for each Host VM that is created as a Kubernetes node.

$ gcloud container clusters create "gke-mongodb-demo-cluster" --image-type=CONTAINER_VM

This time, when all the Kubernetes resources are created and deployed, the "mongod" containers correctly utilise XFS formatted persistent volumes. 

If this all seems a bit complicated, it is probably helpful to view the full end-to-end deployment flow, provided in my example GitHub project gke-mongodb-demo.

There is one final observation to make before finishing the discussion on XFS. In Google's online documentation, it is stated that the Debian Host VM OS is deprecated in favour of Chromium OS. I hope that in the future Google will add XFS support directly to its Chromium OS distribution, to make the use of XFS a lot less painful and to ensure XFS can still be used with MongoDB, if the Debian Host VM option is ever completely removed.

Disabling NUMA

For optimum performance, the MongoDB Production Notes recommend that "on NUMA hardware, you should configure a memory interleave policy so that the host behaves in a non-NUMA fashion". The DockerHub "mongo" container image which has been used so far with Kubernetes in this blog series, already contains some bootstrap code to start the "mongod" process with the "numactl --interleave=all" setting. This setting makes the process environment behave in a non-NUMA way.

However, I believe it is worth specifying the "numactl" settings explicitly in the "mongod" Service/StatefulSet resource definition, anyway, just in case other users choose to use an alternative or self-built Docker image for the "mongod" container. The excerpt below shows the added "numactl" elements (in bold), required to run the containerised "mongod" process in a "non-NUMA" manner.

$ cat mongodb-service.yaml
        - name: mongod-container
          image: mongo
            - "numactl"
            - "--interleave=all"
            - "mongod"

Controlling CPU & RAM Resource Allocation Plus WiredTiger Cache Size

Of course, when you are running a MongoDB database it is important to size both CPU and RAM resources correctly for the particular database workload, regardless of the type of host environment. In a Kubernetes containerised host environment, the amount of CPU & RAM resource dedicated to a container can be defined in the "resource" section of the container's declaration, as shown in the excerpt of the "mongod" Service/StatefulSet definition below:

$ cat mongodb-service.yaml
        - name: mongod-container
          image: mongo
            - "mongod"
            - "--wiredTigerCacheSizeGB"
            - "0.25"
            - "--replSet"
            - "MainRepSet"
            - "--auth"
            - "--clusterAuthMode"
            - "keyFile"
            - "--keyFile"
            - "/etc/secrets-volume/internal-auth-mongodb-keyfile"
            - "--setParameter"
            - "authenticationMechanisms=SCRAM-SHA-1"
              cpu: 1
              memory: 2Gi

In the example (shown in bold), 1x virtual CPU (vCPU) and 2GB of RAM have been requested to run the container. You will also notice that an additional parameter has been defined for "mongod", specifying the WiredTiger internal cache size ("--wiredTigerCacheSizeGB"). In a containerised environment it is absolutely vital to explicitly state this value. If this is not done, and multiple containers end up running on the same host machine (node), MongoDB's WiredTiger storage engine may attempt to take more memory than it should. This is because of the way a container "reports" it's memory size to running processes. As per the MongoDB Production Recommendations, the default cache size guidance is: "50% of RAM minus 1 GB, or 256 MB". Given that the amount of memory requested is 2GB, the WiredTiger cache size here, has been set to 256MB.

If and when you define a different amount of memory for the container process, be sure to also adjust the WiredTiger cache size setting accordingly, otherwise the "mongod" process may not leverage all the memory reserved for it, by the container.

Controlling Anti-Affinity for Mongod Replicas

When running a MongoDB Replica Set, it is important to ensure that none of the "mongod" replicas in the replica set are running on the same host machine as each other, to avoid inadvertently introducing a single point of failure. In a Kubernetes containerised environment, if containers are left to their own devices, different "mongod" containers could end up running on the same nodes. Kubernetes provides a way of specifying pod anti-affinity to prevent this from occurring. Below is an excerpt of a "mongod" Services/StatefulSet resource file which declares an anti-affinity configuration.

$ cat mongodb-service.yaml
  serviceName: mongodb-service
  replicas: 3
        replicaset: MainRepSet
          - weight: 100
                - key: replicaset
                  operator: In
                  - MainRepSet
              topologyKey: kubernetes.io/hostname

Here, a rule has been defined that asks Kubernetes to apply anti-affinity when deploying pods with the label "replicaset" equal to "MainRepSet", by looking for potential matches on the host VM instance's hostname, and then avoiding them.

Setting File Descriptor & User Process Limits

When deploying the MongoDB Replica Set on GKE Kubernetes, as demonstrated in the current GitHub project gke-mongodb-demo, you may notice some warning about "rlimits" in the output of each containerised mongod's logs. These log entries can be viewed by running the following command:

$ kubectl logs mongod-0 | grep rlimits

2017-06-27T12:35:22.018+0000 I CONTROL  [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 29980 processes, 1000000 files. Number of processes should be at least 500000 : 0.5 times number of files.

The MongoDB manual provides some recommendations concerning the system settings for the maximum number of processes and open files when running a "mongod" process.

Unfortunately, thus far, I've not established an appropriate way to the enforce these thresholds using GKE Kubernetes. This topic will possibly be the focus of a blog post for another day. However, I thought that it would be informative to highlight the issue here, with the supporting context, to allow others the chance to resolve it first.


In this blog post I’ve provided some methods for addressing certain best practices when deploying a MongoDB Replica Set to GKE's Kubernetes platform. Although this post does not provide an exhaustive list of best practice solutions, I hope it proves useful for others (and myself) to build upon, in the future.

[Next post in series: Using the Enterprise Version of MongoDB on GKE Kubernetes]

Song for today: The Mountain by Jambinai

No comments: