IntroductionIn the first part of my blog series I showed how to deploy a MongoDB Replica Set to GKE's Kubernetes environment, whilst ensuring that the replica set is secure by default and resilient to various types of system failures. As mentioned in that post, there are number of other "production" considerations that need to be made when running MongoDB in Kubernetes and Docker environments. These considerations are primarily driven by the best practices documented in MongoDB’s Production Operations Checklist and Production Notes. In this blog post, I will address how to apply some (but not all) of these best practices, on GKE's Kubernetes platform.
For optimum performance, the MongoDB Production Notes strongly recommend applying the following configuration settings to the host operating system (OS):
Host VM Modifications for Using XFS & Disabling Hugepages
- Use an XFS based Linux filesystem for WiredTiger data file persistence.
- Disable Transparent Huge Pages.
So with this in mind, the challenge is to build a Docker container image that, when run in privileged mode, uses "nsenter" to spawn a shell to run some shell script commands. As luck would have it, such a container has already been created, in a generic way, as part of the Kubernetes "contributions" project, called startup-script. The generated "startup-script" Docker image has been registered and and made available in the Google Container Registry, ready to be pulled in and used by anyone's Kubernetes projects.
Therefore on GKE, to create a DaemonSet leveraging the "startup-script" image in privileged mode, we first need to define the DaemonSet's configuration:
$ cat hostvm-node-configurer-daemonset.yaml
- name: hostvm-configurer-container
- name: STARTUP_SCRIPT
set -o errexit
set -o pipefail
set -o nounset
# Disable hugepages
echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled
echo 'never' > /sys/kernel/mm/transparent_hugepage/defrag
# Install tool to enable XFS mounting
apt-get update || true
apt-get install -y xfsprogs
Shown in bold at the base of the file, you will notice the commands used to disable Huge Pages and to install the XFS tools for mounting and formatting storage using the XFS filesystem. Further up the file, in bold, is the reference to the 3rd party "startup-script" image from the Google Container Registry and the security context setting to state that the container should be run in privileged mode.
Next we need to deploy the DaemonSet with its "start-script" container to all the hosts (nodes), before we attempt to create any GCE disks, that need to be formatted as XFS:
$ kubectl apply -f hostvm-node-configurer-daemonset.yaml
In the GCE disk definitions, described in the first blog post in this series (i.e. "gce-ssd-persistentvolume?.yaml"), an addition of a new parameter needs to be made (shown in bold below) to indicate that the disk's filesystem type needs to be XFS:
Now in theory, this should be all that is required to get XFS working. Except on GKE, it isn't!
After deploying the DaemonSet and creating the GCE storage disks, the deployment of the "mongod" Service/StatefulSet will fail. The StatefulSet's pods do not to start properly because the disks can't be formatted and mounted as XFS. It turns out that this is because, by default, GKE uses a variant of Chromium OS as the underlying host VM that runs the containers, and this OS flavour doesn't support XFS. However, GKE can also be configured to use a Debian based Host VM OS instead, which does support XFS.
To see the list of host VM OSes that GKE supports, the following command can be run:
$ gcloud container get-server-config
Fetching server config for europe-west1-b
Here, "COS" is the label for the Chromium OS and "CONTAINER_VM" is the label for the Debian OS. The easiest way to start leveraging the Debian OS image is to clear out all the GCE/GKE resources and Kubernetes cluster from the current project and start deployment all over again. This time, when the initial command is run to create the new Kubernetes cluster, an additional argument (shown in bold) must be provided to define that the Debian OS should be used for each Host VM that is created as a Kubernetes node.
$ gcloud container clusters create "gke-mongodb-demo-cluster" --image-type=CONTAINER_VM
This time, when all the Kubernetes resources are created and deployed, the "mongod" containers correctly utilise XFS formatted persistent volumes.
If this all seems a bit complicated, it is probably helpful to view the full end-to-end deployment flow, provided in my example GitHub project gke-mongodb-demo.
There is one final observation to make before finishing the discussion on XFS. In Google's online documentation, it is stated that the Debian Host VM OS is deprecated in favour of Chromium OS. I hope that in the future Google will add XFS support directly to its Chromium OS distribution, to make the use of XFS a lot less painful and to ensure XFS can still be used with MongoDB, if the Debian Host VM option is ever completely removed.
Disabling NUMAFor optimum performance, the MongoDB Production Notes recommend that "on NUMA hardware, you should configure a memory interleave policy so that the host behaves in a non-NUMA fashion". The DockerHub "mongo" container image which has been used so far with Kubernetes in this blog series, already contains some bootstrap code to start the "mongod" process with the "numactl --interleave=all" setting. This setting makes the process environment behave in a non-NUMA way.
However, I believe it is worth specifying the "numactl" settings explicitly in the "mongod" Service/StatefulSet resource definition, anyway, just in case other users choose to use an alternative or self-built Docker image for the "mongod" container. The excerpt below shows the added "numactl" elements (in bold), required to run the containerised "mongod" process in a "non-NUMA" manner.
$ cat mongodb-service.yaml
- name: mongod-container
Controlling CPU & RAM Resource Allocation Plus WiredTiger Cache SizeOf course, when you are running a MongoDB database it is important to size both CPU and RAM resources correctly for the particular database workload, regardless of the type of host environment. In a Kubernetes containerised host environment, the amount of CPU & RAM resource dedicated to a container can be defined in the "resource" section of the container's declaration, as shown in the excerpt of the "mongod" Service/StatefulSet definition below:
$ cat mongodb-service.yaml
- name: mongod-container
In the example (shown in bold), 1x virtual CPU (vCPU) and 2GB of RAM have been requested to run the container. You will also notice that an additional parameter has been defined for "mongod", specifying the WiredTiger internal cache size ("--wiredTigerCacheSizeGB"). In a containerised environment it is absolutely vital to explicitly state this value. If this is not done, and multiple containers end up running on the same host machine (node), MongoDB's WiredTiger storage engine may attempt to take more memory than it should. This is because of the way a container "reports" it's memory size to running processes. As per the MongoDB Production Recommendations, the default cache size guidance is: "50% of RAM minus 1 GB, or 256 MB". Given that the amount of memory requested is 2GB, the WiredTiger cache size here, has been set to 256MB.
If and when you define a different amount of memory for the container process, be sure to also adjust the WiredTiger cache size setting accordingly, otherwise the "mongod" process may not leverage all the memory reserved for it, by the container.
Controlling Anti-Affinity for Mongod ReplicasWhen running a MongoDB Replica Set, it is important to ensure that none of the "mongod" replicas in the replica set are running on the same host machine as each other, to avoid inadvertently introducing a single point of failure. In a Kubernetes containerised environment, if containers are left to their own devices, different "mongod" containers could end up running on the same nodes. Kubernetes provides a way of specifying pod anti-affinity to prevent this from occurring. Below is an excerpt of a "mongod" Services/StatefulSet resource file which declares an anti-affinity configuration.
$ cat mongodb-service.yaml
- weight: 100
- key: replicaset
Setting File Descriptor & User Process LimitsWhen deploying the MongoDB Replica Set on GKE Kubernetes, as demonstrated in the current GitHub project gke-mongodb-demo, you may notice some warning about "rlimits" in the output of each containerised mongod's logs. These log entries can be viewed by running the following command:
$ kubectl logs mongod-0 | grep rlimits
2017-06-27T12:35:22.018+0000 I CONTROL [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 29980 processes, 1000000 files. Number of processes should be at least 500000 : 0.5 times number of files.
Unfortunately, thus far, I've not established an appropriate way to the enforce these thresholds using GKE Kubernetes. This topic will possibly be the focus of a blog post for another day. However, I thought that it would be informative to highlight the issue here, with the supporting context, to allow others the chance to resolve it first.
SummaryIn this blog post I’ve provided some methods for addressing certain best practices when deploying a MongoDB Replica Set to GKE's Kubernetes platform. Although this post does not provide an exhaustive list of best practice solutions, I hope it proves useful for others (and myself) to build upon, in the future.
[Next post in series: Using the Enterprise Version of MongoDB on GKE Kubernetes]
Song for today: The Mountain by Jambinai