Friday, April 13, 2018

MongoDB Graph Query Example, Inspired by Designing Data-Intensive Applications Book

Introduction


People who have worked with me recently are probably bored by me raving about how good this book is: Designing Data-Intensive Applications by Martin Kleppmann (O'Reilly, 2016). Suffice to say, if you are in IT and have any sort of interest in databases and/or data-driven applications, you should read this book. You will be richly rewarded for the effort.

In the second chapter of the book ('Data Models and Query Languages'), Martin has a section called 'Graph Like Data Models' which explores 'graph use cases' where many-to-many relationships are typically modelled with tree-like structures, with indeterminate numbers of inter-connections. The book section shows how a specific 'graph problem' can be solved by using a dedicated graph database technology with associated query language (Cypher) and by using an ordinary relational database with associated query language (SQL). One thing that quickly becomes evident, when reading this section of the book, is how difficult it is in a relational database to model complex many-to-many relationships. This may come as a surprise to some people. However, this is consistent with something I've subconsciously learnt over 20 years of using relational databases, which is, relationships ≠ relations, in the world of RDBMS.

The graph scenario illustrated in the book shows an example of two people, Lucy and Alain, who are married to each other, who are born in different places and who now live together in a third place. For clarity, I've included the diagram from the book, below, to best illustrate the scenario (annotated with the book's details, in red, for reference).




Throughout the book, numerous types of databases and data-stores are illustrated, compared and contrasted, including MongoDB in many places. However the book's section on graph models doesn't show how MongoDB can be used to solve the example graph scenario. Therefore, I thought I take this task on myself. Essentially, the premise is that there is a data-set of many people, with data on the place each person was born in and the place each person now lives in. Of course, any given place may be within a larger named place, which may in turn be within a larger named place, and so on, as illustrated in the diagram above. In the rest of this blog post I show one way that such data structures and relationships can be modelled in MongoDB and then leveraged by MongoDB's graph query capabilities (specifically using the graph lookup feature of MongoDB's Aggregation Framework). What will be demonstrated is how to efficiently answer the exam question posed by the book, namely: 'Find People Who Emigrated From US To Europe'.


Solving The Book's Graph Challenge With MongoDB


To demonstrate the use of MongoDB's Aggregation 'graph lookup' capability to answer the question 'Find People Who Emigrated From US To Europe', I've created the following two MongoDB collections, populated with data:
  1. 'persons' collection. Contains around one million randomly generated person records, where each person has 'born_in' and 'lives_in' attributes, which each reference a 'starting' place record in the places collection.
  2. 'places' collection. Contains hierarchical geographical places data, with the graph structure of: SUBDIVISIONS-->COUNTRIES-->SUBREGIONS-->CONTINENTS. Note: The granularity and hierarchy of the data-set is slightly different than illustrated in the book, due to the sources of geographical data I had available to cobble together.
Similar to the book's example, amongst the many 'persons' records stored in MongoDB data-set, are the following two records relating to 'Lucy' and 'Alain'.

{fullname: 'Lucy Smith', born_in: 'Idaho', lives_in: 'England'}
{fullname: 'Alain Chirac', born_in: 'Bourgogne-Franche-Comte', lives_in: 'England'}

Below is an excerpt of some of the records from the 'places' collection, which illustrates how a place record may refer to another place record, via its 'part_of' attribute.

{name: 'England', type: 'subdivision', part_of: 'United Kingdom of Great Britain and Northern Ireland'}
..
{name: 'United Kingdom of Great Britain and Northern Ireland', type: 'country', part_of: 'Northern Europe'}
..
{name: 'Northern Europe', type: 'subregion', part_of: 'Europe'}
..
{name: 'Europe', type: 'continent', part_of: ''}

If you want to access this data yourself and load it into the two MongoDB database collections, I've created JSON exports of both collections and made these available in a GitHub project (see the project's README for more details on how to load the data into MongoDB and then how to actually run the example's 'graph lookup' aggregation pipeline).

The MongoDB aggregation pipeline I created, to process the data across these two collections and to answer the question 'Find People Who Emigrated From US To Europe', has the following stages:
  1. $graphLookup: For every record in the 'persons' collection, using the person's 'born_in' attribute, locate the matching record in the  'places' collection and then walk the chain of ancestor place records building up a hierarchy of 'born in' place names.
  2. $match: Only keep 'persons' records, where the 'born in' hierarchy of discovered place names includes 'United States of America'.
  3. $graphLookup: For each of these remaining 'persons' records, using each person's 'lives_in' attribute, locate the matching record in the 'places' collection and then walk the chain of ancestor place records building up a hierarchy of 'lives in' place names.
  4. $match: Only keep around the remaining 'persons' records, where the 'lives in' hierarchy of discovered place names includes 'Europe'.
  5. $project: For the resulting records to be returned, just show the attributes 'fullname', 'born_in' and 'lives_in'.

The actual MongoDB Aggregation Pipeline for this is:

db.persons.aggregate([
    {$graphLookup: {
        from: 'places',
        startWith: '$born_in',
        connectFromField: 'part_of',
        connectToField: 'name',
        as: 'born_hierarchy'
    }},
    {$match: {'born_hierarchy.name': born}},
    {$graphLookup: {
        from: 'places',
        startWith: '$lives_in',
        connectFromField: 'part_of',
        connectToField: 'name',
        as: 'lives_hierarchy'
    }},
    {$match: {'lives_hierarchy.name': lives}},
    {$project: {
        _id: 0,
        fullname: 1, 
        born_in: 1, 
        lives_in: 1, 
    }}
])

When this aggregation is executed, after first declaring values for the variables highlighted in red...

var born = 'United States of America', lives = 'Europe'

...the following is an excerpt of the output that is returned by the aggregation:

{fullname: 'Lucy Smith', born_in: 'Idaho', lives_in: 'England'}
{fullname: 'Bobby Mc470', born_in: 'Illinois', lives_in: 'La Massana'}
{fullname: 'Sandy Mc1529', born_in: 'Mississippi', lives_in: 'Karbinci'}
{fullname: 'Mandy Mc2131', born_in: 'Tennessee', lives_in: 'Budapest'}
{fullname: 'Gordon Mc2472', born_in: 'Texas', lives_in: 'Tyumenskaya oblast'}
{fullname: 'Gertrude Mc2869', born_in: 'United States of America', lives_in: 'Planken'}
{fullname: 'Simon Mc3087', born_in: 'Indiana', lives_in: 'Ribnica'}
..
..

On my laptop, using the data-set of a million person records, the aggregation takes about 45 seconds to complete. However, if I first define the index...

db.places.createIndex({name: 1})

...and then run the aggregation, it only takes around 2 seconds to execute. This shows just how efficiently the 'graphLookup' capability is able to walk a graph of relationships, by leveraging an appropriate index.


Summary


I've shown the expressiveness and power of MongoDB's aggregation framework, combined with 'graphLookup' pipeline stages, to perform a query of a graph of relationships across many records. A 'graphLookup' stage is efficient as it avoids the need to develop client application logic to programmatically navigate each hop of a graph of relationships, and thus avoids the network round trip latency that a client, traversing each hop, would otherwise incur. The 'graphLookup' stage can and should leverage an index, to enable the 'tree-walk' process to be even more efficient.

Although MongoDB may not be as rich in terms of the number of graph processing primitives it provides, compared with 'dedicated' graph databases, it possesses some key advantages for 'graph' use cases:
  1. Business Critical Applications. MongoDB is designed for, and invariably deployed as a realtime operational database, with built-in high availability and enterprise security capabilities to support realtime business critical uses. Dedicated graph databases tend to be built for 'back-office' and 'offline' analytical uses, with less focus on high availability and security. If there is a need to leverage a database to respond to graph queries in realtime for applications sensitive to latency, availability and security, MongoDB is likely to be a great fit.
  2. Cost of Ownership & Timeliness of Insight. Often, there may be requirements to satisfy CRUD random realtime operations on individual data records and satisfy graph-related analysis of the data-set as a whole. Traditionally, this would require an ecosystem containing two types of database, an operational database and a graph analytical database. A set of ETL processes would then need to be developed to keep the duplicated data synchronised between the two databases. By combining both roles in a single MongoDB distributed database, with appropriate workload isolation, the financial cost of this complexity can be greatly reduced, due to a far simpler deployment. Additionally, and as a consequence, there will be no lag that arises when keeping one copy of data in one system, up to date with the other copy of the data in another system. Rather than operating on stale data, the graph analytical workloads operate on current data to provide more accurate business insight.


Song for today: Cosmonauts by Quicksand

Sunday, February 4, 2018

Run MongoDB Aggregation Facets In Parallel For Faster Insight

Introduction

MongoDB version 3.4 introduced a new Aggregation stage, $facet, to enable developers to "create multi-faceted aggregations which characterize data across multiple dimensions, or facets, within a single aggregation stage". For example, you may run a clothes retail website and use this aggregation capability to characterise the choices across a set of filtered products, by the following facets, simultaneously:
  1. Size (e.g. S, M, L,)
  2. Full-price vs On-offer
  3. Brand (e.g. Nike, Adidas)
  4. Average Rating (e.g. 1 - 5 stars)
In this blog post, I explore a way in which the response times for faceted aggregation workloads can be reduced, by leveraging parallel processing.

Parallelising Aggregated Facets

If an aggregation pipeline declares the use of the $facet stage, it defines multiple facets where each facet is a "sub-pipeline" containing a series actions specific to its facet. When a faceted aggregation is executed, the result of the aggregation will contain the combined output of all the facet's sub-pipelines. Below is an example of the structure of a "faceted" aggregation pipeline.
In this example, there are two facets or dimensions, each containing a sub-pipeline. Each sub-pipeline is essentially a regular aggregation pipeline, with just a small handful of restrictions on what it can contain. Notable amongst these restrictions is the fact that the sub-pipeline cannot contain a $facet stage. Therefore you can't use this to go infinite levels deep!

The ability to define an aggregation containing different facets is not just useful for responding to online user interactions, in realtime. It is also useful for activities such as running a business's "internal reporting" workloads, where a report may need to analyse a full data set, and then summarise the data in different dimensions.

A data set that I've been playing around with recently, is the publically available "MOT UK Annual Vehicle Test Result Data". An MOT is a UK annual safety check on a vehicle, and is mandatory for all cars over 3 years old. The UK government makes the data available to download, in anonymised form, for anyone to consume, through its data.gov.uk platform. It's a rich data set, providing a lot of insight into the characteristics of cars that UK residents have been driving over the last ten years or so. As a result, it's a good data set for me to use to explore faceted aggregations.

2014-2016 MOT car data loaded into MongoDB - displayed in MongoDB Compass

To analyse the car data, I created a GitHub project at mongo-uk-car-data. This contains some Python scripts to load the data from the MOT data CSV files into MongoDB, and to perform various analytics on the data set using MongoDB's Aggregation Framework. One of the Python scripts I created, mdb-mot-agg-cars-facets.py, uses a $facet stage to aggregate together summary information, in the following three different dimensions:
  1. Analyse the different car makes/brands (e.g. Ford, Vauxhall) and categorise them into a range of "buckets", based on how many different unique models each car make has.
  2. Summarise the amount of tested cars that fall into each fuel type category (e.g. Petrol, Diesel, Electric). Note: "Petrol" is equivalent to "Gas" for my American friends, I believe.
  3. List the top 5 car makes/brands from the car tests, showing how many cars there are for each car make, plus each car make's most popular and least popular models.
The following shows the result of the aggregation when run against the data set for years 2014-16 (a data set of approximately 113 million records).


When I ran this test on my Linux Laptop (hosting both a mongod server and the test Python script), the aggregation was completed in about 5:20 minutes. In my test Python client code, a faceted pipeline is constructed, which uses the PyMongo Driver to send the aggregation command and pipeline payload to the MongoDB database. Significantly, the database's Aggregation framework processes each facet's sub-pipeline serially. Therefore, for example, if the first facet takes 5 minutes to process, the second takes 2 minutes and the third facet takes 10 minutes, the client application will only receive a full response in just over 17 minutes.

It occurred to me that there was a way to potentially speed up the execution time of this analytics job. At the point of invoking collection.aggregate(pipeline) in the mdb-mot-agg-cars-facets.py script, a custom function could be invoked instead, that internally breaks the pipeline up into separate pipelines, one for each facet. The function could then send each facet as a separate aggregation command, in parallel, to the MongoDB database to be processed, before merging the results into one, and returning it. Functionally, the behaviour of this code and the content of the response would be identical, but I hoped the response time would be significantly less. So I replaced the line of code that directly invoked the MongoDB aggregation command, collection.aggregate(pipeline), with a call to my new function aggregate_facets_in_parallel(collection, pipeline), instead. The implementation of this function can be seen in the Python file parallel_facets_agg.py. The function  uses Python's multiprocessing.pool.ThreadPool library to send each facet's sub-pipeline in a separate client thread and waits for all parallel aggregations to complete before returning the combined result.

This time, when I ran the test Python script against the same data set, I received the exact same result, but in a time of just 3:30 minutes (versus the original time of 5:20 minutes). This is not a bad speed up!  :-D

Some Observations

Some people may look at this and ask why, given that there were 3 facets, the aggregation didn't respond in just one third of the original time (i.e. in around 1:47 minutes). Well, there are many reasons, including:
  1. This would assume each separate facet sub-pipeline takes the same amount of time to execute, which is highly unlikely. The different facet sub-pipelines will each have different complexities and processing requirements. The overall response time cannot be any faster than the slowest of the 3 facet sub-pipelines.
  2. Just because the client code spawns 3 "concurrent" threads, it doesn't mean that these 3 threads are actually running completely in parallel. For example, my laptop has 2 CPU cores, which would be a cause of some resource contention. There will of course be many other potential causes of resource contention, such as multiple threads competing to retrieve different data from the same hard drive.
  3. For my simple tests, the client test Python script is running on the same machine as the MongoDB database, and thus will consume some of the compute capacity (albeit, for these tests, it will mostly just be blocking and waiting).
  4. In most real world cases (but not for my simple tests here), there may also be other workloads being processed by the MongoDB database simultaneously, consuming significant portions of the shared compute resources.
The other question people may ask is, if this is so simple, why doesn't the MongoDB server implement such parallelism itself for processing the different sections of an aggregation $facet stage. There are at least two reasons why this is not the case in MongoDB:
  1. My test scenario places some restrictions on the aggregation pipeline as a whole. Specifically, the top level pipeline must only contain one stage (the $facet stage) and my custom function throws an exception if this is not the case. This is fine and quite common where the workload is an analytical workload that needs to "full table scan" most or all of a data set. However, in the original retail example at the top of the post, the likelihood is that there would need to be a $match stage, before the $facet stage, to first restrict the multi-faceted clothes classifications based on a filter that the user has entered (e.g. product name contains "Black Trainers"). Thus, it may well be more efficient to perform the $match just once, to reduce the set of data to work with, before having this data passed on to a $facet stage. The workaround would be to duplicate the $match as the first stage of each of the $facet sub-pipelines, which could well turn out to be slower, as the same work would be repeated.
  2. For the most part, MongoDB's runtime architecture does not attempt to divide and process an individual client request's CRUD operations into parallel chunks, and instead processes the elements of an individual request serially. One reason why this is a good thing, is that typically a MongoDB database will be processing many requests in parallel from many clients. If one particular request was allowed to dominate the system's resources, by being parallelised "server-side" for a "burst of time", this may adversely affect other requests and cause the database to exhibit inconsistent performance as a whole. MongoDB's architecture generally encourages a fairer share of resources, spread across all clients' requests. For this reason, you may want to carefully consider how much you use the "client-side parallelism" tip in this blog post, in order to avoid abusing this "fair share" trust.

Summary

I've shown an example in this blog post of how multi-faceted MongoDB aggregations can be sped up by encouraging parallelism, from the client application's perspective. The choice of Python to implement this was fairly arbitrary. I could have implemented it in any programming languages that there is a MongoDB Driver for, using the appropriate multi-threading libraries for that language. The parallelism benefits discussed in this post are really aimed at analytics type workloads that need to process a whole data-set to produce multi-faceted insight. This is in contrast to the sorts of workloads that would first match a far smaller subset of records, against an index, before then aggregating on the small data subset.



Song for today: Bella Muerte by The Patch of Sky

Thursday, July 13, 2017

Deploying a MongoDB Sharded Cluster using Kubernetes StatefulSets on GKE

[Part 4 in a series of posts about running MongoDB on Kubernetes, with the Google Kubernetes Engine (GKE). For this post, a newer GitHub project gke-mongodb-shards-demo has been created to provide an example of a scripted deployment for a Sharded cluster specifically. This gke-mongodb-shards-demo project also incorporates the conclusions from the earlier posts in the series. Also see: http://k8smongodb.net/]


Introduction

In the previous posts of my blog series (1, 2, 3), I focused on deploying a MongoDB Replica Set in GKE's Kubernetes environment. A MongoDB Replica Set provides data redundancy and high availability, and is the basic building block for any mission critical deployment of MongoDB. In this post, I now focus on the deployment of a MongoDB Sharded Cluster, within the GKE Kubernetes environment. A Sharded Cluster enables the database to be scaled out over time, to meet increasing throughput and data volume demands. Even for a Sharded Cluster, the recommendations from my previous posts are still applicable. This is because each Shard is a Replica Set, to ensure that the deployment exhibits high availability, in addition to scalability.


Deployment Process

My first blog post on MongoDB & Kubernetes observed that, although Kubernetes is a powerful tool for provisioning and orchestrating sets of related containers (both stateless and now stateful), it is not a solution that caters for every required type of orchestration task. These tasks are invariably technology specific and need to operate below or above the “containers layer”. The example I gave in my first post, concerning the correct management of a MongoDB Replica Set's configuration, clearly demonstrates this point.

In the modern Infrastructure as Code paradigm, before orchestrating containers using something like Kubernetes, other tools are first required to provision infrastructure/IaaS artefacts such as Compute, Storage and Networking. You can see a clear example of this in the provisioning script used in my first post, showing non-Kubernetes commands ("gcloud"), which are specific to the Google's Compute Platform (GCP), being used first, to provision storage disks. Once containers have been provisioned by a tool like Kubernetes, higher level configuration tasks, such as data loading, system user identity provisioning, secure network modification for service exposure and many other "final bootstrap" tasks, will also often need to be scripted.

With the requirement here to deploy a MongoDB Sharded Cluster, the distinction between container orchestration tasks, lower level infrastructure/IaaS provisioning tasks and higher level technology-specific orchestration tasks, becomes even more apparent...

For a MongoDB Sharded Cluster on GKE, the following categories of tasks must be implemented:

    Infrastructure Level  (using Google's "gcloud" tool)
  1. Create 3 VM instances
  2. Create storage disks of various sizes, for containers to attach to
    Container Level  (using Kubernetes' "kubectl" tool)
  1. Provision 3 "startup-script" containers using a Kubernetes DaemonSet, to enable the XFS filesystem to be used and to disable Huge Pages
  2. Provision 3 "mongod" containers using a Kubernetes StatefulSet, ready to be used as members of the Config Server Replica Set to host the ConfigDB
  3. Provision 3 separate Kubernetes StatefulSets, one per Shard, where each StatefulSet is composed of 3 "mongod" containers ready to be used as members of the Shard's Replica Set
  4. Provision 2 "mongos" containers using a Kubernetes Deployment, ready to be used for managing and routing client access to the Sharded database
    Database Level  (using MongoDB's "mongo shell" tool)
  1. For the Config Servers, run the initialisation command to form a Replica Set
  2. For each of the 3 Shards (composed of 3 mongod processes), run the initialisation command to form a Replica Set
  3. Connecting to one of the Mongos instances, run the addShard command, 3 times, once for each of the Shards, to enable the Sharded Cluster to be fully assembled
  4. Connecting to one of the Mongos instances, under localhost exception conditions, create a database administrator user to apply to the cluster as a whole

The quantities of resources highlighted above are specific to my example deployment, but the types and order of provisioning steps apply regardless of deployment size.

In the accompanying example project, for brevity and clarity, I use a simple Bash shell script to wire together these three different categories of tasks. In reality, most organisations would use more specialised automation software, such as Ansible, Puppet, or Chef, for example, to glue all such steps together.


Kubernetes Controllers & Pods

The Kubernetes StatefulSet definitions for the MongoDB Config Server "mongod" containers and the Shard member "mongod" containers hardly differ from those described in my first blog post.

Below is an excerpt of the StatefulSet definition for each Config Server mongod container (shard specific addition highlighted in bold):

containers:
  - name: mongod-configdb-container
    image: mongo
    command:
      - "mongod"
      - "--port"
      - "27017"
      - "--bind_ip"
      - "0.0.0.0"
      - "--wiredTigerCacheSizeGB"
      - "0.25"
      - "--configsvr"
      - "--replSet"
      - "ConfigDBRepSet"
      - "--auth"
      - "--clusterAuthMode"
      - "keyFile"
      - "--keyFile"
      - "/etc/secrets-volume/internal-auth-mongodb-keyfile"
      - "--setParameter"
      - "authenticationMechanisms=SCRAM-SHA-1"

Below is an excerpt of the StatefulSet definition for each Shard member mongod container (shard specific addition highlighted in bold):

containers:
  - name: mongod-shard1-container
    image: mongo
    command:
      - "mongod"
      - "--port"
      - "27017"
      - "--bind_ip"
      - "0.0.0.0"
      - "--wiredTigerCacheSizeGB"
      - "0.25"
      - "--shardsvr"
      - "--replSet"
      - "Shard1RepSet"
      - "--auth"
      - "--clusterAuthMode"
      - "keyFile"
      - "--keyFile"
      - "/etc/secrets-volume/internal-auth-mongodb-keyfile"
      - "--setParameter"
      - "authenticationMechanisms=SCRAM-SHA-1"

For the Shard's container definition, the name of the specific Shard's Replica Set is declared. This Shard definition will result in 3 mongod replica containers being created for the Shard Replica Set, called "Shard1RepSet". Two additional and similar StatefulSet resources also have to be defined, to represent the second Shard ("Shard2RepSet") and third Shard ("Shard3RepSet") too.

To provision the Mongos Routers, a StatefulSet is not used. This is because neither persistent storage nor a fixed network hostname are required. Mongos Routers are stateless and, to a degree, ephemeral. Instead, a Kubernetes Deployment resource is defined, which is the Kubernetes preferred approach for stateless services. Below is the Deployment definition for the Router mongos container:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: mongos
spec:
  replicas: 2
  template:
    spec:
      volumes:
        - name: secrets-volume
          secret:
            secretName: shared-bootstrap-data
            defaultMode: 256
      containers:
        - name: mongos-container
          image: mongo
          command:
            - "numactl"
            - "--interleave=all"
            - "mongos"
            - "--port"
            - "27017"
            - "--bind_ip"
            - "0.0.0.0"
            - "--configdb"
            - "ConfigDBRepSet/mongod-configdb-0.mongodb-configdb-service.default.svc.cluster.local:27017,mongod-configdb-1.mongodb-configdb-service.default.svc.cluster.local:27017,mongod-configdb-2.mongodb-configdb-service.default.svc.cluster.local:27017"
            - "--clusterAuthMode"
            - "keyFile"
            - "--keyFile"
            - "/etc/secrets-volume/internal-auth-mongodb-keyfile"
            - "--setParameter"
            - "authenticationMechanisms=SCRAM-SHA-1"
          ports:
            - containerPort: 27017
          volumeMounts:
            - name: secrets-volume
              readOnly: true
              mountPath: /etc/secrets-volume

The actual structure of this resource definition is not a radical departure from the definition of a StatefulSet. A different command has been declared ("mongos", rather than "mongod"), but the same base container image has been referenced (mongo image from Docker Hub). For the "mongos" container, it is still important to enable authentication and to reference the generated cluster key file. Specific to the "mongos" parameter list is "--configdb", to specify the URL of the "ConfigDB" (the 3-node Config Server Replica Set). This URL will be connected to by each instance of the mongos router, to enable discovery of the Shards in the cluster and to determine which Shard holds specific ranges of the stored data. Due to the fact that the "mongod" containers, used to host the "ConfigDB", are deployed as a StatefulSet, their hostnames remain constant. As a result, a fixed URL can be defined in the "mongos" container resource definition, as shown above.

Once the  "generate" shell script in the example project has been executed, and all the Infrastructure, Kubernetes and Database provisioning steps have successfully completed, 17 different Pods will be running, each hosting one container. Below is a screenshot showing the results from running the "kubectl" command, to list all the running Kubernetes Pods:
This tallies up with what Kubernetes was asked to be deployed, namely:
  • 3x DaemonSet startup-script containers (one per host machine)
  • 3x Config Server mongod containers
  • 3x Shards, each composed of 3 replica mongod containers
  • 2x Router mongos containers
Only the "mongod" containers for the Config Servers and the Shard Replicas have fixed and deterministic names, because these were deployed as Kubernetes StatefulSets. The names of the other containers reflect the fact that they are regarded as stateless, disposable and trivial to re-create, on-demand, by Kubernetes.

With the full deployment generated and running, it is a straight forward process to connect to the sharded cluster to check its status. The screenshot below shows the use of the "kubectl" command to open a Bash shell connected to the first "mongos" container. From the Bash shell, the "mongo shell" has been opened, connecting to the local "mongos" process running in the same container.  Before the command to view the status of the Sharded cluster has been run, the database has first been authenticated with, using the "admin" user.
The output of the status command shows that the URLs of all three Shards that have been defined (each is a MongoDB Replica Set). Again, these URLs remain fixed by virtue of the use of Kubernetes StatefulSets for the "mongod" containers that compose each Shard.

UPDATE 02-Jan-2018: Since writing this blog post, I realised that because the mongos routers ideally require stable hostnames, to be easily referenceable from the app tier, the mongos router containers should also be declared and deployed as a Kubernetes StatefulSet and Service, rather than a Kubernetes Deployment. The GitHub project gke-mongodb-shards-demo, associated with this blog post, has been changed to reflect this.


Summary

In this blog post I’ve shown how to deploy a MongoDB Sharded Cluster using Kubernetes with the Google Kubernetes Engine. I've mainly focused on the high level considerations for such deployments, rather than listing every specific resource and step required (for that level of detail, please view the accompanying GitHub project). What this study does reinforce, is that a single container orchestration framework (Kubernetes in this case) does not cater for every step required to provision and deploy a highly available and scalable database, such as MongoDB. I believe that this would be true for any non-trivial and mission critical distributed application or set of services. However, I don't see this is a bad thing. Personally, I value flexibility and choice, and having the ability to use the right tool for the right job. In my opinion, a container orchestration framework that tries to cater for things beyond its obvious remit would be the wrong framework. A framework that attempts to be all things to all people, would end up diluting its own value, down to the lowest common denominator. At the very least, it would become far too prescriptive and restrictive. I feel that Kubernetes, in its current state, strikes a suitable and measured balance, and with its StatefulSets capability, provides a good home for MongoDB deployments.


Song for today: Three Days by Jane's Addiction

Friday, June 30, 2017

Using the Enterprise Version of MongoDB on GKE Kubernetes

[Part 3 in a series of posts about running MongoDB on Kubernetes, with the Google Kubernetes Engine (GKE). See the GitHub project gke-mongodb-demo for an example scripted deployment of MongoDB to GKE, that you can easily try yourself. The gke-mongodb-demo project combines the conclusions from all the posts in this series so far. Also see: http://k8smongodb.net/]


Introduction

In the previous two posts of my blog series (1, 2) about running MongoDB on GKE's Kubernetes environment, I showed how to ensure a MongoDB Replica Set is secure by default, resilient to system failures and how to ensure various best practice "production" environment settings are in place. In those examples, the community version of the MongoDB binaries were used. In this blog post I show how the enterprise version of MongoDB can be utilised, instead.


Referencing a Docker Image for Use in Kubernetes 

For the earlier two blog post examples, a pre-built "mongo" Docker image was "pulled" by Kubernetes, from Docker Hub's "official" MongoDB repository. Below is an excerpt of the Kubernetes StatefulSet definition, that shows how this image was referenced (highlighted in bold):

$ cat mongodb-service.yaml
....
....
      containers:
        - name: mongod-container
          image: mongo
          command:
            - "mongod"
....
....

By default, if no additional metadata is provided, Kubernetes will look in the Docker Hub repository for the image with the given name. Other repositories such as Google's Container Registry, AWS's EC2 Container Registry, Azure's Container Registry, or any private repository, can be used instead.

It is worth clarifying what is meant by the "official" MongoDB repository. This is a set of images that are "official" from Docker Hub's perspective because Docker Hub manages how the images are composed and built. They are not, however, official releases from MongoDB Inc.'s perspective. When the Docker Hub project builds an image, in addition to sourcing the "mongod" binary from MongoDB Inc's website, other components like the underlying Debian OS, plus various custom scripts, are generated into the image too.

At the time of this blog post, Docker Hub currently provides images for MongoDB community versions 3.0, 3.2, 3.4 and 3.5 (unstable). The Docker Hub repository only contains images for the community version of MongoDB and not the enterprise version.


Building a Docker Image Using the MongoDB Enterprise binaries 

You can build a Docker image to run "mongod" in any way you want, using your own custom Dockerfile to define how the image should be generated. The Docker manual even provides a tutorial for creating a custom Docker image specifically for MongoDB, called Dockerize MongoDB. That process can be followed as the basis for building an image which pulls down and uses the enterprise version of MongoDB, rather than the community version.

However, to generate an image containing the enterprise MongoDB version, it isn't actually necessary to create your own custom Dockerfile. This is because, a few weeks ago, I created a pull request to add support for building MongoDB enterprise based images, as part of the normal "mongo" GitHub project that is used to generate the "official" Docker Hub "mongo" images. The GitHub project owner accepted and merged this enhancement a few days later. My enhancement essentially allows the project's Dockerfile, when used by Docker's build tool, to instruct that the enterprise MongoDB binaries should be downloaded into the image, rather than the community ones. This is achieved by providing some "build arguments" to the Docker build process, which I will detail further below. This doesn't mean the Docker Hub's "official repository" for MongoDB now contains pre-generated "enterprise mongod" images ready to use. It just means you can use the source Github project directly, without any changes to the project's Dockerfile, to generate your own image, containing the enterprise binary.

To build the Docker "mongo" image, that pulls in the enterprise version of MongoDB, follow these steps:

  1. Install Docker locally on your own workstation/laptop.

  2. Download the source files for the Docker Hub "mongo" project (on the project page, click the "Clone or download" green button, then click the "Download zip" link and once downloaded, unpack into a new local folder).

  3. From a command line shell, in the project folder, change directory to the sub-folder of the major version you want to build (e.g. "3.4"), and run:

  $ docker build -t pkdone/mongo-ent:3.4 --build-arg MONGO_PACKAGE=mongodb-enterprise --build-arg MONGO_REPO=repo.mongodb.com .

This tells Docker to build an image using the Dockerfile in the current directory (".") with the resulting image name being "pkdone/mongo-ent" and tag being ":3.4". By convention, the image name is prefixed by the author's username, which in my case is "pkdone" (obviously this should be replaced by a different prefix, for whoever follows these steps).

The two new "--build-arg" parameters, "MONGO_PACKAGE" and "MONGO_REP" are passed to the "mongo" Dockerfile.  The version of the Dockerfile with my enhancements uses these two parameters to locate where to download the specific type of MongoDB binary from. In this case, the values specified mean that the enterprise binary is pulled down into the generated image. If no "build args" are specified, the community version of MongoDB is used, by default.

Important note: When running the "docker build" command above, because the enterprise version of MongoDB will be downloaded, it will mean you are implicitly accepting MongoDB Inc's associated commercial licence terms.

Once the image is generated, you can also quickly test the image by running it in a Docker container on your local machine (as just a "mongod"  single instance):

$ docker run -d --name mongo -t pkdone/mongo-ent:3.4

To be sure this is running properly, connect to the container using a shell, check if the "mongod" process is running and check that the Mongo Shell can connect to the containerised "mongod" process.

$ docker exec -it mongo bash
$ ps -aux
$ mongo

The output of the Shell should include the prompt "MongoDB Enterprise >" which shows that the database is using the enterprise version of MongoDB. Exit out of the Mongo Shell, exit out of the container and then from the local machine, run the command to view the "mongod" container's logged output (with example results shown):

$ docker logs mongo | grep enterprise

2017-07-01T12:08:42.177+0000 I CONTROL  [initandlisten] modules: enterprise

Again this result should demonstrate that the enterprise version of MongoDB has been used.


Using the Generated Enterprise Mongod Image from the Kubernetes Project

The easiest way to use the "mongod" container image that has just been created, from GKE Kubernetes, is to first register it with Docker Hub, using the following steps:

  1. Create a new free account on Docker Hub.

  2. Run the following commands to associate your local workstation environment with your new Docker Hub account, to list the built image registered on your local machine, and to push this newly generated image to your remote Docker Hub account:

    $ docker login
    $ docker images
    $ docker push pkdone/mongo-ent:3.4   

  3. Once the image has finished uploading, in a browser, return to the Docker Hub site and log-in to see the list of your registered repository instances. The newly pushed image should now be listed there.

Now back in the GKE Kubernetes project, for the "mongod" Service/StatefulSet resource definition, change the image reference to be the newly uploaded Docker Hub image, as shown below (highlighted in bold):

$ cat mongodb-service.yaml
....
....
      containers:
        - name: mongod-container
          image: pkdone/mongo-ent:3.4
          command:
            - "mongod"
....
....

Now re-perform all the normal steps to deploy the Kubernetes cluster and resources as outlined in the first blog post in the series. Once the MongoDB Replica Set is up and running, you can check the output logs of the first "mongod" container, to see if the enterprise version of MongoDB is being used:

$ kubectl logs mongod-0 | grep enterprise

2017-07-01T13:01:42.794+0000 I CONTROL  [initandlisten] modules: enterprise


Summary

In this blog post I’ve shown how to use the enterprise version of MongoDB, when running a MongoDB Replica Set, using Kubernetes StatefulSets, on the Google Kubernetes Engine. This builds on the work done in previous blog posts in the series, around ensuring MongoDB Replica Sets are resilient and better tuned for production workloads, when running on Kubernetes.

[Next post in series: Deploying a MongoDB Sharded Cluster using Kubernetes StatefulSets on GKE]


Song for today: Standing In The Way Of Control by Gossip

Tuesday, June 27, 2017

Configuring Some Key Production Settings for MongoDB on GKE Kubernetes

[Part 2 in a series of posts about running MongoDB on Kubernetes, with the Google Kubernetes Engine (GKE). See the GitHub project gke-mongodb-demo for an example scripted deployment of MongoDB to GKE, that you can easily try yourself. The gke-mongodb-demo project combines the conclusions from all the posts in this series so far. Also see: http://k8smongodb.net/]


Introduction

In the first part of my blog series I showed how to deploy a MongoDB Replica Set to GKE's Kubernetes environment, whilst ensuring that the replica set is secure by default and resilient to various types of system failures. As mentioned in that post, there are number of other "production" considerations that need to be made when running MongoDB in Kubernetes and Docker environments. These considerations are primarily driven by the best practices documented in MongoDB’s Production Operations Checklist and Production Notes. In this blog post, I will address how to apply some (but not all) of these best practices, on GKE's Kubernetes platform.


Host VM Modifications for Using XFS & Disabling Hugepages

For optimum performance, the MongoDB Production Notes strongly recommend applying the following configuration settings to the host operating system (OS):
  1. Use an XFS based Linux filesystem for WiredTiger data file persistence.
  2. Disable Transparent Huge Pages.
The challenge here is that neither of these elements can be configured directly within normally deployed pods/containers. Instead, they need to be set in the OS of each machine/VM that is eligible to host one or more pods and their containers. Fortunately, after a little googling I found a solution to incorporating XFS, in the article Mounting XFS on GKE, which also provided the basis for deriving a solution for disabling Huge Pages too. It turns out that in Kubernetes, it is possible to run a pod (and its container) once per node (host machine), using a facility called a DaemonSet. A DaemonSet is used to schedule a "special" container to run on every newly provisioned node, as a one off, before any "normal" containers are scheduled and run on the node. In addition, for Docker based containers (the default on GKE Kubernetes), the container can be allowed to run in a privileged mode, which gives the "privileged" container access to other Linux Namespaces running in the same host environment. With heightened security rights the "privileged" container can then run a utility called nsenter ("NameSpace ENTER") to spawn a shell using the namespace belonging to the host OS ("/proc/1"). The script that the shell runs can then essentially perform any arbitrary root level actions on the underlying host OS.

So with this in mind, the challenge is to build a Docker container image that, when run in privileged mode, uses "nsenter" to spawn a shell to run some shell script commands. As luck would have it, such a container has already been created, in a generic way, as part of the Kubernetes "contributions" project, called startup-script. The generated "startup-script" Docker image has been registered and and made available in the Google Container Registry, ready to be pulled in and used by anyone's Kubernetes projects.

Therefore on GKE, to create a DaemonSet leveraging the "startup-script" image in privileged mode, we first need to define the DaemonSet's configuration:

$ cat hostvm-node-configurer-daemonset.yaml

kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
  name: hostvm-configurer
  labels:
    app: startup-script
spec:
  template:
    metadata:
      labels:
        app: startup-script
    spec:
      hostPID: true
      containers:
      - name: hostvm-configurer-container
        image: gcr.io/google-containers/startup-script:v1
        securityContext:
          privileged: true
        env:
        - name: STARTUP_SCRIPT
          value: |
            #! /bin/bash
            set -o errexit
            set -o pipefail
            set -o nounset
            
            # Disable hugepages
            echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled
            echo 'never' > /sys/kernel/mm/transparent_hugepage/defrag
                        
            # Install tool to enable XFS mounting
            apt-get update || true
            apt-get install -y xfsprogs

Shown in bold at the base of the file, you will notice the commands used to disable Huge Pages and to install the XFS tools for mounting and formatting storage using the XFS filesystem. Further up the file, in bold, is the reference to the 3rd party "startup-script" image from the Google Container Registry and the security context setting to state that the container should be run in privileged mode.

Next we need to deploy the DaemonSet with its "start-script" container to all the hosts (nodes), before we attempt to create any GCE disks, that need to be formatted as XFS:

$ kubectl apply -f hostvm-node-configurer-daemonset.yaml

In the GCE disk definitions, described in the first blog post in this series (i.e. "gce-ssd-persistentvolume?.yaml"), an addition of a new parameter needs to be made (shown in bold below) to indicate that the disk's filesystem type needs to be XFS:

apiVersion: "v1"
kind: "PersistentVolume"
metadata:
  name: data-volume-1
spec:
  capacity:
    storage: 30Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: fast
  gcePersistentDisk:
    fsType: xfs
    pdName: pd-ssd-disk-1

Now in theory, this should be all that is required to get XFS working. Except on GKE, it isn't!

After deploying the DaemonSet and creating the GCE storage disks, the deployment of the "mongod" Service/StatefulSet will fail. The StatefulSet's pods do not to start properly because the disks can't be formatted and mounted as XFS. It turns out that this is because, by default, GKE uses a variant of Chromium OS as the underlying host VM that runs the containers, and this OS flavour doesn't support XFS. However, GKE can also be configured to use a Debian based Host VM OS instead, which does support XFS.

To see the list of host VM OSes that GKE supports, the following command can be run:

$ gcloud container get-server-config
Fetching server config for europe-west1-b
defaultClusterVersion: 1.6.4
defaultImageType: COS
validImageTypes:
- CONTAINER_VM
- COS
....

Here, "COS" is the label for the Chromium OS and "CONTAINER_VM" is the label for the Debian OS. The easiest way to start leveraging the Debian OS image is to clear out all the GCE/GKE resources and Kubernetes cluster from the current project and start deployment all over again. This time, when the initial command is run to create the new Kubernetes cluster,  an additional argument (shown in bold) must be provided to define that the Debian OS should be used for each Host VM that is created as a Kubernetes node.

$ gcloud container clusters create "gke-mongodb-demo-cluster" --image-type=CONTAINER_VM

This time, when all the Kubernetes resources are created and deployed, the "mongod" containers correctly utilise XFS formatted persistent volumes. 

If this all seems a bit complicated, it is probably helpful to view the full end-to-end deployment flow, provided in my example GitHub project gke-mongodb-demo.

There is one final observation to make before finishing the discussion on XFS. In Google's online documentation, it is stated that the Debian Host VM OS is deprecated in favour of Chromium OS. I hope that in the future Google will add XFS support directly to its Chromium OS distribution, to make the use of XFS a lot less painful and to ensure XFS can still be used with MongoDB, if the Debian Host VM option is ever completely removed.

UPDATE 13-Oct-2017: Google has recently updated GKE and in addition to upgrading Kubernetes to version 1.7, has removed the old Debain container VM option and added an Ubuntu container VM option, instead. In the new Ubuntu container, the XFS tools are already installed, and therefore do not need to be configured by the DaemonSet. The gke-mongodb-demo project has been updated accordingly, to use the new Ubuntu container VM and to omit the command to install the XFS tools. 

Disabling NUMA

For optimum performance, the MongoDB Production Notes recommend that "on NUMA hardware, you should configure a memory interleave policy so that the host behaves in a non-NUMA fashion". The DockerHub "mongo" container image which has been used so far with Kubernetes in this blog series, already contains some bootstrap code to start the "mongod" process with the "numactl --interleave=all" setting. This setting makes the process environment behave in a non-NUMA way.

However, I believe it is worth specifying the "numactl" settings explicitly in the "mongod" Service/StatefulSet resource definition, anyway, just in case other users choose to use an alternative or self-built Docker image for the "mongod" container. The excerpt below shows the added "numactl" elements (in bold), required to run the containerised "mongod" process in a "non-NUMA" manner.

$ cat mongodb-service.yaml
....
....
      containers:
        - name: mongod-container
          image: mongo
          command:
            - "numactl"
            - "--interleave=all"
            - "mongod"
....
....


Controlling CPU & RAM Resource Allocation Plus WiredTiger Cache Size

Of course, when you are running a MongoDB database it is important to size both CPU and RAM resources correctly for the particular database workload, regardless of the type of host environment. In a Kubernetes containerised host environment, the amount of CPU & RAM resource dedicated to a container can be defined in the "resource" section of the container's declaration, as shown in the excerpt of the "mongod" Service/StatefulSet definition below:

$ cat mongodb-service.yaml
....
....
      containers:
        - name: mongod-container
          image: mongo
          command:
            - "mongod"
            - "--wiredTigerCacheSizeGB"
            - "0.25"
            - "--bind_ip"
            - "0.0.0.0"
            - "--replSet"
            - "MainRepSet"
            - "--auth"
            - "--clusterAuthMode"
            - "keyFile"
            - "--keyFile"
            - "/etc/secrets-volume/internal-auth-mongodb-keyfile"
            - "--setParameter"
            - "authenticationMechanisms=SCRAM-SHA-1"
          resources:
            requests:
              cpu: 1
              memory: 2Gi
....
....

In the example (shown in bold), 1x virtual CPU (vCPU) and 2GB of RAM have been requested to run the container. You will also notice that an additional parameter has been defined for "mongod", specifying the WiredTiger internal cache size ("--wiredTigerCacheSizeGB"). In a containerised environment it is absolutely vital to explicitly state this value. If this is not done, and multiple containers end up running on the same host machine (node), MongoDB's WiredTiger storage engine may attempt to take more memory than it should. This is because of the way a container "reports" it's memory size to running processes. As per the MongoDB Production Recommendations, the default cache size guidance is: "50% of RAM minus 1 GB, or 256 MB". Given that the amount of memory requested is 2GB, the WiredTiger cache size here, has been set to 256MB.

If and when you define a different amount of memory for the container process, be sure to also adjust the WiredTiger cache size setting accordingly, otherwise the "mongod" process may not leverage all the memory reserved for it, by the container.


Controlling Anti-Affinity for Mongod Replicas

When running a MongoDB Replica Set, it is important to ensure that none of the "mongod" replicas in the replica set are running on the same host machine as each other, to avoid inadvertently introducing a single point of failure. In a Kubernetes containerised environment, if containers are left to their own devices, different "mongod" containers could end up running on the same nodes. Kubernetes provides a way of specifying pod anti-affinity to prevent this from occurring. Below is an excerpt of a "mongod" Services/StatefulSet resource file which declares an anti-affinity configuration.

$ cat mongodb-service.yaml
....
....
  serviceName: mongodb-service
  replicas: 3
  template:
    metadata:
      labels:
        replicaset: MainRepSet
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: replicaset
                  operator: In
                  values:
                  - MainRepSet
              topologyKey: kubernetes.io/hostname
....
....

Here, a rule has been defined that asks Kubernetes to apply anti-affinity when deploying pods with the label "replicaset" equal to "MainRepSet", by looking for potential matches on the host VM instance's hostname, and then avoiding them.


Setting File Descriptor & User Process Limits

When deploying the MongoDB Replica Set on GKE Kubernetes, as demonstrated in the current GitHub project gke-mongodb-demo, you may notice some warning about "rlimits" in the output of each containerised mongod's logs. These log entries can be viewed by running the following command:

$ kubectl logs mongod-0 | grep rlimits

2017-06-27T12:35:22.018+0000 I CONTROL  [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 29980 processes, 1000000 files. Number of processes should be at least 500000 : 0.5 times number of files.

The MongoDB manual provides some recommendations concerning the system settings for the maximum number of processes and open files when running a "mongod" process.

Unfortunately, thus far, I've not established an appropriate way to the enforce these thresholds using GKE Kubernetes. This topic will possibly be the focus of a blog post for another day. However, I thought that it would be informative to highlight the issue here, with the supporting context, to allow others the chance to resolve it first.

UPDATE 13-Oct-2017: Google has recently updated GKE and in addition to upgrading Kubernetes to version 1.7, has removed the old Debain container VM option and added an Ubuntu container VM option, instead. The default "rlimits" settings in the Ubuntu container VM, are already appropriate for running MongoDB. Therefore a fix is no longer required to address this issue. 

Summary

In this blog post I’ve provided some methods for addressing certain best practices when deploying a MongoDB Replica Set to the Google Kubernetes Engine. Although this post does not provide an exhaustive list of best practice solutions, I hope it proves useful for others (and myself) to build upon, in the future.

[Next post in series: Using the Enterprise Version of MongoDB on GKE Kubernetes]


Song for today: The Mountain by Jambinai