Friday, May 2, 2014

MongoDB Connector for Hadoop with Authentication - Quick Tip

If you are using the MongoDB Connector for Hadoop and you have enabled authentication on your MongoDB database (eg. auth=true) you may find that you are prevented from getting data in to or out of the database.

You may have provided the username and password to the connector (eg. mongo.input.uri = "mongodb://myuser:mypassword@host:27017/mytestdb.mycollctn"), for an Hadoop Job that pulls data from the database. The connector will authenticate to the database successfully, but early in in the job run, the job will fail with an error message similar to the following:

14/05/02 13:17:01 ERROR util.MongoTool: Exception while executing job... com.mongodb.hadoop.splitter.SplitFailedException: Unable to calculate input splits: need to login
        at com.mongodb.hadoop.MongoInputFormat.getSplits(
        at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(
        at org.apache.hadoop.mapreduce.Job$

This is because the connector needs to run the MongoDB-internal splitVector DB command, under the covers, to work out how to split the MongoDB data up into sections ready to distribute across the Hadoop Cluster. However, by default, you are unlikely to have given sufficient privileges to the user, used by the connector, to allow this DB command to be run. This issue can be simulated easily by opening a mongo shell against the database, authenticating with your username and password and then running the splitVector command manually. For example:

> var result = db.runCommand({splitVector: 'mytestdb.mycollctn', keyPattern: {_id: 1}, 
                                                         maxChunkSizeBytes: 32000000})
> result
"ok" : 0,
"errmsg" : "not authorized on mytestdb to execute command
    splitVector: "mytestdb.mycollctn", keyPattern: { _id: 1.0 },
    maxChunkSizeBytes: 32000000.0 }",
"code" : 13

To address this issue, you first need to use the mongo shell, authenticated as your administration user, and run the updateUser command to give the connector user the clusterManager role, to enable the connector to run the DB commands it requires. For example:

use mytestdb
db.updateUser("myuser", {
                   roles : [
                      { role: "readWrite", db: "mytestdb" },
                      { role : "clusterManager", db : "admin"  }

After this, your Hadoop jobs with the connector should run fine.

Note: In my test, I ran Cloudera CDH version 5, MongoDB version 2.6 and Connector version 1.2 (built with target set to 'cdh4').

Song for today: Spanish Sahara by Foals