Saturday, November 7, 2015

Groovy Shell Scripting Meets Hadoop

In this post I show how Hadoop map and reduce functions can be written as Groovy shell scripts in a concise and elegant way.

The most common way to kick of a hadoop map-reduce task by command line is a python script. Alternatively, one could write the map and reduce functions in java, but he has to pay the additional cost of packaging them in a jar and running the jar in the command line. The difference between python and java is that python is a scripting language, while java not. For instance, you can run directly python as a script, by adding "#!/usr/bin/env python" at the top of your files.

Groovy fills this gap, being the first scripting language among the family of jvm languages. Reasons to prefer groovy from python inlcude:
  1. You are a java/groovy developer and you want to write the map reduce functions in a familiar language;
  2. You want to leverage the power of jvm as well as of a vast amount of libraries written in any jvm language;
  3. You find python's syntax pathetic, e.g. change indentation and you change the code semantics;
  4. Groovy goodies allow you to quickly write concise and elegant code.

Requirements

For running hadoop, I use the cloudera vm (you can download it here for free). Of course, you can use the hadoop infrastructure of your wish. Hereafter, I refer to cloudera vm and you need to adjust properly if you use an alternative.

Groovy needs to be installed in the machines that run hadoop. It is not on cloudera vm, so you need to do it, read instructions here.

The Wordcount Example

This is probably the most popular "hello word" hadoop example. The problem is stated as follows: Given a collection of files that contain strings separated by blank spaces, compute the number each word occurs across the collection. The output should be a file with records of the form <word #occurences>.

The mapper groovy script follows.


The logic is clearly communicated by the code: for each line read from the standard input, trim it, get all the words (assuming they are separated with each other by blank spaces) and for each word emit  "word    1" (a tab separates the key from the value), meaning that we found one occurrence of the word. The first line #!/usr/bin/env groovy tells the OS to use groovy to interpret this file.

The reducer groovy script follows.

Again, the logic is straightforward: for each line read from the standard input, trim it and get the word, #occurence fields out of it. If the current word is different than the previous emit the record "word counter" and reset the counter. In any case, update the previous word with the current one and sum the current value (#occurence) to the total counter. In the end, emit the record for the last word.

You need to make these 2 scripts executable:
chmod +x wordcount_mapper.groovy
chmod +x wordcount_reducer.groovy


Assuming, you have put the input files under the hdfs directory '/user/cloudera/wordcount_input', you can run the hadoop with:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/wordcount_input -output /user/cloudera/wordcount_output -mapper wordcount_mapper.groovy -reducer wordcount_reducer.groovy

If you get a failure including the message "Container killed on request. Exit code is 143", this means that groovy is not visible by the datanode (cloudera vm includes just one data node) that needs to run the map or reduce jobs. You may have installed groovy on the cluster node, but you need to install it on all data nodes too, as hadoop pushes computation to them.

A workaround for cloudera vm is to replace "groovy" in the first line of your scripts with the absolute path of the groovy file. To find where groovy is installed, run: 
printenv GROOVY_HOME

For instance, if you have installed it with sdkman, groovy is located under '/home/cloudera/.sdkman/candidates/groovy/current/bin/' and you have to replace the first line of both scripts with:
#!/usr/bin/env /home/cloudera/.sdkman/candidates/groovy/current/bin/groovy

Although not needed for this problem, you can reuse any published jvm library. Grape will fetch the dependency from a repository and you need to add import statements as you normally do in java. An example can be found here.

No comments:

Post a Comment