Elastic Map Reduce on AWS

derived dataLast week, I put out a post about Redshift on AWS as an effective tool to quickly and dynamically put your toe in a large data warehouse environment.

Another tool from AWS that I experimented with was Amazon’s Elastic Map Reduce (EMR). This is an open source Hadoop installation that supports MapReduce as well as a number of other highly parallel computing approaches. EMR also supports a large number of tools to help with implementation (keeping the environment fresh) such as:  PigApache HiveHBase, Spark, Presto… It also interacts with data from a range of AWS data stores like: Amazon S3 and DynamoDB.

EMR supports a strong security model, enabling encryption at rest as well as on the move and is available in GovCloud, handling a range of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics.

For many organizations, a Hadoop cluster has been a bridge to far for a range of reasons including support and infrastructure costs and skills. EMR seems to have effectively addressed those concerns allowing you to set up or tear down the cluster in minutes, without having to worry much about the details of node provisioning, cluster setup, Hadoop configuration, or cluster tuning.

For my proof of concept efforts, the Amazon EMR pricing appeared to be simple and predictable allowing you to pay a per-second rate for the clusters installation and use — with a one-minute minimum charge (it used to be an hour!). You can launch a 10-node Hadoop cluster for less than a dollar an hour (naturally, data transport charges are handled separately). There are ways to keep your EMR costs down though.

The EMR approach appears to be focused on flexibility, allowing complete control over your cluster. You have root access to every instance and can install additional applications and customize the cluster with bootstrap actions (which can be important since it takes a few minutes to get a cluster up and running), taking time and personnel out of repetitive tasks.

There is a wide range of tutorials and training available as well as tools to help estimate billing.

Overall, I’d say that if an organization is interested in experimenting with Hadoop, this is a great way to dive in without getting soaked.