What Hadoop can and can’t do
By Brian Proffit
Created 2012-06-21 03:00AM
The lure of using big data for your business is a strong one, and there is no brighter lure these days than Apache Hadoop, the scalable data storage platform that lies at the heart of many big data solutions.
But as attractive as Hadoop is, there is still a steep learning curve involved in understanding what role Hadoop can play for an organization, and how best to deploy it.
By understanding what Hadoop can, and can’t do, you can get a clearer picture of how it can best be implemented in your own data center or cloud. From there, best practices can be laid out for a Hadoop deployment.
What Hadoop can’t do
We’re not going to spend a lot of time on what Hadoop is, since that’s well covered in documentation and media sources. It’s important to know the two major components of Hadoop: The Hadoop distributed file system for storage, and the MapReduce framework that lets you perform batch analysis on whatever data you have stored within Hadoop. That data, notably, does not have to be structured, which makes Hadoop ideal for analyzing and working with data from sources like social media, documents, and graphs — anything that can’t easily fit within rows and columns.
That’s not to say you can’t use Hadoop for structured data. In fact, there are many solutions that take advantage of the relatively low storage expense per TB of Hadoop to simply store structured data there instead of a RDBMS (relational database system ). But if your storage needs are not all that great, then shifting data back and forth between Hadoop and an RDBMS would be overkill.
One area you would not want to use Hadoop for is transactional data. Transactional data, by its very nature, is highly complex, as a transaction on an e-commerce site can generate many steps that all have to be implemented quickly. That scenario is not at all ideal for Hadoop.
Nor would it be optimal for structured data sets that require very minimal latency, like when a website is served up by a MySQL database in a typical LAMP stack. That’s a speed requirement that Hadoop would poorly serve.
What Hadoop can do
Because of its batch processing, Hadoop should be deployed in situations like index building, pattern recognitions, creating recommendation engines, and sentiment analysis — all situations where data is generated at a high volume, stored in Hadoop, and queried at length later using MapReduce functions.
But this does not mean that Hadoop should replace existing elements within your data center. On the contrary, Hadoop should be integrated within your existing IT infrastructure in order to capitalize on the myriad pieces of data that flow into your organization.
Consider, for instance, a fairly typical non-Hadoop enterprise website that handles commercial transactions. According to Sarah Sproehnle, Director of Educational Services for Cloudera, the logs from one of their customer’s popular sites would undergo an ETL (extract, transform, and load) procedure on a nightly run that could take up to three hours before depositing the data in a data warehouse, at which time a stored procedure would be kicked off and (after another two hours) the cleansed data would reside in the data warehouse. The final data set, though, would only be a fifth of its original size — meaning that if there was any value to be gleaned from the entire original data set, it would be lost.
After Hadoop was integrated into this organization, things improved dramatically in terms of time and effort. Instead of undergoing an ETL operation, the log data from the Web servers was sent straight to the HDFS within Hadoop in its entirety. From there, the same cleansing procedure was performed on the log data, only now using MapReduce jobs. Once cleaned, the data was then sent to the data warehouse. But the operation was much faster, thanks to the removal of the ETL step and the speed of the MapReduce operation. And, all of the data was still being held within Hadoop — ready for any additional questions the site’s operators might come up with later.
This is a critical point to understand about Hadoop: It should never be thought of as a replacement for your existing infrastructure, but rather as a tool to augment your data management and storage capabilities. Using tools like Apache Flume, which can pull data from RDBMS to Hadoop and back, or Apache Sqoop, which can extract system logs in real time to Hadoop, you can connect your existing systems with Hadoop and have your data processed no matter the size. All you need to do is add nodes to Hadoop to handle the storage and the processing.
Required hardware and costs
So how much hardware are we talking?
Estimates for the hardware needed for Hadoop vary a bit, depending on who you ask. Cloudera’s list is detailed and specific on what a typical slave node for Hadoop should be:
- Midrange processor
- 4GB to 32GB of memory
- 1 GbE network connection to each node, with a 10 GbE top-of-rack switch
- A dedicated switching infrastructure to avoid Hadoop saturating the network
- 4 to 12 drives per machine, non-RAID
Hortonworks, another Hadoop distributor, has similar specs, though it is a little more vague on the network stats, because of the varying workloads any given organization can apply to their Hadoop instance.
“As a rule of thumb, watch the ratio of network-to-computer cost and aim for network cost being somewhere around 20 percent of your total cost. Network costs should include your complete network, core switches, rack switches, any network cards needed, etc.,” wrote Hortonworks CTO Eric Baldeschwieler.
For its part, Cloudera estimates anywhere from $3,000 to $7,000 per node, depending on what you settle on for each node.
Sproehnle also outlined a fairly easy to follow rule-of-thumb for planning your Hadoop capacity. Because Hadoop is linearly scalable, you will increase your storage and processing power whenever you add a node. That makes planning straightforward.
If your data is growing by 1TB a month, for instance, then here’s how to plan: Hadoop replicates data three times, so you will need 3TB of raw storage space to accommodate the new terabyte. Allowing a little extra space (Sproehnle estimates 30 percent overhead) for processing operations of data, that puts the actual need at 4TB that month. If you’re using 4 X 1TB drive machines for your nodes, that’s 1 new node per month.
The nice thing is that all new nodes are immediately put to use when connected, getting you X times the processing and storage, where X is the number of nodes.
Installing and managing Hadoop nodes is not exactly nontrivial, though, but there are many tools out there that can help. Cloudera Manager, Apache Ambari (which is what Hortonworks uses for its management system), and the MapR Control System are all equally effective Hadoop cluster managers. If you are using a “pure” Apache Hadoop solution, you can also look at Platform Symphony MapReduce, StackIQ Rocks + Big Data, and Zettaset Data Platform third-party Hadoop management systems.
This is just the tip of the iceberg, of course, when it comes to deploying a Hadoop solution for your organization. Perhaps the biggest take-away is understanding that Hadoop is not meant to replace your current data infrastructure, only augment it.
Once this important distinction is made, it becomes easier to start thinking about how Hadoopp can help your organization, without ripping out the guts of your data processes.