Machine Learning Times
Machine Learning Times
EXCLUSIVE HIGHLIGHTS
Three Best Practices for Unilever’s Global Analytics Initiatives
    This article from Morgan Vawter, Global Vice...
Getting Machine Learning Projects from Idea to Execution
 Originally published in Harvard Business Review Machine learning might...
Eric Siegel on Bloomberg Businessweek
  Listen to Eric Siegel, former Columbia University Professor,...
Effective Machine Learning Needs Leadership — Not AI Hype
 Originally published in BigThink, Feb 12, 2024.  Excerpted from The...
SHARE THIS:

This excerpt is from the InfoWorld. To view the whole article click here.  

9 years ago
Big Data, Big Challenges: Hadoop in the Enterprise

 

Fresh from the front lines: Common problems encountered when putting Hadoop to work — and the best tools to make Hadoop less burdensome

As I work with larger enterprise clients, a few Hadoop themes have emerged. A common one is that most companies seem to be trying to avoid the pain they experienced in the heyday of JavaEE, SOA, and .Net — as well as that terrible time when every department had to have its own portal.

To this end, they’re trying to centralize Hadoop, in the way that many companies attempt to do with RDBMS or storage. Although you wouldn’t use Hadoop for the same stuff you’d use an RDBMS for, Hadoop has many advantages over the RDBMS in terms of manageability. The row-store RDBMS paradigm (that is, Oracle) has inherent scalability limits, so when you attempt to create one big instance or RAC cluster to serve all, you end up serving none. With Hadoop, you have more ability to pool compute resources and dish them out.

Unfortunately, Hadoop management and deployment tools are still early stage at best. As awful as Oracle’s reputation may be, I could install it by hand in minutes. Installing a Hadoop cluster that does more than “hello world” will take hours at least. Next, when you start handling hundreds or thousands of nodes, you’ll find the tooling a bit lacking.

Companies are using devops tools like Chef, Puppet, and Salt to create manageable Hadoop solutions. They face many challenges on the way to centralizing Hadoop:

  • Hadoop isn’t a thing: Hadoop is a word we use to mean “that big data stuff” like Spark, MapReduce, Hive, HBase, and so on. There are a lot of pieces.
  • Diverse workloads: Not only do you potentially need to balance a Hive:Tez workload against a Spark workload, but some workloads are more constant and sustained than others.
  • Partitioning: YARN is pretty much a clusterwide version of the process scheduler and queuing system that you take for granted in the operating system of the computer, phone, or tablet you’re using right now. You ask it to do stuff, and it balances it against the other stuff it’s doing, then distributes the work accordingly. Obviously, this is essential. But there’s a pecking order — and who you are often determines how many resources you get. Also, streaming jobs and batch jobs may need different levels of service. You may have no choice but to deploy two or more Hadoop clusters, which you need to manage separately. Worse, what happens when workloads are cyclical?
  • Priorities: Though your organization may want to provision a 1,000-node Spark cluster, it doesn’t mean you have the right to provision 1,000 nodes. Can you really get the resources you need?

On one hand, many organizations have deployed Hadoop successfully. On the other, if this smells like building your own PaaS with devops tools, your nose is working correctly. You don’t have a lot of choice yet. Solutions are coming, but none really solve the problems of deploying and maintaining Hadoop in a large organization yet:

  • Ambari: This Apache project is a marvel and an amazing thing when it works. Each version gets better and each version manages more nodes. But Ambari isn’t for provisioning more VMs and does a better job provisioning than reprovisioning or reconfiguring. Ambari probably isn’t a long-term solution for provisioning large multitenanted environments with diverse workloads.
  • Slider: Slider enables non-YARN applications to be managed by YARN. Many Hadoop projects at Apache are really controlled or sponsored by one of the major vendors. In this case, the sponsor is Hortonworks, so it pays to look at Hortonworks’ road map for Slider. One of the more interesting developments is the ability to deploy Dockerized apps via YARN based on your workload. I haven’t seen this in production yet, but it’s very promising.
  • Kubernetes: I admit to being biased against Kubernetes because I can’t spell it. Kubernetes is a way to pool compute resources Google-style. It brings us one step closer to a PaaS-like feel for Hadoop. I can see a potential future when you use OpenShift, Kubernetes, Slider, YARN, and Docker together to manage a diverse cluster of resources. Cloudera hired a Google exec with that on his resume.
  • Mesos: Mesos has some overlap with Kubernetes but competes directly with YARN or more accurately YARN/Slider. The best way to understand the difference is that YARN is more like traditional task-scheduling. A process gets scheduled against resources that YARN has available to it on the cluster. Mesos has an app request, Mesos makes an offer, and the process can “reject” that offer and wait for a better offer, sort of like dating. If you really want to understand this in detail, MapR has a good walkthrough (though possibly the conclusions are a bit biased). Finally, there’s a YARN/Mesos hybrid called Myriad. The hype cycle has burned a bit quick for Mesos.

By: Andrew C. Oliver, Strategic Developer www.infoworld.com

Leave a Reply