Machine Learning Times
Machine Learning Times
EXCLUSIVE HIGHLIGHTS
Three Best Practices for Unilever’s Global Analytics Initiatives
    This article from Morgan Vawter, Global Vice...
Getting Machine Learning Projects from Idea to Execution
 Originally published in Harvard Business Review Machine learning might...
Eric Siegel on Bloomberg Businessweek
  Listen to Eric Siegel, former Columbia University Professor,...
Effective Machine Learning Needs Leadership — Not AI Hype
 Originally published in BigThink, Feb 12, 2024.  Excerpted from The...
SHARE THIS:

This excerpt is from InfoWord. To view the whole article click here

9 years ago
Debunked! 9 Myths about Big Data and Hadoop

 

These unfounded beliefs about budget skills, technology, and technology fit can lead you astray.

Big data analytics is one of the major trends every company is told it must jump on for competitive advantage, even survival. As a result, there’s a lot of mythology around big data. Those myths can lead you astray, wasting resources or putting you on dead-end paths. They can also cause you to miss opportunities where budget approaches help.

Become a Predictive Analytics SuperhereoHere are the nine biggest myths about big data and Hadoop that you should not believe.

Myth No. 1: You can get data scientists

Recently, a presales engineer at one of my company’s partners mentioned how much trouble his firm had finding data scientists. I asked about the qualifications his company was seeking. Well, they need to have a doctorate in math, a background in computer science, and what amounts to an MBA, not to mention actual work experience in all of those fields. I asked, “How old is this person, 90?”

Here’s what actually exists:

  • Good mathematicians who write crap Python and often need the business stuff spoon-fed to them
  • Good computer science people who understand some math
  • Good computer science people who understand business after working enough problems
  • Business types who understand math
  • Subject matter experts
  • Leaders who know how to get these people to work together

Because that company could not find this data-scientist unicorn, it had to create a working group with a cross-section of expertise. This is in fact what you have to do.

Myth No. 2: Everything is new

Technologists like to throw away the past, preferring tools that are new for what they claim is a totally new reality or problem set. That’s rarely the case.

For example, the Kafka message broker is portrayed as a big-data-needs-a-new-tool product. But compared to other message brokers, it has a pretty poor feature set and is immature. What’s actually new (meaning different): Kafka is architected for the Hadoop platform and with massive distribution in mind. That could be useful, if you can accept its flaws.

That said, sometimes you need more sophisticated routing and guarantees. Use ActiveMQ or a more robust option for those situations.

Myth No. 3: Machine learning is what you need

I estimate that about 85 percent of what people call machine learning is simple statistics. Most of your problems are probably simple math and analysis. Start there.
Myth No. 4: You are special

As the philosopher Dirden once said, “You are not special. You are not a delicate and unique snowflake.” Guess what? About half of the industry is busy writing the same ETL scripts for many of the same data sources and custom-creating the same analysis. Hell, in any sizable company, many departments probably are duplicating this work as well.

Needless to say, it’s a good time to be a big data consultant.

Myth No. 5: Hive is fast

Hive is not fast. It cannot be made to impress you. Yes, the new version is better, but it will still underwhelm you from a performance perspective. It scales well, but you may need multiple tools in your chest to hit Hadoop with SQL.
Myth No. 6: You can use clusters with fewer than 12 nodes

Hadoop 2+ barely fits on 12 nodes — anything less and you will wait forever for it to even start. Plus, anything you run will complete in cricket time, if at all. (Well, you can run “hello world” on 12 nodes.) Hadoop 2 runs more processes, which means you need more nodes and more memory.

Spark will do better minus the load time from HDFS so long as the data set fits in memory.

Myth No. 7: Virtualization is a solution for your data nodes

Your vendor told you no. Your IT team balked. No, you cannot put data nodes on your SAN. But If you put your management nodes in VMs, you could bottleneck if writing the logs and any journals hit latency, or you get low IOPS or high latency to the data nodes.

That said, Amazon Web Services and others navigate these issues and still manage reasonable performance and scalability. You can too, but you need to distinguish this from your internal file servers and your external corporate presence site, as well as manage hardware and virtualized resources effectively.

Remember: Throughput and latency are orthogonal. HDFS cares about both in different places.

This excerpt is from InfoWord. To view the whole article click here.

By: Andrew Oliver, Strategic Developer, InfoWord
Originally published at www.infoworld.com

Leave a Reply