This post originated from an RSS feed registered with Java Buzz
by News Manager.
Original Post: The 7 most common Hadoop and Spark projects
Feed Title: JavaWorld
Feed URL: http://www.javaworld.com/index.rss
Feed Description: JavaWorld.com: Fueling Innovation
There's an old axiom that goes something like this: If you offer someone your full support and financial backing to do something different and innovative, they’ll end up doing what everyone else is doing.
So it goes with Hadoop, Spark, and Storm. Everyone thinks they're doing something special with these new big data technologies, but it doesn't take long to encounter the same patterns over and over. Specific implementations may differ somewhat, but based on my experience, here are the seven most common projects.
Project No. 1: Data consolidation
Call it an "enterprise data hub" or "data lake." The idea is you have disparate data sources, and you want to perform analysis across them. This type of project consists of getting feeds from all the sources (either real time or as a batch) and shoving them into Hadoop. Sometimes this is step one to becoming a “data-driven company”; sometimes you simply want pretty reports. Data lakes usually materialize as files on HDFS and tables in Hive or Impala. There's a bold, new world where much of this shows up in HBase -- and Phoenix, in the future, because Hive is slow.