Hadoop & Big Data Analysis


Enterprises generate vast volumes of data every day and they need to extract business value from this information. However, data troves are so huge that traditional business intelligence tools like relational databases and math packages are no longer effective. What has recently emerged as the de facto standard for big data analysis is Apache Hadoop, a free, Java-based programming framework. Hadoop is effective because it distributes the processing to where the big data is stored. A large data cluster is broken down into hundreds or thousands of nodes where the computing is actually done. This provides for extraordinarily scalable workloads and because Hadoop replicates big data across the cluster, the failure of any node does not impact processing. This enables Hadoop jobs to be conducted, and big data to be stored, on commodity hardware. Hadoop is said to be a little like RAID in that instead of replicating big data across many inexpensive disks, big data is replicated across many inexpensive servers.

Hadoop basically consists of two parts—the Hadoop Distributed File System (HDFS) and MapReduce, both of which are derived from Google technologies. HDFS enables the distributed architecture and uses a system called NameNode to track big data across the nodes. MapReduce is the not so secret sauce at the core of Hadoop. The Map function distributes the processing to the individual nodes and Reduce collates the work from all the nodes to produce a single result.

Hadoop is a platform upon which specific applications can be created and run to process and analyze even petabytes of big data. It is a practical, cost-effective big data solution for data mining, financial analyses, and scientific simulations. If it isn’t already, Hadoop will someday directly or indirectly impact your business.