Monday, October 17, 2011

Looking at Hadoop and Mahout

Since I finished my most recent contract work assignment on Friday I'm going to spend some time reading up on Hadoop and Mahout.
 
Hadoop:
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
Mahout:
The Apache Mahout machine learning library's goal is to build scalable machine learning libraries.
 
Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.
In short, Big Data and cluster-based computing.
 
Today I'm reading through the Hadoop Tutorial.

-- Jack Krupansky

No comments:

Post a Comment