Here I will explain the classic word counting example used in demonstrating how Hadoop functions, in this case using the sixty-six books of the Protestant canon. Hadoop does something called “mapping” by indexing each word. This is a big deal because mapping changes the unstructured text below into a structured index which Hadoop can reduce.
It is time for me to speak of the books of the New Testament.
Receive only four evangelists:
Matthew, then Mark, to whom, having added Luke
As third, count John as fourth in time,
But first in height of teachings,
For I call this one rightly a son of thunder,
Sounding out most greatly with the word of God.
Today we will map using white space as the delimiter between words dropping puncation altogether. Since Hadoop indexes by Key/Value pairs the words are the unique key and their value is the number of ocurrances. In the example “it” occurs once, “of” occurs five times.
Stop to think about that for a minute. After mapping instead of blah, blah, blah, now there is a neat, orgainized set of unique words with a count of the number of times they occur.
Hadoop takes it one step further. Since there are sixty-six books Hadoop allocates 66 processors to parse all the books at the same time. Then Hadoop “reduces” or combines the indexes from 66 into 1 automagiclly.
Now it is up to you to analyze this index of unique words based on their number of occurances to gain business insight.
Okay, let me simplifiy this. Say you have 6 billion tweets and you need to know who is trending, Justin Bieber or Philip Hoffman. Hadoop does exactly the same thing but you only need to check which count is larger to find who is trending.
Now that you have a thorough understanding of the functioning of Hadoop, how would you use it to gain business insight?