Enterprise Adoption of Hadoop

The buzz in Hadoop these days is around enterprise adoption. Gartner sparked* this discussion with a report last month stating that for some enterprises, …”investment remains tentative in the face of sizable challenges around business value and skills.” (See report publication here).

Are we looking at a half-full glass of Hadoop, or are we simply expanding adoption and have entered early Majority adoption? That was the question that Hortonworks addressed at their keynote at the Hadoop summit last week looking at this data through a different perspective. A good summary of that position is in this Hortonworks blog.

**But this is the land of Big Data. Is there a way we can deterministically count and measure Hadoop adoption, how big is it, what’s the total size of this market?**

How about a statistical theory of estimation.

Wait, we have such a device! For the problem of estimating the maximum of a discrete uniform distribution, we have a set of statistical inferencing and modeling known colloquially as… The German Tank Problem. The German Tank Problem was essentially Big Data analysis before we called it Big Data, but an analysis nonetheless that led to triumphant success in World War Two. As Wikipededia succintcly puts it: “*Estimating the population maximum based on a single sample yields divergent results, while the estimation based on multiple samples is an instructive practical estimation question whose answer is simple but not obvious.” (take that Donald Rumsfeld!) Source: https://en.wikipedia.org/wiki/German_tank_problem*

THE GERMAN TANK PROBLEM

Imagine for a moment you are in intelligence for the central command of the U.S. armed forces in World War II. By 1943, it has become obvious that a ground invasion of Europe, through France, was going to be your only option to destroy Nazi Germany. But what you need to determine is the number of German tanks that you might face in a battle in France. In fact, what you really want to determine is the number of German Tiger II (“* Königstiger”, aka Bengal Tiger) *tanks.

*You don’t mind fighting the Panzer tanks, or at least, you have a good shot at defeating the Panzer IV tank divisions in battle. What is your worst nightmare are the Tiger IIs. Armored with an unheard of 120mm of steel plating, shells from the U.S. Sherman tanks would literally bounce off or fail to penetrate this tank. Worse, at a length of over 20 feet, the gun barrel of this tank could outshoot any U.S. tank by a factor of 1.5, or about 1/2 mile. In other words, a German Tiger II could shoot you dead 10-15 mins before you could even get close enough, running at full speed, to shoot at this tank with shells that are in 99% of the time going to bounce off it. A monster indeed.*

So how many tanks and tiger tanks are there?

What to do, what to do? You are in intelligence! you do what intel officers do. You count tanks rolling out of the factory, traveling on flatbed train cars, sitting in the back of battalions, based on aerial photography, and sunken ships, and feet on the street. You listen to your spies. You bribe manufacturers and construction workers and steal manifests. You watch rubber production and shipments. You build, tally, and forecast. And you get very scared. Because based on this type of intelligence, U.S. Army intel officers predicted with these conventional methods that the Germans were producing around 1,400 tanks a month between June 1940 and September 1942. Yikes.

But is there a better way? In a quest for better data analytics, the U.S. Army also turned to a set of statisticians that noticed something very different and *informative* about these German tanks. While it was very hard to shoot and destroy these tanks (see above), they did sometimes break down (Tiger IIs were notoriously difficult to maintain, suffering from complex mechanical problems due to their incredible weight). And they also often ran out of gas, especially in the North African theatre with very long, stretched supply lines for the Germans. As such, the U.S. and Allies did, from time to time, get their hands on complete German tanks. And here’s what they noticed.

Germans are known for their order and organization. And when they produced these tanks, they liked to use, identify and label part numbers for everything. In fact, the Germans numbered them. **Numbered them s***equentially. **Y*es, that means that important parts on these tanks, such as their wheels, gearboxes, chassis, and engines were all numbered (i.e., 1, 2, 3, …, N) .

That doesn’t seem so helpful. But here’s a quick though experiment. I am going to tell you seven random numbered balls, all coming out of this big bag of numbered balls. But I’m not going to tell you how many balls are in this bag.

Here I go, I’ve pulled out: 22, 44, 6,89, 11, 54,18.

No, this isn’t bingo, but quickly, looking at those numbers again, care to guess at how many balls in total are in the bag?

600, 9000? Feels like about 100-150 maybe. Probably not 600, right? Because you’d expect a ball in the 300, 400, or 500 range in the set. What you are intuiting is a property of statistics known as frequentist analysis and more specifically, maximum likelihood estimation. Formally represented as:

But let’s bring this down to more intuitive math. In my example above, there were 7 balls drawn, so the sample size in this experiment is 7 (S=7). The highest numbered ball drawn was 89, we will call this the maximum. (M=89). A good-enough estimator of the number of total balls in the bag would then be **(M-1)(S+1)/S. Plugging the numbers in, we get: (89-1)(7+1)/7 = 100.57. **Statistically speaking, 7 balls drawn from a bag o balls that includes the whole set of balls has somewhere around 100 balls in it for the set {22, 44, 6,89, 11, 54,18}

The Actual German Tank Results

Using these statistical tools, and with the sets of serial numbers collected from abandoned/destroyed tanks across theatres of war, the U.S. army was able to greatly improve tank production predictions. In fact, due to same variation in manufacturers, casts, parts, they had quite a few maxima to work with. As discussed above, with conventional methods, they believed the Germans were producing in the range of 1,400 tanks per month. (Source: *How a statistical formula won the war*). With these new statistical data points, the predicted German tank production was calculated to be 246 tanks per month.

Emboldened with this new data, the Allies felt more confident about their landing and invasion capabilities and proceeded to victory in Berlin 18 months later.

How did those statisticians do? One of the other properties of WWII Germans is that in addition to be organized and tidy, they had great documentation! An analysis of German production records after the war using captured German records from Albert Speer’s ministry of industry showed the Germans were producing… get ready for this… 245 tanks per month. The statisticians had an accuracy rate of 99.60%!

Bringing this back to Hadoop Deployments …

Ok, so count serial numbers in a sequential series and you can see the big picture in tank production. How does this help determine Hadoop enterprise uptake? I’ll conclude that someone smarter than me can ideally comment and update with a data collection model for Hadoop adoption and run a similar discrete uniform distribution analysis.

More informally, all last year, I was plagued with the German Tank idea that I keep seeing the same limited set of Hadoop customers and customer examples repeatedly being used in press releases, tradeshow lectures, distribution logos and case studies — names like Netflix, Orbitz, Yahoo, Twitter showing up over and over again. My thinking being, if the entire set of possible enterprise Hadoop customers globally should be something like 2,000 enterprises, why the same small sample of 5!

This year feels different. The applications, names, and variety of customers are exploding. I suppose with ~50% of all enterprises, that isn’t a surprise, as our new expected total pool should be 1,000. I’ll just have to go watch some logos for a while at the next show to see if this data can fit the model.

(Photo Source: Panzerfabrik in Deutschland.jpg| Bundesarchiv, Bild 101I-635-3966-27 / Hebenstreit / CC-BY-SA [CC BY-SA 3.0 de (http://creativecommons.org/licenses/by-sa/3.0/de/deed.en)], via Wikimedia Commons)

* From time to time, I find myself using the expression, the elephant in the room. In the context of Hadoop, this is a somewhat double entendre that is probably getting fatiguing. Does sparking run the same risk soon?