Big Data Revolution, Part 1
There are a few media sources I follow zealously for insights on business, data and analytics. My usual suspects include the New York Times, the Wall Street Journal, Forbes, Wired, the Economist, the Harvard Business Review and the MIT Sloan Management Review. A few weeks back, a positive WSJ review of a new book on big data got me scurrying to make a next-day order on Amazon. I’m glad I hurried.
Big Data: A Revolution That Will Transform How We Live, Work and Think, co-authored by Oxford professor Viktor Mayer-Schonberger and Economist editor Kenneth Cukier, is destined to be one of the top business/analytics books of 2013. Big Data’s an easy four hour read, but heavy with substance at the same time. Many of its arguments have been made before, though BD’s more persuasive than other big data polemics I’ve read.
The point of departure for Big Data, not surprisingly, is the current obsession with datafication, a concept that refers to “taking information about all things under the sun…and transforming it into a data format to make it quantified. This allows us to use the information in new ways, such as predictive analysis.”
Though datafication’s been fundamentally enabled by advances in computing technology, the book’s most entertaining illustration is the work of 19th century naval officer Matthew Maury, who painstakingly sifted through records of nautical books, maps and charts and inventoried barometers, compasses, sextants and chronometers to assemble new navigational charts that revolutionized trans-Atlantic travel. The “Pathfinder of the Seas” established and analyzed a huge body of data that ultimately led to cuts in the times of long voyages by a third. Today, Maury would be a celebrated data scientist.
The authors articulate three fundamental shifts surrounding big data’s ascendancy that sound like heresy to those steeped in the statistical tradition of the scientific method. Shift one is the ability to collect and analyze incredibly large data stores, the Holy Grail being able to work with N=all. The authors argue that statistical sampling is an artifact of an age when technology limited the amount of data that could be analyzed. Indeed, even though random sampling’s at the heart of modern measurement, according to the authors, “it is only a shortcut, a second-best alternative to collecting and analyzing the full data set.” The ability to drill into sampled data to examine rare occurrences is limited, as is the ability to ask questions unanticipated at the outset.
Increasing data size is often associated with more measurement error, so shift two has to do with a tolerance for “messy” data. As with sampling, the obsession with measurement error might be “an artifact of the information-deprived analog era. When data was sparse, every data point was critical.” The sheer volume of data may make it worthwhile to sacrifice exactitude, just as size often trumps better algorithms. Better large and approximate than small and exact.
Shift three transitions from the experimental method’s hunger for cause and effect to big data’s tolerance for much less rigorous correlation. Now the “what” often pre-empts the “why”: “Knowing why may be pleasant, but it’s unimportant for stimulating sales. Knowing what, however, drives clicks.” For fast-moving big data companies, hypotheses-driven business theories are too slow and flawed; relying on experts to ferret out theories is inefficient. And correlations are often enough: just ask Amazon and Netflix, whose recommendation engines have moved them to industry-leading positions by exploiting “valuable correlations without knowing the underlying causes.”
Despite the evolution from hypotheses-driven to data-driven world, Big Data rejects Wired editor Chris Anderson’s contention that traditional scientific theory is dead, “replaced by statistical analysis of pure correlations that is devoid of theory.” The authors argue that big data is itself founded on theory, “In fact big data may offer a fresh look and new insights precisely because it is unencumbered by the conventional thinking and inherent biases implicit in the theories of a specific field.” Meta-theory, if you will.
One issue barely addressed in Big Data is the risk of identifying relationships in wide data sets as real that are in fact spurious. Holding such analytics to the higher level of experimental evidence offers a safety net of protection from such false positives. In today’s big data world, though, the answer is likely train/tune/test-divided data sets, cross-validation and “shrinkage” methods rather than experiments.
I must admit my traditional statistical grounding has taken a hit with Big Data. The notions that the core scientific method techniques of sampling, measurement error, and the experimental method’s cause and effect, may well lose importance as central components of the analytics’ tool chest hasn’t quite registered with me yet– and maybe never will. Perhaps it’s best to maintain the distinctions between traditional, top-down, hypotheses-led, scientific inquiries and the newer, bottom-up, “searching”, data-driven ones. I at least like to think I’m open-minded.
Next time I’ll report on what Big Data has to say about evolving business models that revolve on data, as well the implications and risks of a big data world.
This blog originated at Information-Management, a sister publication of Health Data Management.