Thursday, August 22, 2013

Five Myths About Big Data

Samuel Arbesman, an applied mathematician and network scientist, is a senior scholar at the Ewing Marion Kauffman Foundation and the author of “The Half-Life of Facts.” Follow him on Twitter: @Arbesman.

Big data holds the promise of harnessing huge amounts of information to help us better understand the world. But when talking about big data, there’s a tendency to fall into hyperbole. It is what compels contrarians to write such tweets as “Big Data, n.: the belief that any sufficiently large pile of s--- contains a pony.” Let’s deflate the hype.

1. “Big data” has a clear definition.

The term “big data” has been in circulation since at least the 1990s, when it is believed to have originated in Silicon Valley. IBM offers a seemingly simple definition: Big data is characterized by the four V’s of volume, variety, velocity and veracity. But the term is thrown around so often, in so many contexts — science, marketing, politics, sports — that its meaning has become vague and ambiguous.

2. Big data is new.

It’s true that today we can mine massive amounts of data — textual, social, scientific and otherwise — using complex algorithms and computer power. But big data has been around for a long time. It’s just that exhaustive datasets were more exhausting to compile and study in the days when “computer” meant a person who performed calculations.

Vast linguistic datasets, for example, go back nearly 800 years. Early biblical concordances — alphabetical indexes of words in the Bible, along with their context — allowed for some of the same types of analyses found in modern-day textual data-crunching.

The sciences also have been using big data for some time. In the early 1600s, Johannes Kepler used Tycho Brahe’s detailed astronomical dataset to elucidate certain laws of planetary motion. Astronomy in the age of the Sloan Digital Sky Survey is certainly different and more awesome, but it’s still astronomy.

3. Big data is revolutionary.

When a phenomenon or an effect is large, we usually don’t need huge amounts of data to recognize it (and science has traditionally focused on these large effects). As things become more subtle, bigger data helps. It can lead us to smaller pieces of knowledge: how to tailor a product or how to treat a disease a little bit better. If those bits can help lots of people, the effect may be large. But revolutionary for an individual? Probably not.

4. Bigger data is better.

In science, some admittedly mind-blowing big-data analyses are being done. In business, companies are being told to “embrace big data before your competitors do.” But big data is not automatically better.

Really big datasets can be a mess. Unless researchers and analysts can reduce the number of variables and make the data more manageable, they get quantity without a whole lot of quality. Give me some quality medium data over bad big data any day.

5. Big data means the end of scientific theories.

Chris Anderson argued in a 2008 Wired essay that big data renders the scientific method obsolete: Throw enough data at an advanced machine-learning technique, and all the correlations and relationships will simply jump out. We’ll understand everything.

But you can’t just go fishing for correlations and hope they will explain the world. If you’re not careful, you’ll end up with spurious correlations. Even more important, to contend with the “why” of things, we still need ideas, hypotheses and theories. If you don’t have good questions, your results can be silly and meaningless.