Big Data is Not So Scary

2013-02-bigdataEditor’s note: This post originally appeared here.

I’ve been reading Nate Silver’s The Signal and the Noise. It’s not the sort of book I normally would read, but since Nate kept me from a jumping off a tall building during the last election I felt I owed him the $27.95. Given Nate’s record predicting election outcomes, you might think this is a book that reveals the hidden secrets of the black art of predicting things. But it’s not. It’s about how hard it is to make accurate predictions even when we have mountains of data from which to do it. And it’s causing me to look differently at the issue of Big Data and predictive analytics.

Nate spends a lot of pages on some of those things for which we have lots of data but still aren’t good at predicting—the weather, earthquakes, economic growth, etc. Consider economic growth. We all have a sense of just how much economic data there is and how long the time series. (Nate estimates around 4 million variables.) But forecasts of growth are all over the map and even “consensus forecasts” routinely are just plain wrong.

Nate argues that predictions fail because we fall victim to two common errors. The first is to overfit the prediction model into something that looks very sophisticated and plausible but either ignores important variables or simply fails to understand the underlying structure of the data. Machine learning is especially susceptible to overfitting. The second is the classic error of interpreting correlation as causation. A good example is the Super Bowl indicator, which says that the direction of the stock market can be predicted based on who wins the Super Bowl.

Ultimately, we need to be able to make good decisions about which data are important.  And we need to be able to look at what a model is saying, why it’s saying it, and judge whether it makes sense. Finally, we need to understand the uncertainty in the prediction and communicate it. That sounds a lot like MR, except for that last part about uncertainty.

Right now the possibility of a future world of petabytes, MPP architectures, neural networks and naïve Bayes is scaring the pants off a lot of people in the MR industry. It may well be very bad news for MR companies but maybe not so bad for the MR profession.  There always will be demand for people who understand data, consumers and the competitive challenges that client companies face in the marketplace.

Or, as Nate writes, “Data-driven predictions can succeed—and they can fail. It is when we deny our role in the process that the odds of failure rise.”

Leave a Reply

Your email address will not be published.