THE EXAGGERATED PROMISE OF SO-CALLED UNBIASED DATA MINING
Data Mining – Nobel laureate Richard Feynman once asked his Caltech students to calculate the probability that, if he walked outside the classroom, the first car in the parking lot would have a specific license plate, say 6ZNA74. Assuming every number and letter are equally likely and determined independently, the students estimated the probability to be less than 1 in 17 million. When the students finished their calculations, Feynman revealed that the correct probability was 1: He had seen this license plate on his way into class. Something extremely unlikely is not unlikely at all if it has already happened.
The Feynman trap—ransacking data for patterns without any preconceived idea of what one is looking for—is the Achilles heel of studies based on data mining. Finding something unusual or surprising after it has already occurred is neither unusual nor surprising. Patterns are sure to be found, and are likely to be misleading, absurd, or worse.
In his best-selling 2001 book Good to Great, Jim Collins compared 11 companies that had outperformed the overall stock market over the previous 40 years to 11 companies that hadn’t. He identified five distinguishing traits that the successful companies had in common. “We did not begin this project with a theory to test or prove,” Collins boasted. “We sought to build a theory from the ground up, derived directly from the evidence.”
He stepped into the Feynman trap. When we look back in time at any group of companies, the best or the worst, we can always find some common characteristics, so finding them proves nothing at all. Following the publication of Good to Great, the performance of Collins’ magnificent 11 stocks has been distinctly mediocre: Five stocks have done better than the overall stock market, while six have done worse.
In 2011, Google created an artificial intelligence program called Google Flu that used search queries to predict flu outbreaks. Google’s data-mining program looked at 50 million search queries and identified the 45 that were the most closely correlated with the incidence of flu. It’s yet another example of the data-mining trap: A valid study would specify the keywords in advance. After issuing its report, Google Flu overestimated the number of flu cases for 100 of the next 108 weeks, by an average of nearly 100 percent. Google Flu no longer makes flu predictions.
This article originally appeared on Wired.com. To read the full article, click here.
Nastel Technologies uses machine learning to detect anomalies, behavior and sentiment, accelerate decisions, satisfy customers, innovate continuously. To answer business-centric questions and provide actionable guidance for decision-makers, Nastel’s AutoPilot® for Analytics fuses:
- Advanced predictive anomaly detection, Bayesian Classification and other machine learning algorithms
- Raw information handling and analytics speed
- End-to-end business transaction tracking that spans technologies, tiers, and organizations
- Intuitive, easy-to-use data visualizations and dashboards
If you would like to learn more, click here