Data Science – What if the Needle is Made of Hay?

“Finding a needle in a haystack” is perhaps the most overused quote of the data science trade, with most material promising to sift/burn/search the haystack faster than before to find the vast stack of needles it hides underneath. In reality, however, there is not always a needle and even when there is, we may not know what it looks like.

Within the context of data science, the haystack is the mass of available data while the needle is the valuable insight or answer. Metaphor-extending shenanigans aside, this idiom is misleading: it implies we know how the needle looks like, and also that the insight is distinctly different from the data; in fact insights come from the data.

Too often, a failure to find the needle is attributed to the staff performing the analysis or the technology used. In a previous post about the scientific method , I argued that there was no such thing as failure: a hypothesis can be rejected or not, but both are valid answers in the eyes of science (if not in the eyes of the business). If the question asked is “what action can I take to reduce churn by 50%?” under the constraint that significantly reducing prices is not an option, the answer may well be nothing*. There is no needle, and yet the science hasn’t failed the business**.

In physical sciences, there often is a needle. What’s more, scientists have a pretty good idea how the needle looks like, as theory and repeated scientific experiments give an ever increasing precise understanding of the world. In behavioural sciences, however, confounding factors are multiple, and customer behaviour is not (yet?) fully understood. As a result, we do not know how the needle looks like, or even if there is one. The question “find a demographic split that leads to a customer segment with 50% probability of buying item X” may not exist, may be too small to be meaningful, or may not be actionable. Following the scientific method, however leads to scientifically valid answers that should be used to acquire more/different data, design new experiments or refine the original question.

Does failing to find a pre-specified needle show a fundamental limit to what data science can achieve? In short: no. Discovery is an iterative process: successive predictions, testing, and design of experiments separate the wheat from the chaff. Hypothesis -> Testing -> New Hypothesis -> New Test -> etc… Sorting and filtering the stack of data strand by strand is ultimately what data science is about.

The needle is not always there, and is often not made of gold. In fact, in the beginning there is no needle, but successive tests, experiments, and regular engagement between scientists and business experts allows everyone to understand the haystack better, and, strand by strand, to construct a needle made of hay.

* In reality the answer may be that doing X and Z will reduce it by 35%. How acceptable that answer is will vary.

** When the business provides both the question and the answer form the onset it may, however, fail the science.

Clément Fredembach is a data scientist with Teradata Australia and New Zealand Advance Analytics group. With a background in Colour Science, Computational Photography and Computer Vision, Clement has designed and built perceptual statistical experiments and models for the past 10 years.

The post Data Science – What if the Needle is Made of Hay? appeared first on International Blog.

Teradata Blogs Feed