Deriving Signals from Biased Data

5 min readApr 7, 2024

It’s not all garbage.

Background

Daniel Kahneman passed away earlier this week at the age of 90. His work on cognitive bias and behavioural economics, particularly Prospect Theory (taking profits too soon while holding on to losses) had an influence on my work as a data analyst / data scientist. I also remember a story from one of his books about prolonging an anoscopy to achieve a more pleasant ending — the recency bias causes the patient to remember the entire procedure (which is really not pleasant) as much better, even though it’s prolonged (which is somewhat unethical from a medical perspective). The key take-away for me was the exploitation of known behaviour bias for economic advantage.

Now, as a data analyst / data scientist, I come across many processes that give rise to significantly biased data. But it doesn’t mean these data are unusable. On the contrary, they may actually yield valuable information signals if we know how to interpret them. And so I dedicate my 33rd weekly article to the memory of Daniel Kahneman by unpacking my thoughts around the use of significantly biased data.

(I write a weekly series of articles where I call out bad thinking and bad practices in data analytics / data science which you can find here.)

Garbage In, Garbage Out?

Most of us would be familiar with the term “garbage in, garbage out” which describes the issue where we cannot get good quality data outputs from poor quality data inputs. And it would seem logical to extend this argument to data that is significantly biased; it’s deemed not usable.

Let’s first unpack the definition of bias in data. In layman’s term, it simply means that the data does not accurately represent the phenomenon of interest that it is measuring. This is not to be confused with skewness in data, which is a term that refers to data not being normally distributed. Skewed data can be unbiased / representative — e.g. personal income data.

In the financial services industry, a classic case of significantly biased data is customer risk profiling in wealth management — customers indicate whether they are risk-conservative or more risk-taking when it comes to their investment appetite. What we observe is that most customers will say they are NOT risk-conservative to allow them wriggle room to purchase the odd risky investment if the opportunity or need arises. There is no upside for the customers to be truthful, only downside. And this asymmetry of benefits results in the data becoming biased.

I came across another recent example of biased data from a client, this time around customer relationship management (CRM) data. The CRM platform was enhanced with built-in nudges aimed at increasing productivity, and sales employees were encouraged to rate these nudges. They found that the nudges were well rated, but productivity did not increase. They soon realise that because negative ratings required the employees to provide explanation inputs, the employees chose the ‘easy way out’; the extra effort was counter to the whole point of improving productivity. Once again, this asymmetry of benefits resulted in biased data.

In both these examples above, it would seem that the data has lost its information signal; we can no longer ascribe meaning to their values. But perhaps they can still be ‘tweaked’ to yield some kind of information signal? Consider: in the scenario of the investment risk profiling data, if we do come across those customers who score themselves as risk-conservative, are they in fact saying that they are ultra risk-conservative? Perhaps they are also signalling that they do not want a broader engagement with the bank? Similarly, in the scenario of the CRM nudge data, if one encounters a bad rating, what does it mean? If one encounters an employee who gives a good mix of both good and bad ratings, what information signal does it tell us about the employee?

Information Signal in Biased Data

I would argue that ALL biased data can be useful. By nature of its definition, biased data is NOT random data. Which means the information signal is distorted. Interpreting biased data therefore requires expansive knowledge of context to either incorporate or unwind the distortion.

As a start, one good way to anticipate if data is possibly biased is to ask if the context of its collection or derivation was subjected to significant asymmetric ‘pressure’. Most data, when sourced through human inputs, are significantly biased. This is unavoidable because humans are extremely sensitive to the effects present in our interaction environments, and further compounded by our biased cognitive processes.

The next question to ask is whether that asymmetric pressure is macro or individual in nature. Consider the example of (training) course evaluation ratings in Asia. Because Asians tend to prioritise harmony, we prefer to ‘hold our tongues’ when giving critical feedback. As such, there is a phenomenon where evaluation ratings tend to drift towards higher (better) scores. For example, in Singapore, where I grew up and where I teach, any course with an evaluation score below ‘4’ is bad (on a scale of ‘1’ to ‘5’ with ‘5’ being the best). These kinds of biased are what I would term as macro pressure. And the way to derive information signal from such biased data is to re-calibrate.

What about asymmetric pressure that is individual in nature? The CRM nudge evaluation data would be a good example of such. In such cases, one cannot simply re-calibrate the data by raising or lowering the ‘cut-off’ threshold. In this particular case, I would exclude all the ratings from employees with an aggregate >80% good ratings from the analysis; it is highly unlikely that the nudge algorithm can achieve such fantastic yields. In such cases, cutting out some parts of the distorted data can still make it usable.

A Word on Data Instrumentation

One area not often covered in discussions and literature is the choice of data instrumentation to counter or reduce bias in the data collection process. If we are collecting inputs from human ‘sensors’, then binary-choice inputs are generally inferior to sliding-scale inputs. Binary inputs do not allow for re-calibration in the event of bias detection. Consider the case of the CRM nudge evaluation data. If we data-instrumented for ‘thumbs up / thumbs down’ rating, with negative ratings requiring additional explanatory text inputs, then we would arrive at the biased outcome of overwhelmingly positive ratings. If we data-instrumented for a sliding scale input (e.g. 1–5 ratings scale), and requiring that only those rated ‘1’ and ‘2’ need additional explanatory text inputs, then we might find an overwhelming response at ‘3’ (i.e. neutral), thus giving us a hint that something isn’t quite right.

Conclusion

All data is biased. Recognising if a set of data is possibly significantly biased because of its collection design process is an important competency. It can be a matter of discarding the data set because we think it’s unusable to being able to leverage it for insightful information signals. It’s not necessarily “garbage in, garbage out”.