Information is NOT in the data
“Information is NOT in the data!” (Prof. Judea Pearl, ‘Book of Why’). This phrase was initially articulated to highlight that data in itself does not contain information on causality or the reasons for correlation. The phrase resonated with me and became a foundational part of my PhD dissertation; the realisation that we, as observers, imbue the data with meaning based on our subjective experiences and interpretations, and that information is ultimately probabilistic. This has major implications on how we approach data analytics and data science … a practice where many have lost sight of the forests from the trees. Therefore, I dedicate my 9th ‘business anniversary’ article to unpacking this idea.
Most people believe that a piece of data has a singular interpretation and that its derived meaning is immutable. This is a completely erroneous understanding of the nature of data. Meaning in data (i.e. information) is neither singular nor unchangeable. Let’s start with the latter. Data must always be interpreted against the background context of its collection period. As the background or environmental context changes, that same data may shift in its meaning or may even cease to represent the initial target phenomenon. Consider the following example: Many corporate folks bemoan that universities are no longer preparing their students for success in the corporate world; that the skills and knowledge learnt are no longer relevant. But what does an undergraduate degree ‘mean’? What information does it represent? Universities have NEVER thought corporate-relevant skills. But two generations ago, an undergraduate degree meant that the person attaining it was disciplined and conscientious, because people generally dislike studying and access to higher education was limited. Now, this representation of discipline and conscientiousness was matched with the attributes needed to succeed in corporate life then. Dial forward to today where the same undergraduate degree no longer strongly represent discipline and conscientiousness because of increased access to higher education and ‘smart’ ways of getting by with minimal studying. Couple this with the fact that to succeed in the modern knowledge economy requires more than hardwork and discipline. This means that the undergraduate degree as a data point has changed in its meaning and representation.
What about the former point, where data can have multiple interpretations and meaning? Through ‘transformation’, the underlying data can take new interpretations. Consider the following example: We have the variable ‘AGE’ in our HR system. In of itself, we think about age as young versus old, but that raw data can be transformed into ‘Gen X/Y/Z’, into ‘insurance premium age group’, into ‘time-to-retirement’. Each transformation unlocks a different set of interpretation and meaning, and triggers new insights.
So more information can be derived from data through better exposure to context (i.e. being ‘in the field’ and acquiring domain expertise) or through the act of data transformation. This transformation is not the classical kind that data scientists are familiar with, which is about making the data usable for modelling. Neither is it feature engineering because that is about data ‘manipulation’ to increase model accuracy. The objective of our kind of data transformation is about enabling more variations of meaning.
All of this leads to the key point that I want to make: the next evolution of data analytics and data science is going to be Information Science (a term coined in 1955 and includes the extraction and organisation of information and meaning from data). Despite increased access to data and algorithmic prowess, many organisations are still struggling with how to fully leverage it all.
I was recently asked by a client: “What does good analytics capability look like?”. Good analytics capability requires maturity in three dimensions — data capability, computational capability, and translation capability. Data capability relates to the ability to identify, gather and organise data. Computational capability relates to the algorithmic sophistication applied to problem solving. Translation capability relates to cognitive abilities to interpret and leverage data for decision-making. The common thread that runs through these three capabilities is the enablement of information. Knowing how to unlock meaning in data provides clarity to an organisation’s data strategy, e.g. collect only those data that provide incremental interpretation to a phenomenon of interest and keep testing that the representation remains valid. Focusing on the meaning in data can remove unnecessary computational complexity and reduce friction in decision-making.
Ultimately, for an organisation’s analytics capability to mature, it must evolve from the domain of technical specialists into the hands of everyone in the organisation. As data science tools become more user-intuitive and automated, placing more importance on cognitive abilities such as information utility will competitively sustain us in the knowledge economy.