About Data Proxies
Stuff they don’t teach you in kindergarten.
Background
We don’t spend enough time talking about data proxies in the course of our work as data analysts and data scientists. While “feature engineering” is a kind of proxy, its underpinning is based on transforming already-acquired data to amplify linear correlations. What’s lacking is the upstream cognitive skills to identify what data to acquire that would serve as appropriate proxies for a given phenomenon of interest.
My 131st article is an attempt to bridge this gap by suggesting some useful approaches to thinking about data proxies.
(I write a weekly series of articles where I call out bad thinking and bad practices in data analytics / data science which you can find here.)
What Are Data Proxies?
We never truly measure anything directly. The act of measurement itself is by way of proxy. A ruler or measuring tape is simply a proxy for a set of agreed standards in length; it can be imperfect. Secondly, all data are proxies. For something. Understanding the context, and the relationship with other data points, enables us to understand what they might be proxies for.
In the practice of data analytics and data science, we constantly work with data proxies. Why? Because the phenomenon of interest that we are trying to measure or estimate are typically unobservable or immeasurable. And so we have to access it by means of another directly measurable, highly correlated variable. A classic example is “customer satisfaction”, which is proxied by net promoter score. Another classic example would be measuring “interest in a product” by using the proxy data of “webpage clicks” or “search words”.
The issue with data proxies is that it often don’t fully measure the phenomenon of interest, but simply a “slice” of it. They typically need to be properly calibrated (i.e. transformed) and triangulated. But no one teaches you the “how” and “what”, and many data analysts and data scientists fumble their way through via trial and error.
Rules-of-Thumb
Allow me to share some useful approaches and rules-of-thumb when thinking about data proxies:
- Build your mental world model through active observation.
- Think radially and not linearly; consider what else has got to be true and embrace a multi-proxy approach.
- Don’t be afraid to be creative; there’s more data out there than you realise.
Consider point (1). You need to actively live in the world. Observe the comings and goings. Observe the interactions. This helps to develop and strengthen the world model in your mind — how things work, how things interact, how things connect. For example, in your world model, you understand that in California, people drive to the strip mall stores to buy stuff like groceries, clothes, and such. And therefore, you can estimate a retail store’s “economic success” by counting the number of cars coming in and out of the parking zone in front of it, which is obtainable via satellite imagery. As a data analyst / data scientist, the best thing you can do is to get off your ass and unglue yourself from your computer screen. Do field work. Read broadly. Watch TV. Have an anthropological mindset so that you are not consuming data numbly but intently. The fidelity of your mental world model is going to be the most critical aspect since it is likely that you won’t have the opportunity to validate the correlation of the data proxy to the actual hard-to-measure phenomenon of interest.
Consider point (2), and using the same example of measuring “economic success”, does data about operating hours reinforce or invalidate our hypothesis? Does the type of parked vehicles tell us something as well about demographics and spending power? That same satellite imagery could yield more signals (see my recent article “Maybe There’s No Such Thing as Noise”). As a data analyst / data scientist, the trick here is to strengthen our thinking by continuously asking “What else has got to be true?”. We switch our thinking mode from linear to radial. Cast the net all around. Are there pre-conditions (i.e. upstream inputs) and post-conditions (i.e. downstream impact) to the phenomenon of interest that is worth considering? Are there conjoined conditions? Do these stand up to the logic of your world model (you may need to do some research to convince yourself)?
Consider point (3). We are well past the Age of Digitalisation. You’ll be surprised with the kinds of data you can extract from the digital ecosystem. For example, you can now measure the rhythm and pressure of typing on a laptop to detect stress/fatigue and possibly even detect early signs of Parkinson’s. You can also measure the fluctuations in wi-fi signals to detect human presence or estimate footfalls in a room. As a data analyst / data scientist, you need to continuously keep yourself informed about the latest advancements in data instrumentation. Familiarise yourself with the art of the possible.
Conclusion
While the algorithmic side of working with data proxies is well known, the cognitive process of identifying appropriate data proxies isn’t similarly covered. A good data analyst / data scientist needs to intentionally construct their mental world models in their domain of practice, and continuously enhance it. This will allow to select the right data proxies, and then further enrich it with upstream, downstream or sidestream considerations, bearing in mind the advancements in data instrumentation in an ever digitalised world.
