Defining & Measuring Success of Gen AI Implementations
It’s all about mimicry.
Background
Everywhere you turn, organisations are hard at work figuring out the opportunity space for Generative AI (Gen AI for short), the application outcome of the decades of work in Large Language Models (LLM). But Gen AI implementations are fundamentally different from the more traditional predictive AI solutions. So how should one define and measure the success of Gen AI solutions?
In a recent conversation with a client, I realise this may be a glaring gap in organisational conversations around Gen AI adoption. A look around the internet also confirms the dearth of meaningful discussions on this topic — just a bunch of recycled idiotic dribble that shed no real light on the issues. And so I dedicate my 63rd article to proposing an approach to defining and measuring the success of Gen AI implementation in organisations.
(I write a weekly series of articles where I call out bad thinking and bad practices in data analytics / data science which you can find here.)
There’s No Uncertainty
Let’s start with the key difference between traditional predictive AI and Gen AI. All predictive AI solutions, regardless of the flavour and permutations, reduce information uncertainty, thereby leading to better decision-making. And therefore, we can articulate clearly, the success measures around their implementations by evaluating both the decision-making processes and the decision-making outcomes. But the thrust of Gen AI isn’t about reducing information uncertainty; instead its key essence is mimicry. In my 40th article (How To Think About Gen AI Use Cases), I wrote that Gen AI was developed to solve a different class of problems — reduce wasted time in seek & search for information, increase speed and reduce inaccuracies in information summarisation, reducing cost and effort of creating non-unique content through mimicry. In summary, Gen AI plays no direct role in improving decision-making.
In my 4th article (The Problem with Dashboards), I wrote about a useful framework for metrics design: the continuum of Input -> Activity -> Output -> Outcome -> Impact. This ‘metrics continuum’ was popularised in the 2011 United Nations (UN) Results-Based Management handbook to drive country-level development. This same continuum can perhaps shed light on how we perceive and measure predictive AI vs Gen AI.
The diagram below highlights the stages in the metrics continuum where predictive AI and Gen AI operates. Reducing uncertainty is an output endeavour, while mimicry is an activity endeavour. Once we see this, we then have a sense of how we should be defining and measuring the success of Gen AI implementations.
It’s A Productivity Tool
I read on the internet that the rate of adoption should be a good metrics to use when measuring the success of Gen AI implementation. The argument is that adoption is a proxy for utility, and if it was high, it means that users see value in Gen AI, regardless what that actual value might be. While I don’t disagree that adoption rate can be a useful metric, it is too high level to be really useful to unpack what’s going on with the implementation. Some others suggest user experience as the measure of success, derived through surveys. I am generally not a fan of survey data due to their typical poor construction and non-representation. And in this case, survey-based user experience is equally high level and not constructive.
The first thing to note is that because Gen AI’s objective is mimicry (it’s doing what humans can naturally do) and that it sits in the activity phase of the metrics continuum, we can conclude that unlike traditional predictive AI, Gen AI is a tool and not a solution. We should view it as we do Excel (for example) — we all acknowledge the immense value of Excel, but no one says Excel is a solution. Excel is not doing anything you can’t do for yourself; it’s just doing it faster, and probably more accurately. So how would you define the success of implementing Excel in the workplace? We would logically define and measure it as a productivity tool. We wouldn’t measure if Excel was particularly user-friendly. We wouldn’t measure if Excel has accurate in its formulae. We wouldn’t measure the utilisation of Excel’s many features. These measure won’t give us a sense of understanding success.
Instead, a good success measure must have a direct line-of-sight to the P&L; it should be aligned to one of the following: revenue, expense, efficiency, risk, user experience. In the case of traditional predictive AI, it can run the gamut of all 5 categories. But for Gen AI, I would argue that it’s mainly in the efficiency category, given that it operates as a productivity tool. Now, Gen AI can be implemented either as a fine-tuned tool for a very specific set of activities, e.g. client-facing or employee-facing chatbot, or as a general tool, e.g. subscription-based ChatGPT. Regardless, we can measure efficiency based on the time savings from the start to end of any specific activity that utilises Gen AI; this can be instrumented because the activity occurs via digital interfaces (hence the use of Gen AI). In addition to measuring time savings, we should also measure the number of reworks in the activity. A good implementation should significantly reduce time spent while also decreasing the rate of reworks. And that’s all that really matters for defining and measuring success of any Gen AI implementation.
Conclusion
Gen AI is the embodiment of the Imitation Game (Alan Turing’s turing test). But success isn’t about the fidelity of the imitation. Rather, success is about replacement. The more we can replicate, and thus, replace human activities, the more successful the Gen AI implementation is. And this means doing the same human activities faster and with less (or equivalent) errors. Measuring the replacement rate, proxied by time savings and reworks is the only relevant measure.