I’ve been in plenty of meetings where an eager young (or not so young) scientist shows up with a great graph showing tremendous progress in their work. This line went from here to there! I got 5% more of this metric, 10%! Isn’t that exciting! Meanwhile, businesspeople in the room look to each other, shifting uncomfortably, not understanding what they’re looking at, why they should care, and mentally calculating just how much money this meeting is costing them in terms of everyone’s hourly wage.
Consider this table:
Here we have a pretty typical case of finding a needle in a haystack. Maybe your haystack is fraudulent transactions, and you need to find them. Maybe your haystack is finding the one document in legal discovery that will cost your client millions of dollars because of some obscure clause. Maybe your haystack is insurance claims adjustment, or loan approvals, or … whatever it is. You’re looking for something, and are probably being paid a large amount of money to find it.
But chances are, you’re not alone. Chances are, the company you’re working for was already keenly aware of their problem, and have likely hired a number of experts to comb through this data to look for the needles, and already have a procedure in place. Great! You can use that procedure to establish a baseline of performance, something you have to beat to justify your existence in the organization. But maybe they don’t— maybe they just go through each case by hand. Let’s suppose their expenses look like this:
Given the above, at 500k cases per week, they are spending $2.5 million to catch $1.7 million, or a net loss of $800k. No wonder they need you!
First things first— I’m making this data up, but there is no way you should do that. You should absolutely find out what the costs are of a false positive and a false negative are. If the costs of a false negative is literally millions of dollars, then combing through all the documents may make perfect sense, so long as your experts don’t also cost millions of dollars. You need to understand the scope of the problem in terms of money and in terms of personnel, not in terms of F1 score. That is the language of the executive— what will this cost me, what is my Return On Investment (ROI). A gain of 5% in F1 means nothing to them, which is why so many data scientists and analysts get a sea of confused and frustrated faces.
(As a side note, if they already have case workers working on the problem, then that means that you will have manually labeled data generated by experts who are extremely incentivized to get the label correct. There is no better dataset to work with, trust me. Make friends with the DBAs and engineers who track and maintain that data— while your bosses will have a vague sense that that’s gold, you should definitely know what to do with that.)
So now, let’s revisit that original table, but add in some additional columns:
Well well, what have we here. For this particular case, you can set a threshold where your precision is 2.91% and your recall is 88.24%, and you will save your employer $785k/month. Unless you and your team are (somehow) getting paid more than that, you’ve just justified your existence to the executives. Time for celebration! Not only did you do a good job, but you’ve also expressed it well to those who make the decisions. You’ve just earned yourself some political capital to work on interesting problems, and have justified the no-doubt expensive existence of you and your team.
Now let’s suppose you’ve done some feature engineering. You’ve figured something out, managed to show that you can improve the model even more. You managed to get to this point:
From a sheer numbers perspective, that is quite the improvement. Look at those numbers in the bottom row: 45% precision to 81% precision, 14% recall to 38% recall. Nice!
But wait. Somehow, the row that the executives are interested in is still the third row. If you’ve been thinking just in terms of precision/recall/f1 etc, you might still be glossing over that row. You went from 2.91% precision to 3.23% precision, 88.24% recall to…. 88.24% recall? Well, sometimes that’s just how it works, but who cares! Look at that bottom row!
No, the third row is still the most interesting, because that’s the row wherre you will save the most money. You still have an increase in dollars saved, from $785k to $835k. So this is a slam dunk, right? Ship it!
Not so fast.
What kind of feature engineering did you do? Did you just go from shipping a model that had very few infrastructure requirements to requiring feature stores like Feast/Tekton? What will that end up costing you? Are the execs having to hire additional staff to get that new model into production?
One of the great innovations in this field has been the introduction of MLOps platforms, of which there are now many. When I started doing this work, everything was done by hand, because whatever you got off the shelf likely wasn’t better, but was almost certainly closed source and unlikely to be suited as well to your use case as what you could write yourself. That situation is no longer true. Platforms like Metaflow, MLFlow, Kubeflow (apparently, we now flow in this business) all help to ease the burdens of deploying new models into production. Did you do your work in one of those? Maybe you did, maybe you didn’t— but if you didn’t, what will it cost your company to deploy the new model, compared to how much you’re going to make it? Will they realize that $50k/month in additional savings, or not?
In my experience, at its core, this conversation needs to be had among all the people involved in the project, especially including the executive sponsors. It is the job of the data scientist to create this kind of Rosetta stone, translating the esoteric metrics that is our day-to-day understanding of the world into numbers that everyone else can understand. If that bridge cannot be built, then the brilliance of the solution to the problem is irrelevant, and that sea of confused faces will begin to wonder just why they’re spending so much money on data science.
In upcoming posts— how to facilitate that conversation.