Basic Building Blocks
Listing and understanding the people involved in a project, and where data science fits
In the last post, I talked about how to translate metrics that data scientists understand into metrics that everyone else can understand. But who are these other people, and what do they want? Answering those questions requires backing up to describing the basic building blocks of a project: Who, What, When, Where, How, and Why.
I first heard these six words used together in elementary school when describing how journalists do their work. My teacher insisted that all stories have these six components, and unless I could list each of those six things for any given news story, then I couldn’t possibly understand what happened. I’ve come to understand that the same six words need to be known and understood for projects to be successful.
On top of the six words, another layer exists: who is responsible for each of those six words. Everyone has a role to play to bring a data science product into the market.
Why
First, let’s start with Why. In situations where people do not understand why they are doing some task, they can easily get distracted and do work close to but not exactly relevant to the problem. In successful businesses that I’ve been a part of, the why is owned by the businesspeople; that is, there exists in the organization someone who understands intimately what the market is and what the market wants, whether the business is microscopes, tickets, or commercial real estate. If you do not know who that person is, you should, because they are responsible for describing the reason for the business in the first place, and most likely participated in growing that business enough to be able to afford a (no doubt expensive) data science team in the first place. That person or those people will be determining why the business needs to invest in a new initiative.
In this context, businesspeople have two general ways of thinking about tasks, projects, and teams. A project exists either to increase revenue or reduce costs.
A project to increase revenue will need to have some sense of how much revenue could be expected to be gained and how much it will cost to get the revenue. As a data scientist, you may find yourself asked to perform some of these projections. As a general rule, I don’t consider this work to be data science per se, but more analytics— but that’s a distinction that’s become very blurry over the years, so it may be useful to familiarize yourself with ways of creating cost and revenue projections.
Projects that increase revenue should only be started if the projections can show that more money will be made than spent. I’ve made the mistake of thinking that if a problem is hard to solve, it will also be lucrative to solve; those two are not correlated. The world has plenty of hard, thankless problems, and as a general rule, successful businesses avoid those problems, or find some way to apply the solution of a hard problem in such a way that the solution can make money. Finding solutions to lucrative problems is the job of the businessperson. Part of your job as a data scientist will be advising what problems you can solve, and working with the businessperson to determine if the solution has any kind of merit.
Projects that reduce costs are more cut and dried. I’ve only seen companies engage in cost-cutting when the perceived costs of a program outweigh its worth. That scenario can play out in multiple ways— cost cutting can be done to reduce cloud bills, but they can also be done to reduce fraudulent transactions. In these situations, the actual amount of money to be reduced is well known, and that creates a fundamental difference between the two types of projects.
Cost reduction projects have a finite measurement of success; the company cannot save more money than it’s spending. Projects that increase revenue can produce a multiple on the amount of money that’s spent, although most projections I’ve seen tend to be handwavy about the exact size of what’s achievable. Terms like Total Addressable Market (TAM) are thrown around as an estimate of the size of the pie, and there are plenty of consulting and data firms that are happy to help produce a TAM for an idea. For a fee, of course.
For the data scientist, understanding which side of this coin a project falls on will help to understand how to craft success metrics and produce that Rosetta Stone.
What
Once the why of a project has been decided, the next step is to determine what to do. In many places where I’ve worked, that next step is a product to be managed by a Product Manager. In Agile terms, the product manager is the Single Neck To Strangle, a term that’s not very PC but captures the need for the organization to have a single individual responsible for making decisions about a product. This person is almost never a data scientist, and for good reason; this person needs to be highly attuned to the needs of the business, and needs to have a set of skills to create products— an entire discipline unto itself.
Do not underestimate your product manager, they will be the person to make the final call about what steps happen next, including whether or not to continue to use and develop data science tools in their product.
If there isn’t a single person who is making decisions about what the solution to the problem is, that’s an organizational problem that needs to be addressed. One person needs to be making calls; modern engineering and data science projects are all about tradeoffs, about selecting one technology or methodology over another. Without a single arbiter or final decision maker, projects will wallow in indecision and you will probably find yourself looking for a new job when the money runs out.
For a product that will involve a data scientist, here is a list of decisions that the product manager will own:
The acceptable performance of a model deployed to production. The data scientist may provide a series of models and thresholds, but ultimately the product manager should own the model, threshold selection, etc. For instance, in the example in the previous article, it’s the job of the product manager to choose line three, not the data scientist.
The necessary speed of delivery, or SLA (Service Level Agreement) for a solution. If a system needs to be producing recommendations in sub-millisecond time, that’s a very different problem than producing recommendations in bulk once a week or once a month.
Rollout plans. Are you rolling out to one customer, then ten, then a hundred? Private invite-only beta like Gmail’s initial rollout or a Super Bowl commercial? Is there a regional component to rollouts in, say, one city or state before rolling out nationally/globally? Ideally, these plans are done with Product Marketing and Marketing teams, and data science/analytics may be used to inform different go-to-market plans.
Acceptable levels of redundancy. Throughout my career, one aphorism I keep hearing is “Speed, quality, cost, choose two.” The product manager will be the one responsible for going to market with a lower quality solution in exchange for lower time-to-market (ie, they want to get something out the door as quickly as possible), or taking more time to create a more complete solution. One reason to move quickly is to gather more information about what the market responds to positively or negatively— in other words, everyone has a plan until they get punched in the mouth and maybe adjusting the plan on the fly after exposure to the market is a great idea… or maybe not. I do not envy product managers for this decision.
The trick is, data science is not the product manager’s job. As a data scientist, your job will be explaining what can and cannot be done, and more importantly, what experiments can be done to validate whether a technology is a fit for a problem.
When, Who, Where
I have seen multiple different approaches to these three questions. In larger companies, I’ve seen these questions be the responsibility of Project Managers. These people help the business as a whole understand who has been committed to what project and when projects are expected to finish so that their impacts can begin to be accounted for by the finance types. For software companies, the question of where has become less important as a lot of work has become virtual, but for companies involved in hardware or physical devices, the logistics involved in getting prototypes and physical goods into the hands of the people who will work on said prototypes can be a tremendous undertaking.
The more project/product teams a company has, the more potential for redundant projects to exist and for redundant work to be done. An ideally streamlined organization would not have redundant work, since that situation will tend to lead to extra work being done for no real overall gain. That redundant work can exist for a wide variety of reasons, from managers just not talking to one another, political infighting, and other unfortunate behaviors. At some point, if enough redundant work accumulates, then a cost reduction project will likely be undertaken to trim out the redundancies, which is generally not a great feeling for anyone, especially those who find themselves working on teams considered redundant. Better to get ahead of redundancies by having a group of people who watch the organization as a whole to prevent that situation from arising in the first place.
Jira and issue tracking systems are often (but not always) under the purview of project managers. These systems allow project managers to understand team velocity, how much work will be done, and by when. These projections allow project managers to create time estimates for how long a project can take, and then compare those projections to any kind of roadmap that should have been made at a project’s inception. If a particularly important project is not on track to finish in time, the job of the project manager is to raise red flags that some sort of managerial intervention will become necessary to maintain delivery expectations.
In smaller companies with a handful or fewer teams, there likely isn’t a need for a fully fleshed out project management organization. The threshold is different for different organizations, but generally, once a business has grown to contain more product managers that can report to a single Head of Product, and there are two or three levels of product managers reporting to that head of product, cohesion can begin to unravel as there simply aren’t enough hours in the day for that head of product to know what every one of their people is doing in sufficient detail to prevent redundancies.
To the data scientist, these logistical questions are very important, because you will likely be tasked with determining what technology can be applied to solve the problem and then describing how long a team will need to create the solution. In other words, you will be working with the product manager to build out a roadmap. Even so, without input from engineering, any roadmap you make will be wrong.
How
How is owned by engineering, full stop.
Let me emphasize that again— engineering, not data science, owns the how, the details of how a solution will be implemented in production.
There are many different flavors of engineering in data science projects, and the famous 2015 Google NeurIPS paper about the Hidden Technical Debt of Machine Learning Systems serves as a landmark declaration of just how much engineering work is required to do the job of bringing data science into production.
Data engineering is a critical component of any successful data science project. 90% of the work that I’ve done on data science projects has been around feature engineering and gathering and preprocessing the data. Train/test splits, metric determination and calculation, etc all pale in comparison to getting quality data at the start of a project and delivered consistently to a model in production in exactly the same form as the data used to train the model.
I have found myself in charge of both data science and data engineering, and they are very different disciplines. Data engineering has much more well defined boundaries. Data can be delivered within a specified amount of time, in specified formats, and in specified locations, without any randomness. Data science, on the other hand, has a degree of probability about it; you know that some percentage of the time, the model will be wrong, just as humans can be wrong.
As a trivial example, think of the problem of facial recognition. The number of times I’ve seen someone on the street I thought I knew but was in fact a perfect stranger is not zero, and I don’t think I’m alone in that. Is that the fault of my eyes for delivering bad information to me, or the fault of my brain for a bad interpretation? Well, I’ve had LASIK, so I can’t really blame my eyes.
Moving models into production, getting data to those models, taking predictions from those models— these are all tasks for the engineering team.
So what about data science?
In all of this setup, where does data science fit?
Well, the businessperson has a reason why they want to do a project. The product manager will decide what to do on the project. The project manager (or some other function) will decide who works on the project and when (now?). The engineering team will decide how things will be done.
Data science can inform all of these steps.
In my experience, data science works best as a series of experiments around a hypothesis. The team as a whole, with the final decision going to the product manager, will have a goal of what to accomplish and can brainstorm several ideas that could accomplish the goal. The data scientist will pick a few of those ideas to explore, and then present the results of the experiment to the group. The group will then choose what to do next.
Say you have a goal to reduce fraudulent transactions. This project is squarely in the cost reduction bucket; you know exactly how much money the business is losing to fraud, and the best outcome you can hope to achieve is to prevent all fraudulent transactions while allowing valid transactions. You will need to know what the consequences of the project’s failure are so that you can help determine a useful metric for success. Accuracy, for instance, is almost certainly the wrong metric, yet in my experience, a lot of people reach for accuracy. Why is accuracy so a terrible metric? If 3% of transactions are fraudulent, credit card companies may refuse to deal with your business due to a huge amount of chargebacks, and then you may be out of business because you simply can’t accept money online for goods or services. At a 3% rate of fraudulent transactions, that means that 97% of transactions are good, so a model training on accuracy would be right 97% of the time by always saying that every transaction is good.
It is the job of the data scientist to avoid that trap and advise everyone else to avoid that trap. No one else is as equipped as the data scientist to do so, and more importantly, that’s what data science brings to the table.
In that example, the important piece of information is understanding that a 3% fraudulent transaction rate will result in serious business consequences, and so selecting a metric that will avoid the naive pitfall. Talking to the businesspeople will definitely help, as well as crafting experiments to determine if the proposed solution will result in improvements in business conditions.
At its core, data science is science, and science is about testable hypotheses. In this case, you know that you need to solve a class imbalance problem and that knowledge can lead you to a whole collection of methods and solutions that can be deployed out of the box. But which one will work for you and yours? Pick a method and create a comparison against the current fraud detection system. The testable hypothesis is that this new method will produce a better outcome than the existing system. What’s a better outcome? Well, one way to remove all fraudulent transactions is to just shut down all transactions, but then the business will close, so that’s not a better method.
You need to:
Devise a method for measuring success, ideally with communicating with everyone on your team
Measure the current system using that new method and calibrate your metric so that failure and success are obvious
Choose a technique to solve the problem
Run that technique on a prepared dataset (the creation of which is a problem in and of itself, but hopefully engineering can help you here)
Measure the new system
Compare the measurements in steps 2 and 5
Present your solution to your team
In step 7, the goal is not just to show what you can do; it’s to help the team and ultimately the product manager decide whether to use your model in production. Assuming that the results you present show a significant improvement, the team will want to take your prototype to production. The engineering lead will need to determine how much work will be required to do step 4 repeatedly, consistently, and within SLAs/SLOs/SLIs specified by the product manager. The project manager will need to weigh in with who will be available and when to help do the work specified by engineering, as well as to coordinate other resources the product may need to go to market.
If the product manager decides that the solution is inadequate, then repeat steps 3-7 until you arrive at a solution that the entire team can support (or you go out of business or get fired because that fraud rate is awful and you should be able to find some way to improve it).
Note that while you may be relatively expensive, you are almost certainly cheaper than an entire product engineering team. It is cheaper for the business to hire you to do these explorations for a few weeks/months than to have the team build some poorly specified and poorly understood solution that may not work at all. Your work should be done before specifications are made so that everyone on the team is clear on what solution they should implement and what kind of performance they can expect from that solution.
In upcoming posts— what does that presentation look like? How frequently should you be checking in?