In the last article, I talked about one of the persistent needs of a data science team, and that is someone to shepherd and curate data into datasets that can be used with full confidence by a scientist to accomplish a task. In this article, I'm going to dig a little deeper, and look at different possible skillsets that could fill this role.
Stability Diffusion XL seems to think that a data librarian should have a magical blue squirrel-phoenix as an assistant; who am I to say that won’t help?
To recap, a data librarian will identify, procure, label, clean, and manage different datasets that a data science team will use to create a model. Since model creation and maintenance is not a one-time task, but happens over the lifetime of the product, the datasets themselves need to be maintained as well. Different versions of the dataset need to be matched to the pertinent model created by the DS team and tracked along with those models along with any and all relevant performance metrics, ideally with a single easy-to-find identifier.
From my perspective, for the purposes of this discussion, data falls along a spectrum between structured and unstructured. Unstructured data, such as text documents, images with or without captions, brochures, marketing materials, powerpoints, videos, and the like, are becoming more and more useful in untouched form as fodder for LLM-based RAG search engines. Structured data, like information that can fit into a database, is a much more traditional format for data to be given to a data science team. Either way, someone needs to prepare the data, and those preparation steps include (but are not limited to!):
Procurement/Creation. The data has to come from somewhere, either from a well-structured data warehouse or a third party. Data brokers are happy to sell structured data to everyone, but may not be providing new or different information depending on what's already available. These third-party datasets may also not have been cleansed of important and sensitive information, like scrubbing phone numbers that are on a national do-not-call registry. Unstructured data may require having someone go out and grab the data from various sources along with a license from the creator to allow the data to be used in a model. Data engineering teams can greatly help with these steps, or own them entirely. Getting data in the first place is necessarily the starting point for everything that follows, but it doesn’t necessarily have to be owned entirely by the librarian. More developed teams may have the ownership of the process entirely on data engineering, but driven by the librarian who sees holes in the collection.
Labeling. Labeling converts data from a collection of information that could maybe one day be useful into data that can be useful right now for a data science team. Labeled data may require the use of labeling programs, either created in-house or using packages like LabelStudio, GroundTruth, Mechanical Turk, or Snorkel. Labeling could also be done using internally written and maintained programs for specialized use cases. For companies with a large collection of internal documents, labeling data that has already been obtained cuts out a huge amount of work in obtaining the data in the first place. Labeling still comes with its own perils, such as:
Do you trust your labelers? Are they motivated to care about what you care about, or are they just in it to label as quickly as possible for cash? Are they subject matter experts (SMEs), or are they random people working under contract? Different labelers have different levels of trustworthiness, and different levels of expertise are needed for different tasks. Labeling medical data may require documents, labeling street signs may require someone familiar with those signs (i.e., using offshore teams could lead to some cultural confusion there), while labeling legal clauses may require lawyers.
How much data do you need to label? The greedy answer is "all of it," but in reality, you might be able to get away with just 1000 pieces of labeled data, so long as you have good coverage. What is good coverage? "Good" coverage is achieved when as much of the space of searchable examples is covered, including (and especially!) any place where humans could disagree on the answer.
How frequently do you need to label? Language changes, data drifts, and what was true yesterday may not be true tomorrow. The starting lineup of the Lakers changes every year, and anyone who produces a labeled dataset showing Magic as the current starting point guard for the Lakers is a bit out of date.
Quality checks. For structured data, frameworks like Great Expectations or Pandera can be a great help. For unstructured data, actually reading/skimming documents to determine relevance seems to be the best way to go, although such activities can require an extensive labeling team. An unstructured document may be a great fit for some purposes and a lousy fit for others; a law textbook, for instance, may be a great fit for an LLM about the law, but a terrible fit for a chatbot designed to help someone navigate their credit report.
Profiling. Profiling with a library like pandas-profiling (now ydata-profiling) shows the distribution of the data and helps to understand which fields in a structured data set contain the most entropy. Information entropy points to places in the data where important distinguishing characteristics may lie. An individual field could be characterized by a range of data for a floating point number or an enumerated field for a categorical variable, but in the actual data itself, that enumerated field may typically have only one or two values out of a possible N. The appearance of a third value may or may not be indicative of something interesting; it may also be noise, and profiling to find those outlier entries helps the data science team focus their efforts accordingly. Profiling unstructured data is more about providing the range of topics and formatting covered in the constituent documents. If the entire data set is composed of medical forms, for instance, then the appearance of a spreadsheet or poster is almost certainly a fluke to be removed. Another important aspect of profiling is determining the ability of the data to cover the use cases likely to be seen in production; if the training data doesn't cover a particular case that appears in actual use, then that exact use case needs to be incorporated into the next version of the dataset. I would dearly love to know before I roll a model out to production that there are certain cases that the training data does not cover so I can prepare my stakeholders for those known problems, but you can't always get what you want.
Redundancy checks. I put redundancy checks outside of profiling, but profiling often uncovers redundancy. Redundant information in a supposedly already-cleaned dataset has caused me more problems than I'd care to admit, so I'm including redundancy as its own category. It's already a well-known practice in structured data to remove duplicate rows, where the same information appears multiple times in a dataset. This duplication can cause that sample to be given more importance than other data since it arises more than once, which may not be the intention of the team. Duplicate unstructured documents or duplicate images can also similarly unbalance an unstructured training set. Defining what makes a field a duplicate is a tricky and subjective practice, in my experience. If five of eight fields match, the remaining three fields may still contain important distinguishing information, or they may not, and consulting with SMEs and product managers will help to guide this process.
Coverage. The opposite of redundancy, coverage checks that the data covers as many scenarios as possible as can be seen in the wild. First passes at solving data science problems almost never have full coverage, because some corner cases (or not-so-corner obvious cases) will appear after first contact with users. Those cases need to be captured and labeled quickly to improve model performance, or else model performance in production will not match model performance during creation. Even before a model is first released, though, a responsible team will look through the data and theorize about what they might have missed, and attempt to compensate.
Ongoing maintenance. A static dataset will soon be a stale representation of the state of the world. Cleaned data used in model creation should be versioned, and newly acquired data will then be versioned along with the next model trained on the larger dataset. Whatever metrics are used to indicate the performance of each model should also be captured along with the model, ideally in such a way that the model performances can be directly compared. For instance, if a model's performance is understood through a precision/recall curve, keeping the constituent data points that create the curve will be helpful so that curves can be compared to one another in the same graph, even if the points themselves are generated months apart.
Given these preparation steps, the data librarian will need to have a very particular set of skills:
Work with labeling programs. Labeled data is the lifeblood of a properly functioning data science team, and obtaining labels is a nontrivial endeavor. I've written labeling programs, and I've used labeling programs, and there are definitely a lot of nuances that can trip people up. Determining the trustworthiness of labelers, the quality of the data to be labeled, and the quantity of data to be labeled are all decisions to be made with serious potential downstream effects. I would expect the librarian to work with data scientists to understand the creation of the initial dataset, as well as ongoing label acquisition and checkpointing data. For instance, assume that you have labeled data that is pertinent to a particular version of a product. As the product evolves, is that labeled data still useful, or do the users have significantly more options than they did when the data was first gathered? Should labeled data be pulled out of data sets in that case, or do the labels help the underlying model perform well?
Work with data scientists to bring data cleaning steps into the flow of data management. Even with all of this help, a big part of a data scientist's job will be preparing data and features to feed into models. Whatever steps that scientist determines will create the most effective model they can build should be immortalized into production-ready code so that those functions can be run repeatably and easily on subsequently obtained relevant data. The librarian may or may not be involved in creating a test suite to wrap around these transformation functions; that determination of responsibility should flow from a conversation with data engineering and data science. The result of DS work will be to create a dataset that has been enhanced with engineered features and cleaned data; both the original and cleaned data needs to be preserved for future work, as well as the steps to clean that data.
Think critically about data coverage, from the various types of statistical bias that may be present to whole aspects of behavior that may have not been covered in the data. Perhaps one group of documents is scanned by a terrible potato scanner, and so the quality of the information is very low; without actually looking at those documents and comparing to others in the set, that problem may get missed entirely, and result in disastrous performance on whatever information those documents are intended to cover.
Work with external vendors to find data that exists external to the organization. Many times, these external vendors may not have useful information, so profiling and quality checks are even more important in this context, especially for legal compliance. This function may already be being performed by other teams for their specific products; the focus of the data librarian is to make data available for data scientists, not directly for a product, so the librarian may work with those internal teams as if the cleaned and prepared data is internal, not external. In addition, there are plenty of government-issued open data sets freely available that can be onboarded at little to no cost, and the common crawl is a great way to get historical snapshots of the web. These datasets should only be used with some kind of licensing agreement or understanding in place; scraping people's data without a license can land you in some hot water and will not endear you to them.
A lot of the software packages I've mentioned here (and several others I didn't, such as some data storage solutions like dvc) would need to be in the librarian's toolbox. The librarian that I'm looking for would be knowledgeable in all of these packages and also have a general background in computing so that they can interact with the scientists, engineers, architect(s), and ops as a peer.
A final note: I know it's very trite to slap data in front of a title these days, but I need a title that people looking for a job will somewhat recognize as involving data and not (necessarily only) books, and also to indicate that programming skills will be part of the job. This person should be a peer to data scientists and engineers, not an underling, and should be able to think curiously and critically about potential issues in the data sets they're curating, including anything from important records that were overlooked and not included to records that don't accurately reflect reality.