One of the most common refrains we hear from institutions considering implementing an enterprise analytics initiative is concern that they’re “not ready” because they have “bad data”. Maybe they think that an analytics platform is like a precisely engineered jet engine that can only run on high-octane, ultra-pure data as fuel. Or maybe they think that an analytics platform is like a decadent dessert, and they haven’t earned the right to have it until they’ve forced down their data-governance vegetables. Luckily, neither of these analogies rings remotely true to reality. A better metaphor is to think of bad data as muscles that have atrophied from years of under-use, and the only way to make improvements is to start exercising those muscles by actually using the data for institutional decision-making. In this scenario, a data analytics initiative is like a sturdy pair of running shoes that can enable your data-integrity couch-to-5K program.
Having “bad data” is not a reason to delay an analytics initiative. It is the reason to start now.
To better understand how an enterprise data analytics initiative can help to improve data quality, it is important to understand the root causes of the bad data problem. In my experience, when people lament having “bad data”, they may mean one of three distinct things:
- the data are misclassified or misunderstood
- the data are inaccessible, or
- the source system data are incorrect.
Typically, all three of these issues are present to varying degrees within any large and complex organization. We will take a look at each of these issues in turn, investigate why each occurs, examine some examples, and discuss how implementing a data analytics initiative is a key part of the solution to alleviating each issue.
Data Quality Issue #1: The data are misclassified or misunderstood
When people say that their data are “bad” or “wrong”, I usually find that what they really mean is that the data has not been organized in the way they want to see it. Take a seemingly simple example, such as the question: “What were tuition revenues last year?” Do we mean last fiscal year, academic year, or calendar year? Gross tuition or net tuition? Tuition alone or tuition and fees? Assessed or collected tuition? None of these options is wrong, and every possible variation is potentially valid in certain circumstances. However, all too often the CFO is simply handed two different reports: one of tuition assessments from last academic year, and another of tuition and fees collections from last fiscal year. When they inevitably don’t match, it just seems like another case of bad data.
Fundamentally, however, this is not an issue of data quality or cleanliness, but of data organization, which is a key aspect of data governance. The people who built the two reports above probably had excellent reasons for making the choices they did, which are appropriate to what they need to do their respective jobs. However, they may never have considered that others need to organize the same data in different ways to perform other functions, and they almost certainly are not familiar with all the nuances of those other variants. Dealing with exactly this type of data misunderstanding is one of the primary goals of a data analytics initiative. A well-built analytics platform will categorize the underlying data with each of the necessary labels based on a clear set of business rules (e.g., calendar vs academic vs fiscal year, tuition vs fees, etc.). Not only does this allow for greater consistency in data reporting and for straightforward comparisons between different organizational schemes, but the process of creating the labels is often an important catalyst for broader data governance conversations between units about why specific rules are applied, when to use different variants, etc.
Data Quality Issue #2: The data are inaccessible
Sometimes institutional data seem “bad” because it seems like they don’t exist. Or if they do exist, the only way to access them is to be fluent in a programming language that went obsolete during the Eisenhower administration. Of course, somewhere in the basement of the IT building sits a wonderful programmer who still knows this language, but when they retire the whole institution may cease to function. It’s certainly true that decision-makers not having access to critical data is a bad situation. However, data inaccessibility also creates a bad data perception in that most users of the data do not understand the nuances of how the data was created or pulled, nor do they have the ability to make any adjustments to the way it is provided to them. Furthermore, the few people who can directly interact with the source data have often learned to apply bespoke operations on the data to account for its flaws and to make it seem acceptable to other users. The fact that these operations are not transparent may lead to misunderstandings and miscommunications about the data.
In many ways, the data inaccessibility problem is a hybrid of the misclassification issue outlined above, and the incorrect source system data described below. When users do not understand the details of the source data, they are unable to organize it in a way that is suitable for their particular needs. Similarly, lack of transparency may lead to missing or incorrect data being entered into the source system. Including these data in a data analytics platform will not only make these critical but hidden data more accessible to more users across the institution, but it will enable the data governance process to bring greater clarity to their definition and function.
Data Quality Issue #3: The source system data are incorrect.
The simplest cause of bad data is that the data in the source system really are just wrong. While all large organizations will have some incorrect source data, I have generally found that this is the least common of the major causes behind the bad data problem. Some examples might be that student activity attendance is collected sporadically, or class instructor records in the Student Information System (SIS) do not accurately reflect faculty teaching assignments. These cases exemplify the two most common root cases for incorrect source system data. For the student activity case, the data may not be entered correctly (or not entered at all) because the data have not historically been used for any significant purpose, so there was no incentive for people to keep accurate records. In the case of instructor assignments, I have seen many examples where the data were intentionally mis-entered as a workaround to some limitation of the system - perhaps the system does not have appropriate functionality to record teaching assistants or multiple instructors. Because these data have not typically been used for any other purpose, no one ever made it a priority to fix the underlying system to enable more appropriate data entry.
The key aspect of both these common causes of incorrect data is that the data are poor because they are not used for decision-making purposes or other critical functions. If data have not historically been used, then staff may not understand the importance of entering quality data or prioritize it in their day to day work. These types of incorrect data are rarely seen in systems that are critical to the functioning of the institution, such as student transcripts or payroll records. Thus, the only solution to acquiring cleaner data is to ensure that the data are used regularly and visibly in all levels of institutional decision-making. Of course, the best way to make that happen is through an institutional analytics initiative that provides actionable data to users throughout the enterprise.
The worse your data, the better reason to get started
Institutions that want to improve their data-informed decision making are absolutely right to worry about the quality of their data. However, what many fail to realize is that improving data integrity is not a prerequisite for implementing an enterprise analytics solution, but a key outcome of the process. Robust data governance and reliable data reporting require a cycle of continuous improvement, and the best way to get started is to finally get started; NYIT's approach to data governance provides a great example to emulate.