A year ago, there was a lot of AI hype and less substance as vendors and institutions were developing their strategies and policies to adapt to this new normal. This year, there is a much firmer grasp of the technology, its use cases and limitations. As a technologist and practitioner in higher education, this is an exciting time to dive deep on the topic of AI in Analytics.
The impact of AI on Higher Ed
Today’s new AI tools are driving a shift that will dramatically change the way we analyze data, who can perform analysis and the role of Institutional Research (IR) and Information Technology (IT) in providing these capabilities. The promise of AI in analytics is to allow the casual user (think program chairs, enrollment managers, academic advisors, etc…) to interact with and interrogate the data by simply asking questions in plain English as if they were having a conversation with a data analyst. AI tools will answer questions, create dashboards and generate reports instantaneously. These tasks are currently relegated to the domain of IR and IT analysts that have the technical expertise and domain knowledge to execute these queries. In doing so, new AI tools could reduce a backlog of requests for descriptive statistics and diagnostic analyses that users will now have full access to assuming the right security permissions.
The promise and limitations of AI applications in Higher Ed
AI and Large Language Models (LLMs) have captured our imagination. These models can generate text, create images, produce videos, summarize conference calls and even create podcasts on the fly with some prompting. Given all of this, it’s reasonable to assume that AI will make quick work of some data tables and quickly become our own personal data assistant but this is not the case. LLMs struggle with calculations that require deep domain knowledge or reasoning over complex information. For example, an LLM might be able to calculate the average age of students in a database, but it would likely struggle to correctly calculate a more complex metric such as the retention rate for a given student cohort. The notion that we can just point an LLM at raw data in a database and expect accurate results is a false narrative. In fact, foundational data engineering, data modeling and data governance efforts are even more important than ever when it comes to being able to leverage AI in Analytics… but there is a missing link.
The best way to think about the limitations of LLMs in building queries is to think about the human processes that the model must account for when we ask a question. Let’s use a seemingly simple example to illustrate our point. Let’s ask the AI ‘How many students were enrolled in Fall 2024?’. At the surface level, this should be an easy question to answer. As a practitioner, I know to get an accurate result, there are a number of considerations:
- Should I pull headcount, FTE or the number of course registrations?
- Should I use Census data or live data?
- Do I exclude non-degree seeking students from my query?
In this same vein, AI needs to make these same determinations in returning an answer to the user's prompt. In our testing, we have observed that AI takes its best guess which results in inaccurate and inconsistent results. This is enough to give any IR Director (and others who are invested in an institution's data governance practices) heart palpitations, but all is not lost.
The Importance of the Semantic Layer
Now let me introduce the importance of building a semantic layer as a new component to our data analytics platform that is the key to unlocking the value of AI in Analytics. Before we get into the specifics of this concept, I want to reiterate the importance of building a strong data foundation. Without investments in data engineering, data modeling and data governance, your AI efforts will be for naught.
A semantic layer is like a translator between data and business language. It is the instruction manual for the AI to better understand how to interact with your data. The semantic layer helps the AI tool understand the content, table definitions and grain of the data set. It provides sample queries, metric definitions and calculations, and translates source system jargon into meaningful information. Using our example above, the semantic layer would help the AI tool understand the user prompt on how to correctly pull the number of students that were enrolled in Fall 2024. For instance, it may inform the AI that unless otherwise specified, the query should only pull degree seeking students and leverage Census data for historical terms and live data for the current term. The sample queries would help the AI understand that in this context, the user prompt is asking it to calculate the count of the distinct headcount that enrolled in the Fall semester. These considerations may seem intuitive to an IR / IT analyst that has been working with this student data for years but we need to inform the AI on how to interpret the data structures and provide additional context for it to produce accurate and consistent results.
The concept of the semantic layer has long existed as a way to provide users with the context they need to use data correctly and in accordance with data governance guidelines. The semantic layer is now taking on a greater importance as the mechanism to provide this critical information to AI tools. Additionally, for AI tools to provide accurate results, the semantic layer must be more complete and robust than ever in order to capture the knowledge that trained human users have implicitly carried around in their heads, and make it explicit in a format that the AI tools can access. The semantic layer is thus the key to unlocking the potential of AI for analytics, and truly fulfilling the promise of data democratization to non-technical users across campus. As an add-on benefit, these improved AI tools will free up the time of IR and IT analysts from building reports, allowing them to work on higher order analyses and modeling that are more deserving of their many talents. The juice is worth the squeeze but getting to this point will be harder than it looks.