By Mathilda Dougherty

It is easy to get the impression that AI chatbots like Gemini or ChatGPT systems respond to questions by searching for the best available information, weighing it, summarizing it, and presenting it to you. If this were true, it would make sense to ask it to do things like evaluate a building project, create a list of readings tailored to a specific class, or opine on a dispute between two people. This impression is mistaken, however. LLMs process language, but do not understand it. Knowing more about this process and its limitations can shape your evaluations of chatbot outputs as well as decisions about whether and how to use them in your classes.
How LLMs “think” and respond to prompts
GenAI-based chatbots respond to prompts by using Large Language Models (LLMs). LLMs model language mathematically and produce new language based on those models (here is a fuller introduction to how this works). They break text down into chunks called “tokens,” usually less than a word long, and assign mathematical weights to each token. This allows them to predict what tokens are likely responses to a prompt.
This allows these systems to do three things very well:
- Reproduce patterns of language found in their training data. For example, an LLM can produce text resembling a cover letter, an Instagram post about a vacation, and Star Trek fanfiction equally easily because its training data included examples of all these kinds of writing.
- Find patterns of language in uploaded documents and incorporate them into responses about those documents. For example, if I upload an assignment guide to an LLM and ask it to give me a bulleted list of key details about the assignment, it is likely to accurately reproduce text such as the due date, type of assignment, and marking scheme.
- Find patterns of language in Internet search results and incorporate them into responses. For example, if I ask an LLM to give me directions about how to spackle a wall, it is likely to run a web search and reproduce language found in the top results (in this case, probably a company that makes or sells spackle or a website where users ask and answer questions like Reddit or Quora).
These abilities easily give the impression that LLMs are thinking and researching in response to prompts. However, all these responses are entirely based on the probability that one string of tokens will be found near another, not on understanding the meanings of the source texts. That includes the reports on “thinking” or “reasoning” that some LLMs produce, which are in fact simply descriptions of what might be happening inside the system. This is because LLMs have no direct experience or knowledge of the world that would be necessary to comprehend any of the words that they generate or how they generate them. To the extent that they “understand” anything, they understand what bits of language are likely to be found near others.
The limits of LLMs’ reliability
This distinction matters because it undermines the reliability of LLMs’ answers to questions in two key ways. First, all their responses, whether they are accurate or not, are produced in the same way: arranging likely chunks of language together. Because of this quality, the AI researcher Johan Fredrikzon characterizes LLMs as “epistemologically indifferent,” that is, unable to produce either truth or falsehood because their answers are based solely on the probability of one word following another. Fredrickzon’s point is not just that LLMs give incorrect answers sometimes. His point is that the way that LLMs get things wrong undermines the truth value of everything that they say. From the perspective of the LLM, a false statement is the same as a true one: a string of tokens that seems likely in the context of other tokens. Hence, LLMs don’t lie, and they don’t tell the truth: they just produce likely text.
For example, If I say to you that you should use glue to hold on your pizza toppings, you could reasonably assume that I have a larger understanding of what a bad idea this would be and am either lying or joking (please do not put glue on pizza).
If an LLM told you to use glue to hold your pizza toppings on, however, something radically different would be happening. It might do this if it assigned a reasonable probability to the tokens “put glue on your pizza” in the context of your prompt. When Google’s AI summary gave this advice in the spring of 2024, it appears that the system rated the answer as likely because it was trained on text that included a joke about putting glue on pizza. At no point in the process did Google AI summary think about that answer in the way grounded in the material reality of what glue on a pizza might do to a human who ate it. This means that it did not lie, mislead, or “hallucinate.” It simply produced likely text, as it always does.
LLMs’ tendency to magnify human biases and blind spots
The fact that LLMs deal with language without understanding it undermines their credibility in another way. The ways we model the world, including using language, are not the same as the world. They are always approximations and incomplete descriptions. LLMs make this problem much worse.
Deepak Varuvel Dennison, a PhD student in Information Science at Cornell, recently warned that these systems tend to worsen the underrepresentation of perspectives already underrepresented in their training data. He illustrates this idea with the example of an LLM asked to name favourite foods. Imagine that an LLM was trained on a data set of favorite foods where “pizza” appeared 60% of the time, “pasta” 30% of the time, and “biryani” 10% of the time. When asked later to name favourite foods, that LLM would list “biryani” far less than 10% of the time, or possibly not at all. This is because it looks for the most probable answer based on its training data, and “biryani” is, in this system, always an improbable answer.
Dennison’s point is that what is true of his example favourite-food LLM is true more broadly for any perspectives or ideas underrepresented in an LLM’s dataset. The commercial chatbots currently available were trained on data that overrepresented the perspectives of English speakers who had the wealth, leisure, and access to technology to fill the Internet with their thoughts. Hence, they overwhelmingly reflect back these perspectives. Worse, simply expanding the data set will not solve this problem. Some perspectives will always be enough in the minority to vanish from an LLM’s responses in the ways Dennison describes. In this way, using LLMs to gain knowledge will always tend to occlude what was already difficult to find.
This distortion of reality threatens the maintenance of knowledges and perspectives underrepresented in LLMs’ training data by further diverting attention and resources away from them. It also threatens those from overrepresented cultural perspectives, who will lose out on the knowledge and technologies of thought that they might have gained from taking others’ points of view seriously. As the ethicist Shannon Vallor writes in The AI Mirror, all users will have the same set of views endlessly reflected back to them in a way that forecloses the curiosity and creativity that are important goals for university education.
How this matters for the classroom
All of this matters for thinking about whether LLMs have a place in your class, and what that place might be. Below are some examples of potential uses for LLMs that appear in the literature on AI in education, organized by how the LLM is meant to be used.
Here are some quick examples of activities that involve finding and reproducing patterns in text:
| Activity | Description | Considerations |
|---|---|---|
| Identifying genre conventions | When students need to learn to write a specific genre of text, an LLM can rapidly produce many examples of that genre or reproduce similar texts in different genres so students can identify the differences. | Students will still need guidance to identify the key characteristics of a genre so they know what to read for. |
| Correcting errors of grammar or style | An LLM can offer suggestions for revising text that does not conform to the conventions of grammar and style for a specified genre of text. | LLMs are likely to “correct” students toward an average of writing in that genre, and so may eliminate a student’s distinctive voice or perspective. Students are quite likely to accept these suggestions because of automation bias, or the tendency of humans to accept the results of a technological system. |
| Identifying or mapping out the structure of a reading | An LLM can find patterns of text that indicate different stages or sections, and can use this method to produce an outline or other map. | This outline will not always be correct, and there is always the risk that students will substitute reading the outline for reading the text. For this reason, the best use of this functionality is for an instructor to use an LLM to produce a draft outline, correct it, and pair it with reading questions. |
| Locating and extracting specific information in a document | When properly and specifically prompted, an LLM can help to retrieve specific kinds of information from a series of documents. | It would be best to ask for a specific reference to go with each piece of information, as it will be necessary to manually check each piece of information to get an accurate result. |
| Creating toy examples and practice problems | Because these kinds of texts are highly standardized, LLMs are good at producing many examples quickly and can be used to practice skills. | Be sure to check any toy examples or practice problems for the reinforcement of biases. |
As you can see, each of these cases has its own considerations, but LLMs’ capacities to detect and reproduce patterns in text work with the grain of the activity.
In other cases, however, the epistemic unreliability of LLMs may frustrate the purposes of a learning activity. A good example of that would be activities in which students need to find dependable information. In those cases, the considerations are more serious. Here are some examples of such activities:
| Activity | Description | Considerations |
|---|---|---|
| Building a bibliography | Students might use an LLM as a starting point for gathering sources on a given topic. | LLMs are highly likely to: a) fabricate sources that don’t exist and b) overrepresent sources that are findable with simple Internet searches, regardless of whether they are the most relevant, useful, or current. Hence, the library catalog and Google Scholar are likely to give better results. |
| Researching a new topic | Students might use an LLM as a way into a new topic and understand the basics of an issue, especially if they are using a tool suited for this purpose like Gemini’s Guided Learning or Deep Research modes. LLM summaries are usually highly readable and can be refined or expanded with further prompts. | LLM explanations of topics may contain fabrications, cite sources in ways that ignore their context, and overrepresent majority viewpoints. The fluency of the language may conceal these weaknesses, and automation bias may make students less likely to do further research. |
| Comparing multiple viewpoints in a topic | Students might use LLMs to easily present multiple sides of an issue, especially in a tool suited for this purpose such as Gemini’s Deep Research mode. LLM summaries are usually highly readable and can be refined or expanded with further prompts. | LLM explanations of debates may contain fabrications, cite sources in ways that ignore their context, and overrepresent majority viewpoints. The fluency of the language may conceal serious weaknesses, and automation bias may make students less likely to do further research. |
| Practicing debate or other interpersonal skills | LLMs can simulate a human interlocutor to help students sharpen arguments and practice other interpersonal skills relevant to course learning goals. | Because LLMs are designed to provide users with the kind of text they ask for, they have a tendency to give responses that read as overly agreeable or even sycophantic. This tendency should be mitigated with careful prompting. There are also broader concerns about students developing emotional dependence on chatbots that may be exacerbated by using them in this way. Instructors should avoid any situation where interactions with chatbots are a substitute for students’ interactions with each other and with instructors. |
The hype around the capabilities of LLMs (as well as hype around the capabilities of putative future AI systems) can make them seem too complex, powerful, or unprecedented for instructors to be able to make decisions about their use. On the contrary, instructors equipped with a basic understanding of how these systems solve problems can weigh the advantages and disadvantages of their use in specific classroom activities as I did in these charts above. If you are a TMU contract lecturer or faculty member and you want help making those decisions, please do not hesitate to reach out to the Centre for Excellence in Learning and Teaching (CELT) at askcelt@torontomu.ca
Mathilda Dougherty is an Educational Developer in the Centre for Excellence in Learning and Teaching at Toronto Metropolitan University. They earned their Ph.D. in religious studies from the University of North Carolina at Chapel Hill and have taught at UNC-Chapel Hill, TMU, University of Toronto – Mississauga, Queen’s University, and Emmanuel College of Victoria University in the University of Toronto. At TMU, they support the Faculty of the Arts, the Excellence in Teaching Program, and the Certificate in Instructional Excellence. You can contact them at mdougherty@torontomu.ca.






