Intro to LLM Fundamentals: What are LLMs and How do they come to be?¶

Serrano, S., Brumbaugh, Z., & Smith, N. A. (2023). Language Models: A Guide for the Perplexed. http://arxiv.org/abs/2311.17301

This paper provides a comprehensive, non-technical introduction to language models (LMs), including large language models (LLMs). It offers a high-level overview of how language models, including LLMs, are trained, evaluated, and utilized. I found the sections discussing data particularly insightful, and the discussion on practical applications of LLMs thought-provoking.

Data is crucial both for training and testing. Testing serves as an indicator of the system's output quality and should not be used for any other purpose prior to the final test. NLP tasks depend heavily on both the quality and quantity of training data. Understanding a model involves knowing its training data. However, nowadays, companies developing LLMs are reluctant to share their training datasets. This leads to concerns about hidden training data; for instance, if a model answers a complex question accurately and clearly, we should be impressed only if we are certain that the question and answer were not in the training data. Without access to the training data, we cannot verify whether the model is genuinely being tested fairly or simply recalling answers it has already seen.

The paper also discusses "hallucination," where the content generated by LLMs is inaccurate or nonfactual. From my own experience using ChatGPT, I have noticed this issue as well. The paper suggests that while LLMs rely on training data, they do not directly access this data; instead, they seem to encode patterns from the data but do not "remember" the data precisely at all times. Thus, for topics with ample supporting data and straightforward tasks, the likelihood of hallucination is lower. Conversely, with more complex tasks or less-discussed subjects, hallucination becomes less surprising. Moreover, since the training data may contain incorrect or biased information, the model might encode these inaccuracies.

The discussion on how to effectively use LLMs is also worth considering. The paper mentions that using LLMs to summarize newspapers may not be worthwhile since the first paragraph of a news article typically summarizes the entire piece. This perspective is particularly interesting in today's context, where there is significant buzz around employing LLMs. However, the challenge remains on how to fully leverage these technologies to solve everyday problems, which continues to be an area needing exploration.