Exploring AI’s Effectiveness in History Education: A Look at Artificial Intelligence in the Classroom

Recent research evaluated the historical knowledge of AI models like GPT-4 and Llama, revealing they answered only 33-46% of questions correctly, particularly struggling with recent history and certain regions. Conducted by Jakob Hauser’s team, the study utilized a comprehensive database to assess doctoral-level historical expertise. Despite advancements in AI, results showed significant deficiencies in nuanced historical understanding, especially regarding complex recent events, highlighting the need for improved training data and potential remedial education for AI systems.

Exploring AI’s Historical Acumen

The historical understanding of artificial intelligence (AI) has been put to the test, revealing intriguing insights into how well these systems grasp global history. Researchers initially employed a rigorous doctoral-level examination to evaluate this skill set. The assessment encompassed not only historical facts but also the ability to interpret and navigate contradictions. Findings indicated that AI models such as GPT-4, Gemini, and Llama managed to answer only 33 to 46 percent of the questions correctly, which falls short of what is considered expert-level knowledge. Notably, these models struggled particularly with recent history and regions like Africa and Oceania.

Unveiling a New Benchmark in Historical Knowledge

In the past two years, AI technology has advanced at an astonishing pace, with these systems beginning to outshine humans in various intellectual domains. They have successfully passed the Turing test, aided daily activities, and even showcased creative abilities. However, challenges persist: AI systems can produce inaccurate information, mislead intentionally, and often lack genuine comprehension of complex topics. Furthermore, their responses can be biased when addressing ethical or politically sensitive matters, yet an increasing number of individuals are turning to AI-generated information.

To further understand the historical knowledge of large language models (LLMs), a team led by Jakob Hauser at the Complexity Science Hub in Vienna has conducted research using an advanced benchmark. This evaluation goes beyond mere general knowledge and focuses specifically on assessing historical expertise at the doctoral level. Their extensive database encompasses knowledge of 600 societies globally, with over 36,000 data points drawn from more than 2,700 scientific publications.

According to the researchers, “Our database includes everything from fundamental facts to intricate subjects like specific religious and ideological systems. For these complex topics, it’s essential to account for various interpretations, subtleties, and historical contexts.” The data spans a timeline from 10,000 years ago to contemporary times, representing cultures from all corners of the globe.

The study compared seven AI models, including OpenAI’s GPT-3.5, GPT-4-Turbo, and GPT-4o, as well as variants of Llama and Google’s Gemini-1.5flash. Multiple-choice questions were crafted by Hauser’s team, complete with four answer options for each query. To enhance clarity, each AI was first presented with four example tasks. “We also utilized personalization techniques, asking the LLMs to respond as if they were historians,” the researchers noted, which often leads to improved performance.

The AI models were tasked not only with selecting the correct answer but also with indicating whether their responses were grounded in solid evidence or based on hypotheses and potentially conflicting interpretations. “Our goal was to establish a benchmark to evaluate how adept these LLMs are in managing historical knowledge,” Hauser explained.

Co-author Peter Turchin remarked, “It was astonishing to see how poorly these models performed.” Results ranged from around 33 percent accuracy with Llama-3.1-8B to 46 percent with GPT-4-Turbo. “While the large language models outperformed random guessing, their accuracy still falls far short of expert historical knowledge,” the researchers concluded.

Particularly concerning were the models’ deficiencies in understanding recent history post-1500, with no model exceeding 40 percent accuracy in this area. “This indicates that while LLMs can manage basic knowledge about earlier periods effectively, they struggle significantly with the increased complexity of recent events,” Hauser and his team explained. In particular, modern history demands an understanding of conflicting trends, overarching developments, and intricate relationships.

Senior author R. Maria del Rio-Chanona from University College London emphasized, “The most significant takeaway from this study is that large language models, despite their impressive capabilities, still lack the deep understanding required for advanced historical research at the doctoral level. They are excellent for conveying fundamental knowledge, but not yet equipped for nuanced historical analysis.”

The research team identified the complexity of historical contexts as a major challenge, often necessitating an understanding of the underlying social, economic, and ideological factors. “History is frequently viewed as a collection of facts, yet meaningful interpretation is sometimes essential for comprehension,” del Rio-Chanona added.

There is potential for improvement among GPT and other models. Many of the observed knowledge gaps stem from the training data utilized by current AI systems, which generally perform better regarding the histories of North and Central America compared to those of Southern Africa or Oceania. This disparity highlights the training data’s bias, which predominantly reflects information from Europe and North America.

However, focused “remedial education” for AI systems could pave the way for enhancements. “Our publicly accessible dataset could play a crucial role in advancing the historical knowledge of LLMs,” Hauser and his colleagues stated. “We also plan to test newer LLM models, such as GPT-4o3, to determine if they can rectify the weaknesses identified in this study.” (NeurIPS Conference, 2024; Preprint)

Related Articles