How Do Large Language Model Chatbots Simulate Human Reasoning

Altamimi, Aishah Ahmed

How Do Large Language Model Chatbots Simulate Human Reasoning

dc.contributor.advisor	Baber, Christopher
dc.contributor.author	Altamimi, Aishah Ahmed
dc.date.accessioned	2023-11-20T10:03:53Z
dc.date.available	2023-11-20T10:03:53Z
dc.date.issued	2023-12-06
dc.description.abstract	The large language models excel in linguistic tasks such as writing articles, summarizing, expanding, and translating text. However, it is still unclear whether these models can reason. Discovering the mistakes large language models make when solving simple reasoning problems and whether these models can reason and develop rules and guidelines that enhance their reasoning skills helps individuals utilize their capabilities across various areas of life. It enables these models to tackle tasks that extend beyond language-related activities. In this research, the GPT-3.5-turbo model was used to generate responses and to discover the mistakes the model makes when solving simple reasoning problems and proposing solutions for rectifying these errors. Consequently, rules for prompting the model were developed and tested. Based on the proposed rules, three approaches were proposed, implemented in Python, and compared with other approaches, such as Chain-of-Thought, Zero-shot-CoT, and few-shot using a linguistic format. Inferential statistics were used to analyze the results, such as paired t-tests and analysis of variance (single-factor ANOVA and two-way ANOVA). The comparison result provides recommendations for the best methods for prompting the model, which significantly enhances the model’s reasoning skills. The results show that, in in-context learning, the large language model is better at finding the pattern if the provided examples are presented in a linguistic format (the Mean of the model’s accuracy = 0.525) more than finding the pattern if the provided examples presented in numerical format (the Mean of the model’s accuracy = 0.175) and ( p-value = 0.006 < 0.05). However, supporting the numerical format with explanations increases the model performance in finding the pattern (from the Mean of the model’s accuracy = 0.175 to = 0.663) and ( p-value = 0.0003 < 0.05), which indicates the importance of including instructions in the prompt. Also, the results show that the best performance is from the ”programmatic approach,” where the measure of accuracy is close to 1; the worst performance comes from the ”few-shot using numerical format without explanations” and ”zero-shot-CoT” techniques; and there is a mid-level performance from ”Chain-of-Thought”, ”Few-shot using the linguistic format”, and ”few-shot using the numerical format with explanations.” One interesting finding is that the mid-level performance is inferior to the programmatic approach, which is the ideal computer output.
dc.format.extent	89
dc.identifier.citation	IEEE
dc.identifier.uri	https://hdl.handle.net/20.500.14154/69741
dc.language.iso	en
dc.publisher	Saudi Digital Library
dc.subject	”Chain-of-Thought prompt engineering
dc.subject	Large language models
dc.subject	Reasoning
dc.subject	Human reasoning
dc.title	How Do Large Language Model Chatbots Simulate Human Reasoning
dc.type	Thesis
sdl.degree.department	Computer Science
sdl.degree.discipline	NLP and AI
sdl.degree.grantor	The University of Birmingham
sdl.degree.name	Masters in Human-Computer Interaction

Collections

SACM - United Kingdom

How Do Large Language Model Chatbots Simulate Human Reasoning

Files

Collections