How Do Large Language Model Chatbots Simulate Human Reasoning
Date
2023-12-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Saudi Digital Library
Abstract
The large language models excel in linguistic tasks such as writing articles, summarizing, expanding, and translating text. However, it is still unclear whether these models can reason. Discovering the mistakes large language models make when solving simple reasoning problems and whether these
models can reason and develop rules and guidelines that enhance their reasoning skills helps individuals utilize their capabilities across various areas of life. It enables these models to tackle tasks that extend beyond language-related activities. In this research, the GPT-3.5-turbo model was used to generate responses and to discover the mistakes the model makes when solving simple reasoning problems and proposing solutions for rectifying these errors. Consequently, rules for prompting the model were developed and tested. Based on the proposed rules, three approaches were proposed, implemented in Python, and compared with other approaches, such as Chain-of-Thought, Zero-shot-CoT, and few-shot using a linguistic format. Inferential statistics were used to analyze the results, such as paired
t-tests and analysis of variance (single-factor ANOVA and two-way ANOVA). The comparison result provides recommendations for the best methods for prompting the model, which significantly enhances the model’s reasoning skills. The results show that, in in-context learning, the large language model is better at finding the pattern if the provided examples are presented in a linguistic format (the Mean of the model’s accuracy = 0.525) more than finding the pattern if the provided examples presented in numerical format (the Mean of the model’s accuracy = 0.175) and ( p-value = 0.006 < 0.05). However, supporting the numerical format with explanations increases the model performance in finding the pattern (from the Mean of the model’s accuracy = 0.175 to = 0.663) and ( p-value = 0.0003 < 0.05), which indicates the importance of including instructions in the prompt. Also, the results show that the best performance is from the ”programmatic approach,” where the measure of accuracy is close to 1; the worst performance comes from the ”few-shot using numerical format without explanations” and ”zero-shot-CoT” techniques; and there is a mid-level performance from ”Chain-of-Thought”, ”Few-shot using the linguistic format”, and ”few-shot using the numerical format with explanations.” One interesting finding is that the mid-level performance is inferior to the programmatic approach, which is the ideal computer output.
Description
Keywords
”Chain-of-Thought prompt engineering, Large language models, Reasoning, Human reasoning
Citation
IEEE