How Do Large Language Model Chatbots Simulate Human Reasoning

Altamimi, Aishah Ahmed

How Do Large Language Model Chatbots Simulate Human Reasoning

Date

2023-12-06

Authors

Altamimi, Aishah Ahmed

Publisher

Saudi Digital Library

Abstract

The large language models excel in linguistic tasks such as writing articles, summarizing, expanding, and translating text. However, it is still unclear whether these models can reason. Discovering the mistakes large language models make when solving simple reasoning problems and whether these models can reason and develop rules and guidelines that enhance their reasoning skills helps individuals utilize their capabilities across various areas of life. It enables these models to tackle tasks that extend beyond language-related activities. In this research, the GPT-3.5-turbo model was used to generate responses and to discover the mistakes the model makes when solving simple reasoning problems and proposing solutions for rectifying these errors. Consequently, rules for prompting the model were developed and tested. Based on the proposed rules, three approaches were proposed, implemented in Python, and compared with other approaches, such as Chain-of-Thought, Zero-shot-CoT, and few-shot using a linguistic format. Inferential statistics were used to analyze the results, such as paired t-tests and analysis of variance (single-factor ANOVA and two-way ANOVA). The comparison result provides recommendations for the best methods for prompting the model, which significantly enhances the model’s reasoning skills. The results show that, in in-context learning, the large language model is better at finding the pattern if the provided examples are presented in a linguistic format (the Mean of the model’s accuracy = 0.525) more than finding the pattern if the provided examples presented in numerical format (the Mean of the model’s accuracy = 0.175) and ( p-value = 0.006 < 0.05). However, supporting the numerical format with explanations increases the model performance in finding the pattern (from the Mean of the model’s accuracy = 0.175 to = 0.663) and ( p-value = 0.0003 < 0.05), which indicates the importance of including instructions in the prompt. Also, the results show that the best performance is from the ”programmatic approach,” where the measure of accuracy is close to 1; the worst performance comes from the ”few-shot using numerical format without explanations” and ”zero-shot-CoT” techniques; and there is a mid-level performance from ”Chain-of-Thought”, ”Few-shot using the linguistic format”, and ”few-shot using the numerical format with explanations.” One interesting finding is that the mid-level performance is inferior to the programmatic approach, which is the ideal computer output.

Keywords

”Chain-of-Thought prompt engineering, Large language models, Reasoning, Human reasoning

Citation

IEEE

URI

https://hdl.handle.net/20.500.14154/69741

Collections

SACM - United Kingdom

Full item page

How Do Large Language Model Chatbots Simulate Human Reasoning

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By