Saudi Cultural Missions Theses & Dissertations

Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/10

Browse

Search Results

Now showing 1 - 2 of 2
  • ItemRestricted
    PyZoBot: A Platform for Conversational Information Extraction and Synthesis from Curated Zotero Reference Libraries through Advanced Retrieval-Augmented Generation. A
    (Virginia Commonwealth University, 2025) Alshammari, Suad; Wijesinghe, Dayanjan
    This work presents a systematic evaluation of PyZoBot, an AI-powered platform for literature-based question answering, using the Retrieval-Augmented Generation Assessment Scores (RAGAS) framework. The study focuses on a subset of 49 cardiology-related questions extracted from the BioASQ benchmark dataset. PyZoBot's performance was assessed across 32 configurations, including standard Retrieval-Augmented Generation (RAG) and GraphRAG pipelines, implemented with both OpenAI-based models (GPT-3.5-Turbo, GPT-4o) and open-source models (LLaMA 3.1, Mistral). To establish a comparative benchmark, responses generated by PyZoBot were evaluated alongside answers manually written by six PhD students and recent graduates from the pharmacotherapy field, using a curated Zotero library containing BioASQ-referenced documents. The evaluation applied four key RAGAS metrics—faithfulness, answer relevancy, context recall, and context precision—along with a composite harmonic score to determine overall performance. The findings reveal that 22 PyZoBot configurations surpassed the highest-performing human participant, with the top pipeline (GPT-3.5-Turbo + layout-aware chunking, k=10) achieving a 129 harmonic RAGAS score of 0.6944. Statistical analysis using Kruskal-Wallis and Dunn’s post hoc tests confirmed significant differences across all metrics, especially in faithfulness and time efficiency. These results validate PyZoBot’s ability to support high-quality biomedical information synthesis and demonstrate the system’s potential to meet or exceed human performance in complex, evidence-based academic tasks.
    13 0
  • Thumbnail Image
    ItemRestricted
    How Do Large Language Model Chatbots Simulate Human Reasoning
    (Saudi Digital Library, 2023-12-06) Altamimi, Aishah Ahmed; Baber, Christopher
    The large language models excel in linguistic tasks such as writing articles, summarizing, expanding, and translating text. However, it is still unclear whether these models can reason. Discovering the mistakes large language models make when solving simple reasoning problems and whether these models can reason and develop rules and guidelines that enhance their reasoning skills helps individuals utilize their capabilities across various areas of life. It enables these models to tackle tasks that extend beyond language-related activities. In this research, the GPT-3.5-turbo model was used to generate responses and to discover the mistakes the model makes when solving simple reasoning problems and proposing solutions for rectifying these errors. Consequently, rules for prompting the model were developed and tested. Based on the proposed rules, three approaches were proposed, implemented in Python, and compared with other approaches, such as Chain-of-Thought, Zero-shot-CoT, and few-shot using a linguistic format. Inferential statistics were used to analyze the results, such as paired t-tests and analysis of variance (single-factor ANOVA and two-way ANOVA). The comparison result provides recommendations for the best methods for prompting the model, which significantly enhances the model’s reasoning skills. The results show that, in in-context learning, the large language model is better at finding the pattern if the provided examples are presented in a linguistic format (the Mean of the model’s accuracy = 0.525) more than finding the pattern if the provided examples presented in numerical format (the Mean of the model’s accuracy = 0.175) and ( p-value = 0.006 < 0.05). However, supporting the numerical format with explanations increases the model performance in finding the pattern (from the Mean of the model’s accuracy = 0.175 to = 0.663) and ( p-value = 0.0003 < 0.05), which indicates the importance of including instructions in the prompt. Also, the results show that the best performance is from the ”programmatic approach,” where the measure of accuracy is close to 1; the worst performance comes from the ”few-shot using numerical format without explanations” and ”zero-shot-CoT” techniques; and there is a mid-level performance from ”Chain-of-Thought”, ”Few-shot using the linguistic format”, and ”few-shot using the numerical format with explanations.” One interesting finding is that the mid-level performance is inferior to the programmatic approach, which is the ideal computer output.
    55 0

Copyright owned by the Saudi Digital Library (SDL) © 2025