Russo, AlessandraMadhyastha, PranavaAl-Negheimish, Hadeel2024-02-132024-02-132024-02-01https://hdl.handle.net/20.500.14154/71429Answering questions about a specific context often requires integrating multiple pieces of information and reasoning about them to arrive at the intended answer. Reasoning in natural language for machine reading comprehension (MRC) remains a significant challenge. In this thesis, we focus on numerical reasoning tasks. As opposed to current black-box approaches that provide little evidence of their reasoning process, we propose a novel approach that facilitates interpretable and verifiable reasoning by using Reasoning Templates for question decomposition. Our evaluations hinted at the existence of problematic behaviour in numerical reasoning models, underscoring the need for a better understanding of their capabilities. We conduct, as a second contribution of this thesis, a controlled study to assess how well current models understand questions and to what extent such models are basing their answers on textual evidence. Our findings indicate that applying transformations that obscure or destroy the syntactic and semantic properties of the questions does not change the output of the top-performing models. This behaviour reveals serious holes in how the models work. It calls into question evaluation paradigms that only use standard quantitative measures such as accuracy and F1 scores, as they lead to a false illusion of progress. To improve the reliability of numerical reasoning models in MRC, we propose and demonstrate, as our third contribution, the effectiveness of a solution to one of these fundamental problems: catastrophic insensitivity to word order. We do this by FORCED INVALIDATION: training the model to flag samples that cannot be reliably answered. We show it is highly effective at preserving word order importance in machine reading comprehension tasks and generalises well to other natural language understanding tasks. While our Reasoning Templates are competitive with the state-of-the-art on a single type, engineering them incurs a considerable overhead. Leveraging our better insights on natural language understanding and concurrent advancements in few-shot learning, we conduct a first investigation to overcome scalability limitations. Our fourth contribution combines large language models for question decomposition with symbolic rule learning for answer recomposition, we surpass our previous results on Subtraction questions and generalise to more reasoning types.169enNLPNLUMRCAIMLneurosymbolicQALLMreasoningTowards Numerical Reasoning in Machine Reading ComprehensionThesis