TOWARDS ROBUST AND ACCURATE TEXT-TO-CODE GENERATION

almohaimeed, saleh

TOWARDS ROBUST AND ACCURATE TEXT-TO-CODE GENERATION

dc.contributor.advisor	Wang, Liqiang
dc.contributor.author	almohaimeed, saleh
dc.date.accessioned	2024-12-18T18:55:08Z
dc.date.issued	2024
dc.description.abstract	Databases play a vital role in today’s digital landscape, enabling effective data storage, manage- ment, and retrieval for businesses and other organizations. However, interacting with databases often requires knowledge of query (e.g., SQL) and analysis, which can be a barrier for many users. In natural language processing, the text-to-code task, which converts natural language text into query and analysis code, bridges this gap by allowing users to access and manipulate data using everyday language. This dissertation investigates different challenges in text-to-code (including text-to-SQL as a subtask), with a focus on four primary contributions to the field. As a solution to the lack of statistical analysis in current text-to-code tasks, we introduce SIGMA, a text-to- Code dataset with statistical analysis, featuring 6000 questions with Python code labels. Baseline models show promising results, indicating that our new task can support both statistical analysis and SQL queries simultaneously. Second, we present Ar-Spider, the first Arabic cross-domain text-to-SQL dataset that addresses multilingual limitations. We have conducted experiments with LGESQL and S2SQL models, enhanced by our Context Similarity Relationship (CSR) approach, which demonstrates competitive performance, reducing the performance gap between the Arabic and English text-to-SQL datasets. Third, we address context-dependent text-to-SQL task, often overlooked by current models. The SParC dataset was explored by utilizing different question rep- resentations and in-context learning prompt engineering techniques. Then, we propose GAT-SQL, an advanced prompt engineering approach that improves both zero-shot and in-context learning experiments. GAT-SQL sets new benchmarks in both SParC and CoSQL datasets. Finally, we introduce Ar-SParC, a context-dependent Arabic text-to-SQL dataset that enables users to interact with the model through a series of interrelated questions. In total, 40 experiments were conducted to investigate this dataset using various prompt engineering techniques, and a novel technique called GAT Corrector was developed, which significantly improved the performance of all base- line models.
dc.format.extent	115
dc.identifier.uri	https://hdl.handle.net/20.500.14154/74336
dc.language.iso	en
dc.publisher	University of Central Florida
dc.subject	Artificial intelligence
dc.subject	Text-to-code
dc.subject	Text-to-SQL
dc.subject	Semantic Parsing
dc.subject	Deep Learning
dc.title	TOWARDS ROBUST AND ACCURATE TEXT-TO-CODE GENERATION
dc.type	Thesis
sdl.degree.department	Department of Computer Science
sdl.degree.discipline	Artificial intelligence
sdl.degree.grantor	University of Central Florida
sdl.degree.name	Doctor of philosophy in Computer science

Files

Original bundle

Now showing 1 - 1 of 1

Name:: SACM-Dissertation.pdf
Size:: 1.35 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.61 KB
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

SACM - United States of America