TOWARDS ROBUST AND ACCURATE TEXT-TO-CODE GENERATION

dc.contributor.advisorWang, Liqiang
dc.contributor.authoralmohaimeed, saleh
dc.date.accessioned2024-12-18T18:55:08Z
dc.date.issued2024
dc.description.abstractDatabases play a vital role in today’s digital landscape, enabling effective data storage, manage- ment, and retrieval for businesses and other organizations. However, interacting with databases often requires knowledge of query (e.g., SQL) and analysis, which can be a barrier for many users. In natural language processing, the text-to-code task, which converts natural language text into query and analysis code, bridges this gap by allowing users to access and manipulate data using everyday language. This dissertation investigates different challenges in text-to-code (including text-to-SQL as a subtask), with a focus on four primary contributions to the field. As a solution to the lack of statistical analysis in current text-to-code tasks, we introduce SIGMA, a text-to- Code dataset with statistical analysis, featuring 6000 questions with Python code labels. Baseline models show promising results, indicating that our new task can support both statistical analysis and SQL queries simultaneously. Second, we present Ar-Spider, the first Arabic cross-domain text-to-SQL dataset that addresses multilingual limitations. We have conducted experiments with LGESQL and S2SQL models, enhanced by our Context Similarity Relationship (CSR) approach, which demonstrates competitive performance, reducing the performance gap between the Arabic and English text-to-SQL datasets. Third, we address context-dependent text-to-SQL task, often overlooked by current models. The SParC dataset was explored by utilizing different question rep- resentations and in-context learning prompt engineering techniques. Then, we propose GAT-SQL, an advanced prompt engineering approach that improves both zero-shot and in-context learning experiments. GAT-SQL sets new benchmarks in both SParC and CoSQL datasets. Finally, we introduce Ar-SParC, a context-dependent Arabic text-to-SQL dataset that enables users to interact with the model through a series of interrelated questions. In total, 40 experiments were conducted to investigate this dataset using various prompt engineering techniques, and a novel technique called GAT Corrector was developed, which significantly improved the performance of all base- line models.
dc.format.extent115
dc.identifier.urihttps://hdl.handle.net/20.500.14154/74336
dc.language.isoen
dc.publisherUniversity of Central Florida
dc.subjectArtificial intelligence
dc.subjectText-to-code
dc.subjectText-to-SQL
dc.subjectSemantic Parsing
dc.subjectDeep Learning
dc.titleTOWARDS ROBUST AND ACCURATE TEXT-TO-CODE GENERATION
dc.typeThesis
sdl.degree.departmentDepartment of Computer Science
sdl.degree.disciplineArtificial intelligence
sdl.degree.grantorUniversity of Central Florida
sdl.degree.nameDoctor of philosophy in Computer science

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
SACM-Dissertation.pdf
Size:
1.35 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.61 KB
Format:
Item-specific license agreed to upon submission
Description:

Copyright owned by the Saudi Digital Library (SDL) © 2024