Saudi Universities Theses & Dissertations
Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/11
Browse
1 results
Search Results
Item Restricted Developing a High-Quality Tool for Arabic Text-To Speech Using Deep Learning Techniques(Saudi Digital Library, 2020) Madhfer, Mokhtar Ali; Qamar, MustafaText-To-Speech (TTS) synthesis is the process of converting written text into speech. Traditional TTS systems involve two stages: frontend (transforms the text into linguistic features) and backend (uses the linguistics features produced by the frontend to generate the synthesized speech). The frontend requires linguistic expertise to define the features, which is a complex and time-consuming task. Recent advances in deep learning enabled researchers to integrate both frontend and backend into a single system called an end to-end TTS synthesis. These end-to-end systems provide high-quality speech synthesis with simpler designs. While the English language has many TTS synthesis models such as Tacotron and Tacotron 2, the Arabic language lacks any high-quality model. In this work, we build upon the recent advances in deep learning to develop a high-quality TTS synthesis system for the Arabic language. To achieve our goal, we dealt with many challenges, such as diacritization, building speech corpus, and text segmentation. For diacritization, we propose three deep learning models. One of the models achieves state-of-the-art performances in both word error rate and diacritic error rate metrics. The lack of a sizable speech corpus for the Arabic language was solved by designing two speech corpora. The first corpus was designed from an audiobook, where we split the audiobook into smaller audios (1-14 seconds) and then align each audio with its corresponding text. We built a web-based service that highly speeds up the alignment. The second corpus was built for experimental purposes. It was designed from the Polly service synthesized audios, but we were surprised that the model synthesized speech with better natural speech. The TTS model is inspired from Tacotron with many modifications, such as using location-based attention instead of content-based attention. We evaluated the model using the mean opinion score (MOS) on a scale of 1 (Bad) to 5 (Excellent) using a public website. Our best model, which is trained using an audiobook, got an MOS of 4.60 ± 0.11 in intelligibility, 4.34 ± 0.13 in naturalness, and 4.36 ± 0.14 in overall quality. The MOS in naturalness is higher than the results of the English language results using Tacotron (3.8 ± 0.085) and very close to the results of Tacotron 2 using WaveNet (4.526 ± 0.066).12 0