Building a Spell Checker from Corpora

Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The spelling detection and correction problem is deemed to be one of the known text processing problems in NLP (Natural Language Processing) area that merits further investigation. According to several studies, spelling errors are classified into two types: non� word errors and real word errors. This MSc dissertation aimed to build a corpora-based spell checker software that detects misspellings of the former type (i.e., non-word errors) in the text and suggests alternative correct spellings using the edit distance and conventional N� gram approach while comparing the performance with the state-of-the-art BERT model. This software used Project Gutenberg (PG) Corpora in conjunction with the Named Entity Recognition (NER) pipeline to determine whether a word is correctly spelled or not. Moreover, corrections for misspelled words are carried out using a combination of edit distance and unigram probability to determine the best five alternative words that can potentially replace the erroneous word (i.e., non-word errors). Although, the BERT approach gives suggestions based on the context of the misspelled word, in most cases, the suggested words are not even anywhere close to the misspelled word. Therefore, the backend of the software was built using the edit distance, N-gram model, lookup dictionary, and NER. Apparently, the N-gram with edit distance approach outperformed the BERT approach, albeit without taking the context into consideration. Additionally, the User Interface acts similarly to Microsoft Word by underlining misspelled words using squiggly red lines and offering a context menu with the alternative spellings. Once the user selects on one of these words, it automatically replaces the misspelled word. Moreover, the developed software was evaluated by some second language volunteers and a list of misspellings downloaded from https://www.dcs.bbk.ac.uk/~ROGER/corpora.html. The result of the experiments of the misspellings corpora yielded an accuracy of about 83% for the Wikipedia dataset, and approximately 74% for the Aspell dataset. Finally, the MSc report was concluded with a discussion of the study's accomplishments, limitations, and future work.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025