Building a Spell Checker from Corpora
Abstract
The spelling detection and correction problem is deemed to be one of the known text
processing problems in NLP (Natural Language Processing) area that merits further
investigation. According to several studies, spelling errors are classified into two types: non�
word errors and real word errors. This MSc dissertation aimed to build a corpora-based spell
checker software that detects misspellings of the former type (i.e., non-word errors) in the
text and suggests alternative correct spellings using the edit distance and conventional N�
gram approach while comparing the performance with the state-of-the-art BERT model.
This software used Project Gutenberg (PG) Corpora in conjunction with the Named Entity
Recognition (NER) pipeline to determine whether a word is correctly spelled or not.
Moreover, corrections for misspelled words are carried out using a combination of edit
distance and unigram probability to determine the best five alternative words that can
potentially replace the erroneous word (i.e., non-word errors). Although, the BERT approach
gives suggestions based on the context of the misspelled word, in most cases, the
suggested words are not even anywhere close to the misspelled word. Therefore, the
backend of the software was built using the edit distance, N-gram model, lookup dictionary,
and NER. Apparently, the N-gram with edit distance approach outperformed the BERT
approach, albeit without taking the context into consideration. Additionally, the User Interface
acts similarly to Microsoft Word by underlining misspelled words using squiggly red lines and
offering a context menu with the alternative spellings. Once the user selects on one of these
words, it automatically replaces the misspelled word. Moreover, the developed software was
evaluated by some second language volunteers and a list of misspellings downloaded from
https://www.dcs.bbk.ac.uk/~ROGER/corpora.html. The result of the experiments of the
misspellings corpora yielded an accuracy of about 83% for the Wikipedia dataset, and
approximately 74% for the Aspell dataset. Finally, the MSc report was concluded with a
discussion of the study's accomplishments, limitations, and future work.