COVID-19 Fake News Detection on Twitter
Abstract
x
During the COVID-19 pandemic, social networks have seen a substantial amount of false news [28]. Furthermore, people have been discussing numerous false remedies to cure COVID-19. However, these remedies are extremely dangerous to human health. The director-general of the World Health Organiza- tion calls it “infodemic" [31] because of the amount of misinformation, disinformation, and "false news" relating to COVID-19. With the enormous number of news regarding COVID-19 on the Internet, it is difficult for many to assess truthfulness. Moreover, the riots and panic shopping also occurred due to the propagation of “false news" [31]. In this thesis project, I aim to build an automated COVID-19 misinformation detection system and investigate the value of a social network structure compared to the text-based classification approach. I have implemented a variety of techniques to detect fake news and misinformation in tweets related to COVID-19. The research objective is to classify each tweet as either true/fake with various text feature representations techniques and graph structure to compare and evaluate their performances. The project is comprised of two parts which are text-based and graph structure-based fake news detection techniques. For the first part, I conduct five different classification al- gorithms relying on various embeddings techniques including BoW, TF-IDF and Word2Vec embeddings. For the second part, I represent the data in a graph structure and learn the feature representations for the nodes using the Node2Vec embedding algorithm, which can then be used for the downstream classi- fication task. Different studies [29, 32] revealed that N-grams based features are efficient in identifying false information. The author in [33] uses TF-IDF feature to classify hoax news with 84.67% accuracy. The classifiers used in this paper are more modern KNN, MNB, LSVM, and logistic classification. In this thesis, it was concluded that the use of a combination of features in the GNB classifier has the most accurate performance. Moreover, text-based modelling outperforms graph-based modelling in terms of ROC score, accuracy as well as weighted average F1 score.
The overall aim was to produce a system that can auto-detect fake news in tweets related to COVID-19, improving the results obtained by previous studies.