Analysing Twitter Posts to Support the Development of Smart Cities

Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The purpose of this research to perform sentiment analysis on people’s tweets toward traffic status in smart cities by using coordinates. To achieve this, a supervised machine learning technique called Support Vector Machine (SVM) is used. The data – including tweets and their respective coordinates – is first collected through Twitter streaming APIs through the use of a Python library called Tweepy. A dataset of over 12000 tweets previously labeled using sentiment140, a sentiment analysis R package, are imported to train the SVM classifier. Within the labeled dataset, Tweets about traffic congestion with negative sentiment are classified as “congestion” whereas Tweets with positive sentiment are classified as “no congestion”. Both of the datasets are preprocessed through various data cleaning methods including tokenization, stopwords removal, data clearing and lowercasing. Data is then visualized using "Matplotlib" to analyze the most frequent words in the data. Initially, the data is visualized in the form of a word cloud showing up to 180 of the most frequently used words. Then, the length of topmost frequent words from the dataset is provided in the form of a bar graph. Data is then sent to the Data classification module. The training dataset is first converted into a Document-term matrix (DTM) which depicts the frequency of each word from the tweets. For this purpose, modeling techniques called “Bag-of-Word (BoW)” and “Bag-of-Embedding (BoE)” are applied to the preprocessed dataset. To apply SVM, the data is then split into a training dataset and a validation dataset at a ratio of 80-20. Next, SVM classifier is applied to the training set with gamma at 0.01 and regularization parameter C at 100 as well as the threshold at 50%. Training set accuracy was shown to be above 99%, whereas the validation set accuracy was shown to be above 82%. The test dataset of unlabeled tweets which was built through real-time traffic congestion tweets is applied to predict if there is congestion in the area. The resulting classified tweets from the respective area are then converted into the percentage of congestion and no congestion which are then plotted on Google Maps using Shapely. The plotted map shows the percentage of congestion in London, Manchester and Liverpool according to their latitudes and longitudes.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025