Generating an RDF dataset from Twitter data: A Study Using Machine Learning

Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This thesis presents work focused on finding the most effective and efficient mechanism(s) for generating an RDF dataset from Twitter data. The motivation is the desire to extract knowledge from social media data. Knowledge, that can only be extracted if structure, in the form of RDF, is imposed on the data. The main research question that the thesis seeks to address is,“What is the most effective and efficient mechanism whereby an RDF database associated with a particular domain of discourse, described in the form of Twitter free text, can be generated?”. Four different frameworks for RDF dataset generation from Twitter data are proposed: (i) The Stanford CoreNLP RDF dataset framework, (ii) The GATE RDF dataset framework, (iii) The regular expression RDF dataset framework and (iv) The Shortest Path Dependency Parsing and Word Mover’s Distance (SPDP-WMD) framework. Additional contributions include, (i) a regular expression pattern syntax, (ii) a regular expression parser, (iii) a fully labelled motor vehicle pollution evaluation data set and (iv) a partially labelled diabetes evaluation data set. The first RDF dataset generation frameworks was a benchmark, the second an alternative to the benchmark. Both featured a requirement for substantial amounts of training data and, in the case of the second framework, entity lists. The third and fourth frameworks were introduced to address the limitations of the first two. A feature of all the frameworks was that they all utilised Named Entity Recognition (NER) and Relation Extraction (RE) models of some kind. All the frameworks also utilised existing lexical databases such as WordNet, and/or existing schema such as those available from Schema.org, for building the hierarchical structures of classes and relations for the RDFS to enrich the RDF dataset. Apache Jena was used to generate both RDF and RDFS files. The frameworks were evaluated using datasets drawn from two domains of discourse: motor vehicle pollution and diabetes. The motor vehicle pollution data set was used to evaluate all the frameworks; then, the most effective framework was evaluated using the larger diabetes data set. The NER and RE models were evaluated using k-fold cross validation where applicable. The RDFS were populated and then further evaluated by using a set of examples for querying, using the SPARQL query language, so as to extract knowledge. In the case of the diabetes dataset, the populated RDFS was also evaluated by clinicians.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025