Keyword Kaleidoscope: Identifying the difference in keywords predominantly used within one community via contrasting with another community
Abstract
Online platforms seek to combat unwanted activities and content by implementing measures to block search terms associated with specific keywords frequently used by malicious actors. However, a persistent challenge arises as this approach may inadvertently affect legitimate content that shares these keywords.
This study aims to utilize publicly available datasets of online posts to identify differences in the most prominent keywords in these datasets. The goal is to obtain such distinctions by applying similar methods in harmful and benign communities that share similar language and, consequently, employ them toward more effective search term-blocking.
To this end, we employed several analysis methods. Keyword frequencies were computed and compared tabularly, visually, and through hypothesis tests. Topic modeling was applied to the reviews from the datasets to examine the keywords within similar topics and their frequencies. Keyword co-occurrences, delineated by how frequently keywords appeared in the same review as each other, were also tallied, and keywords with the top co-occurrence differences were further explored through plots and representative reviews.
While this study centered on two reviewer communities, we have discovered several overarching insights, specifically a similar process could be implemented to guide and aid the process of effective banning in search functionalities. The two datasets examined were found to be speaking about similar concepts. While the ordering of the top keywords shifts between the two, the majority of the most frequent keywords are found near the top of both lists. Despite these similarities, however, differences in the overall frequencies of overlapping keywords existed. Notable dissimilarities between the two communities were discovered either as keywords missing from one top list or the other, or in frequency through Pearson’s chi-squared contingency test. The topic model results showed that some topics were present in both communities but were linked to different keywords in each. Finally, the keyword-keyword co-occurrence analysis in this work indicates that even keywords used commonly by both communities can have alternate associations.
Description
Keywords
Keywords, Search terms, NLP, Communities, Content moderation