A Tool For Indexing And Classifying Unstructured Textual Documents Based on Product Family Algebra

Thumbnail Image

Date

2020-08-01

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Unstructured textual documents comprise the bulk of the data used and archived by organizations within all sectors of the economy. The need to index and classify these documents became an interesting topic that gained more attention in the field of data analytic. Different approaches are used to perform indexing and classification of textual documents. They range from supervised Machine Learning (ML) approaches to rule-based ones. There is a need for exploring novel classification approaches that exhibit better effectiveness and performance in classifying the increasing volume of this kind of data. In this thesis, we propose a novel approach to index and classify unstructured textual documents based on Product Family Algebra (PFA) and implemented using Binary Decision Diagram (BDD). In the proposed approach, a signature is first constructed for a document or a family of documents. The signature is relative to a dictionary of the typical words used in the category under consideration. Then, using operations on product family implemented using BDDs, we carry the classification of a document or families of documents using their signatures. Since ML methods are considered to be the de facto standard in document classification and to compare our method performance to their, we implement four ML classification methods: Support Vector Machine (SVM), Naive Bayes (NB), K-Nearest Neighbor (K-NN), and Decision Tree (DT). After that, we merge these modules into one software system called Smart Document Classification System (SDCS). The assessment of our approach to the classification of textual documents shows its f lexibility in indexing and classifying families of textual documents. The classification is deterministic and on a single document (not families of documents), it compares very well with the SVM ML-classifier. Using rules articulated in the language of PFA, It offers a variety of ways for classifying families of documents

Description

Keywords

Algebra, Machine Learning, Documents Classification

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2024