An Exploration of Methodologies to Improve Semi-supervised Hierarchical Clustering with Knowledge-Based -Constraints

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Clustering algorithms with constraints (also known as semi-supervised clustering algorithms) have been introduced to the field of machine learning as a significant variant to the conventional unsupervised clustering learning algorithms. They have been demonstrated to achieve better performance due to integrating prior knowledge during the clustering process, that enables uncovering relevant useful information from the data being clustered. However, the research conducted within the context of developing semi-supervised hierarchical clustering techniques are still an open and active investigation area. Majority of current semi-supervised clustering algorithms are developed as partitional clustering (PC) methods and only few research efforts have been made on developing semi-supervised hierarchical clustering methods. The aim of this research is to enhance hierarchical clustering (HC) algorithms based on prior knowledge, by adopting novel methodologies. Such prior knowledge is translated into triple-wise relative constraints, which can effectively be applied in hierarchical clustering. The research presented in this thesis contributes to: the proposal of a novel clustering algorithm taking into account six agglomerative linkage measures, with triple-wise relative constraints and the critical investigation of the performance of the algorithm with the use of various parameters integrating distance metrics, linkage methods and different levels of constraints; Enhancing the effectiveness of Constrained Ward’s Hierarchical Agglomerative Clustering (CWHAC) algorithm by addressing the issues of constraint violation and redundancy and its efficiency by reducing the timeconsuming process of generating constraints; development of a novel hybrid clustering approach for Constrained Ward's Hierarchical algorithm underpinned by the intelligent k-Means clustering algorithm (CWHC-IKM) for cluster initialization; to address the challenges of typical agglomerative clustering approaches; developing a novel framework to handle noise or irrelevant features named as, Constrained Weighted Ward Hierarchical Clustering algorithm based on intelligent K-means algorithm (CWWHCIKM), which is designed to combine feature weighting approach with semi-supervised clustering. The thesis presents a rigorous performance analysis of the proposed novel Semi-Supervised Hierarchical Clustering (ssHC) algorithms proving their superiority in data clustering.