Statistical Approaches for Binary and Categorical Data Modeling
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Saudi Digital Library
Abstract
Nowadays a massive amount of data is generated as the development of technology
and services has accelerated. Therefore, the demand for data clustering in order to
gain knowledge has increased in many sectors such as medical sciences, risk assessment
and product sales. Moreover, binary data has been widely used in various applications
including market basket data and text documents analysis. While applying classic
widely used k-means method is inappropriate to cluster binary data, we propose an
improvement of K-medoids algorithm using binary similarity measures instead of Euclidean
distance which is generally deployed in clustering algorithms. In addition to
K-medoids clustering method, agglomerative hierarchical clustering methods based
on Gaussian probability models have recently shown to be ecient in dierent applications.
However, the emerging of pattern recognition applications where the features
are binary or integer-valued demand extending research eorts to such data types.
We propose a hierarchical clustering framework for clustering categorical data based
on Multinomial and Bernoulli mixture models. We have compared two widely used
density-based distances, namely; Bhattacharyya and Kullback-Leibler. The merits of
our proposed clustering frameworks have been shown through extensive experiments
on clustering text, binary images categorization and images categorization.
The development of generative/discriminative approaches for classifying dierent
kinds of data has attracted scholars' attention. Considering the strengths and weaknesses
of both approaches, several hybrid learning approaches which combined the
desirable properties of both have been developed. Our contribution is to combine
Support Vector Machines (SVMs) and Bernoulli mixture model in order to classify
binary data. We propose using Bernoulli mixture model for generating probabilistic
kernels for SVM based on information divergence. These kernels make intelligent use of unlabeled binary data to achieve good data discrimination. We evaluate the
proposed hybrid learning approach by classifying binary and texture images.