On the Sensitivity of Feature Selection Methods in Gene Expression Datasets

No Thumbnail Available
2014, 2015-02-18 14:02:55.947
Journal Title
Journal ISSN
Volume Title
Saudi Digital Library
Recent advances in DNA microarray technology have enabled biologists to monitor and measure the expression level of thousands of genes in parallel through a single experiment. Such experiments result in gene expression datasets, which are considered to be complex for manual analysis, since the number of features (genes) is very large, compared to the small sample size. Furthermore, this high-dimensional nature of gene expression datasets presents a challenge for traditional machine learning algorithms in building interpretable, accurate, and reasonable models from such data. Hence, feature selection methods are now frequently used as important embedded techniques when applying machine learning algorithms. A feature selection method is a process of selecting informative features that are relevant to the analysis task and removing irrelevant, redundant, and noisy features. Applying feature selection methods helps to increase the prediction accuracy and efficiency of the learned models. This thesis will address and contribute to solving two well-known problems associated with gene expression datasets: multiple probe sets of the same gene and the dimensionality reduction of the number of features (genes). Multiple probe sets of the same gene will be examined by applying the average method, which will then be evaluated using a filtering ranking method, a multiple linear regression (MLR) model, and leave-one-out crossvalidation (LOOCV). Experiments will show that this simple method gives promising results and can be considered a feature selection method, as it reduces the number of features (genes) while improving the prediction accuracy of the learned model. The experiments will then concentrate on dimensionality reduction for the number of features (genes) by conducting a comparative study of the filtering ranking method and ensemble methods. Ensemble methods combine the filtering ranking method with some wrapper methods, which include some forms of greedy search strategies. The greedy search strategies included in the ensemble method involve correlation-based forward addition (CFA), correlation-based backward elimination (CBE), sequential forward selection (SFS), and sequential backward elimination (SBE). MLR model and LOOCV will serve to evaluate the proposed methods. The experiments show that the results of the proposed ensemble methods using CFA and CBE are comparable. Also, the experiments show that the proposed ensemble method with SFS provides the best results among all of the proposed methods and shortest execution time compared to the other ensemble methods, while the proposed ensemble method with SBE gives the worst results and longest execution time among all of the proposed methods. The experiments in this thesis have been applied in a C++ environment.