METHODOLOGY FOR SELECTING MICROARRAY BIOMARKER GENES FOR CANCER CLASSIFICATION

Document Type : Original Article

Authors

1 Engineering Division, Systems & Information Department, National Research Centre, El Buhouth Street, Dokki, Cairo,

2 Engineering Division, Systems & Information Department, National Research Centre, El Buhouth Street, Dokki, Cairo,Egypt

Abstract

In the analysis of microarray gene expression data, it is very difficult to obtain a satisfactory
classification result by machine learning techniques because of the dimensionality problem. That is the
gene expression data are very high dimensional, while datasets usually contain a few tens samples.
Microarray data includes many redundant, noisy genes and numerous genes contain inappropriate
information for classification.The best combination of gene selection and classification is required to
identify biomarker genesfrom thousands of genes. In this research, a methodology has been developed
to eliminate noisy, irrelevant and redundant genes and find a small setof significant informative
biomarker genes which can classify cancer dataset with high accuracy. The process consists of two
phases which are gene selection and classification. In gene selection phase, the genes have been ranked
according to their ranking scores; two statistical approaches which are class separability and T-test
have been used. Then from the highest ranked genes, different subsets of genes have been used to
classify dataset until reach the highest possible accuracy. Two data mining techniques have been used
for classifications which are K-Nearest Neighbor and Support Vector Machine. The proposed method
has been used to classify 7 benchmarkgene expression cancer datasets. The results showed that the
proposed methodology can identifysmall subsetof relevant predictive genes and can achieve high
prediction accuracy with this small subset of genes for different datasets.The accuracyand subset of
biomarker genes have been identified for different cancer datasets.