An Optimized Classification and Regression Tree Algorithm by Combining Feature Selection Methods

Yakout, Noha Khamis Elsayed; Madbouly, Magda; El Sherbiny, Mohamed

doi:10.21608/jocc.2025.446643

An Optimized Classification and Regression Tree Algorithm by Combining Feature Selection Methods

Document Type : Original Article

Authors

¹ Institute of Graduate Studies and Research, Alexandria, Egypt

² Information Technology Department, Institute of Graduate Studies and Research, Alexandria, Egypt

10.21608/jocc.2025.446643

Abstract

Feature selection is the process of removing features from the data set that are irrelevant with respect to the task that is to be performed. Also, Feature selection can be extremely useful in reducing the dimensionality of the data to be processed by the classifier, reducing execution time and improving predictive accuracy . In addition, Feature selection is a dimensionality reduction technique that reduces the number of attributes to a manageable size for processing and analysis . The accuracy of the classifier not only depends on the classification algorithm but also on the feature selection method. Selection of irrelevant and inappropriate features may confuse the classifier and lead to incorrect results. Feature selection is must in order to improve efficiency and accuracy of classifier . There are several classification methods. One of the famous methods of classification is decision tree. The decision tree is used for finding the best way to distinguish a class from another class. There are five mostly & commonly used algorithms for decision tree: - ID3, CART, CHAID, C4.5 algorithm and J48 . The CART (Classification And Regression Tree) is a nonparametric model which uses historical data to construct so-called decision trees. Trees are built top-down recursively beginning with a root node . In this paper, a new technique is suggested to optimize the classification of the classification tree of CART algorithm by using Combination of Feature selection methods which are Principal Component Analysis (PCA) method and Information Gain method. The PredictorImportance(imp) of decision tree express the accuracy of this tree. The proposed model is practiced on labor database. Results shows the classifier accuracy and predictor importance have been surely enhanced by the use of Feature selection methods than the classifier accuracy and predictor importance without feature selection.

Keywords

References

References

[1] S. Doraisamy, S. Golzari, N. M. Norowi, M. N. B. Sulaiman, and N. I. Udzir, “A Study on Feature Selection and Classification Techniques for Automatic Genre Classification of Traditional Malay Music”, the 9th International Conference on Music Information Retrieval, pp.331-336, Philadelphia, USA, 14-18 September 2008.

[2] L. Doddipalli, and K. U. Rani, “ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS”, Indian Journal of Computer Science and Engineering (IJCSE), Vol. 2, No. 5, pp. 756-763, October-November 2011.

[3] M. Mwadulo, “A Review on Feature Selection Methods for Classification Tasks”, International Journal of Computer Applications Technology and Research, Vol. 5, Issue 6, pp. 395 - 402, June 2016.

[4] P. Kumbhar, and M. Mali, “A Survey on Feature Selection Techniques and Classification Algorithms for Efficient Text Classification”, International Journal of Science and Research (IJSR), Vol. 5, Issue 5, pp. 1267-1275, May 2016.

[5] P. Saini,S. Rai, and A. K. Jain, “Decision Tree Algorithm Implementation Using Educational Data”, International Journal of Computer-Aided technologies (IJCAx), Vol.1, No.1, pp. 31-41, April 2014.

[6] Breiman, L., Friedman, J., Stone, C.J., A., O.R.: Classification and regression trees, Chapman and Hall CRC, Boca Raton (1984)

[7] L. Zhang, G. Liu, X. Zhang, S. Jiang, and E. Chen, Storage Device Performance Prediction with Selective Bagging Classification and Regression Tree, Session 2: Parallel Algorithms, IFIP International Conference on Network and Parallel Computing, pp. 121-133, Zhengzhou, China, 13-15 September 2010.

[8] A. Jović, K. Brkić,and N. Bogunović, “A review of feature selection methods with applications”, The 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), 25-29 May 2015.

[9] J. Tang, S. Alelyani, and H. Liu, “Feature Selection for Classification: A Review”, Data Classification: Algorithms and Applications, (Chapter 2), pp. 37-64, 2014.

[10] P. Saini, S. Rai, and Aj. K. Jain, “Decision Tree Algorithm Implementation Using Educational Data”,

International Journal of Computer-Aided technologies (IJCAx), Vol. 1,No. 1, pp. 31-41, April 2014.

[11] L. Gordon, “Using Classification and Regression Trees (CART) in SAS® Enterprise Miner^TM for Applications in Public Health”, SAS Global Forum, pp. 1-8, SanFrancisco, 28 April- 1 May 2013.

[12] Y. Yohannes, and P. Webb, “CLASSIFICATION

AND REGRESSION TREES, CART™”, A USER MANUAL FOR IDENTIFYING INDICATORS

OF VULNERABILITY TO FAMINE AND CHRONIC FOOD INSECURITY, International Food Policy Research Institute, Washington, U.S.A.

[13] W. Loh, “Classification and Regression Tree Methods”, Encyclopedia of Statistics in Quality and Reliability, Computationally Intensive Methods and Simulation, pp. 315–323, Wiley, 2008.

[14] L. Yu, and H. Liu, “Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution”, Proceedings of the Twentieth International Conference (ICML 2003), pp. 856-863, Washington, DC, United States, 21- 24 Aug 2003.

[15] L. Yu, and H. Liu, “Efficient Feature Selection via Analysis of Relevance and Redundancy”, Journal of Machine Learning Research, Vol.5, pp. 1205-1224, October 2004.

[16] E. Gascaa, J.S. Sánchezb, and R. Alonsoa, “Eliminating redundancy and irrelevance using a newMLP-based feature selection method”, Pattern Recognition, Vol. 39, Issue 2, pp. 313-315, February 2006.

[17] S. WANG, C. LIU, and L. ZHENG, “ FEATURE SELECTION BY COMBINING FISHER CRITERION AND PRINCIPAL FEATURE ANALYSIS”, the Sixth International Conference on Machine Learning and Cybernetics, pp. 1149-1154, Hong Kong, China, -22 August 2007.

[18] E. C. Blessie, and E. Karthikeyan, “Sigmis: A Feature Selection Algorithm Using Correlation Based Method”, Journal of Algorithms & Computational Technology, Vol. 6 No. 3, pp. 385-394, September 2012.

[19] Saranya .V. M, S. Uma, and Sherin.A, Saranya.K, “Survey on Classification Techniques Used in Data Mining and their Recent Advancements”, International Journal of Science, Engineering and Technology Research, Vol. 3, Issue 9, pp. 2380-2385, September 2014.

[20] M. Doshi, and S. K. Chaturvedi, “CORRELATION BASED FEATURE SELECTION (CFS) TECHNIQUE TO PREDICT STUDENT PERFROMANCE”, International Journal of Computer Networks & Communications (IJCNC), Vol.6, No.3, pp. 197-206, May 2014.

[21] B. Li, Tommy. W. S. Chow, and P. Tang, “Analyzing rough set based attribute reductions by extension rule”, Neurocomputing journal, Vol. 123, pp. 185-196, 10 January 2014.

[22] D.Jayasimha, A. P. Kumar, and A. R. Kumar, “Removal of Redundant and Irrelevant attributes for high Dimensional data using Clustering Approach”, INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES, Vol. V , Issue 5 , pp. 259-264, October 2015

[23] R. I. Srinivas, “Feature Subset Selection using Rough Sets for High Dimensional Data”, International Research Journal of Engineering and Technology (IRJET), Vol. 2, Issue 5, pp. 8-12, August 2015.

[24] Radha R, and Muralidhara S, “REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD”, International Journal of Computer Science and Mobile Computing, Vol.5 , Issue.7, pp. 359-364, July 2016.

[25] O. R. Zaïane, CMPUT 391: Database Design Theory or Relational Normalization Theory, Database Management Systems, ATH 352, Department of Computing Science, University of Alberta, Mondays, Wednesdays and Fridays 13:00 to 13:50, Winter 2004.

[26] “Database Design and Data Redundancy”, Cis 119do - Introduction To Oracle: Sql, Scottsdale Community College, rev: 7/2001

[27]A. L. Blum, and P. Langley, "Selection of relevant features and examples in machine learning", Artificial Intelligence Journal, Vol. 97, Issue 1-2, pp. 245-271, December 1997.

[28] S.Vanaja , and K. R. kumar, " Analysis of Feature Selection Algorithms on Classification: A Survey", International Journal of Computer Applications, Vol. 96, No.17, pp. 28-35, June 2014.

[29] O. Villacampa, "Feature Selection and Classification Methods for Decision Making: A Comparative Analysis", Doctor of Philosophy Thesis, College of Engineering and Computing, Nova Southeastern University, Florida, United States, 2015.

[30] P. Tan, M. Steinbach, and V. Kumar, “Introduction to Data Mining”, Pearson Addison-Wesley, 2006.

[31] I. Guyon, and A. Elisseeff, “An Introduction to Variable and Feature Selection”, Journal of Machine Learning Research, Vol.3, pp.1157-1182, March 2003.

[32] G. D. Fiol, and P. J. Haug, “Classification models for the prediction of clinicians’ information needs”, Journal of Biomedical Informatics, Vol. 42. No. 1, pp. 82–89, 2009.

[33] A. Adel, N. Omar and A. Al-Shabi, “A COMPARATIVE STUDY OF COMBINED FEATURE SELECTION METHODS FOR ARABIC TEXT CLASSIFICATION”, Journal of Computer Science, Vol. 10, No. 11, pp. 2232-2239 , November 2014.

[34] Y. Bouchlaghem, Y. Akhiat, and S.Amjad, “Feature Selection: A Review and Comparative Study”, E3S Web Conf, 10th International Conference on Innovation, Modern Applied Science & Environmental Studies (ICIES’2022), Vol. 351, Article Number 01046, 24 May 2022.

[35] C. M. Wray, A.L. Byers, “Methodological Progress Note: Classification and Regression Tree Analysis
”, Journal of Hospital Medicine, Vol. 15 , No. 9, pp. 549-551, September 2020.