Comparison Between K-Fold Cross Validation And Percentage Split In Decision Tree Algorithms For Anemia Classification
Downloads
Anemia is a significant global health challenge characterized by a pathological deficit in hemoglobin concentration, often leading to physiological instability. Accurate clinical diagnosis typically relies on complete blood count (CBC) tests, which provide critical hematological parameters for classification. While machine learning models have demonstrated high efficacy in diagnosing anemia, existing research often relies on static data partitioning strategies that may overlook evaluation reliability and performance stability. This study addresses this gap by shifting the focus from architectural benchmarking to validation robustness, specifically evaluating the C4.5 algorithm's performance across different data-splitting techniques. The research uses a dataset comprising 1,281 clinical records with 14 numerical features and 9 anemia-type labels. To assess stability, two distinct partitioning strategies were implemented: a static Percentage Split (ranging from 60:40 to 90:10) and iterative K-Fold Cross Validation (with K values of 3, 5, 7, 10, and 15). Experimental results demonstrate that the C4.5 algorithm achieved its peak performance with the 90:10 Percentage Split, achieving an average accuracy of 99.46%, precision of 98.32%, and recall of 99.28%. In comparison, the K-Fold (K=10) approach yielded a slightly lower but more stable accuracy of 99.19% with a significantly reduced standard deviation (±0.09), highlighting its reliability for clinical applications. While the high-ratio percentage split maximizes training exposure and predictive potential, the K-Fold method provides a more objective, generalizable benchmark by accounting for the entire data distribution. The study further identifies challenges in classifying minority classes, such as Leukemia with thrombocytopenia, due to inherent data scarcity. Ultimately, this research confirms that the C4.5 algorithm, when paired with an optimal partitioning protocol, remains a robust and highly interpretable solution for clinical anemia screening, outperforming several complex modern architectures
[1] Z. Faradila, A. Homaidi, and J. D. Prasetyo, “Classification of Anaemia Status Using The K-Nearest Neighbor Algorithm,” G-Tech: Jurnal Teknologi Terapan, vol. 9, no. 1, pp. 436–444, Jan. 2025, doi: 10.70609/gtech.v9i1.6377.
[2] M. N. Garcia-Casal, O. Dary, M. E. Jefferds, and S. R. Pasricha, “Diagnosing anemia: Challenges selecting methods, addressing underlying causes, and implementing actions at the public health level,” Jun. 01, 2023, John Wiley and Sons Inc. doi: 10.1111/nyas.14996.
[3] L. Del Castillo et al., “Prevalence and risk factors of anemia in the mother–child population from a region of the Colombian Caribbean,” BMC Public Health, vol. 23, no. 1, Dec. 2023, doi: 10.1186/s12889-023-16475-0.
[4] J. G. Gómez, C. Parra Urueta, D. S. Álvarez, V. Hernández Riaño, and G. Ramirez-Gonzalez, “Anemia Classification System Using Machine Learning,” Informatics, vol. 12, no. 1, Mar. 2025, doi: 10.3390/informatics12010019.
[5] M. Mert Usta, M. Çakmak, and D. Ekmekçi, “Anemia Types Prediction Using Ensemble Learning.” [Online]. Available: https://www.icensos.com/
[6] R. Vohra, A. Hussain, A. K. Dudyala, J. Pahareeya, and W. Khan, “Multi-class classification algorithms for the diagnosis of anemia in an outpatient clinical setting,” PLoS One, vol. 17, no. 7 July, Jul. 2022, doi: 10.1371/journal.pone.0269685.
[7] G. Airlangga, “Anemia Classification Using Hybrid Machine Learning Models: A Comparative Study of Ensemble Techniques on CBC Data,” Journal of Computer System and Informatics (JoSYC), vol. 5, no. 4, pp. 1108–1117, Aug. 2024, doi: 10.47065/josyc.v5i4.5848.
[8] M. K. Hirok, S. Rahman, and M. Parvin, “Anemia prediction and classification of all classes with and without anemia patients using a machine learning model,” in 2024 IEEE International Conference on Computing, Applications and Systems (COMPAS), IEEE, Sep. 2024, pp. 1–6. doi: 10.1109/COMPAS60761.2024.10796730.
[9] S. J. M. Mohammed, A. A. Ahmed, A. A. Ahmad, and M. S. Mohammed, “Anemia Prediction Based on Rule Classification,” in Proceedings - International Conference on Developments in eSystems Engineering, DeSE, Institute of Electrical and Electronics Engineers Inc., Dec. 2020, pp. 427–431. doi: 10.1109/DeSE51703.2020.9450234.
[10] A. Végh, L. Takáč, O. Czakóová, K. Dansca, and D. Nagy, “Evaluating Optimizable Machine Learning Models for Anemia Type Prediction from Complete Blood Count Data,” International Journal of Advanced Natural Sciences and Engineering Researches, vol. 7, no. 7, pp. 108–119, 2024, [Online]. Available: https://as-proceeding.com/index.php/ijanser
[11] D. C. E. Saputra, K. Sunat, and T. Ratnaningsih, “A New Artificial Intelligence Approach Using Extreme Learning Machine as the Potentially Effective Model to Predict and Analyze the Diagnosis of Anemia,” Healthcare (Switzerland), vol. 11, no. 5, Mar. 2023, doi: 10.3390/healthcare11050697.
[12] O. O. Okundalaye, N. Özdemir, and F. Evirgen, “Leveraging Machine Learning for Early and Accurate Anaemia Diagnosis: A Comparative Study of Classification Algorithms,” in Advances in Mathematical Modelling, Applied Analysis and Computation, J. Singh, G. A. Anastassiou, D. Baleanu, and D. Kumar, Eds., Cham: Springer Nature Switzerland, 2025, pp. 42–52.
[13] Y. Cakmak and I. Pacal, “AI-Driven Classification of Anemia and Blood Disorders Using Machine Learning Models,” Computers and Electronics in Medicine, vol. 2, no. 2, pp. 43–52, Jul. 2025, doi: 10.69882/adba.cem.2025073.
[14] Y. Zhang, Y. Xin, and Q. Li, “Research on parameter selection and optimization of C4.5 algorithm based on algorithm applicability knowledge base,” Sci. Rep., vol. 15, no. 1, Dec. 2025, doi: 10.1038/s41598-025-11901-2.
[15] M. Teke, T. Etem, and M. Karhan, “Enhancing anemia diagnosis using ensemble machine learning and feature selection techniques on CBC data,” European Physical Journal: Special Topics, Oct. 2025, doi: 10.1140/epjs/s11734-025-01838-y.
[16] M. Bhagat and B. Bakariya, “Implementation of Logistic Regression on Diabetic Dataset using Train-Test-Split, K-Fold and Stratified K-Fold Approach,” National Academy Science Letters, vol. 45, no. 5, pp. 401–404, Oct. 2022, doi: 10.1007/s40009-022-01131-9.
[17] M. Rhifky Wayahdi, D. Syahputra, S. Hafiz, and N. Ginting, “EVALUATION OF THE K-NEAREST NEIGHBOR MODEL WITH K-FOLD CROSS VALIDATION ON IMAGE CLASSIFICATION,” JURNAL INFOKUM, vol. 9, no. 1, pp. 1–6, Dec. 2020, [Online]. Available: http://infor.seaninstitute.org/index.php/infokum/index
[18] I. K. Nti, O. Nyarko-Boateng, and J. Aning, “Performance of Machine Learning Algorithms with Different K Values in K-fold CrossValidation,” International Journal of Information Technology and Computer Science, vol. 13, no. 6, pp. 61–71, Dec. 2021, doi: 10.5815/ijitcs.2021.06.05.
[19] K. Jung, D. H. Bae, M. J. Um, S. Kim, S. Jeon, and D. Park, “Evaluation of nitrate load estimations using neural networks and canonical correlation analysis with K-fold cross-validation,” Sustainability (Switzerland), vol. 12, no. 1, 2020, doi: 10.3390/SU12010400.
[20] I. O. Muraina, “IDEAL DATASET SPLITTING RATIOS IN MACHINE LEARNING ALGORITHMS: GENERAL CONCERNS FOR DATA SCIENTISTS AND DATA ANALYSTS.” [Online]. Available: www.artuklukongresi.org
[21] B. Vrigazova, “The Proportion for Splitting Data into Training and Test Set for the Bootstrap in Classification Problems,” Business Systems Research, vol. 12, no. 1, pp. 228–242, May 2021, doi: 10.2478/bsrj-2021-0015.
[22] J. Tan, J. Yang, S. Wu, G. Chen, and J. Zhao, “A critical look at the current train/test split in machine learning,” Jun. 2021, [Online]. Available: http://arxiv.org/abs/2106.04525
[23] A. Z. Abdullah, B. Winarno, and D. R. S. Saputro, “The decision tree classification with C4.5 and C5.0 algorithm based on R to detect case fatality rate of dengue hemorrhagic fever in Indonesia,” in Journal of Physics: Conference Series, IOP Publishing Ltd, Feb. 2021. doi: 10.1088/1742-6596/1776/1/012040.
[24] Sumiati, V. V. R. Repi, P. Hendriyati, Anharudin, A. Yusta, and A. Triayudi, “Classification of cardiac disorders based on electrocardiogram data using a decision tree classification approach with the C45 algorithm,” IAES International Journal of Artificial Intelligence, vol. 12, no. 3, pp. 1128–1138, Sep. 2023, doi: 10.11591/ijai.v12.i3.pp1128-1138.
[25] M. M. Mijwil and R. A. Abttan, “Utilizing the Genetic Algorithm to Pruning the C4.5 Decision Tree Algorithm,” 2021. [Online]. Available: www.ajouronline.com
[26] M. Yunus, M. K. Biddinika, and A. Fadlil, “Classification of Stunting in Children Using the C4.5 Algorithm,” Jurnal Online Informatika, vol. 8, no. 1, pp. 99–106, Jun. 2023, doi: 10.15575/join.v8i1.1062.
[27] A. Sharma, M. Grover, J. Malhotra, and S. Sharma, “Predicting Maternal Health Risk Using Machine Learning Models And Comparing The Performance Of Percentage Split And K-Fold Cross Validation,” 2024. [Online]. Available: www.ijnrd.org
[28] L. Pawar, J. Malhotra, A. Sharma, D. Arora, and D. Vaidya, “A Robust Machine Learning Predictive Model for Maternal Health Risk,” in 3rd International Conference on Electronics and Sustainable Communication Systems, ICESC 2022 - Proceedings, Institute of Electrical and Electronics Engineers Inc., 2022, pp. 882–888. doi: 10.1109/ICESC54411.2022.9885515.
[29] P. Verma and V. Chopra, “A Review on Machine Learning Algorithms for Anemia disease Prediction,” 2022.
[30] S. S. Abdul-Jabbar, A. K. Farhan, and A. S. Luchinin, “A Comparative Study of Anemia Classification Algorithms for International and Newly CBC Datasets,” International journal of online and biomedical engineering, vol. 19, no. 6, pp. 141–157, 2023, doi: 10.3991/ijoe.v19i06.38157.
[31] B. Çil, H. Ayyıldız, and T. Tuncer, “Discrimination of β-thalassemia and iron deficiency anemia through extreme learning machine and regularized extreme learning machine based decision support system,” Med. Hypotheses, vol. 138, p. 109611, 2020, doi: https://doi.org/10.1016/j.mehy.2020.109611.
[32] D. A. Tyas, S. Hartati, A. Harjoko, and T. Ratnaningsih, “Morphological, Texture, and Color Feature Analysis for Erythrocyte Classification in Thalassemia Cases,” IEEE Access, vol. 8, pp. 69849–69860, 2020, doi: 10.1109/ACCESS.2020.2983155.
[33] S. De and B. Chakraborty, “Case-Based Reasoning (CBR)-Based Anemia Severity Detection System (ASDS) Using Machine Learning Algorithm,” in Advanced Machine Learning Technologies and Applications, A. E. Hassanien, R. Bhatnagar, and A. Darwish, Eds., Singapore: Springer Singapore, 2021, pp. 621–632.
[34] Y. K. Fu et al., “The tvgh-nycu thal-classifier: Development of a machine-learning classifier for differentiating thalassemia and non-thalassemia patients,” Diagnostics, vol. 11, no. 9, Sep. 2021, doi: 10.3390/diagnostics11091725.
[35] P. Memmolo et al., “Differential diagnosis of hereditary anemias from a fraction of blood drop by digital holography and hierarchical machine learning,” Biosens. Bioelectron., vol. 201, p. 113945, 2022, doi: https://doi.org/10.1016/j.bios.2021.113945.
[36] B. E. Dejene, T. M. Abuhay, and D. S. Bogale, “Predicting the level of anemia among Ethiopian pregnant women using homogeneous ensemble machine learning algorithm,” BMC Med. Inform. Decis. Mak., vol. 22, no. 1, p. 247, 2022, doi: 10.1186/s12911-022-01992-6.
[37] Md. M. Islam et al., “Risk Factors Identification and Prediction of Anemia among Women in Bangladesh using Machine Learning Techniques,” Curr. Womens Health Rev., vol. 17, Feb. 2021, doi: 10.2174/1573404817666210215161108.
Copyright (c) 2025 Nanda Putri Rahmawati, Irwan Budiman, Muhammad Itqan Mazdadi, Andi Farmadi, Friska Abadi (Author)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlikel 4.0 International (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).





