Articles | Open Access | https://doi.org/10.37547/ijmsphr/Volume07Issue03-05

Ensemble Machine Learning and Natural Language Processing for Automated Cancer Indicator Detection in Clinical Notes

Md Yassir Mottalib , Master of Science in Information System Technology, Wilmington University, USA
Nur Nobe , Department of Health Sciences & Leadership, St. Francis College, Brooklyn, USA
MD Tanvir Islam , Department of computer science, Monroe University, USA
Afjal Hossain Jisan , Department of Supply Chain & Information Systems, The Pennsylvania State University, University Park, Pennsylvania
Md. Emran Hossen , Department of Science in Biomedical Engineering, Gannon University, USA

Abstract

Trebuchet MSEarly identification of cancer indicators within clinical documentation is essential for improving diagnostic efficiency and patient outcomes. This study presents a Natural Language Processing (NLP) and machine learning framework designed to extract cancer-related indicators from unstructured clinical notes. Clinical text data obtained from Kaggle and structured diagnostic features from the Breast Cancer Wisconsin (Diagnostic) Dataset available through the UCI Machine Learning Repository were used to develop and evaluate the proposed model. The methodology involved comprehensive text preprocessing, TF–IDF-based feature extraction, and feature engineering to represent clinically meaningful patterns in narrative medical text. Multiple machine learning algorithms, including Logistic Regression, Support Vector Machines, Random Forest, and Gradient Boosting classifiers, were trained and evaluated using standard performance metrics. Experimental results indicate that ensemble learning approaches outperform traditional classifiers in detecting cancer-related information from clinical narratives. Among the evaluated models, the Gradient Boosting classifier achieved the best performance with an accuracy of 95%, precision of 94%, recall of 93%, and an F1-score of 0.93. These results demonstrate the effectiveness of machine learning–based NLP systems in identifying cancer indicators within electronic health records. The proposed framework highlights the potential of automated clinical text analysis to support early cancer detection, enhance clinical decision support systems, and improve healthcare data analytics.

Keywords

Natural Language Processing, Machine Learning, Cancer Detection, Clinical Notes, Healthcare Analytics, Electronic Health Records, Text Mining.

References

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., Cui, C., Corrado, G., Thrun, S., & Dean, J. (2019). A guide to deep learning in healthcare. Nature Medicine, 25(1), 24–29.

Jiang, F., Jiang, Y., Zhi, H., Dong, Y., Li, H., Ma, S., Wang, Y., Dong, Q., Shen, H., & Wang, Y. (2017). Artificial intelligence in healthcare: Past, present and future. Stroke and Vascular Neurology, 2(4), 230–243.

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. European Conference on Machine Learning, 137–142.

Kourou, K., Exarchos, T., Exarchos, K., Karamouzis, M., & Fotiadis, D. (2015). Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, 13, 8–17.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Rajkomar, A., Dean, J., & Kohane, I. (2019). Machine learning in medicine. New England Journal of Medicine, 380(14), 1347–1358.

Savova, G., Masanz, J., Ogren, P., Zheng, J., Sohn, S., Kipper-Schuler, K., & Chute, C. (2010). Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES). Journal of the American Medical Informatics Association, 17(5), 507–513.

Nitu, F. N., Mia, M. M., Roy, M. K., Yezdani, S., FINDIK, B., & Nipa, R. A. (2025). Leveraging Graph Neural Networks for Intelligent Supply Chain Risk Management in the Era of Industry 4.0. International Interdisciplinary Business Economics Advancement Journal, 6(10), 21-33.

Siddique, M. T., Uddin, M. N., Gharami, A. K., Khan, M. S., Roy, M. K., Sharif, M. K., & Chambugong, L. (2025). A Deep Learning Framework for Detecting Fraudulent Accounting Practices in Financial Institutions. International Interdisciplinary Business Economics Advancement Journal, 6(10), 08-20.

Mia, M. M., Al Mamun, A., Ahmed, M. P., Tisha, S. A., Habib, S. A., & Nitu, F. N. (2025). Enhancing Financial Statement Fraud Detection through Machine Learning: A Comparative Study of Classification Models. Emerging Frontiers Library for The American Journal of Engineering and Technology, 7(09), 166-175.

Akhi, S. S., Ahamed, M. I., Alom, M. S., Rakin, A., Awal, A., & Al Mamoon, I. (2025, July). Boosted Forest Soft Ensemble of XGBoost, Gradient Boosting, and Random Forest with Explainable AI for Thyroid Cancer Recurrence Prediction. In 2025 International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN) (pp. 1-6). IEEE.

Alom, M. S., Akhi, S. S., Borsha, S. N., Mia, N., Tamim, F. S., & Nabin, J. A. (2025, July). Federated Machine Learning for Cardiovascular Risk Assessment: A Decentralized XGBoost Approach. In 2025 International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN) (pp. 1-6). IEEE.

Akhi, S. S., Rahaman, M. A., & Alom, M. S. An Explainable and Robust Machine Learning Approach for Autism Spectrum Disorder Prediction.

Rabbi, M. A., Rijon, R. H., Akhi, S. S., Hossain, A., & Jeba, S. M. (2025, January). A Detailed Analysis of Machine Learning Algorithm Performance in Heart Disease Prediction. In 2025 4th International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST) (pp. 259-263). IEEE.

Mujiba Shaima, Mazharul Islam Tusher, Estak Ahmed, Sharmin Sultana Akhi, & Rayhan Hassan Mahin. (2025). Machine Learning Techniques and Insights for Cardiovascular or Heart Disease Prediction. Academic International Journal of Engineering Science, 3(01), 22-35.

Jamee, S. S., Arif, M., Rahman, M. M., YASSAR, I. S., & Hossain, M. A. (2025). Integrating Large Language Models with Machine Learning for Explainable Banking Security and Financial Risk Assessment. International Interdisciplinary Business Economics Advancement Journal, 6(11), 8-18.

Umam, S., & Razzak, R. B. (2024, October). Linguistic disparities in mental health services: Analyzing the impact of spanish language support availability in saint louis’ region, Missouri. In APHA 2024 Annual Meeting and Expo. APHA.

Umam, S., & Razzak, R. B. (2025, November). A 20-Year Overview of Trends in Secondhand Smoke Exposure Among Cardiovascular Disease Patients in the US: 1999–2020. In APHA 2025 Annual Meeting and Expo. APHA.

Razzak, R. B., & Umam, S. (2025, November). Health Equity in Action: Utilizing PRECEDE-PROCEED Model to Address Gun Violence and associated PTSD in Shaw Community, Saint Louis, Missouri. In APHA 2025 Annual Meeting and Expo. APHA.

Razzak, R. B., & Umam, S. (2025, November). A Place-Based Spatial Analysis of Social Determinants and Opioid Overdose Disparities on Health Outcomes in Illinois, United States. In APHA 2025 Annual Meeting and Expo. APHA.

Umam, S., Razzak, R. B., Munni, M. Y., & Rahman, A. (2025). Exploring the non-linear association of daily cigarette consumption behavior and food security-An application of CMP GAM regression. PLoS One, 20(7), e0328109.

Estak Ahmed, A Thi Phuong Nguyen, Aleya Akhter, KAMRUN NAHER, & HOSNE ARA MALEK. (2025). Advancing U.S. Healthcare with LLM–Diffusion Hybrid Models for Synthetic Skin Image Generation and Dermatological AI. Journal of Medical and Health Studies, 6(5), 83-90. https://doi.org/10.32996/jmhs.2025.6.5.11

Article Statistics

Downloads

Download data is not yet available.

Copyright License

Download Citations

How to Cite

Mottalib, M. Y. ., Nobe, N. ., Islam, M. T. ., Jisan, A. H. ., & Hossen, M. E. . (2026). Ensemble Machine Learning and Natural Language Processing for Automated Cancer Indicator Detection in Clinical Notes. International Journal of Medical Science and Public Health Research, 7(03), 27–37. https://doi.org/10.37547/ijmsphr/Volume07Issue03-05