Importance of a standard medical database for breast cancer research

By Santanu Banerjee on February 7, 2024

In data mining, significant patterns and knowledge are extracted from big databases that in the context of cancer research include patient characteristics, genetic records, outcomes from treatment, and other facts. Classification and data mining approaches are frequently employed in healthcare settings to support decision-making through diagnosis and research (Alhasani et al., 2023).

While most research on cancer focuses on clinical as well as biological aspects, data-driven statistical research is increasingly being used as a supplement. Several investigators have explored utilising patient clinic information to forecast patients’ risk of developing breast cancer in order to address the sharp rise in breast cancer cases (Wang and Yoon, 2015). According to the World Health Organisation (WHO), breast cancer is the leading cause of death among women and may become fatal (Sardouk et al., 2019). The best ways to stop this disease in its tracks are to take preventative measures and conduct routine investigations where data mining of cancer data becomes essential.

Mining data for cancer research using Weka

In recent years, researchers and medical professionals have utilised various online cancer datasets or medical datasets to determine a correlation between patient-specific attributes and survival using existing methodologies (Kaur et al., 2022). Research indicates that Weka, a Java-based application is used in data mining methods to mine cancer data. Weka primarily uses pre-processing, visualisation, categorization, grouping, regression, and association rule methods (Kulkarni and Bhagwat, 2015).

Laboratory investigations are constrained by their controlled environment and are used to test theories under specific settings. While observational studies look at demographic features to find correlations, they might not establish cause-and-effect relationships. Human-centered clinical studies offer evidence of cause and effect. Treatments for breast cancer have been developed mostly through clinical trials (Lei et al., 2021).

Analysing cancer research data for early detection and diagnosis

Researchers find possible biomarkers or risk factors linked to the illness by examining recurring trends and patterns in cancer-related data. Timely intervention and enhanced treatment results can be rendered by early detection and can be achieved by investigating mammography facts, genetic markers, or other diagnostic tools (Parmar and Garg, 2020). In countries where efficient treatment options have been implemented, the annual rate of breast cancer mortality has decreased by anywhere from two to four per cent. According to Ak (2020), it is projected that two million and fifty thousand more patients will survive from 2020 to 2040 if the annual incidence of breast cancer death is lowered by 2.5 per cent.

Recognising the diversity of breast cancer cases can be assisted by the analysis of both genetic and molecular data. This allows customized treatment plans to be created based on each person’s distinct biological composition. As per the view of Mendes et al.,(2015), the application of targeted medicines such as Herceptin is responsible for getting guided employing the diagnosis of HER2-positive breast cancer and those results in fewer adverse effects along with successful treatments.

Chaurasia et al., (2018) mentioned that magnetic resonance imaging, biopsy, Positron Emission Tomography, and mammography are all prevalent techniques that are capable to be used for diagnosing breast cancer. The outcomes of these approaches are subjected to detailed statistical analysis through data mining alongside the implementation of machine learning algorithms to identify trends. This helps medical professionals to easily distinguish instances as benign or malignant.

Importance of a standard medical history database in risk assessment

Scientists can identify certain genetic abnormalities or signalling pathways linked to breast cancer by examining sizable databases and that becomes essential for the development of new medications and treatments. The invention of PARP inhibitors is used to treat certain forms of breast cancer and can be possible with the use of large-part data analysis (Mateo et al., 2019). Research on unexplored therapeutics is responsible for having great benefits from the National Cancer Data Base (NCDB) by covering approximately 70% of new advanced cancer diagnoses in the US each year (Gill et al., 2015). By using the NCDB, researchers are able to investigate trends, treatment results, and patient reactions to provide important new information to the continuing efforts to advance cancer treatment approaches.

Environmental and lifestyle variables also increase the likelihood of breast cancer occurrence. Researchers have linked different factors like food, childbearing historical events and hormonal replacement medication with breast cancer statistically (Berrington de Gonzalez et al., 2021). The primary care role is significantly altered by the trend towards proactive risk assessment, which also creates the opportunity for a primary care-based screening strategy to identify individuals who are at moderate or high risk of breast cancer (Usher-Smith et al., 2023). According to epidemiology studies, those with low socioeconomic positions and little education are more likely to develop breast cancer; conversely, these same socioeconomic characteristics and low education also have a detrimental effect on breast cancer patient’s chances of survival (Abdull, 2015).

Analysing cancer incidence data at the community level facilitates knowledge of the disease’s prevalence in various demographic groups. This data depends on planning for public health, allocating resources, and creating focused screening programmes. The lowest starting peak ages were 40 in South Korea and Cameroon, whereas the highest peak ages among females were 55–60 in China, Japan, Iran, Fiji, and Morocco (Lei et al., 2021). These sorts of data are extremely helpful while developing detailed plans regarding the awareness of breast cancer irrespective of different demographics.

Existing studies for the prediction and prognostication of breast cancer

Patients with breast carcinoma can have their prognosis predicted with the use of clinical and molecular data analysis. Large datasets can be processed by machine learning algorithms to find patterns linked to the development of illness. Algorithms including KNN, NB, DT, SVM, and LR are being used in the ongoing research for breast cancer prediction (Nemade and Fegade, 2023). Strelcenia and Prakoonwit (2023) suggested a critical component of prognosis through XGBoost and Random Forest system with an emphasis on forecasting breast cancer risk. Likewise, Ma et al., (2022), presented the EXSA gradient boosting method to forecast the course of breast cancer and offered insightful information on prognosis.

As per the view of Alshammari and Mezher, (2020), the possibility for forecasting outcomes and prognosis in breast cancer is demonstrated by the comparative study executed through the breast carcinoma dataset making use of a variety of pre-processing methods using the WEKA data mining tool. Their investigations used Lazy algorithms, more specifically the IBK and K-star algorithms and showed exceptional results, with accuracy rates as high as 98.2% within the ideal time range.


  • Abdull, M. (2015). Data mining techniques and breast cancer prediction: A case study of Libya. [online] Available at: [Accessed 5 Feb. 2024].
  • Ak , M.F., 2020, April. A comparative analysis of breast cancer detection and diagnosis using data visualization and machine learning applications. In Healthcare (Vol. 8, No. 2, p. 111). MDPI.
  • Alhasani , A.T., Alkattan , H., Subhi , A.A., El- Kenawy , E.S.M. and Eid, M.M., 2023. A comparative analysis of methods for detecting and diagnosing breast cancer based on data mining. Methods, 7 (9).
  • Alshammari, M. and Mezher, M., 2020. A comparative analysis of data mining techniques on breast cancer diagnosis data using WEKA toolbox.
  • Berrington de Gonzalez, A., Pasqual, E. and Veiga , L., 2021. Epidemiological studies of CT scans and cancer risk: the state of the science. The British Journal of Radiology, 94, p.20210471.
  • Chaurasia , V., Pal, S. and Tiwari , B.B., 2018. Prediction of benign and malignant breast cancer using data mining techniques. Journal of Algorithms & Computational Technology, 12 (2), pp.119-126.
  • Gill, B.S., Bernard, M.E., Lin, J.F., Balasubramani , G.K., Rajagopalan , M.S., Sukumvanich , P., Krivak , T.C., Olawaiye , A.B., Kelley, J.L. and Beriwal , S., 2015. Impact of adjuvant chemotherapy with radiation for node-positive vulvar cancer: A National Cancer Data Base (NCDB) analysis. Gynecologic oncology, 137 (3), pp.365-372.
  • Kaur , I., Doja , M.N. and Ahmad, T. (2022). Data mining and machine learning in cancer survival research: An overview and future recommendations. Journal of Biomedical Informatics, 128, p.104026. doi:
  • Kulkarni , S. and Bhagwat , M., 2015. Predicting breast cancer recurrence using data mining techniques. International Journal of Computer Applications, 122 (23).
  • Lei, S., Zheng , R., Zhang, S., Wang, S., Chen, R., Sun, K., Zeng , H., Zhou, J. and Wei, W., 2021. Global patterns of breast cancer incidence and mortality: A population-based cancer registry data analysis from 2000 to 2020. Cancer Communications, 41 (11), pp.1183-1194.
  • Ma, B., Yan, G., Chai, B. and Hou , X., 2022. XGBLC: an improved survival prediction model based on XGBoost. Bioinformatics, 38 (2), pp.410-418.
  • Mateo, J., Lord, C.J., Serra, V., Tutt , A., Balmaña , J., Castroviejo-Bermejo, M., Cruz, C., Oaknin , A., Kaye, S.B. and De Bono, J.S., 2019. A decade of clinical development of PARP inhibitors in perspective. Annals of Oncology, 30 (9), pp.1437-1447.
  • Mendes, D., Alves , C., Afonso , N., Cardoso, F., Passos -Coelho, J.L., Costa, L., Andrade, S. and Batel -Marques, F., 2015. The benefit of HER2-targeted therapies on overall survival of patients with metastatic HER2-positive breast cancer–a systematic review. Breast Cancer Research, 17 (1), pp.1-14.
  • Nemade , V. and Fegade , V., 2023. Machine Learning Techniques for Breast Cancer Prediction. Procedia Computer Science, 218, pp.1314-1320.
  • Parmar , N. and Garg , B. (2020). Breast Cancer Prediction Using Data Mining. [ online ] Available at: [Accessed 5 Feb. 2024].
  • Sardouk , F., Duru , A.D. and Bayat , O., 2019. Classification of breast cancer using data mining. American Scientific Research Journal for Engineering, Technology, and Sciences (ASRJETS) , 51 (1), pp.38-46.
  • Strelcenia , E. and Prakoonwit , S., 2023. Effective feature engineering and classification of breast cancer diagnosis: a comparative study. BioMedInformatics, 3(3), pp.616-631.
  • Usher-Smith, J.A., Hindmarch , S., French, D.P., Tischkowitz , M., Moorthie , S., Walter, F.M., Dennison, R.A., Stutzin Donoso , F., Archer, S., Taylor, L. and Emery, J., 2023. Proactive breast cancer risk assessment in primary care: a review based on the principles of screening. British Journal of Cancer, pp.1-11.
  • Wang, H. and Yoon, S.W., 2015. Breast cancer prediction using data mining method. In IIE Annual Conference. Proceedings (p. 818). Institute of Industrial and Systems Engineers (IISE).
  • World Health Organization (2023). Breast cancer. [ online ] World Health Organization. Available at: [Accessed 5 Feb. 2024].