Breast cancer is a significant health concern worldwide. Its prognosis and survival rate are greatly dependent on timely detection and accurate prediction of the progression. Many prediction models have been developed which take into consideration genomics, racial disparities, and tumor characteristics. However most of them focus on short-term outcomes. Long-term follow-up studies that assess breast cancer recurrence, late-stage complications, and survival beyond the initial treatment phase are essential for providing a more comprehensive picture of patient outcomes.

This study first reviews critical research which has been conducted in the past on breast cancer prediction and identifies their shortcomings. It also identiies the distribution pattern and risk factors. Then it uses two existing breast cancer datasets with over 1000 observations each, containing important variables such as demographics, tumor size, omics data, mutation count, cancer type, duration of treatment, among others. Survival analysis is applied to identify independent predictors of breast cancer survival, considering factors such as tumor characteristics, treatment modalities, and patient demographics. Furthermore, machine learning algorithms are employed to enhance predictive accuracy. Python software is used.

Goal 1

Goal 1- To critically review existing research on breast cancer prediction using machine learning algorithms

Purpose: Healthcare datasets are different in nature than other statistical datasets. Understanding them is essential for creating a prediction model. In this goal we will identify key studies on breast cancer prediction models and study the properties, assumptions, methodologies and parameters of their datasets. We will look at them critically and systematically identify their shortcomings.

Method: Systematic and critical analysis of 50-60 existing studies on prediction modelling for breast cancer. Following elements will be reviewed:

  • Author
  • Aim
  • Study type
  • Dataset characteristics
  • Variables/ parameters
  • Data analysis method
  • Findings
  • Shortcomings

Requirement: Familiarity with healthcare datasets is a must. Must also possess knowledge of prediction modelling, empirical review, systematic review and literature review.

Milestones

To contribute and publish select a pending milestone.

Completed
Importance of analysing breast cancer data
Pending
Factors causing growth in breast cancer incidences around the world
Factors affecting the outcome of breast cancer treatment
Machine learning models used in breast cancer prediction
Factors causing growth in breast cancer incidences around the world
Systematic review of survival analysis used in breast cancer prediction
Dataset characteristics and analysis method for creating a prediction model for breast cancer
Goal 2

Goal 2- To create a prediction model for breast cancer which focuses on long-term outcomes.

Purpose: To elaborate the process of creating a prediction model for healthcare data on Python with accuracy of at least 85%.

Method: Empirical analysis i.e. creating the prediction model will involve two stages.

Stage 1- We first perform the analysis on the training dataset using the following steps:

  • Step 1- exploring the dataset using descriptive analysis and histograms
  • Step 2- survival time distribution
  • Step 3- treating outliers
  • Step 4- proportional hazards assumption
  • Step 5- Cox proportional hazards regression
  • Step 6- model interpretation
  • Step 7- comparison of survival curves
  • Step 8- creating the prediction model

 

Stage 2- At this stage we will validate the model by running it on the test dataset. It will then be refined by taking other factors into account.

  • Step 1- time dependent covariates
  • Step 2- risk stratification
  • Step 3- model assessment and validation
  • Step 4- clinical utility

 

Milestones

To contribute and publish select a pending milestone.

Completed
Pending
Methodology for breast cancer prediction model using survival analysis
Findings of the proposed breast cancer prediction model
Comparison of breast cancer prediction models