Breast cancer prediction with survival analysis
working
Breast cancer is a significant health concern worldwide. Its prognosis and survival rate are greatly dependent on timely detection and accurate prediction of the progression. Many prediction models have been developed which take into consideration genomics, racial disparities, and tumor characteristics. However most of them focus on short-term outcomes. Long-term follow-up studies that assess breast cancer recurrence, late-stage complications, and survival beyond the initial treatment phase are essential for providing a more comprehensive picture of patient outcomes.
This study first reviews critical research which has been conducted in the past on breast cancer prediction and identifies their shortcomings. It also identiies the distribution pattern and risk factors. Then it uses two existing breast cancer datasets with over 1000 observations each, containing important variables such as demographics, tumor size, omics data, mutation count, cancer type, duration of treatment, among others. Survival analysis is applied to identify independent predictors of breast cancer survival, considering factors such as tumor characteristics, treatment modalities, and patient demographics. Furthermore, machine learning algorithms are employed to enhance predictive accuracy. Python software is used.
Goal 1- To critically review existing research on breast cancer prediction using machine learning algorithms
Purpose: Healthcare datasets are different in nature than other statistical datasets. Understanding them is essential for creating a prediction model. In this goal we will identify key studies on breast cancer prediction models and study the properties, assumptions, methodologies and parameters of their datasets. We will look at them critically and systematically identify their shortcomings.
Method: Systematic and critical analysis of 50-60 existing studies on prediction modelling for breast cancer. Following elements will be reviewed:
- Author
- Aim
- Study type
- Dataset characteristics
- Variables/ parameters
- Data analysis method
- Findings
- Shortcomings
Requirement: Familiarity with healthcare datasets is a must. Must also possess knowledge of prediction modelling, empirical review, systematic review and literature review.
To contribute and publish select a pending milestone.
Completed
Importance of analysing breast cancer data
Aim: Breast cancer data is analysed for various reasons including predictions, prognosis recommendations, cost management, planning clinical trials, and population health management. In this article the aim is to elaborate those reasons.
Method: literature review of 20-25 studies examining the importance of analysing breast cancer data. the studies must be published between 2015 and 2023 in A category journals.
Presentation/ structure: Present the article in the below structure.
- Introduction to uses of breast cancer data and how it is mined
- Importance of analysing breast cancer data- identify 6-7 points of importance. Explain each point individually with examples of application. Provide sufficient evidence.
- Conclusion- what does the future look like for breast cancer data mining?
Pending
Factors causing growth in breast cancer incidences around the world
Aim: Identifying factors behind the growing rates of breast cancer is crucial in predicting if the disease is likely to affect an individual. It ranges from demographic to lifestyle factors, and genetic to environmental factors. The aim of this study is to comprehensively review 40-50 recent studies on factors causing growing rates of breast cancer today.
Method: Systematic review of 40-50 articles published in A category journals in the last 10 years.
Structure: This article will carry tables and text. Tables must be in landscape view revealing the systematic review of 40 studies which contain empirical analysis, i.e., primary study. They must contain the following columns:
- Author (year)
- Aim of the study
- Methodology
- Variables considered
- Findings (factors identified)
- Limitations
Factors affecting the outcome of breast cancer treatment
Aim- treatment success of breast cancer depends on a number of factors. In this article the aim is to identify and explain them.
Methodology- systematic review of 25-30 studies published after 2015. They must contain empirical analysis, i.e. first hand data analysis.
Analysis/ presentation- present your data in the form of tables and text. The tables must contain the systematic review in the form of the following columns.
- Author (year)
- Aim
- Methodology
- Factors identified
- Findings
- Limitations
Machine learning models used in breast cancer prediction
Aim- This article will examine the different machine learning algorithms that have been used in the past for breast cancer prediction. Remember that prediction models can be made for different purposes like:
- predicting survival rate
- disease recurrence
- treatment response
- quality of life
All these purposes must be considered, with types of machine learning models used for each elaborated well.
Methodology: Systematic review of all the points mentioned above. Each point will be elaborated with analysis of 5 empirical studies conducted after 2018. All these studies must have an original prediction model. The findings must be presented in the form of tables and text discussion.
Since there are 4 points above, there will be 4 tables. And since you need to review 5 models each, there will be a total of 20 studies.
Presentation: Each table must contain the following columns, explaining briefly what you found in the paper.
- Author (year)
- Aim
- Methodology (particularly explain the dataset)
- Variables considered
- Machine learning model used
- Outcome/ findings (emphasise on accuracy of the model like the following):
- prediction accuracy: total number of correct predictions / total number of predictions
- precision and recall using true positives
- F1 score
- confusion matrix
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-squared (Coefficient of Determination)
- Limitations
After every table you must present a discussion section describing a cohesive view of the findings.
Factors causing growth in breast cancer incidences around the world
Aim: Survival analysis can be of different types. While some aim to predict time to event, others predict heterogeneity in survival experiences, relationships between covariates and survival outcomes, etc. This article aims to explain the versatility of survival analysis. It will show different scenarios in which survival analysis is applicable with emphasis on the kind of data that was used.
Methodology: Empirical review of 20-25 studies. All these studies must contain some form of primary data. They will be split into different categories based on the purpose of prediction. Analysis should discuss the findings of these studies cohesively. Therefore focus on synthesis.
Presentation: Structure of the article will be as follows.
- Introduction to survival analysis
- Why is survival analysis used in breast cancer prediction?
- Types of survival analysis used in breast cancer data
- Type/ purpose 1 (synthesise 5-6 studies)
- Type/ purpose 2 and so on
- Advantages of survival analysis
- Conclusion- application of the analysis today
Systematic review of survival analysis used in breast cancer prediction
Aim: This article will focus on understanding the application process of survival analysis in breast cancer data. Aim is to understand common characteristics in datasets used, methods/ models applied, outcomes achieved, challenges faced, and limitations of these studies. This is a very important article since it paves the way for the second goal, i.e., creation of a prediction model.
Methodology: We will use systematic review method. Choose 20-25 studies published after 2015 which have used survival analysis for different predictions on breast cancer. It will be presented in the form of a table followed by a discussion section. The table must contain the following columns.
- Author (year)
- Aim
- Dataset characteristics
- Methodology (survival analysis model/ method used)
- Variables of the dataset
- Findings
- Limitations
Structure: The article will be presented in the following format.
- Introduction to survival analysis in breast cancer predictions (types of predictions)
- Systematic review of breast cancer prediction using survival analysis (table)
- Discussion
- Conclusion (shortcomings in existing research)
Dataset characteristics and analysis method for creating a prediction model for breast cancer
Aim: this article is the second part of a two-part series which explains the methodology for creating a breast cancer prediction model. In this one, the focus is on explaining the dataset characteristics and analysis plan. Therefore this article cannot be written unless part 1 of the methodology is written.
Methodology: Two datasets of breast cancer will be obtained from open source data libraries: training and test data. the training dataset will be further split into training & test data. the goal is to create the prediction model based on training part of training dataset and test in on the two other datasets to validate it. Both the datasets will have similar parameters and a minimum of 1000 observations. It is essential that the ultimate outcome of the treatment/ prognosis is revealed, so that our model accuracy can be validated. In this article there will be no theoretical elaboration of our approach. It will be written in a clear and precise manner, with sole focus on conveying the approach to creating the prediction model.
Structure:
- Aim of the prediction model & variables (review the previous article for this)
- Dataset characteristics
- Sources of data
- Purpose & characteristics of training data
- Purpose & characteristics of test data
- Step-wise method of model creation (with specification of training versus test)
- step 1- splitting into training & test
- step 2- exploration with descriptive statistics
- step 3- survival time distribution
- step 4- treating outliers
- step 5- proportional hazards assumption
- step 6- Cox Proportional Hazards Regression:
- step 7- model interpretation
- step 8- comparison of survival curves
- step 9- predictive modeling
- step 10- time dependent covariates- factors that change over time
- step 11- Risk Stratification
- step 12- Model Assessment and Validation
- step 13- clinical utility
- Expected outcome
Goal 2- To create a prediction model for breast cancer which focuses on long-term outcomes.
Purpose: To elaborate the process of creating a prediction model for healthcare data on Python with accuracy of at least 85%.
Method: Empirical analysis i.e. creating the prediction model will involve two stages.
Stage 1- We first perform the analysis on the training dataset using the following steps:
- Step 1- exploring the dataset using descriptive analysis and histograms
- Step 2- survival time distribution
- Step 3- treating outliers
- Step 4- proportional hazards assumption
- Step 5- Cox proportional hazards regression
- Step 6- model interpretation
- Step 7- comparison of survival curves
- Step 8- creating the prediction model
Stage 2- At this stage we will validate the model by running it on the test dataset. It will then be refined by taking other factors into account.
- Step 1- time dependent covariates
- Step 2- risk stratification
- Step 3- model assessment and validation
- Step 4- clinical utility
To contribute and publish select a pending milestone.
Pending
Methodology for breast cancer prediction model using survival analysis
Aim: this article is the first of two-part series on the methodology which we will undertake to create a prediction model for breast cancer. The focus of this part of the series will be on explaining the variables and conceptual design of the model.
Methodology: This article cannot be made without completion of Goal 1. Language should be clear and precise. Do not use unnecessary theory to explain anything. Focus should be on conveying our approach to creation of a novel prediction model with justification. Use diagrams, charts, tables to present your ideas.
Prior experience in writing methodology is a must.
Structure:
- Introduction to the purpose of the prediction model
- Variables identified from the systematic and empirical reviews
- Conceptual diagram presenting the input and output variables and relationship between them.
- Hypothesis (if any)
- Conclusion
Findings of the proposed breast cancer prediction model
Note: This milestone cannot be achieved unless the previous milestone 'Dataset characteristics and analysis method for creating a prediction model for breast cancer' is completed.
Purpose: The aim of this milestone is to create a prediction model and present its findings. We need to show that not only is our proposed model useful in predicting breast cancer survival rate, disease recurrence, treatment response and quality of life, but also functions better (with more accuracy) than previous models.
Method: Follow the below steps to create the prediction model.
- Download the dataset from the milestone 'Dataset characteristics and analysis method for creating a prediction model for breast cancer' in Goal 2 OR contact the module creator for the same.
- Import data into Python for prediction and pre-processing
- eliminate all missing data from data, if any
- draw pie charts and other visual graphs for understanding selected data attributes
- build correlation charts to assess linkage between attributes and outcome variable
- preprocess data to derive standard form and code the attributes
- build the machine and deep learning prediction models using optimizer (GridsearcCV). Define the best parameters for each model
- Fit each model on their best parameters
Structure/ presentation: After creating the prediction model based on the above instructions, write a 1000- 1500 words article in the following structure.
- Introduction (brierfly introduce the dataset connecting the previous milestone and explain the goal our our prediction model)
- Exploratory data analysis: (explain findings of visual representation of data)
- Data pre-processing (process followed and outcome)
- Model fitting (explanation of findings from steps 7 & 8 above)
- Conclusion
Note: Prior experience in creating prediction models using Python is mandatory.
Comparison of breast cancer prediction models
Note: You cannot proceed with this milestone unless all the previous milestones in Goals 1 and 2 are completed.
Purpose: Until now in this module we have reviewed the existing breast cancer prediction models and then created our own model in Python. Now in this milestone the aim is to compare the accuracy of our model to the previous ones and prove how it is better than them.
Methodology: Compare the findings of our model to the previously created ones (check milestone 'Machine learning models used in breast cancer prediction' under Goal 1 for the models) on the basis of the following parameters:
- prediction accuracy: total number of correct predictions / total number of predictions
- precision and recall using true positives
- F1 score
- confusion matrix
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-squared (Coefficient of Determination)
Structure: In order to achieve this milestone, write a 1000-1500 words article in the following structure.
- Introduction- A brief overview of our developed prediction model based on the previous milestone
- Comparison of prediction models- Explain how our proposed prediction model fares in comparison to the previously reviewed ones for each of the above listed parameters.
- Conclusion- what more can be improved in breast cancer prediction? How can it be achieved?
Note: Experience in creating prediction models is mandatory.