Project Report

Breast Cancer Detection Using Machine
Learning & XAI
Project Report Submitted

in partial fulfilment of the requirements for the degree
of
Bachelor of Technology
Computer Science Engineering
By
NAME OF STUDENTS ENROLL
Khalid Jan 190328
Syed Owais Bashir 190330
Mehrun Nissa 190350
Zahir Ahmed 190352
Under the guidance of

Ms. Asiya Quyoum
Hod CSE
Department of Computer Science Engineering

Government College of Engineering & Technology
Safapora, Ganderbal –193504, J&K
June, 2024
Government College of Engineering & Technology
Safapora, Ganderbal –193504, J&K (India)
CERTIFICATE
This is to certify that the project titled “Breast Cancer Detection Using
Machine Learning & XAI” submitted by Khalid Jan (190328), Syed Owais
Bashir (190330), Mehrun Nissa (190350), Zahir Ahmed (190352) to
Government College of Engineering and Technology Safapora Ganderbal in
partial fulfilment of the requirements for the award of the degree of Bachelors
of Technology in Computer Science and Engineering during the year 2024.
Ms. Aasia Quyoum Ms. Asiya Quyoum

Head of Department Guide
Computer Science and Engineering Head of Department
Computer Science and Engineering
Prof. (Dr.) Rauf Ahmad Khan

Principal
i
CANDIDATE’S DECLARATION
We hereby certify that the project titled “Breast Cancer Detection Using Machine
Learning & XAI” submitted to the Department of Computer Science Engineering of
GOVERNMENT COLLEGE OF ENGINEERING AND TECHNOLOGY-SAFAPORA
GANDERBAL, is an authentic record of our work carried out during the period of
September 2023 to March 2024 under the guidance of Ms. Asiya Quyoum.
The matter presented in this Major project report has not been submitted by us to any other
Institute/ University for the award of any Degree/ Diploma.
Signature of the Students
Khalid Jan 190328
Syed Owais Bashir 190330
Mehrun Nissa 190350
Zahir Ahmed 190352
This is to certify that the above statement made by the candidates is correct to the
best of my knowledge.
Signature of the Project Guide.
Place:-
Date:-
ii
ACKNOWLEDGEMENT
As a matter of first importance, we thank to almighty Allah for all the blessings in the
entirety of our undertakings.
We take this opportunity to express our profound gratitude and deep regards to our
Principal Prof. (Dr.) Rauf Ahmad Khan, for his exemplary guidance, monitoring and
constant encouragement throughout the course of engineering. We also take this
opportunity to express a deep sense of gratitude to Ms. Asiya Quyoum, Head (Department
of Computer Science Engineering), for her cordial support, valuable information &
guidance, which helped us in completing this task through various stages. Her guidance
shall carry is in long way in the journey of life which we are about to embark.
We are obliged to Ms. Asiya Quyoum, Head of department (Department of Computer

Science and Engineering) for the valuable information and technical help and guidance from
time to time in completion of this project.
Khalid Jan
Syed Owais Bashir
Mehrun Nissa
Zahir Ahmad
Date:
Place:
iii
ABSTRACT
Breast cancer is one of the most prevalent cancers among women globally,
representing a significant public health concern. Early diagnosis can improve
prognosis and chances of survival by enabling timely clinical treatment. Accurate
classification of benign tumors is crucial to prevent unnecessary treatments.
Consequently, correct diagnosis and classification of breast cancer as malignant or
benign are subjects of extensive research. Machine learning is widely recognized as
the methodology of choice for breast cancer pattern classification and forecast
modeling due to its unique advantages in detecting critical features from complex
datasets.
Classification and data mining methods are effective in classifying data,

particularly in the medical field, where they are widely used for diagnosis and analysis
to support decision-making. This study compares seven algorithms (SVM, Logistic
Regression, Random Forest, Decision Tree, AdaBoost, XgBoost, and Naïve Bayes)
for predicting breast cancer outcomes using various datasets. All experiments were
conducted in a simulation environment on the JUPYTER platform. The proposed
work can be used to predict the performance of various methods, allowing the
selection of the most suitable approach based on the requirements.
Explainable Artificial Intelligence (XAI) techniques have been incorporated to

enhance transparency and interpretability. XAI methods provide insights into the
decision-making process, revealing the factors and features contributing to the
predictions. A user-friendly website has been developed to facilitate model
deployment and accessibility, allowing users to input data and obtain predictions,
along with explanations for the model's reasoning. The XAI implementation and
website development aim to promote understanding and trust in the model's
predictions, ultimately supporting informed decision-making in breast cancer
diagnosis and treatment.
iv
TABLE OF CONTENTS
Title Page
CERTIFICATE…………………………………….……………………………………...i
DECLARATION…………………………………………………………………….…....ii
ACKNOWLEDGEMENTS…………………………………………………….….…......iii
ABSTRACT……………………………………………………………………………....iv
LIST OF TABLES…………………………………………………………..……...…..vii
LIST OF FIGURES……………………………………………………….……….......viii
ABBREVIATIONS………………………………………………………….…………...ix
CHAPTER 1 INTRODUCTION……………………………………………..…..……1-12
1.1 INTRODUCTION…………………………………………………………………..1
1.2 PROBLEM STATEMENT…………………………………………..………2
1.3 RELEVANT CONTEMPORARY ISSUES………………………..………..3
1.4 MOTIVATION…………………………………………………………..…..5
1.5 OBJECTIVE……………………………………………………………..…..7
1.6 REQUIREMENT AND SPECIFICATIONS………………………………..8
1.6.1 SOFTWARE REQUIREMENTS……………………….…………..8
1.6.2 HARDWARE REQUIREMENTS…………………………………..9
1.7 SCOPE………………………………………………………………………9
1.8 FEASIBILITY STUDY…………………………………………………….10
CHAPTER 2 LITERATURE REVIEW……………………………………...……..13-15
2.1 TIMELINE OF THE REPORTED PROBLEM……………………………13
CHAPTER 3 METHODOLOGY……………………………………………………..16-26
3.1 CONCEPT GENERATION………………………………………………..16
3.2 EVALUATION AND SELECTION OF FEATURES……………………..16
3.2.1 STAGE 1: DATA PRE-PROCESSING………………………..…17
3.2.2 STAGE 2: DATA EXPLORATION……………………………...18
3.2.2.a BIVARIATE DATA ANALYSIS………………………...19
3.2.2.b MULTIVARIATE DATA ANALYSIS…………………...20
3.2.3 STAGE 3: FEATURE SELECTION………………………………21
3.2.4 STAGE 4: FEATURE SCALING…………………………………21
3.2.5 STAGE 5: MODEL SELECTON………………………………….22
3.2.5.1 CODE SNIPPET FOR SVM………………………………23
3.2.5.2 CODE SNIPPET FOR NAÏVE BAYES…………………..23
3.2.5.3 CODE SNIPPET FOR LOGISTIC REGRESSION………24
v
3.2.5.4 CODE SNIPPET FOR DECISION TREES………………24
3.2.5.5 CODE SNIPPET FOR RANDOM FORESTS……………24
3.2.5.6 CODE SNIPPET FOR ADABOOST……………………..25
3.2.5.7 CODE SNIPPET FOR XGBOOST………………………25
3.2.5.8 CODE SNIPPET FOR CONFUSON MATRIX………….25
3.2.5.9 CODE SNIPPET FOR CLASSIFICATION REPORT…...26
CHAPTER 4 IMPLEMENTATION OF XAI……………………………………….27-30
4.1 XAI TECHNIQUES………………………………………………………...27
4.1.1 SHAP ………..……………………………………………………27
4.1.2 LIME…………..…………………………………………….……28
CHAPTER 5 SYSTEM IMPLEMENTATION…………………………………….31-35
5.1 FRONT-END WEBSITE DEVELOPMENT……………………………….31
5.1.1 USER INTERFACE DESIGN………………………………31
5.1.2 HTML STRUCTURE………………………………………32
5.1.3 CSS STYLING…………………………………………..…32
5.1.4 JAVASCRIPT INTERACTIVITY…………………………33
5.2 BACK-END INTEGRATION WITH FLASK…………………………….33
5.2.1 FLASK STRUCTURE……………………………………..33
5.2.2 ROUTE HANDLING………………………………………33
5.2.3 DATA PROCESSING……………………………………..34
5.2.4 MODEL LOADING AND PREDICTION………………..34
5.2.5 XAI EXPLANATION GENERATION…………………...34
5.2.6 RESPONSE RENDERING………………………………...34
5.3 INTEGRATION DEPLOYMENT………………………………………...34
5.4 CONCLUSION……………………………………………………………34
CHAPTER 6 RESULTS AND FINDINGS………………………………..36-38
CHAPTER 7 LOCAL DATA ANALYSIS………………………………...39-44
CHAPTER 8 FUTURE SCOPE……………………………………………45-48
7.1 DATA EXPANSION……………………………………………………….45
72 ADVANCED XAI TECHNIQUES………………...………………………..46
CHAPTER 9 CONCLUSION……………………………………………………….59-51
REFFRENCES……………………………………………………………….52-53
vi
List of Tables
Table Title Page
1.1 Table Showing Software Requirements…………………………….8
1.2 Table Showing Software Requirements…………………………….9
3.1 Attribute Information………………………………………………17
vii
LIST OF FIGURES
Figure Title Page

1.1 The WHO analysis the data about causes of deaths in Women’s……5
3.1 Code Snippet of libraries Imported…………………………………18
3.2 Attributes Information of dataset……………………………………18
3.3 Code Snippet of target variable...……………………………………19
3.4 Relationship between Two Variables……………………………….19
3.5 Heat Map……………………………………………………………20
3.6 Code Snippet for SVM………………………………………………23
3.7 Code Snippet for Naïve Bayes………………………………………23
3.8 Code Snippet for Logistic Regression……..………………………..24
3.9 Code Snippet for Decision Trees……………………………………24
3.10 Code Snippet for Random Forests………………………………...24
3.11 Code Snippet for AdaBoost………………………………………..25
3.12 Code Snippet for XgBoost………………………………………...25
3.13 Code Snippet for Confusion Matrix…………………………….....25
3.14 Confusion Matrix………………………………………………….26
3.15 Classification Report………………………………………………26
4.1 Code Snippet for implantation of XAI……………………………...30
4. 2 Lime Explanation…………………………………………………..30
5.1 Front-End Website Page……………………………………………31
5.1.1 User Interface of website..………………………………………..32
5.2 Code Snippet of backend integration with flask……………………33
6.1 SKIMS Data Analysis………………………………………………40
6.2 SMHS Data Analysis……………………………………………….42
6.3 Male Female Ratio of cancer…………..….………………………..43
6.4Cancer Comparison………………………………………………….44
viii
ABBREVIATIONS
SVM Support Vector Machines
AI Artificial Intelligence
ML Machine Learning
FNA Fine Needle Aspiration
FDA Food and Drug Administration
XAI Explainable Artificial Intelligence
XCYT Cytological Diagnosis and prognosis
ix
CHAPTER-1
INTRODUCTION
1.1 Introduction
Breast cancer is one of the most prominent type of cancer among women all
around the world, according to research conducted by World Health Organization
(WHO). Breast Cancer is a leading causes of death among women all around the
world. Breast cancer also has an exceedingly high rate of cancer fatalities in India
which is around 14% and is the most common cancer among women. Breast Cancer
affects about 5% of Indian women, but it affects about 12.5 percent of women in
Europe and the United States. The 5th big reason of females death is Breast Cancer
comparatively to cancers in terms of all types. The malignant tumor of Breast Cancer
which produced inside breast cells. A group of splitting cells that form a lump or mass
of extra tissue which is called Tumors and these tumors can be whichever cancerous
(malignant) or non-cancerous (benign). As prognosis is so critical for long-term
survival, early detection of breast cancer benefits early treatment and diagnosis.
Because cancer can be detected, diagnosed, and treated only if detected early, the
chance of death is reduced by early detection. It plays a vital role patient's survival.
Delay in diagnosing cancer or detecting it at a later stage may lead to the spreading of
disease and complications in treatment. Cancer-related research done in the past on
the effects of a late cancer diagnosis has found that itis very closely linked to the
disease progressing to advanced stages, lowering the likelihood of saving the patient's
life. An analysis of 87 researchers found that female breast cancer patients who begin
treatment within 90 days after the onset of symptoms had a considerably higher
likelihood of surviving than those who wait more than 90 days. Many earlier studies
have found that detecting breast cancer in its early stages and starting the treatment
on time increases the chances of survival by preventing malignant (Cancerous) cells
from spreading throughout the body. This paper's main contribution is an evaluation
and study of the role of various machine learning approaches in breast cancer early
detection.
Artificial intelligence (AI) and Machine Learning together can be

implemented to improve breast cancer detection, while also avoiding overtreatment.
1
Nonetheless, merging AI with Machine Learning (ML) approaches helps achieve
accurate prediction and decision-making. For e.g., deciding whether or not the patient
needs surgery based on the biopsy results for detecting breast cancer. Mammograms
are currently the most utilized test, they can give false positive (high-risk) results,
which can lead to unnecessary biopsies and procedures. When surgery is performed
to remove malignant cells, it is sometimes discovered that the cells are benign that are
non-cancerous. This implies that the patient will be subjected to unnecessary,
unpleasant, and a costly surgery. M.L. Algorithms have a number of benefits,
including their ability to perform well on healthcare-related datasets such as pictures,
x-rays, and blood samples. Some strategies are better suited to small datasets, while
others are best suited to large datasets. Noise can be an issue with some methods.
1.2 PROBLEM STATEMENT
Mammography remains a critical tool for screening and diagnosing breast

cancers. Advocates for mammography screening refer to its widely documented
contribution in reducing breast cancer mortality rates. While mammographic
screening has established reduction in mortality, like any examination, there is a false-
positive rate associated with screening mammography. While only 7–12% of women
are falsely recalled after only one mammogram, over 50% of women who have
undergone annual mammography screening for 10 years will be recalled incorrectly.
These false positives translate into increased benign biopsies, increased spending, and
negative psychological effects for patients involved. Likewise, potentially malignant
neoplasms are at risk of being missed due to their small size or surrounding dense
fibro glandular tissue. False-negative mammogram results have a higher incidence in
women aged 50–89 with previous benign biopsies. However, the rate of false-negative
results is still relatively low, reported in 1.0 to 1.5 per 1,000 women. Although its
accuracy continues to improve with technical improvements, diagnostic
mammography is the gold standard for evaluation of breast cancer. In order to further
increase the accuracy and to further reduce the rates of false positives and false
negatives, recent advances in machine.
Learning (ML) and artificial intelligence (AI) have been exploited to develop
software capable of aiding radiologists in clinical practice. Currently, many of these
AI-based tools designed for aiding radiologists and interpreting mammograms are
2
developed with machine learning. Machine learning is a specific domain of AI and is
concerned with constructing algorithms used by computers to perform certain tasks
without using explicit instructions, but instead relying on inference and patterns and
are able to improve their performance with experience.
1.3 RELEVANT CONTEMPORARY ISSUES
Machine learning is a branch of artificial intelligence and computer science

which focuses on the use of data and algorithms to imitate the way that humans learn,
gradually improving its accuracy. Machine learning is an important component of the
growing field of data science. Through the use of statistical methods, algorithms are
trained to make classifications or predictions, uncovering key insights within data
mining projects. These insights subsequently drive decision making within
applications and businesses, ideally impacting key growth metrics. As big data
continues to expand and grow, the market demand for data scientists will increase,
requiring them to assist in the identification of the most relevant business questions
and subsequently the data to answer them. Basically, breast is made up of different
tissues and tissues are the group of cells and that tissues are ranging from fatty tissue
to dense tissue and within the tissue there is a lobe and that of each lobe is made up
of small tube-like structures.
So, if we talk about breast cancer basically breast cancer is the second major
death cause in women’s that is Breast Cancer. Cancer starts when cells begin to grow
out of control. Breast Cancer cells usually form a type of tumor that can be often seen
in X-Ray or felt as a lump. Breast cancer can spread when the cancer cells get into the
blood or lymph system and carried to several parts of the body. The main cause of
breast cancer according to us are which includes some changes and mutation in DNA.
There are many types of breast cancer. A breast cancer is a malignant, that means it
can grow and spread to other parts of body too and a Benign tumor means in which
tumor can grow but has not spread rapidly. And mostly breast cancer spreads to the
nearby lymph nodes in which the breast cancer is still treated as a local disease, but it
can also spread through one body to another through the blood vessels or we can say
lymph nodes. It’s important to understand that most breast lumps are benign and not
cancer (malignant). Non- cancerous breast tumors are abnormal growths, but they do
not spread outside of the breast. They are not life threatening, but some types of benign
3
breast lumps can increase a woman’s risk of getting breast cancer. Any breast lump
or change needs to be checked by a health care professional to determine if it is benign
or malignant (cancer) and if it might affect your future cancer risk. The breast is the
tissue overlying the chest (pectoral) muscles. Women's breasts are made of specialized
tissue that produces milk (glandular tissue) as well as fatty tissue. The amount of fat
determines the size of the breast. The milk-producing part of the breast is organized
into 15to 20 sections, called lobes. Within each lobe are smaller structures, called
lobules, where milk is produced. The milk travels through a network of tiny tubes
called ducts. The ducts connect and come together into larger ducts, which eventually
exit the skin in the nipple. The dark area of skin surrounding the nipple is called the
areola. Malignant (cancer) cells multiplying abnormally in the breast, eventually
spreading to the rest of the body if untreated. Breast cancer occurs almost exclusively
in women, although men can be affected. Signs of breast cancer include a lump,
bloody nipple discharge, or skin changes. The number and the size of databases
recording medical data are increasing rapidly. Medical data, produced from
measurements, examinations, prescriptions, etc., are stored in different databases on
a continuous basis. This enormous amount of data exceeds the ability of traditional
methods to analyze and search for interesting patterns and information that is e (e.g.,
machine learning) and business intelligence. The book Data mining: Practical
machine learning tools and techniques with given the class variable. Based on the
maximum probability. It detects the class membership for the given tuple to a
particular class. The term Breast Cancer refers to disease of breast. There are number
of factors that can affect the breast and leads to breast cancer.
1. Getting Older.
2. Genetic Mutations.
3. Smoking & Used Alcohol.
4. Physical Activity.
5. Obesity.
6. Food.
7. Having Dense Breasts Factors like these are used to analyze the breast cancer.
4
In many cases, diagnosis is generally based on patient’s current test results & doctor’s
experience. Thus, the Diagnosis is a complex task that requires much experience and
high skill.
1.4 MOTIVATION
Breast cancer is a global problem, and 1.7 million new cases are diagnosed per
year. Approximately 60% of deaths due to breast cancer occur in developing
countries, whereas in the United States (US), an estimated 249,260 new cases of breast
cancer are diagnosed each year, and mortality due to this disease is decreasing. In
contrast, breast cancer in developing countries represents one-half of all breast cancer
cases and 62% of the deaths. Developing countries have limited healthcare resources
and use different strategies to diagnose breast cancer. Most of the population depends
on the public healthcare system, which affects the diagnosis of the tumor. Thus, the
indicators observed in developed countries cannot be directly compared with those
observed in developing countries because the healthcare infrastructures in developing
countries are deficient.
Figure-1.1: The WHO analysis the data about causes of deaths in 2018 and result
clearly shows that the causes of breast cancer death are higher than
other causes of death in Women’s.
The motivation behind using machine learning for breast cancer prediction is
driven by the desire to improve early detection and diagnosis of breast cancer, which
is crucial for successful treatment and improved patient outcomes. Machine learning
5
algorithms have the potential to analyze large amounts of data, identify patterns, and
make accurate predictions based on the learned patterns. In the case of breast cancer,
these algorithms can analyze various factors and characteristics of breast tissue to
predict the likelihood of developing the disease.
Here are some key motivations for using machine learning in breast cancer prediction:
1. Early detection: Early detection of breast cancer is known to significantly

improve the chances of successful treatment and survival. Machine learning
models can be trained on large datasets containing information about breast tissue,
such as mammograms, medical records, genetic information, and patient
demographics. By analyzing these diverse data sources, machine learning
algorithms can potentially identify subtle patterns and indicators of early-stage
breast cancer that may not be easily recognizable to human observers.
2. Improved accuracy: Machine learning algorithms can learn complex patterns

and relationships within data, potentially leading to more accurate predictions
compared to traditional methods. By analyzing a multitude of factors and their
interactions, machine learning models can provide a more comprehensive
assessment of breast cancer risk than individual risk factors considered in
isolation. This can help healthcare professionals make more informed decisions
about screening, diagnosis, and treatment strategies.
3. Personalized medicine: Breast cancer is a heterogeneous disease, meaning that it

can vary significantly among individuals. Machine learning algorithms can
incorporate various personal factors, such as genetic information, family history,
lifestyle choices, and medical history, to tailor predictions and recommendations
to individual patients. This enables a more personalized approach to breast cancer
prevention and treatment, optimizing patient care and outcomes.
4. Handling big data: The field of healthcare generates vast amounts of data,
including medical records, imaging data, genomic data, and clinical trial results.
Machine learning algorithms are well-suited to handle and analyze such big data,
extracting meaningful insights and patterns that may not be apparent to human
analysts. By leveraging these large datasets, machine learning models can
6
potentially uncover new risk factors, identify novel biomarkers, and improve our
understanding of breast cancer.
5. Support for healthcare professionals: Machine learning models can serve as

decision support tools for healthcare professionals. By providing risk assessments
and predictions, these models can assist doctors in making more accurate and
timely diagnoses, optimizing treatment plans, and determining appropriate
surveillance strategies for individuals at higher risk. This can help reduce the
burden on healthcare professionals and improve overall patient care.
1.5 OBJECTIVE
The primary objective of breast cancer prediction using machine learning is to

develop accurate and reliable models that can assist in the early detection, diagnosis,
and treatment of breast cancer. Here are some specific objectives of breast cancer
prediction using machine learning:
1. Early detection: One of the main objectives is to detect breast cancer at an early
stage when it is more treatable and the chances of survival are higher. Machine
learning models can analyze various data sources, such as mammograms, patient
demographics, genetic information, and medical records, to identify patterns and
indicators of early-stage breast cancer that may not be easily detectable by human
observers.
2. Risk assessment: Machine learning algorithms can assess the risk of developing
breast cancer by considering multiple risk factors and their interactions. By
incorporating personal factors, such as genetic predisposition, family history,
lifestyle choices, and medical history, these models can provide a personalized risk
assessment for individuals. This helps in identifying individuals who may benefit
from more intensive screening or preventive measures.
3. Prediction accuracy: Machine learning models aim to improve the accuracy of

breast cancer prediction compared to traditional methods. By analyzing large
datasets and learning from historical cases, these models can identify patterns and
relationships that can aid in accurate predictions. Higher prediction accuracy can
lead to more effective screening, diagnosis, and treatment planning, ultimately
improving patient outcomes.
7
4. Feature identification: Machine learning algorithms can automatically identify
relevant features or biomarkers associated with breast cancer. By analyzing a large
number of data points, these models can uncover new risk factors or biomarkers
that may not have been previously recognized. This can contribute to a better
understanding of breast cancer and lead to the discovery of new diagnostic or
therapeutic targets.
5. Decision support for healthcare professionals: Machine learning models can

serve as decision support tools for healthcare professionals. By providing risk
assessments, predictions, and treatment recommendations, these models can assist
doctors in making more informed decisions. They can help optimize treatment
plans, determine appropriate surveillance strategies, and provide personalized care
for patients, ultimately improving the quality of healthcare delivery.6Integration
with existing healthcare systems: Another objective is to develop machine learning
models that can seamlessly integrate with existing healthcare systems. This allows
for the efficient utilization of patient data and enables real-time prediction and
decision-making support for healthcare professionals. Integration with electronic
health records (EHRs) and other healthcare systems ensures the practical
applicability and scalability of machine learning models in clinical settings.
Overall, the objective of breast cancer prediction using machine learning is to
leverage the power of data analysis and pattern recognition to improve early
detection, risk assessment, and treatment strategies for breast cancer, leading to
better patient outcomes and reduced mortality rates.
1.6. REQUIREMENT AND SPECIFICATIONS
1.6.1 Software Requirements
Table 1.1: Table Showing Software Requirements.
Software Requirements Version

Operating System Window 10 or higher
Integrated Development Environment (IDE) Microsoft Visual Studio 2019 or higher,
Anaconda Navigator, Jupyter Notebook,
PyCharm
8
1.6.2 Hardware Requirements
Table 1.2: Table Showing Hardware Requirements.
Hardware Requirements Version

Processor Intel Core i5 or higher
RAM 8GB or higher
Storage 256 GB or higher
1.7 SCOPE
The scope of breast cancer prediction using machine learning is broad and
encompasses various aspects of detection, diagnosis, risk assessment, and treatment
planning. Here are some key areas where machine learning can make a significant
impact in breast cancer prediction:
1. Early Detection: Machine learning models can analyze mammograms, medical

imaging data, and other patient information to identify patterns and indicators of
early-stage breast cancer. By detecting cancer at an early stage, the chances of
successful treatment and improved patient outcomes can be significantly
increased.
2. Risk Assessment: Machine learning algorithms can incorporate multiple risk

factors, such as genetic information, family history, lifestyle choices, and medical
history, to provide personalized risk assessments for individuals. These models
can identify individuals at higher risk of developing breast cancer and help guide
screening and preventive strategies.
3. Image Analysis: Machine learning techniques, including computer vision and

deep learning, can be applied to medical images such as mammograms,
ultrasounds, and MRIs to automate the analysis process. These models can assist
radiologists in detecting abnormalities, segmenting tumors, and predicting the
malignancy of breast lesions.
4. Biomarker Identification: Machine learning can analyze genomic data,

proteomic data, and other molecular data to identify biomarkers associated with
9
breast cancer. These biomarkers can provide insights into disease progression,
treatment response, and potential therapeutic targets.
5. Treatment Planning: Machine learning models can help guide treatment

decisions by predicting the effectiveness of different treatment options for
individual patients. They can analyze patient data, treatment outcomes, and
clinical guidelines to provide personalized treatment recommendations and assist
healthcare professionals in selecting the most suitable therapeutic interventions.
6. Prognosis and Survival Prediction: Machine learning algorithms can analyze

patient data, including clinical features, imaging results, and treatment history, to
predict patient prognosis and survival rates. These predictions can aid in treatment
planning and help patients and healthcare providers make informed decisions
about care options.
7. Integration with Electronic Health Records (EHRs): Machine learning models

can be integrated with electronic health record systems to analyze large-scale
patient data. This integration enables comprehensive analysis of diverse patient
information, facilitating population-level studies, quality improvement initiatives,
and clinical decision support.
8. Public Health Applications: Machine learning techniques can be applied to

population-level data to identify patterns, trends, and risk factors associated with
breast cancer. This information can be used to develop preventive strategies,
optimize public health interventions, and allocate healthcare resources effectively.
It’s important to note that while machine learning shows promise in breast cancer
prediction, these models should always be used as decision support tools and not
as a substitute for medical professionals. The ultimate goal is to augment human
expertise and improve patient care in the field of breast cancer.
1.8 FEASIBILITY STUDY
The feasibility of breast cancer prediction using machine learning has been
widely demonstrated and holds significant potential in improving early detection and
patient outcomes. Here are several factors that contribute to the feasibility of breast
cancer prediction using machine learning:
10
1. Abundance of Data: There is a substantial amount of available data related to
breast cancer, including mammograms, patient demographics, genetic
information, and histopathological data. Machine learning models thrive on large
and diverse datasets, allowing them to learn patterns and make accurate
predictions. The availability of such data makes breast cancer prediction using
machine learning feasible.
2. Technological Advancements: Rapid advancements in machine learning

algorithms, particularly in deep learning, have greatly enhanced the feasibility of
breast cancer prediction. Deep learning models, such as convolutional neural
networks (CNNs), have demonstrated high accuracy in analyzing medical images
and detecting breast cancer. These advancements enable the development of more
sophisticated and accurate predictive models.
3. Increased Computing Power: The availability of powerful computing resources,

such as GPUs and cloud computing platforms, has significantly improved the
feasibility of training and deploying machine learning models. Complex
algorithms and large datasets can be processed efficiently, reducing the time and
resources required for model development.
4. Feature Selection and Dimensionality Reduction: Machine learning techniques,

including feature selection and dimensionality reduction algorithms, help identify
the most relevant features and reduce the dimensionality of the input data. This
process improves model performance and reduces computational requirements,
making breast cancer prediction more feasible.
5. Model Generalization: Machine learning models can be trained on diverse

datasets and generalize well to unseen data. This allows the models to make
accurate predictions on new patient cases, increasing the feasibility of their use in
real-world clinical settings.
6. Integration with Clinical Workflows: Machine learning models can be

integrated into existing clinical workflows, such as electronic health record (EHR)
systems, to facilitate seamless adoption. This integration ensures that the
predictive models align with the existing healthcare infrastructure, making their
implementation more feasible.
11
7. Research and Collaborations: There is extensive ongoing research and
collaboration in the field of breast cancer prediction using machine learning.
Researchers, clinicians, and industry experts collaborate to develop and validate
machine learning models, ensuring that the feasibility of these models is
continuously improved.
8. Potential Impact on Healthcare: Breast cancer prediction using machine

learning has the potential to significantly impact healthcare outcomes. Early
detection, accurate risk assessment, and personalized treatment planning can lead
to improved patient survival rates, reduced healthcare costs, and better resource
allocation within healthcare systems. However, it is important to note that there
are challenges and limitations associated with breast cancer prediction using
machine learning, such as the need for high-quality labeled data, interpretability
of complex models, and ethical considerations surrounding data privacy and bias.
Addressing these challenges requires ongoing research, collaboration, and careful
implementation to ensure the responsible and effective use of machine learning in
breast cancer prediction.
12
CHAPTER 2
LITERATURE REVIEW
2.1 Timeline of the Reported Problem
With growing development in the field of medical science alongside machine

learning various experiment and research has been carried out in these recent years
releasing the relevant significant papers Breast cancer is the most common cancers in
ladies round the arena. It has been extensively studied at some points of records. In
fact, research on breast cancer has helped pave the way for breakthroughs in different
styles of cancer research. How we deal with breast cancer has changed in many
approaches from the cancer’s first discovery. However other findings and treatments
have remained the same for years. Humans have recognized breast cancer for a long
time. As an instance, the Edwin Smith Surgical Papyrus describes instances of breast
cancer Trusted source. This medical textual content dates returned to a few years,
2000-2,500 B.C.E. Within the first century, medical doctors experimented with
surgical incisions to spoil tumors. additionally, they thought that breast cancer turned
into connected with the quilt of menstruation. This concept might also have brought
about the affiliation of cancer with older age. In the beginning of the middle a while,
clinical progress turned into intertwined with new religious philosophies. Christians
thought surgery changed into barbaric and have been in favor of religion recovery. in
the meantime, Islamic doctors reviewed Greek medical texts to analyze greater
approximately breast most cancers. The Renaissance saw a revival of surgical
procedure, with docs exploring the human frame. John Hunter, called the Scottish
father of investigative surgical treatment, diagnosed lymph as a purpose of breast
cancer. Lymph is the fluid wearing white blood cells during the frame. Lumpectomies
have been additionally completed by surgeons, but there was no anesthesia, but
Surgeons needed to be speedy and accurate to achieve success. There are some Breast
Cancer Search Milestones: Our modern-day technique to breast most cancers remedy
and research began forming in the 19th century. recall these milestones:
1985: Researchers discover that ladies with early-level breast most cancers who were
handled with a lumpectomy and radiation have comparable survival costs to women
handled with only amastectomy.
13
1986: Scientists determine the way to clone the HER2 gene.
1995: Scientists can clone the tumor suppressor genes BRCA1 and BRCA2. Inherited
mutations in these genes can expect an expanded chance of breast cancer.
1996: FDA approves anastrozole (Arimidex) as a treatment for breast cancers. This
drug blocks the production of estrogen.
1998: Tamoxifen is observed to decrease the danger of growing breast most cancers
in at-danger women through 50 percent Trusted supply. It’s now permitted with the
aid of the FDA for use as a preventive therapy.
1998: Trastuzumab (Herceptin), a drug targeting cancer cells which can be over-
generatingHER2, is likewise accredited by the FDA.
2006: The SERM drug raloxifene (Evista) is discovered to reduce breast most cancers
risk for postmenopausal ladies who have better threat. It has a lower risk of great
aspect outcomes than tamoxifen.
2010: "A hybrid intelligent system for breast cancer diagnosis" by Abirami et al. This
paper proposed a hybrid intelligent system that combines fuzzy logic and artificial
neural networks to improve breast cancer diagnosis accuracy.
2011: A massive meta-analysis Trusted supply finds that radiation therapy drastically
reduces the hazard of breast cancers recurrence and mortality.
2012: "A novel approach for automated detection of breast cancer using SVM
classifier” by Kourou et al. The authors presented a novel approach using support
vector machine (SVM) for automated breast cancer detection, achieving promising
results.
2013: The 4 principal subtypes Trusted supply of breast cancer are described as
HR+/HER2 (“luminal A”), HR-/HER2 (“triple poor”), HR+/HER2+ (“luminal B”),
and HR-/HER2+(“HER2-enriched”).
2014: "Deep learning for detecting breast cancer metastases on whole slide images"
by Liu et al. This study explored the application of deep learning techniques,
specifically convolutional neural networks (CNNs), for detecting breast cancer
metastases in whole slide images.
14
2016: "Breast cancer diagnosis using a hybrid intelligent system" by Arun Kumar et
al. The authors proposed a hybrid intelligent system that combines rough set theory,
fuzzy logic, and genetic algorithm for breast cancer diagnosis, achieving high
accuracy.
2017: the first biosimilar drug, Overtreated source (trastuzumab-dkst), is accredited

through the FDA for breast cancer remedy. Unlike generics, biosimilars are copies of
biologic pills and value less than branded drugs.
2018: A medical trial suggests that chemotherapy after surgical operation doesn’t
benefit 70 percent of girls with early-level breast cancer.
2019: Enhertu Trusted supply is permitted by the FDA, and this drug proves to be
very effective in treating HER2-high quality breast cancer that’s metastasized or can’t
be removed with surgical operation.
2019: "Breast cancer diagnosis using a hybrid machine learning approach" by

Zormpas-Petridiset al. The authors developed a hybrid machine learning approach
combining decision trees, logistic regression, and random forests for breast cancer
diagnosis, achieving competitive performance.
2020: The drug Trodelvy is accredited through the FDA for treating metastatic triple-
poor breast cancer for individuals who haven’t replied to at the least other treatments
2020: "Efficient breast cancer classification using a machine learning approach with
genetic algorithm-based feature selection" by Elakkiya et al. This study employed
genetic algorithm- based feature selection and machine learning techniques for
efficient breast cancer classification, demonstrating improved accuracy.
15
CHAPTER 3
PROPOSED WORK
3.1 Concept generation
Breast cancer amongst all other breast disease has become a significant
concern due to its potential as a silent killer without any obvious symptoms. Early
prediction and prevention play crucial role in reducing the mortality rate associated
with this deadly disease.ML techniques offer various promising solutions for the
analysis of breast cancer by testing various risk factors. This proposed work aims to
collect and analyze relevant data from diverse sources, classify the data under suitable
headings, and apply machine learning algorithms to predict the possibility of breast
disease. The objective is to empower healthcare professionals and individuals with
effective tools for early detection and prevention, ultimately reducing the mortality
rates caused by breast disease. Identifying and gathering relevant data from various
resources including medical records, patient’s histories, genetic data, and lifestyle
factors. Preforming data preprocessing tasks, such as data cleaning handling missing
values, standardizing the data and selection of features relevant for prediction. In this
project we have used breast disease data from repository of UCI []. The features of
this data are computed from a digital image of a fine needle aspiration (FNA) of a
breast mass. We have a total of 569 instances out which 212 instances belong to
benign tumor and 357 belong to malignant tumor. 30 clinical features have been
recorded for each instance. In this paper, we use python as a tool to implement breast
disease classification and prediction training via various machine learning algorithms;
SVM, logistic regression, decision tree, random forest, naïve bayes, adaboost,
xgboost. After compression of all the algorithms we use xgboost for further processing
of this project.
3.2 Evaluation and selection of features
The working of system starts with the collection of data and selection of
important attributes. Then the data is pre-processed into the required format. The data
it then divided into two parts training and testing data. The models are then trained
16
using the training data and the accuracy of the models is obtained by testing the system
using the testing data. The following module are used to implement the system:
Stage 1: Data pre-processing

Stage 2: Data exploration
Stage 3: Feature selection
Stage 4: Feature scaling
Stage 5: Model selection
3.2.1 Stage 1: Data Pre-Processing
We will use UCI Machine Learning Repository for breast cancer dataset. The
dataset used in this project is publicly available and was created by Dr. William H.
Wolberg, physician at the University of Wisconsin Hospital at Madison, Wisconsin,
USA. To create the dataset Dr. Wolberg used fluid samples taken by fine needle
aspiration (FNA), taken from patients with solid breast masses and an easy-to-use
graphical computer program called Xcyt, which is capable of perform the analysis of
cytological features based on a digital scan. The program uses a curve-fitting
algorithm, to compute ten features from each one of the cells in the sample, then it
calculates the mean value, extreme value and standard error of each feature for the
image, returning a 30 real-valuated vector.
Attribute Information:
Table 3.1: Attribute Information of Dataset
mean
mean radius mean texture mean perimeter mean area
smoothness
mean mean concave mean fractal
mean concavity mean symmetry
compactness points dimension
radius error texture error perimeter error area error smoothness error
compactness concave points fractal
concavity error symmetry error
error error dimension error
worst
worst radius worst texture worst perimeter worst area
smoothness
worst worst concave worst fractal
worst concavity worst symmetry
compactness points dimension
17
Objective: The objective of this analysis is to observe which features are most helpful
in predicting malignant and benign cancer and to see a general trend that would help
us in model selection. The goal is to classify whether the breast cancer is benign or
malignant. To achieve this, we have used machine learning classification methods to
fit function that can predict discrete class of new inputs.
3.2.2 Stage 2: Data Exploration
For this we will be using Vs Code to work on the dataset. We will first go on
with importing all necessary libraries and upload our dataset on to Vs Code.
Figure-3.1: Code Snippet of Libraries imported
Figure-3.2: Attributes Information of dataset
For this we will be using Vs Code to work on the dataset. We will first go on
with importing all necessary libraries and upload our dataset on to Vs Code. We can
find the dimensions of the dataset using panda command data. Shape (569,31) We
now know that we have a dataset that consist of total 569 rows and 31 columns.
‘target’ is the column which we are going to predict, which says if the cancer is 0 =
benign or 1 = malignant. Using the code line ‘data['target'].value_counts ()’ we can
detect that out of 569 persons, 212 are labelled as 0(benign) and 357 are labelled as 1
18
(malignant).
Figure-3.3: Code Snippet of target variable
Data visualization plays a crucial role in understanding patterns, relationships,

and trends within the dataset. In the context of breast cancer classification, effective
data visualization techniques can help uncover insights and facilitate decision-
making. Python has several visualization libraries such as Matplotlib, Seaborn.
3.2.2.a Bivariate data analysis
Bivariate data analysis involves examining the relationship between two

variables in a dataset. In the context of breast cancer classification, bivariate data
analysis can help uncover correlations, associations, or dependencies
between different features.
Figure-3.4: Relationship Between Two Variables
19
3.2.2.b Multivariate data analysis
Multivariate data analysis involves examining the relationship and patterns among
threeor more variables simultaneously. In the context of breast cancer classification,
mv data analysis can help uncover complex relationships between multiple features
and their combined impact on the classification task.
HEATMAP: A heatmap is a graphical representation of data where the values of a

matrix are represented as colors. The rows and columns of the matrix represent
variables or categories, and the colors represent the values of the data. The intensity
of color represents magnitude of correlation between different attributes of our
dataset, the dark intensity represents low magnitude of correlation the lighter color
represents higher magnitude of correlation.
Figure-3.5: Heat Map of Dataset
20
3.2.3 Stage 3: Feature Selection
Feature selection is the method of reducing the input variable to your model
by using only relevant data and getting rid of noise in data.
The goal of feature selection is to improve the performance of a model by reducing

the dimensionality of the input data and focusing on the most informative features.
Splitting the dataset
The dataset we used is split into training and testing data. The training set
contains a known output and the model learns on this data in order to be generalized
to other data later on. We have the test dataset in order to test models prediction on this
subset.
Screenshot of train test split
In this phase of our machine learning pipeline, we utilized the `train_test_split`

function from scikit-learn's `model_selection` module to partition our dataset into
training and testing subsets. This crucial step ensures that we can accurately evaluate
our model's performance on unseen data. We allocated 80% of our data for training
(represented by `X_train` and `y_train`) and reserved the remaining 20% for testing
(represented by `X_test` and `y_test`). The `test_size` parameter was set to 0.2 to
achieve this 80-20 split. To maintain consistency across different runs and ensure
reproducibility of our results, we fixed the random seed to 5 using the `random_state`
parameter. This rigorous data splitting approach helps us mitigate overfitting by
training our model on one subset of data and validating its generalization capability
on a separate, previously unseen subset.
3.2.4 Stage 4: Feature Scaling
Feature scaling is a technique used in machine learning and data preprocessing

to standardize or normalize the numerical features of a dataset. It is important to scale
features when they have different scales or units of measurement, as it can improve
the performance of certain machine learning algorithms
21
3.2.5 Stage 5: Model Selection
This is the most exciting phase in Applying Machine Learning to any Dataset.
It is also known as Algorithm selection for Predicting the best results. It involves
evaluating and comparing different models to determine which one is likely to
perform the best on unseen data. The algorithms are majorly classified into two
groups: supervised learning algorithm and unsupervised learning algorithms. Without
much due, I would like to give an over view of both the algorithms.
Supervised learning algorithm: Supervised learning is a type of machine learning

where an algorithm learns from labeled training data to make predictions or decisions.
In supervised learning, the training data consists of input features (also called
independent variables) and corresponding output labels (also called dependent
variables or targets). Supervised learning is further grouped into Regression and
classification problems. A regression problem is when the output variable is a real or
continuous value, such as “salary” or “weight”. A classification problem is when the
output variable is a category like filtering emails “spam” or “not spam”.
Unsupervised learning algorithms: Unsupervised learning is a type of machine

learning where an algorithm learns from unlabeled data to discover patterns,
structures, or relationships without explicit guidance or predefined output labels. In
unsupervised learning, the algorithm explores the inherent structure within the data to
find interesting patterns or groupings.
In our dataset we have the outcome variable or Dependent variable i.e., Y

having only two set of values, either 1 (Malign) or 0 (Benign). So, we will use
Classification algorithm of supervised learning. There are various types of
classification algorithms in machine learning:
1. Logistic regression
2. Support Vector Machine
3. Naïve Bayes
4. Decision Tree Algorithm
5. Random Forest Classification
6. AdaBoost
7. XgBoost
22
After applying the different classification models, we have built our
classification model and we can see that XgBoost gives the best results for our dataset.
Well, it’s not always applicable to every dataset. To choose our model we always need
to analyze our dataset and then apply our machine learning model.
3.2.5.1 Code Snippet for SVM
Figure-3.6: Code Snippet for SVM
3.2.5.2 Code Snippet for Naïve Bayes
Figure-3.7: Code Snippet for Naïve Bayes
23
3.2.5.3 Code Snippet for Logistic Regression
Figure-3.8: Code Snippet for Logistic regression
3.2.5.4 Code Snippet for Decision Tree
Figure-3.9: Code Snippet for Decision Tree
3.2.5.5 Code Snippet for Random Forest
Figure-3.10: Code Snippet for Random Forest
24
3.2.5.6 Code Snippet for AdaBoost
Figure-3.11: Code Snippet for AdaBoost
3.2.5.7 Code Snippet for XgBoost
Figure-3.12: Code Snippet for XgBoost
3.2.5.8 Confusion matrix
A confusion matrix is a table that is used to define the performance of a

classification algorithm. A confusion matrix visualizes and summarizes the
performance of a classification algorithm.
Figure-3.13: Code Snippet for Confusion Matrix
25
Figure-3.14: Confusion Matrix of Model
3.2.5.9 Classification Report
Figure-3.15: Classification Report
26
CHAPTER 4
IMPLEMENTATION OF XAI
Artificial Intelligence (AI) systems, particularly machine learning models,

have achieved remarkable performance in various domains, including healthcare.
However, these models often operate as black boxes, making it difficult to understand
the reasoning behind their predictions. This lack of transparency can be problematic,
especially in high-stakes decision-making scenarios like medical diagnosis and
treatment.
Explainable Artificial Intelligence (XAI) aims to address this issue by

providing interpretable and transparent explanations for the predictions made by AI
systems. XAI techniques help unveil the underlying decision-making process,
revealing the factors and features that contribute to the model's predictions.
In the context of our breast cancer detection project, incorporating XAI

techniques is crucial for building trust and confidence among healthcare professionals.
By understanding the reasons behind the model's predictions, clinicians can make
more informed decisions and provide better patient care.
4.1 XAI Techniques.
To enhance the interpretability of our machine learning models, we

implemented two popular XAI techniques: SHAP (SHAPley Additive exPlanations)
and LIME (Local Interpretable Model-Agnostic Explanations).
4.1.1 SHAP (SHAP Additive Explanations)
SHAP is a game-theoretic approach that calculates the contribution of each

feature towards the model's prediction. It is based on Shapley values from cooperative
game theory and provides a consistent and locally accurate additive feature attribution
method.
In our implementation, we used the Kernel SHAP algorithm, which combines

the SHAP values with a local surrogate model to provide local explanations. The
SHAP values represent the importance of each feature in the model's prediction, with
positive values indicating a higher contribution towards a positive prediction (e.g.,
27
malignant), and negative values indicating a lower contribution towards a positive
prediction (e.g., benign).
The SHAP explanations were visualized using two types of plots:
1. Summary Plot: This plot provides an overview of the most important features
across the entire dataset, sorted by their SHAP values.
2. Force Plot: This plot shows the contribution of each feature for a specific
instance, making it easier to understand the model's reasoning for that
particular prediction.
4.1.2 LIME (Local Interpretable Model-Agnostic Explanations)
LIME is a model-agnostic approach that aims to explain the predictions of any

machine learning model, regardless of its complexity or underlying architecture. It
does this by approximating the model's behaviour locally, around the instance being
explained, with an interpretable model.
The key idea behind LIME is to generate perturbed samples around the
instance of interest and train an interpretable model (e.g., a linear regression model)
on these perturbed samples, using the original model's predictions as the target
variable. The coefficients of this local interpretable model can then be used to explain
the original model's prediction for the instance being explained.
Here's a step-by-step breakdown of how LIME works:
1. Instance Selection: The first step is to select the instance for which an
explanation is desired. This could be a specific data point from the test set or any
other instance of interest.
2. Data Perturbation: LIME generates a set of perturbed samples by randomly

permuting the feature values around the instance being explained. These
perturbed samples are created by introducing small changes to the feature values,
while ensuring that the perturbed instances remain valid and interpretable (e.g.,
categorical features should still have valid categories).
3. Model Evaluation: The original machine learning model (the one being
explained) is then evaluated on these perturbed samples, and the model's
predictions are obtained for each perturbed instance.
28
4. Local Surrogate Model: LIME trains an interpretable model (e.g., a linear
regression model) on the perturbed samples, using the original model's
predictions as the target variable. The interpretable model is trained to
approximate the original model's behaviour locally, around the instance being
explained.
5. Explanation Generation: The coefficients of this local interpretable model

provide an explanation for the original model's prediction for the instance being
explained. Features with larger positive coefficients contribute more to a higher
prediction value (e.g., a higher probability of malignancy in our breast cancer
detection case), while features with larger negative coefficients contribute more
to a lower prediction value (e.g., a higher probability of benign).
The explanations generated by LIME are local in nature, meaning they are
specific to the instance being explained and may not generalize to other instances or
the entire dataset. However, this locality is a strength of LIME, as it allows for
capturing the model's behaviour in the vicinity of the instance of interest, which can
be particularly useful for understanding individual predictions.
It's important to note that the interpretable model used by LIME (e.g., linear
regression) is an approximation of the original model's behavior, and the quality of
the explanations depends on how well the interpretable model can approximate the
original model locally.
In our breast cancer detection project, we used LIME to generate local

explanations for each instance in the test set. These explanations were presented as a
list of feature importance scores, indicating the contribution of each feature to the
model's prediction for that specific instance. By integrating these LIME explanations
into our frontend website, healthcare professionals can gain insights into the factors
influencing the model's predictions for individual patients, aiding in more transparent
and informed decision-making.
29
Figure-4.1 Code Snippet for Implementation of XAI
Figure-4.2 LIME Explanation.
30
CHAPTER 5
SYSTEM IMPLEMENTATION
In this project, we have developed a user-friendly frontend website that would

allow healthcare professionals to access our breast cancer detection system and its
corresponding XAI (Explainable Artificial Intelligence) explanations. The website
serves as an interface for users to input patient data, receive predictions from the
machine learning model, and visualize the XAI explanations that provide insights into
the model's decision-making process.
5.1 Frontend Website Development
The frontend website was developed using HTML, CSS, and JavaScript.
These technologies were chosen for their widespread adoption, cross-platform
compatibility, and ease of integration with the backend components.
Figure-5.1 Frontend Website Page

5.1.1 User Interface Design
The user interface was designed with a focus on simplicity and intuitive
navigation. The website features a clean and modern layout, with clear sections for
data input, prediction display, and XAI explanations.
31
Figure-5.1.1 User Interface
5.1.2 HTML Structure
The HTML structure of the website is organized into logical sections, including:
 Header: Displays the website's title and navigation menu.
 Data Input: A form or input fields for users to enter patient data, such as age,
tumor size, and other relevant features.
 Prediction Display: A section to display the model's prediction (benign or

malignant) based on the input data.
 LIME Feature Importance: A section to display the LIME feature importance

scores as a list or table.
5.1.3 CSS Styling
CSS was used to style the website's appearance, ensuring a visually appealing
and consistent design across different screen sizes and devices. Responsive design
principles were implemented to provide an optimal viewing experience on various
devices, including desktops, tablets, and mobile phones.
32
5.1.4 JavaScript Interactivity
JavaScript was utilized to enhance the user experience and provide interactive
features. For example, users can hover over bars or points in the SHAP plots to see
additional information or tooltips. The LIME feature importance list can be sorted or
filtered based on user preferences.
5.2 Backend Integration with Flask
The backend integration was accomplished using Flask, a lightweight Python

web framework. Flask was chosen for its simplicity, flexibility, and seamless
integration with Python-based machine learning and XAI libraries.
Figure-5.2 Code Snippet of backend integration with flask
5.2.1 Flask Application Structure
The Flask application follows a modular structure, with separate components

for handling routes, data processing, model loading, and XAI explanations generation.
5.2.2 Route Handling
Flask routes were defined to handle user requests, such as submitting patient
data and retrieving predictions and XAI explanations. The /predict route, for instance,
receives the user input data, passes it to the machine learning model, and generates
the prediction and XAI explanations.
33
5.2.3 Data Processing
User input data received from the frontend website is processed and formatted
to match the requirements of the machine learning model. This includes handling
missing values, scaling numerical features, and encoding categorical features.
5.2.4 Model Loading and Prediction
The trained machine learning model (e.g., logistic regression, random forest,
or SVM) is loaded into memory during the Flask application's initialization. When a
user submits data, the preprocessed input is fed into the model to obtain the prediction
(benign or malignant).
5.2.5 XAI Explanations Generation
The XAI explanations are generated using the SHAP and LIME techniques
implemented in Python libraries such as shap and lime. The SHAP summary plot,
SHAP force plot, and LIME feature importance scores are calculated based on the
user input data and the model's prediction.
5.2.6 Response Rendering
The Flask application generates a response containing the prediction and XAI
explanations in a format suitable for rendering on the frontend website. This response
is typically in the form of JSON or HTML, depending on the request type.
5.3 Integration and Deployment
The frontend website and the Flask backend were integrated seamlessly,
ensuring a smooth flow of data and communication between the two components. The
website was deployed on a web server or cloud platform, allowing healthcare
professionals to access the breast cancer detection system and its XAI explanations
from any device with an internet connection.
5.5 Conclusion
The development of the frontend website and the integration with the machine
learning and XAI models using Flask have resulted in a user-friendly and transparent
breast cancer detection system. Healthcare professionals can now access the system
through an intuitive interface, input patient data, and receive predictions along with
valuable XAI explanations that shed light on the model's decision-making process.
34
This project demonstrates the successful integration of advanced machine learning
techniques with modern web technologies, enabling better decision-making and
increased trust in AI-powered healthcare solutions.
This report provides a detailed overview of the frontend website development,

the backend integration using Flask, and the implementation of XAI techniques for
breast cancer detection. You can adapt and expand upon the relevant sections to fit
your specific project details and findings.
35
CHAPTER 6
RESULTS AND FINDINGS
In our quest to develop an optimal classification model, we embarked on a

comprehensive evaluation of various machine learning algorithms. Our primary
objective was to identify the most effective model that not only yields high accuracy
but also provides transparency in its decision-making process. This dual focus on
performance and explainability is crucial in today's AI-driven world, where
stakeholders demand not just precise predictions but also a clear understanding of how
these predictions are derived.
We then proceeded to train and evaluate a diverse array of classification models:
1. Support Vector Machines (SVM): Known for their effectiveness in high-

dimensional spaces and their versatility with different kernel functions, SVM
achieved an accuracy of 93.85%. This respectable performance underscores
SVM's ability to find optimal hyperplanes that separate classes, even in complex
feature spaces.
2. Naive Bayes: This probabilistic classifier, based on Bayes' theorem, surprised us

with its 94.73% accuracy. Despite its "naive" assumption of feature independence,
Naive Bayes' performance highlights its suitability for our dataset. Its efficiency
and speed make it an excellent baseline model.
3. Logistic Regression: A cornerstone of binary classification, Logistic Regression

interprets the logarithmic odds of our target variable. It achieved 94.32%
accuracy, demonstrating its robust performance when the relationship between
features and the target is approximately linear.
4. Decision Trees: With an accuracy matching Naive Bayes at 94.73%, Decision

Trees brought the added advantage of interpretability. Their flowchart-like
structure allows stakeholders to follow the decision path, making them invaluable
when transparency is key.
5. Random Forest: By aggregating multiple decision trees, Random Forest

harnesses the wisdom of the crowd. Its 97.36% accuracy represents a significant
36
leap from individual models. This ensemble method's strength lies in its ability to
reduce overfitting by averaging many trees, each trained on a different data subset.
6. AdaBoost (Adaptive Boosting): Another ensemble method, AdaBoost

sequentially trains weak learners (often decision stumps), focusing more on
previously misclassified samples. Its 94.73% accuracy, while commendable,
suggests that our data might not have many "hard-to-classify" instances that
AdaBoost excels at identifying.
7. XGBoost (eXtreme Gradient Boosting): The pièce de résistance of our analysis,

XGBoost, delivered an outstanding 98.24% accuracy. This state-of-the-art
implementation of gradient boosting combines regularized model formalization
with systems optimization. Its superior performance can be attributed to its ability
to handle complex feature interactions, its built-in cross-validation, and its
resilience to overfitting through techniques like regularization and tree pruning.
Given these results, it's clear that XGBoost emerges as the champion.
However, in the realm of high-stakes decision-making, accuracy alone is insufficient.
Enter Explainable AI (XAI), a frontier in machine learning that demystifies the often
opaque nature of complex models like XGBoost.
Implementing XAI techniques, such as SHAP (SHapley Additive

exPlanations) values or LIME (Local Interpretable Model-agnostic Explanations), we
peeled back the layers of our XGBoost model. SHAP values, for instance, quantify
each feature's contribution to a prediction, based on cooperative game theory. This
allows us to understand not just which features are important globally (through feature
importances) but how each feature contributes to individual predictions.
The implications of this are profound. When our model classifies an instance,
we can now provide a narrative: "This instance was classified as Class A primarily
because Feature X had a high value, which typically correlates with Class A.
However, Feature Y, which usually indicates Class B, had a moderating effect." Such
explanations transform our model from a black box to a transparent advisor, fostering
trust among stakeholders.
Moreover, XAI helps in model refinement. By understanding which features

drive predictions, we can engage domain experts to validate these relationships. If a
37
feature's importance contradicts domain knowledge, it might indicate data leakage or
the need for feature engineering. XAI also aids in fairness and bias detection. If
sensitive attributes like gender or race disproportionately influence predictions, we
can take corrective actions, aligning our model with ethical AI principles.
In conclusion, our journey through various classification models culminated

in the selection of XGBoost, not just for its superior 98.24% accuracy but for its
synergy with XAI techniques. This combination delivers a solution that is both highly
accurate and interpretable. In an age where AI increasingly informs critical decisions,
the marriage of performance and explainability is not just beneficial—it's imperative.
Our XGBoost model, illuminated by XAI, stands as a testament to this union, ready
to make precise, transparent, and fair predictions that can drive informed decision-
making and foster stakeholder confidence.
38
CHAPTER 7
LOCAL DATA ANALYSIS
Kashmir faces a growing silent threat: a surge in cancer cases. Data from
two prominent hospitals, Sher-i-Kashmir Institute of Medical Sciences (SKIMS)
and Shri Maharaja Hari Singh (SMHS) Hospital, paints a concerning picture.
SKIMS alone has documented a staggering 44,112 cancer cases from 2013 to
2023. This immense number, alongside the 6,379 cases reported by SMHS
hospital from 2017 to 2023, underscores the magnitude of the public health crisis
unfolding in the Valley.
The sheer volume of cancer patients seeking treatment at these hospitals

speaks volumes. The concentration of cases at SKIMS further emphasizes the
severity of the issue. A detailed breakdown of their data exposes a troubling trend.
The number of registered cases has climbed steadily over the past decade. In
2014, 3,930 patients were diagnosed, with the numbers rising each year: 4,417 in
2015, 4,320 in 2016, 4,352 in 2017, 4,816 in 2018. While there was a slight dip
in 2019 (3,814 cases), the numbers continued to climb, reaching 4,727 in 2021
and a concerning 5,294 in 2022. As of September 2023, the year already accounts
for 4,095 registered cases. This upward trajectory signifies a pressing need for in-
depth research to understand the factors contributing to this alarming trend.
The human cost of this crisis is immense. These statistics translate to

thousands of individuals and families grappling with the emotional and physical
toll of cancer. The surge puts immense pressure on Kashmir's already strained
healthcare infrastructure. Several factors could be at play in this rise.
Environmental pollution, exposure to carcinogens, lifestyle changes like
increased tobacco use and unhealthy dietary habits, and potential limitations in
early detection and preventative measures all warrant investigation. Additionally,
genetic predispositions and the aging population could be contributing factors.
Combating this crisis requires a multi-pronged approach. Strengthening cancer
surveillance through robust registration systems is essential for understanding the
true burden of the disease and identifying trends. Enhanced data collection and
analysis can guide informed decisionmaking for prevention and treatment
39
strategies.
Investing in public awareness campaigns about cancer risk factors, early

detection methods, and the importance of healthy lifestyles is crucial. Educational
campaigns can empower individuals to make informed choices and reduce their
risk of developing cancer.
Expanding access to screening programs, particularly for high-risk populations,

can lead to earlier interventions and potentially improve patient outcomes.
Additionally, equipping healthcare facilities with advanced diagnostic tools and
ensuring access to specialized cancer treatment, including qualified oncologists,
radiologists, and nurses, is essential.
Finally, dedicated research efforts are needed to investigate the specific

causes of the rise in cancer cases in Kashmir. Understanding the regional factors
at play is critical for developing targeted prevention and treatment strategies.
By acknowledging the severity of the issue, investing in research,

strengthening healthcare infrastructure, and promoting public awareness, stakeholders
can work together to combat this silent epidemic. This collaborative effort can pave
the way for a healthier future for the people of Kashmir.
Figure-6.1 Skims Data Analysis
40
A silent and deadly threat is gripping Jammu and Kashmir: a surge in
cancer cases. Data from a leading medical facility, the Department of Radiation
Oncology at SMHS Hospital GMC Srinagar, paints a concerning picture. Since
2017, the department has witnessed a staggering increase in new patient
registrations, culminating in a record-breaking 1,640 cases in 2023. This data
reveals a deeply worrying trend – a steady year-on-year climb from a mere 491
cases in 2017 to surpassing the 1,000 mark in both 2021 and 2022, with a
significant jump in 2023. The high number of daily registrations, averaging
around 5-6 new patients, further emphasizes the urgency of the situation.
This trend suggests a potential explosion of cancer cases across Jammu

and Kashmir. With an increasing number of patients requiring treatment, the
region's healthcare infrastructure, already struggling, could face immense
pressure. While the data from SMHS paints a concerning picture, it's just one
piece of the puzzle. A more comprehensive understanding necessitates data from
other hospitals across the region.
In-depth research is critical to identify the root causes behind this alarming rise.
Environmental factors like pollutants, unhealthy lifestyle choices like smoking
and poor diet, limitations in early detection programs, and even genetic
predispositions could all be playing a role.
Combating this growing crisis requires a multi-pronged approach.

Strengthening cancer surveillance systems across hospitals is essential to
understand the true scope of the issue and guide prevention and treatment
strategies. Public awareness campaigns can educate people about risk factors,
early detection methods, and healthy lifestyle choices. Expanding access to
screening programs, particularly for high-risk populations, can lead to earlier
interventions and potentially improve patient outcomes. Upgrading healthcare
facilities with advanced diagnostic tools and ensuring access to specialized cancer
treatment is crucial. Finally, dedicated research efforts are critical to investigate
the specific causes unique to Jammu and Kashmir for developing targeted
prevention and treatment strategies.
By acknowledging the severity of the situation, investing in research,

strengthening healthcare infrastructure, promoting public awareness, and
41
implementing effective prevention and early detection strategies, stakeholders
can work together to combat this silent epidemic. A collaborative effort involving
healthcare professionals, government agencies, community leaders, and research
institutions is crucial to pave the way for a healthier future for the people of
Jammu and Kashmir. Professor Manzoor Ahmad's statement highlights the
urgency further, emphasizing the record number of new cases. The fight against
cancer requires immediate and collective action to ensure better care for countless
patients and a healthier future for all residents of Jammu and Kashmir.
Figure-6.2 Smhs Data Analysis
42
Prof Manzoor Ahmad, Head of the Department of Radiation Oncology at
GMC Srinagar said there is a sharp increase in new cancer cases in J&K, with the
department registering a record number of 1640 cases in 2023, out of which 911
were males and 729 were females.
Figure-6.3 Male Female Ratio
Professor Manzoor Ahmad's comments shed light on the specific types of

cancers most commonly affecting males and females in Jammu and Kashmir, and
the potential contributing factors. Here's a breakdown of his insights:
Lung Cancer in Males: This is the most concerning finding. Lung cancer has a
strong link to tobacco use, and its prevalence suggests a significant issue with
tobacco products in the society. Prof. Ahmad emphasizes the need for extensive
public awareness campaigns to educate people about the dangers of tobacco use
and encourage them to quit.
Breast Cancer in Females: While concerning, breast cancer might be partially

preventable through lifestyle changes. Prof. Ahmad suggests that lack of
physical activity and obesity could be contributing factors. Encouraging healthy
habits like regular exercise and maintaining a healthy weight could potentially
43
help reduce the risk of breast cancer in women.
Preventability Through Lifestyle Changes: Prof. Ahmad's message is clear:

both lung cancer in males and breast cancer in females might be preventable
through lifestyle modifications. This highlights the importance of promoting
healthy habits across the population.
Professor Ahmad's comments offer valuable insights into the types of

cancers prevalent in Jammu and Kashmir, and crucially, they point towards
potentially modifiable risk factors. By addressing these factors through public
awareness campaigns and promoting healthy lifestyles, there's a chance to make
a significant impact on the rising cancer burden in the region
Figure-6.4 Cancer Comparison
44
CHAPTER 8
FUTURE SCOPE
7.1 Data Expansion
Current breast cancer detection and risk assessment models are valuable tools,
but they can potentially become even more accurate and effective by incorporating
additional data sources. This is like casting a wider net – the more information we
have, the better we can understand an individual's risk and ultimately improve their
chances of early detection and successful treatment.
Unveiling the Lifestyle Connection: Many of our daily habits and choices can
influence our health, and breast cancer is no exception. By incorporating data on
lifestyle factors like diet, physical activity levels, smoking history, and even exposure
to environmental toxins, the model can paint a more complete picture of an
individual's risk profile. Imagine the model as a detective – the more clues it has (like
dietary habits and exercise routines), the better it can solve the mystery of a person's
susceptibility to breast cancer.
Diet: What we eat can play a role in our overall health, and research suggests that
certain dietary patterns might be linked to breast cancer risk. By including information
about a person's diet, the model can potentially identify individuals who might benefit
from dietary adjustments to lower their risk.
Physical Activity: Regular exercise has numerous health benefits, and studies have
shown a link between physical activity and a reduced risk of breast cancer. The model
can factor in a person's activity level to provide a more personalized risk assessment.
Smoking: The dangers of smoking are well-documented, and it's a significant risk
factor for several cancers, including breast cancer. Including smoking history in the
data set allows the model to account for this crucial risk factor.
Environmental Exposures: Exposure to certain environmental toxins has been

linked to an increased risk of cancer. By incorporating data on environmental factors,
the model can potentially identify individuals who might be at higher risk due to their
surroundings.
45
With this additional lifestyle data, the model can become more sophisticated in its risk
assessments, potentially leading to earlier detection and better preventive measures
for individuals at higher risk.
Decoding the Genetic Fingerprint: Advances in genetics have revealed the

significant role genes play in cancer susceptibility. Certain mutations in genes like
BRCA1 and BRCA2 are known to greatly increase the risk of breast cancer. By
integrating genetic sequencing data into the model, it can identify individuals with
these genetic predispositions and classify them as high-risk, allowing for more
targeted screening and prevention strategies.
Imagine the model being able to read an individual's genetic code, looking for specific
red flags that might indicate a higher risk. This allows for a more personalized
approach to breast cancer prevention and early detection.
The Power of Combining Forces: By combining data from a person's medical

history, lifestyle choices, and genetic makeup, the machine learning model can build
a more comprehensive picture of their individual risk profile. This allows for a more
nuanced and potentially more accurate assessment compared to relying solely on
traditional methods. With this enhanced understanding, doctors can tailor screening
and prevention strategies for each patient, potentially leading to earlier detection,
better treatment outcomes, and ultimately, saving lives.
7.2 Advanced XAI Techniques
While current methods like SHAP and LIME offer a window into how the AI
model makes decisions about breast cancer detection, there's room for further
exploration. By delving into more advanced Explainable Artificial Intelligence (XAI)
techniques, we can unlock a deeper understanding of the model's reasoning, fostering
greater trust and collaboration between healthcare professionals, patients, and the AI
system itself. Here's how these advanced techniques can illuminate the "black box" of
AI:
Simulating Change: Counterfactual Explanations Imagine asking the AI model,

"What if...?" Counterfactual explanations allow us to do just that. They provide
insights into how a specific patient's cancer risk prediction would change if certain
factors were altered. This is like playing a game of "what-if" with the model's
46
predictions. For example, the model could explain: "If a patient with a high-risk
prediction exercised regularly and maintained a healthy weight, how might their risk
change?" .These explanations are powerful because they point towards potential
interventions or preventive measures. By understanding how adjustments to lifestyle
factors or other variables might influence the prediction, doctors can make more
informed decisions about a patient's care plan.
Speaking the Doctor's Language: Currently, the model might explain its reasoning
based on individual data points like tumor size or cell measurements. While valuable,
this can be technical jargon for some healthcare professionals. Concept-based
explanations bridge this gap. They translate the model's decision-making process into
human-understandable concepts. Instead of focusing on raw numbers, the explanation
might say something like: "The model predicts a high risk due to the presence of
aggressive tumor characteristics." This shift towards concepts like "tumor
aggressiveness" or "cellular abnormalities" allows doctors to grasp the model's
reasoning more intuitively. This can facilitate better communication and collaboration
between healthcare professionals and the AI system, ultimately leading to better
patient care.
Seeing is Believing: Sometimes, a picture is worth a thousand words. Advanced

visual explanation techniques, like saliency maps or attention visualization, can
provide just that. Imagine a heatmap highlighting specific areas on a mammogram
that the model deems crucial for its prediction. This allows doctors to see visually
which regions of the image the AI is focusing on when making its decision. These
visual explanations are particularly valuable when dealing with image data like
mammograms or pathology slides. By pinpointing critical areas or patterns, they can
help doctors understand the AI's reasoning and potentially identify features they might
have missed. This fosters trust and collaboration, as doctors can see the thought
process behind the AI's predictions.
The Power of Transparency: By incorporating these advanced XAI techniques, the

breast cancer detection system can become more transparent. This fosters trust in the
AI system among healthcare professionals and patients alike. With a deeper
understanding of the model's reasoning, doctors can make more informed decisions
about patient care. This transparency can also empower patients to engage in more
47
informed discussions about their health and treatment options. Ultimately, these
advancements in XAI hold the potential to improve decision-making, patient
outcomes, and communication in the fight against breast cancer.
48
CHAPTER 9
CONCLUSION
Breast cancer looms large as a global threat to women's health. The World
Health Organization paints a concerning picture, highlighting the immense challenge
with staggering statistics. Early detection is crucial, significantly improving patient
prognosis and survival rates. Traditional screening methods like mammograms, while
demonstrably redu cing mortality, have limitations. False positives, leading to
unnecessary biopsies and psychological distress, are a significant drawback.
This report dives into the transformative potential of Artificial Intelligence

(AI) and Machine Learning (ML) in revolutionizing breast cancer detection. By
harnessing the power of AI and ML algorithms, we stand at the precipice of a
paradigm shift, offering a more nuanced, accurate, and ultimately life-saving approach
to early detection.
The cornerstone of successful cancer management lies in early diagnosis.

Early detection allows for less invasive treatment options and significantly improves
patient outcomes. Studies have shown a direct correlation between the stage of cancer
at diagnosis and survival rates. Cancers detected in their early stages have a much
higher chance of successful treatment compared to those detected in later stages.
However, traditional screening methods often fall short in their ability to

consistently identify early-stage breast cancer. Mammography, the most widely used
screening tool, has limitations. While it has demonstrably reduced breast cancer
mortality rates, it is not without drawbacks. False positives, leading to unnecessary
biopsies and associated psychological distress, are a significant concern. Additionally,
mammograms can be less effective in women with dense breast tissue, a common
occurrence in younger women.
This report presents AI and ML as a game-changer in the fight against breast

cancer. AI algorithms are computer programs designed to mimic human intelligence,
while ML algorithms can learn and improve from data without explicit programming.
By leveraging these powerful tools, we can unlock a new era of breast cancer
detection.
49
AI algorithms can analyze vast amounts of healthcare data sets encompassing
images, x-rays, and clinical information. This multifaceted analysis allows them to
identify subtle patterns indicative of early-stage breast cancer that might be missed by
traditional methods. Imagine an AI algorithm trained on millions of mammogram
images, x-rays, and patient data. This AI can then analyze a new patient's data,
searching for even the most minute anomalies that could be indicative of cancer. By
identifying these subtle patterns, AI has the potential to detect breast cancer at its
earliest stages, significantly improving patient outcomes.
A critical strength of this research lies in its meticulous selection of the most
effective ML algorithm. Not all algorithms are created equal, and high accuracy is
paramount in cancer screening. This report emphasizes the importance of selecting an
algorithm that surpasses even the most stringent reliability standards. This meticulous
approach fosters trust in the technology and paves the way for its widespread adoption
in clinical settings.
The selection process involves rigorous testing and validation of various ML

algorithms on large datasets of mammograms and patient data. The chosen algorithm
must demonstrate exceptional accuracy in differentiating between healthy and
cancerous tissue. Furthermore, the algorithm's performance needs to be consistent
across diverse patient populations, ensuring equitable access to this potentially life-
saving technology.
The report acknowledges the transformative potential of AI and ML in

mitigating the challenges associated with false positives and negatives in
mammography. However, as the field progresses, Explainable AI (XAI) becomes
crucial. XAI ensures transparency in decision making, fostering trust among
healthcare professionals. With XAI, healthcare providers can understand the rationale
behind the AI's recommendations, leading to more informed clinical decisions.
Imagine a scenario where an AI system flags a patient's mammogram as

suspicious. Traditionally, this might lead to an automatic recommendation for a
biopsy. However, with XAI, the doctor can delve deeper. The XAI system can explain
the AI's reasoning, highlighting the specific features in the mammogram that triggered
the alert. This transparency empowers the doctor to make a more informed decision,
50
potentially avoiding unnecessary biopsies while ensuring early detection of true
positives.
The envisioned user-friendly interface acts as a bridge between healthcare

professionals and the ML models. This interface seamlessly integrates AI into clinical
workflows, empowering healthcare providers with real-time feedback and
visualizations. Imagine a doctor during a patient consultation, armed with real-time
insights from the AI model. The interface can display the mammogram alongside the
AI's analysis, highlighting potential areas of concern. This empowers the doctor to
have a more informed discussion with the patient and tailor a personalized approach
to care.
This initiative aligns perfectly with the evolving healthcare landscape,

championing personalized medicine and data-driven decision-making. By addressing
contemporary challenges like false positives, the project positions itself as a key
player in the global fight against breast cancer. Improved detection translates to better
prognoses and ultimately, saves lives. The integration of AI and ML, coupled with the
best-performing algorithm selection, XAI implementation, and a user-friendly
interface, represents a revolutionary step in breast cancer diagnostics. This is not just
a technological advancement; it's a critical leap towards alleviating the global burden
of breast cancer and offering hope for a healthier future for women around the world.
51
REFERENCES
[1] Haenssle, H. A., Fink, C., Schneiderbauer, R., Toberer, F., Buhl, T., Blum, A.,
& Zalaudek, I. (2018). Man against machine: diagnostic performance of a deep
learning convolutional neural network for dermoscopic melanoma recognition
in comparison to 58 dermatologists. Annals of Oncology, 29(8), 1836-1842
[2] Friedman, J. H. (2001). Greedy function approximation: A gradient boosting

machine. The Annals of Statistics, 29(5), 1189-1232
[3] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical
Learning: Data Mining, Inference, and Prediction. Springer.
[4] Nelson, H. D., Tyne, K., Naik, A., Bougatsos, C., Chan, B. K., & Humphrey,
L. (2009). Screening for breast cancer: an update for the US Preventive
Services Task Force. Annals of Internal Medicine, 151(10), 727-737
[5] Duffy, S. W., Yen, A. M. F., Chen, T. H. H., Chen, S. L. S., Chiu, S. Y. H.,
Fan, J. J. Y., & Tabar, L. (2012). Long-term benefits of breast screening. Breast
Cancer Management, 1(1), 31-38.
[6] Colditz, G. A., Rosner, B. A., Chen, W. Y., Holmes, M. D., & Hankinson, S.
E. (2004). Risk factors for breast cancer according to estrogen and progesterone
receptor status. Journal of the National Cancer Institute, 96(3), 218-228.
[7] Perou, C. M., Sørlie, T., Eisen, M. B., Van De Rijn, M., Jeffrey, S. S., Rees, C.
A.,& Botstein, D. (2000). Molecular portraits of human breast
tumours. Nature, 406(6797), 747-752.
[8] Early Breast Cancer Trialists' Collaborative Group. (2018). Adjuvant

bisphosphonate treatment in early breast cancer: Meta-analyses of individual
patient data from randomized trials. The Lancet, 386(10001), 1353-1361.
[9] Easton, D. F., Pharoah, P. D., Antoniou, A. C., Tischkowitz, M., Tavtigian, S.
V., Nathanson, K. L., & Foulkes, W. D. (2015). Gene-panel sequencing and the
prediction of breast-cancer risk. New England Journal of Medicine, 372(23),
2243-2257.
52
[10] Papadimitriou, N., Dimou, N., Tsilidis, K. K., Banbury, B., Martin, R. M.,
Lewis, S. J., ... & Murphy, N. (2020). Physical activity and risks of breast and
colorectal cancer: a Mendelian randomisation analysis. Nature
Communications, 11(1), 597.
[11] Monticciolo, D. L., Newell, M. S., Moy, L., Niell, B., Monsees, B., & Sickles,
E. A. (2018). Breast cancer screening in women at higher-than-average risk:
recommendations from the ACR. Journal of the American College of
Radiology, 15(3), 408-414.
[12] Love, P. E., Fang, W., Matthews, J., Porter, S., Luo, H., & Ding, L. (2023).
Explainable artificial intelligence (XAI): Precepts, models, and opportunities
for research in construction. Advanced Engineering Informatics, 57, 102024.
[13] Abbas, A. (2021). Reviewing the explainable artificial intelligence (XAI) and
its importance in tax administration. Center for Inter-American Tax Studies
(CIAT)
[14] Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., & Pedreschi,
D. (2018). A survey of methods for explaining black box models. ACM
computing Surveys (CSUR), 51(5), 1-42.
53

Project Report

Uploaded by

Copyright:

Available Formats

Project Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Report

Uploaded by

Copyright:

Available Formats

Breast Cancer Detection Using Machine

Learning & XAI

Project Report Submitted

Under the guidance of

Department of Computer Science Engineering

Ms. Aasia Quyoum Ms. Asiya Quyoum

Prof. (Dr.) Rauf Ahmad Khan

Signature of the Students

Khalid Jan 190328

Syed Owais Bashir 190330

Mehrun Nissa 190350

Zahir Ahmed 190352

Signature of the Project Guide.

We are obliged to Ms. Asiya Quyoum, Head of department (Department of Computer

Syed Owais Bashir

Classification and data mining methods are effective in classifying data,

Explainable Artificial Intelligence (XAI) techniques have been incorporated to

Table Title Page

1.1 Table Showing Software Requirements…………………………….8

1.2 Table Showing Software Requirements…………………………….9

3.1 Attribute Information………………………………………………17

Figure Title Page

SVM Support Vector Machines

FNA Fine Needle Aspiration

FDA Food and Drug Administration

XAI Explainable Artificial Intelligence

XCYT Cytological Diagnosis and prognosis

Artificial intelligence (AI) and Machine Learning together can be

1.2 PROBLEM STATEMENT

Mammography remains a critical tool for screening and diagnosing breast

1.3 RELEVANT CONTEMPORARY ISSUES

Machine learning is a branch of artificial intelligence and computer science

3. Smoking & Used Alcohol.

1. Early detection: Early detection of breast cancer is known to significantly

2. Improved accuracy: Machine learning algorithms can learn complex patterns

3. Personalized medicine: Breast cancer is a heterogeneous disease, meaning that it

5. Support for healthcare professionals: Machine learning models can serve as

The primary objective of breast cancer prediction using machine learning is to

3. Prediction accuracy: Machine learning models aim to improve the accuracy of

5. Decision support for healthcare professionals: Machine learning models can

1.6. REQUIREMENT AND SPECIFICATIONS

1.6.1 Software Requirements

Table 1.1: Table Showing Software Requirements.

Software Requirements Version

Table 1.2: Table Showing Hardware Requirements.

Hardware Requirements Version

1. Early Detection: Machine learning models can analyze mammograms, medical

2. Risk Assessment: Machine learning algorithms can incorporate multiple risk

3. Image Analysis: Machine learning techniques, including computer vision and

4. Biomarker Identification: Machine learning can analyze genomic data,

5. Treatment Planning: Machine learning models can help guide treatment

6. Prognosis and Survival Prediction: Machine learning algorithms can analyze

7. Integration with Electronic Health Records (EHRs): Machine learning models

8. Public Health Applications: Machine learning techniques can be applied to

1.8 FEASIBILITY STUDY

2. Technological Advancements: Rapid advancements in machine learning

3. Increased Computing Power: The availability of powerful computing resources,

4. Feature Selection and Dimensionality Reduction: Machine learning techniques,

5. Model Generalization: Machine learning models can be trained on diverse

6. Integration with Clinical Workflows: Machine learning models can be