INTEGRATED DATASET METHOD FOR BREAST CANCER PREDICTION
Tongjai Yampaka, Prabhas Chongstitvatana
Department of Computer Engineering
Chulalongkorn University
Keywords: integrated cancer data, breast cancer prediction
Abstract
Background
The diagnostic or investigational cancer procedure generally starts with
mammogram, ultrasound, or both. The doctor reading the mammogram will be
looking for different types of breast tumor. After the result of mammogram
show suspicious mass, the doctor will order the patient to need a biopsy. If
a biopsy found invasive breast cancer, the staging was estimated to help
determine the prognosis and guide the treatment plan. This work proposes to
integrate mammogram and biopsy dataset. The Latent Variables were used to
integrate two datasets then the predictive models can be inferred.
Methodology
The latent variables are not directly observed but are inferred from other
variables that are observed. The latent variable in mammogram was evaluated
from the severity of tumor factor and classify in three intervals. The
latent variable in biopsy was evaluated from tumor sized and lymph node
status called staging.
Single mammogram dataset has 445 elements and single biopsy has 198
elements. Combined dataset produces 6,326 elements. The single mammogram
factors are patient’s age, tumor shape, margin, density, BI-RADS. The single
biopsy factors are ten cell characteristics (use for grading), tumor size,
lymph node status, outcome, and time (time to recur or disease free
survival).
The predictive model is divided into four modelling steps: 1) staging
prediction, 2) lymph node status prediction, 3) outcome prediction, and 4)
time prediction.
Result using machine learning
The rule induction is used to infer the prediction model. The measurements
are accuracy, sensitivity, specificity, positive predictive value (PPV), and
negative predictive value (NPV). 1) Early staging prediction from mammogram
screening model produces 11 decision rules with specificity 95.60,
sensitivity 90.73, PPV 91.83, NPV 95. 2) The lymph node status prediction
has 13 decision rules with specificity 97.34, sensitivity 94.69, PPV 94.35,
NPV 97.30. 3) The relapse prediction produces 23 rules with specificity
82.49, sensitivity 99.42, PPV 93.61, NPV 98.24. 4) Time to recur model
produces 23 rules with specificity 96.64, sensitivity 90.71, PPV 90.71, NPV
96.64.
end