Tongjai Yampaka, Prabhas Chongstitvatana

Department of Computer Engineering
Chulalongkorn University

Keywords: integrated cancer data, breast cancer prediction



The diagnostic or investigational cancer procedure generally starts with mammogram, ultrasound, or both. The doctor reading the mammogram will be looking for different types of breast tumor. After the result of mammogram show suspicious mass, the doctor will order the patient to need a biopsy. If a biopsy found invasive breast cancer, the staging was estimated to help determine the prognosis and guide the treatment plan. This work proposes to integrate mammogram and biopsy dataset. The Latent Variables were used to integrate two datasets then the predictive models can be inferred.


The latent variables are not directly observed but are inferred from other variables that are observed. The latent variable in mammogram was evaluated from the severity of tumor factor and classify in three intervals. The latent variable in biopsy was evaluated from tumor sized and lymph node status called staging.

Single mammogram dataset has 445 elements and single biopsy has 198 elements. Combined dataset produces 6,326 elements. The single mammogram factors are patient’s age, tumor shape, margin, density, BI-RADS. The single biopsy factors are ten cell characteristics (use for grading), tumor size, lymph node status, outcome, and time (time to recur or disease free survival).

The predictive model is divided into four modelling steps: 1) staging prediction, 2) lymph node status prediction, 3) outcome prediction, and 4) time prediction.

Result using machine learning

The rule induction is used to infer the prediction model. The measurements are accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). 1) Early staging prediction from mammogram screening model produces 11 decision rules with specificity 95.60, sensitivity 90.73, PPV 91.83, NPV 95. 2) The lymph node status prediction has 13 decision rules with specificity 97.34, sensitivity 94.69, PPV 94.35, NPV 97.30. 3) The relapse prediction produces 23 rules with specificity 82.49, sensitivity 99.42, PPV 93.61, NPV 98.24. 4) Time to recur model produces 23 rules with specificity 96.64, sensitivity 90.71, PPV 90.71, NPV 96.64.