Big data and health sciences: Machine learning in chronic illness by Huiyu Deng

Abstract:- Big data has become the new hot topic in recent years. It promotes the understanding of the exploit of data and directs the decision guidance in many sectors. The health science field is also shaped by the innovative idea of big data application. Our study group from the department of preventive medicine of the Keck school of medicine of the University of Southern California aims to build a big data architecture that combines and analyzes data of people from difference sources and provide health related assessments back to them. Specifically, ecological momentary assessments (EMAs), electronic medical records (EMRs), and real-time air quality monitor data of children with pre-existing asthma diagnosis are collected and fed into the machine learning models. Asthma exacerbation alert is generated and delivered back to the children before it happens. The machine learning model was tested and built in a similar study. The study population consists of children from a cohort of the prospective, population-based Children’s Health Study followed from 2003-2012 in 13 Southern California communities. Potential risk factors were grouped into five broad categories: sociodemographic factors, indoor/home exposures, traffic/air pollution exposures, symptoms/medication use, and asthma/allergy status. The outcome of interest, assessed via annual questionnaire, was the presence of bronchitic symptoms over the prior 12 months. A gradient boosting model (GBM) was trained on data consisting of one observation per participant in a random study year, for a randomly selected half of the study participants. The model was validated using hold-out test data obtained in two complementary approaches: (within-participant) a random (later) year in the same participants and (across-participant) a random year in participants not included in the training data. The predictive ability of risk factor groupings was evaluated using the area under receiver operating characteristic curve (AUC) and accuracy. The predictive ability of individual risk factors was evaluated using the relative variable importance. Graphical visualization of the predictor-outcome relationship was displayed using partial dependency plots. Interaction effects were identified using the H-statistic. Gradient boosting model offers a novel approach to better understand predictive factors for chronic upper respiratory illness such as bronchitic symptoms.