Biomedical Data Mining for Information Retrieval. Группа авторовЧитать онлайн книгу.
for mortality prediction risk has been reported in Ref. [16]. For the clinical rules the authors have used fuzzy rule based systems. An optimizer is used with genetic algorithm which generates final solutions coefficients. The model FIS achieves 0.39 score for event 1 and 94 score for event 2. To predict the mortality in an ICU, a new method is proposed in Ref. [17]. The method, Simple Correspondence Analysis (SCA) is based on both clinical and laboratory data with the two previous models APACHE-II and SAPS-II. It collects the data from PhysioNet Challenge 2012 of total 12,000 records of Sets A, B and C and 37 time series variables are recorded. SCA method is applied to select variables. SCA combines these variables using traditional methods APACHE and SAPS. This method predicts whether the patient will survive or not. Finally, model has obtained 43.50% score 1 for set A, 42.25% score 1 for set B and 42.73% score1 for set C. The Naive Bayesian Classifier is used in [18] to predict mortality in an ICU and obtain high and small S1 and S2. For S1 sensitivity and predictive positive and for S2 Hosmer–Lemeshow H statistic is defined. It replaces the missing values by NaN (Not-a-Number) if variable is not measured. The model achieves 0.475 for S1 which is the eighth best solution and 12.820 for S2 which is the first best solution on set B. On set C, model has achieved 0.4928 score for event 1 (forth best solution) and 0.247 score for event 2 (third best solution). Di Marco et al. [19] have proposed a new algorithm for mortality prediction with better accuracy for data collected from the first 48 h of admission in ICU. A binary classifier model is applied to obtain result for event 1. The set A is selected which contains 41 variables of 4,000 patients. For feature selection forward sequential with logistic cost function is used. For classification a logistic regression model is used which obtained 54.9% score on set A and 44.0% on test set B. To predict mortality rate Ref. [20] has developed a model based on Support Vector Machine. Support Vector Machine is the machine learning algorithm which tries to minimize error and find the best hyperplane of maximum margin. The two classes represent 0 as survivor or 1 as died in-hospital. For training they read 3,000 data and for testing 1,000 data. They observed an over-fitting of SVM on set A and obtained 0.8158 score for event 1 and 0.3045 score for event 2. For phase 2 they set to improve the training strategies of SVM. They reduce the over-fitting of SVM. The final obtained for event 1 is 0.530 and for set B is 0.350 and for set C final score is 0.333. An algorithm based on artificial neural network has employed to predict patient’s mortality in the hospital in Ref. [21]. Features are extracted from the PhysioNet data and a method is used to detect solar ‘nanoflares’ due to the similarity between solar and time series data. Data preprocessing is done to remove outliers. Missing values are replaced by the mean value of each patient. Then the model is trained and yields 22.83 score for event 2 on set B and 38.23 score on set C. A logistic regression model is suggested in Ref. [22] for the purpose. It follows three phases. In phase 1 selection of derived variables on set A, calculation of the variable’s first value, average, minimum value, maximum value, total time, first difference and last value is done. Phase 2 has applied logistic regression model to predict patients in-hospital death (0 for survivor, 1 for died) on the set A. Third phase applies logistic regression model to obtain events 1 and 2 score. The results obtained are 0.4116 for score1 and 8.843 for score2. The paper [23] also reported a logistic regression model for the prediction of mortality. The experiment is done using 4,000 ICU patients for training in set A and 4,000 patients for testing in set B. During the filtering process it figures out 30 variables for building up model. Results obtained are score 0.451 for event 1 and score 2 45.010 for event 2. A novel cluster analysis technique is used in Ref. [24] to test the similarities between time series data for mortality prediction. For data preprocessing it uses a segmentation based approach to divide variables in several segments. The maximal and minimal values are used to maintain its statistical features. Weighted Euclidian distance based clustering and rule based classification is used. The average result obtained for death prediction is 22.77 to 33.08% and for live prediction is 75 to 86%.
In Ref. [25], the main goal is to improve the mortality prediction of the ICU patients by using the PhysioNet Challenge 2012 dataset. Mainly three objectives have accomplished (i) reduction of dimensions, (ii) reduction of uncontrolled variance and (iii) less dependency on training set. Feature reduction techniques such as Principal Component Analysis, Spectral Clustering, Factor Analysis and Tukey’s HSD Test are used. Classification is done using SVM that has achieved better accuracy result of 0.73 than the previous work. The authors in Ref. [26] have extracted 61,533 data from the MIMIC-III v1.4, excluded patients whose age is less than 16, patients who stay less than 4 h and patients whose data is not present in the flow sheet. Finally 50,488 cohort ICU stays are used for experiments. Features are extracted by using window of fixed length. The machine learning models used are Logistic Regression, LR with L1 regularization penalty using Least Absolute Shrinkage and Selection Operator (LASSO), LR with L2 regularization penalty and Gradient Boosting Decision Trees. Severity of illness is calculated using different scores such as APS III, SOFA, SAPS, LODS, SAPS II and OASIS. Two types of experiments are conducted i.e. Benchmarking experiment and Real-time experiment. Models are compared from which Gradient Boosting Algorithm obtained high AUROC of 0.920. Prediction of hospital mortality through time series analysis of an intensive care unit patient in an early stage, during the admission by using different data mining techniques is carried in [27]. Different traditional scoring system such as APACHE, SAPS and SOFA are used to obtain score. 4,000 ICU patients are selected from MIMIC database and 37 time series variables are selected from first 48 h of admission. Synthetic Minority Oversampling Technique (SMOTE) (original and smote) is used to modify datasets where they handle missing data by replacing with mean (rep1), then SMOTE (rep1 and smote) is applied. After replacing missing data, EM-Imputation (rep2) algorithm is applied. Finally, result is obtained by using different classifiers like Random Forest (RF), Partial Decision Tree (PART) and Bayesian Network (BN). Among all these three classifiers, Random Forest has obtained best result with AUROC of 0.83 ± 0.03 at 48 h on the rep1, with AUROC of 0.82 ± 0.03 on original, rep1 and smote at 40 h and with AUROC of 0.82 ± 0.03 on rep2 and smote at 48 h.
Sepsis is one of the reasons for high mortality rate and it should be recover quickly, because due to sepsis [28] there is a chance of increasing risk of death after discharge from hospital. The objective of the paper is to develop a model for one year mortality prediction. 5,650 admitted patients with sepsis were selected from MIMIC-III database and were divided into 70% patients for training and 30% patients for testing. Stochastic Gradient Boosting Method is used to develop one-year mortality prediction model. Variables are selected by using Least Absolute Shrinkage and Selection Operator (LASSO) and AUROC is calculated. 0.8039 with confidence level 95%: [0.8033–0.8045] of AUROC result is obtained in testing set. Finally, it is observed that Stochastic Gradient Boosting assembly algorithm is more accurate for one year mortality prediction than other traditional scoring systems—SAPS, OASIS, MPM or SOFA.
Deep learning is successfully applied in various large and complex data-sets. It is one of the new technique which is outperformed the traditional techniques. A multi-scale deep convolution neural network (ConvNets) model for mortality prediction is proposed in Ref. [29]. The dataset is taken from MIMIC-III database and 22 different variables are extracted for measurements from first 48 h for each patient. ConvNet is a multilayer neural network and discrete convolution operation is applied in the network. Convolution Neural Network models have been developed as a backend using different python packages i.e. Keras and TensorFlow. The result obtained by the proposed model gives better result of ROC AUC (0.8735, ± 0.0025) which satisfies the state of art of deep learning models.
1.3 Materials and Methods
1.3.1 Dataset
The dataset is collected from PhysioNet Challenge 2012 which consists of three sets A, B and C [6]. A total of 12,000 patient records are available. Each set consists of 4,000 records of patients from which only set A dataset of 4,000 records are used in this chapter for simulation. There are 41 variables recorded in dataset, five of these variables (age, gender, height, ICU type and initial weight) are general descriptors and 36 variables are times series variables as described in Table 1.1.
From the above 36 variables, only 15 variables are selected for mortality prediction. These variables are represented below in Table