Introduction: Recent advances in machine learning provide new possibilities to process and analyse observational patient data to predict patient outcomes. In this paper, we introduce a data processing pipeline for cardiogenic shock (CS) prediction from the MIMIC III database of intensive cardiac care unit patients with acute coronary syndrome. The ability to identify high-risk patients could possibly allow taking pre-emptive measures and thus prevent the development of CS. Methods: We mainly focus on techniques for the imputation of missing data by generating a pipeline for imputation and comparing the performance of various multivariate imputation algorithms, including k-nearest neighbours, two singular value decomposition (SVD)-based methods, and Multiple Imputation by Chained Equations. After imputation, we select the final subjects and variables from the imputed dataset and showcase the performance of the gradient-boosted framework that uses a tree-based classifier for cardiogenic shock prediction. Results: We achieved good classification performance thanks to data cleaning and imputation (cross-validated mean area under the curve 0.805) without hyperparameter optimization. Conclusion: We believe our pre-processing pipeline would prove helpful also for other classification and regression experiments.

Data processing pipeline for cardiogenic shock prediction using machine learning

Tavazzi, Guido;
2023-01-01

Abstract

Introduction: Recent advances in machine learning provide new possibilities to process and analyse observational patient data to predict patient outcomes. In this paper, we introduce a data processing pipeline for cardiogenic shock (CS) prediction from the MIMIC III database of intensive cardiac care unit patients with acute coronary syndrome. The ability to identify high-risk patients could possibly allow taking pre-emptive measures and thus prevent the development of CS. Methods: We mainly focus on techniques for the imputation of missing data by generating a pipeline for imputation and comparing the performance of various multivariate imputation algorithms, including k-nearest neighbours, two singular value decomposition (SVD)-based methods, and Multiple Imputation by Chained Equations. After imputation, we select the final subjects and variables from the imputed dataset and showcase the performance of the gradient-boosted framework that uses a tree-based classifier for cardiogenic shock prediction. Results: We achieved good classification performance thanks to data cleaning and imputation (cross-validated mean area under the curve 0.805) without hyperparameter optimization. Conclusion: We believe our pre-processing pipeline would prove helpful also for other classification and regression experiments.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11571/1476930
Citazioni
  • ???jsp.display-item.citation.pmc??? 0
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact