Deep learning (DL) methods have revolutionized image segmentation by providing tools to automatically identify structures within images, with high levels of accuracy. In particular, Convolutional Neural Networks (CNN), such as the U-Net and its variants, have achieved remarkable results in many segmentation tasks. Only recently, Vision Transformer (ViT)-based models have emerged and in some cases demonstrated to outperform CNNs in semantic segmentation tasks. However, transformers typically require larger amounts of data for training as compared to CNNs. This can result in a significant drawback given the time-consuming nature of collecting and annotating data, especially in a clinical setting. In particular, only a few studies involved the application of ViT networks for ultrasound image segmentation.In this study, we propose one of the earliest applications of ViT-based architectures for segmenting the left heart chambers in 2D echocardiographic images. Indeed, the identification of cardiac structures, like e.g. heart chambers, can be used to derive relevant quantitative parameters, such as atrial and ventricular volumes, the ejection fraction, etc.We trained and tested several ViT models using the publicly available CAMUS dataset, composed by cardiac image sequences from 500 patients, with the corresponding mask labels of the left ventricle and left atrium. ViT networks performances were then compared to an implementation of the U-Net. We demonstrate how, on this type of data, recent ViT variants can reach and even outperform CNNs, despite the limited data availability.
Deep Semantic Segmentation of Echocardiographic Images using Vision Transformers
Bosco E.;Cotogni M.;Cusano C.;Matrone G.
2023-01-01
Abstract
Deep learning (DL) methods have revolutionized image segmentation by providing tools to automatically identify structures within images, with high levels of accuracy. In particular, Convolutional Neural Networks (CNN), such as the U-Net and its variants, have achieved remarkable results in many segmentation tasks. Only recently, Vision Transformer (ViT)-based models have emerged and in some cases demonstrated to outperform CNNs in semantic segmentation tasks. However, transformers typically require larger amounts of data for training as compared to CNNs. This can result in a significant drawback given the time-consuming nature of collecting and annotating data, especially in a clinical setting. In particular, only a few studies involved the application of ViT networks for ultrasound image segmentation.In this study, we propose one of the earliest applications of ViT-based architectures for segmenting the left heart chambers in 2D echocardiographic images. Indeed, the identification of cardiac structures, like e.g. heart chambers, can be used to derive relevant quantitative parameters, such as atrial and ventricular volumes, the ejection fraction, etc.We trained and tested several ViT models using the publicly available CAMUS dataset, composed by cardiac image sequences from 500 patients, with the corresponding mask labels of the left ventricle and left atrium. ViT networks performances were then compared to an implementation of the U-Net. We demonstrate how, on this type of data, recent ViT variants can reach and even outperform CNNs, despite the limited data availability.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.