: The coronavirus disease of 2019 (COVID-19) pandemic is characterized by sequential emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants and lineages outcompeting previously circulating ones because of, among other factors, increased transmissibility and immune escape1-3. We devised an unsupervised deep learning AutoEncoder for viral genomes anomaly detection to predict future dominant lineages (FDLs), i.e., lineages or sublineages comprising ≥10% of viral sequences added to the GISAID database on a given week4. The algorithm was trained and validated by assembling global and country-specific data sets from 16,187,950 Spike protein sequences sampled between December 24th, 2019, and November 8th, 2023. The AutoEncoder flags low frequency FDLs (0.01% - 3%), with median lead times of 4-16 weeks. Over time, positive predictive values oscillate, decreasing linearly with the number of unique sequences per data set, showing average performance up to 30 times better than baseline approaches. The B.1.617.2 vaccine reference strain was flagged as FDL when its frequency was only 0.01%, more than one year earlier of being considered for an updated COVID-19 vaccine. Our AutoEncoder, applicable in principle to any pathogen, also pinpoints specific mutations potentially linked to increased fitness, and may provide significant insights for the optimization of public health pre-emptive intervention strategies.
Forecasting dominance of SARS-CoV-2 lineages by anomaly detection using deep AutoEncoders
Rancati, SimoneWriting – Original Draft Preparation
;Nicora, GiovannaWriting – Review & Editing
;Bellazzi, RiccardoSupervision
;Salemi, MarcoSupervision
;Marini, SimoneConceptualization
2024-01-01
Abstract
: The coronavirus disease of 2019 (COVID-19) pandemic is characterized by sequential emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants and lineages outcompeting previously circulating ones because of, among other factors, increased transmissibility and immune escape1-3. We devised an unsupervised deep learning AutoEncoder for viral genomes anomaly detection to predict future dominant lineages (FDLs), i.e., lineages or sublineages comprising ≥10% of viral sequences added to the GISAID database on a given week4. The algorithm was trained and validated by assembling global and country-specific data sets from 16,187,950 Spike protein sequences sampled between December 24th, 2019, and November 8th, 2023. The AutoEncoder flags low frequency FDLs (0.01% - 3%), with median lead times of 4-16 weeks. Over time, positive predictive values oscillate, decreasing linearly with the number of unique sequences per data set, showing average performance up to 30 times better than baseline approaches. The B.1.617.2 vaccine reference strain was flagged as FDL when its frequency was only 0.01%, more than one year earlier of being considered for an updated COVID-19 vaccine. Our AutoEncoder, applicable in principle to any pathogen, also pinpoints specific mutations potentially linked to increased fitness, and may provide significant insights for the optimization of public health pre-emptive intervention strategies.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.