In this work we show that Incremental Machine Learning can be used to predict the classification of emerging SARS-CoV-2 lineages, dynamically distinguishing between neutral variants and non-neutral ones, i.e. variants of interest or variants of concerns. Starting from the Spike protein primary sequences collected in the GISAID db, we have derived a set of k-mers features, i.e., aminoacid subsequences with fixed length k. We have then implemented a Logistic Regression Incremental Learner that was monthly tested on the variants collected since February 2020 until October 2021. The average value of balanced accuracy of the classifier is 0.72 ± 0.2, which increased to 0.78 ± 0.16 in the last 12 months. The alpha, beta, gamma, eta, kappa and delta variants were recognized as non-neutral variants with mean recall 90%. In summary, incremental learning proved to be a useful instrument for pandemic surveillance, given its capability to update the model on new data over time
Dynamic Prediction of Non-Neutral SARS-Cov-2 Variants Using Incremental Machine Learning
Nicora G.Writing – Original Draft Preparation
;Bellazzi R.Conceptualization
2022-01-01
Abstract
In this work we show that Incremental Machine Learning can be used to predict the classification of emerging SARS-CoV-2 lineages, dynamically distinguishing between neutral variants and non-neutral ones, i.e. variants of interest or variants of concerns. Starting from the Spike protein primary sequences collected in the GISAID db, we have derived a set of k-mers features, i.e., aminoacid subsequences with fixed length k. We have then implemented a Logistic Regression Incremental Learner that was monthly tested on the variants collected since February 2020 until October 2021. The average value of balanced accuracy of the classifier is 0.72 ± 0.2, which increased to 0.78 ± 0.16 in the last 12 months. The alpha, beta, gamma, eta, kappa and delta variants were recognized as non-neutral variants with mean recall 90%. In summary, incremental learning proved to be a useful instrument for pandemic surveillance, given its capability to update the model on new data over timeI documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.