Dynamic Prediction of Non-Neutral SARS-Cov-2 Variants Using Incremental Machine Learning

Nicora, G.; Marini, S.; Salemi, M.; Bellazzi, R.

doi:10.3233/SHTI220550

In this work we show that Incremental Machine Learning can be used to predict the classification of emerging SARS-CoV-2 lineages, dynamically distinguishing between neutral variants and non-neutral ones, i.e. variants of interest or variants of concerns. Starting from the Spike protein primary sequences collected in the GISAID db, we have derived a set of k-mers features, i.e., aminoacid subsequences with fixed length k. We have then implemented a Logistic Regression Incremental Learner that was monthly tested on the variants collected since February 2020 until October 2021. The average value of balanced accuracy of the classifier is 0.72 ± 0.2, which increased to 0.78 ± 0.16 in the last 12 months. The alpha, beta, gamma, eta, kappa and delta variants were recognized as non-neutral variants with mean recall 90%. In summary, incremental learning proved to be a useful instrument for pandemic surveillance, given its capability to update the model on new data over time