Essays on Statistical Learning Models for Environmental Applications with a Focus on Explainability

Patelli, Luca

Statistical Learning (SL) represents a framework that integrates traditional statistical techniques with Machine Learning (ML) algorithms, thereby enhancing the latter through the embedding of statistical rigour. This approach addresses the inherent limitations of standard ML applications and provides new insights into the field of statistics. A principal aspect of this thesis is the consistent application of SL to environmental data and challenges, thus establishing a common thematic throughout the work. This thesis investigates the SL framework with a particular emphasis on the Random Forest (RF) algorithm. RF is a well-known ML algorithm that has gained renown for its predictive accuracy and versatility across various domains. However, there are certain limitations to RF, especially in the context of data that are not independently and identically distributed (i.i.d.) and in terms of its lack of interpretability. The initial issue addressed in the thesis is the application of RF to spatially dependent data, a frequent scenario when dealing with environmental applications. An extensive literature review is conducted in accordance with the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines, ensuring a systematic and transparent review process. A taxonomy is developed to categorise existing contributions, identified via PRISMA, that incorporate spatial information into RF models either pre-, in, or post-computation. This structured categorisation facilitates a deeper understanding of the current methodologies and reveals potential future research directions. Beyond predictive accuracy, there is a growing demand for ML results that can be readily interpreted. This represents another challenge for RF. In this thesis, RF’s lack of interpretability is addressed by first defining the concepts of interpretability and explainability, and distinguishing between them. Explainable Artificial Intelligence (XAI) is then introduced as a potential solution to the opacity of ML models, with particular emphasis on the strategies that can be applied to explain RF predictions. A review of the latest techniques for providing explanations of RF models is presented, underscoring the need for tailored explainability methods. Methodologically, the thesis introduces Spatial SIRUS (S-SIRUS), an explainability algorithm specifically designed for geostatistical applications in regression tasks. To ascertain the conditions under which S-SIRUS should be employed in preference to its non-spatial counterpart, SIRUS, a series of simulations utilising pseudo-real environmental data are conducted. The simulations demonstrate S-SIRUS’s capacity to enhance the interpretability of spatial predictions, thereby addressing the interpretability challenges associated with RF in spatial contexts. This provides clarity in applications such as complex environmental analyses. In the third and final key chapter, an application of SL for the early estimation of seismic intensity in Italy is presented. In this application, RF is employed as the predictive model for a classification task. In order to tackle the black-box nature of RF, surrogate decision trees are employed to elucidate the underlying prediction mechanisms. Furthermore, to address the issue of uncertainty quantification, a well-known limitation of ML models, dissimilarity indices are applied to provide information regarding the variability of the observed response values in the terminal nodes which provide predictions. This approach permits to balance predictive performance with explainability and uncertainty quantification. The result is a framework that equips decision-makers with a comprehensive toolset to enhance their decision-making processes.