The increasing prevalence of huge datasets addresses the research to appropriate statistical methods for solving troubles caused by their complexity. On the one hand, several techniques are mentioned in the literature, especially for the time-consuming and variables reduction issues. On the other, less debate is devoted to the statistical inference issue. Indeed, a large number of involved statistical units may lead to wrongly consider as significant variables without any actual impact on the phenomenon under study. This paper suggests a suitable subsampling procedure for the reduction of the number of statistical units and provides a novel index for the assessment of the significance effects. The proposal is validated by comparing results obtained from the analysis on the original data to those obtained from the proposed subsampling approach. The illustrative application focuses on the educational dataset made available by the National Committee for the Evaluation of the Italian Education Systems (INVALSI). This dataset collects information about the student features and achievements in Maths within the lower secondary schools of the Lombardy region (Italy). Due to the hierarchical structure of the data, a multilevel model is implemented with the purpose of investigating the effects of both individual and school factors on student Maths score.
Dealing with the biased effects issue when handling huge datasets: the case of INVALSI data
E. Raffinetti;
2015-01-01
Abstract
The increasing prevalence of huge datasets addresses the research to appropriate statistical methods for solving troubles caused by their complexity. On the one hand, several techniques are mentioned in the literature, especially for the time-consuming and variables reduction issues. On the other, less debate is devoted to the statistical inference issue. Indeed, a large number of involved statistical units may lead to wrongly consider as significant variables without any actual impact on the phenomenon under study. This paper suggests a suitable subsampling procedure for the reduction of the number of statistical units and provides a novel index for the assessment of the significance effects. The proposal is validated by comparing results obtained from the analysis on the original data to those obtained from the proposed subsampling approach. The illustrative application focuses on the educational dataset made available by the National Committee for the Evaluation of the Italian Education Systems (INVALSI). This dataset collects information about the student features and achievements in Maths within the lower secondary schools of the Lombardy region (Italy). Due to the hierarchical structure of the data, a multilevel model is implemented with the purpose of investigating the effects of both individual and school factors on student Maths score.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.