Adaptive generation of hypotheses is among the main culprits of the lack of replicability in science. Under conditions of uncertainty, the statements, or the process that generates them, can only be trusted if the reported error rates are reflected in the replication attempts. The discrepancy between the two is due to many factors, but interactive data analysis plays a major role in the inflation of type I error. In this regard, inference after model selection is of particular interest because its misuse can be analyzed through a Monte Carlo simulation. As the findings of this thesis show, inflation of type I error can be quite severe even in low dimensional scenarios, with up to 40% of false positives in the selected set of variables. Depending on the model selection strategy and the structure of the true data-generating mechanism, this percentage varies greatly. The results of the simulation show different performances between the Least Absolute Shrinkage and Selection Operator (LASSO) and the Forward Selection (FS). In particular, the LASSO yields a type I error lower than the FS when the structure of the true data-generating mechanism is additive and a higher one when the structure is multiplicative. The results also provide additional empirical evidence that given an extensive class of problems, most methods will provide on average comparable solutions. As shown in this thesis, the conditional probability approach to selective inference represents a viable solution to control type I error while avoiding any data loss due to data splitting. In the current research environment, incentives and funding policies need to be reshaped in order to bring about effective changes on the overall reliability of the published papers, but the tools to provide rigorous results, while meeting the needs of the researchers, are available for anyone conscientious enough.

Adaptive generation of hypotheses is among the main culprits of the lack of replicability in science. Under conditions of uncertainty, the statements, or the process that generates them, can only be trusted if the reported error rates are reflected in the replication attempts. The discrepancy between the two is due to many factors, but interactive data analysis plays a major role in the inflation of type I error. In this regard, inference after model selection is of particular interest because its misuse can be analyzed through a Monte Carlo simulation. As the findings of this thesis show, inflation of type I error can be quite severe even in low dimensional scenarios, with up to 40% of false positives in the selected set of variables. Depending on the model selection strategy and the structure of the true data-generating mechanism, this percentage varies greatly. The results of the simulation show different performances between the Least Absolute Shrinkage and Selection Operator (LASSO) and the Forward Selection (FS). In particular, the LASSO yields a type I error lower than the FS when the structure of the true data-generating mechanism is additive and a higher one when the structure is multiplicative. The results also provide additional empirical evidence that given an extensive class of problems, most methods will provide on average comparable solutions. As shown in this thesis, the conditional probability approach to selective inference represents a viable solution to control type I error while avoiding any data loss due to data splitting. In the current research environment, incentives and funding policies need to be reshaped in order to bring about effective changes on the overall reliability of the published papers, but the tools to provide rigorous results, while meeting the needs of the researchers, are available for anyone conscientious enough.

Use and misuse of P-values: a conditional approach to post-model-selection inference

GIOÈ, MAURO
2021-03-12

Abstract

Adaptive generation of hypotheses is among the main culprits of the lack of replicability in science. Under conditions of uncertainty, the statements, or the process that generates them, can only be trusted if the reported error rates are reflected in the replication attempts. The discrepancy between the two is due to many factors, but interactive data analysis plays a major role in the inflation of type I error. In this regard, inference after model selection is of particular interest because its misuse can be analyzed through a Monte Carlo simulation. As the findings of this thesis show, inflation of type I error can be quite severe even in low dimensional scenarios, with up to 40% of false positives in the selected set of variables. Depending on the model selection strategy and the structure of the true data-generating mechanism, this percentage varies greatly. The results of the simulation show different performances between the Least Absolute Shrinkage and Selection Operator (LASSO) and the Forward Selection (FS). In particular, the LASSO yields a type I error lower than the FS when the structure of the true data-generating mechanism is additive and a higher one when the structure is multiplicative. The results also provide additional empirical evidence that given an extensive class of problems, most methods will provide on average comparable solutions. As shown in this thesis, the conditional probability approach to selective inference represents a viable solution to control type I error while avoiding any data loss due to data splitting. In the current research environment, incentives and funding policies need to be reshaped in order to bring about effective changes on the overall reliability of the published papers, but the tools to provide rigorous results, while meeting the needs of the researchers, are available for anyone conscientious enough.
12-mar-2021
Adaptive generation of hypotheses is among the main culprits of the lack of replicability in science. Under conditions of uncertainty, the statements, or the process that generates them, can only be trusted if the reported error rates are reflected in the replication attempts. The discrepancy between the two is due to many factors, but interactive data analysis plays a major role in the inflation of type I error. In this regard, inference after model selection is of particular interest because its misuse can be analyzed through a Monte Carlo simulation. As the findings of this thesis show, inflation of type I error can be quite severe even in low dimensional scenarios, with up to 40% of false positives in the selected set of variables. Depending on the model selection strategy and the structure of the true data-generating mechanism, this percentage varies greatly. The results of the simulation show different performances between the Least Absolute Shrinkage and Selection Operator (LASSO) and the Forward Selection (FS). In particular, the LASSO yields a type I error lower than the FS when the structure of the true data-generating mechanism is additive and a higher one when the structure is multiplicative. The results also provide additional empirical evidence that given an extensive class of problems, most methods will provide on average comparable solutions. As shown in this thesis, the conditional probability approach to selective inference represents a viable solution to control type I error while avoiding any data loss due to data splitting. In the current research environment, incentives and funding policies need to be reshaped in order to bring about effective changes on the overall reliability of the published papers, but the tools to provide rigorous results, while meeting the needs of the researchers, are available for anyone conscientious enough.
File in questo prodotto:
File Dimensione Formato  
Use and misuse of P-values.pdf

accesso aperto

Descrizione: Use and misuse of P-values: a conditional approach to post-model-selection inference
Tipologia: Tesi di dottorato
Dimensione 1.75 MB
Formato Adobe PDF
1.75 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11571/1422614
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact