Use and misuse of P-values: a conditional approach to post-model-selection inference

IRIS

Adaptive generation of hypotheses is among the main culprits of the lack of replicability in science. Under conditions of uncertainty, the statements, or the process that generates them, can only be trusted if the reported error rates are reflected in the replication attempts. The discrepancy between the two is due to many factors, but interactive data analysis plays a major role in the inflation of type I error. In this regard, inference after model selection is of particular interest because its misuse can be analyzed through a Monte Carlo simulation. As the findings of this thesis show, inflation of type I error can be quite severe even in low dimensional scenarios, with up to 40% of false positives in the selected set of variables. Depending on the model selection strategy and the structure of the true data-generating mechanism, this percentage varies greatly. The results of the simulation show different performances between the Least Absolute Shrinkage and Selection Operator (LASSO) and the Forward Selection (FS). In particular, the LASSO yields a type I error lower than the FS when the structure of the true data-generating mechanism is additive and a higher one when the structure is multiplicative. The results also provide additional empirical evidence that given an extensive class of problems, most methods will provide on average comparable solutions. As shown in this thesis, the conditional probability approach to selective inference represents a viable solution to control type I error while avoiding any data loss due to data splitting. In the current research environment, incentives and funding policies need to be reshaped in order to bring about effective changes on the overall reliability of the published papers, but the tools to provide rigorous results, while meeting the needs of the researchers, are available for anyone conscientious enough.

Adaptive generation of hypotheses is among the main culprits of the lack of replicability in science. Under conditions of uncertainty, the statements, or the process that generates them, can only be trusted if the reported error rates are reflected in the replication attempts. The discrepancy between the two is due to many factors, but interactive data analysis plays a major role in the inflation of type I error. In this regard, inference after model selection is of particular interest because its misuse can be analyzed through a Monte Carlo simulation. As the findings of this thesis show, inflation of type I error can be quite severe even in low dimensional scenarios, with up to 40% of false positives in the selected set of variables. Depending on the model selection strategy and the structure of the true data-generating mechanism, this percentage varies greatly. The results of the simulation show different performances between the Least Absolute Shrinkage and Selection Operator (LASSO) and the Forward Selection (FS). In particular, the LASSO yields a type I error lower than the FS when the structure of the true data-generating mechanism is additive and a higher one when the structure is multiplicative. The results also provide additional empirical evidence that given an extensive class of problems, most methods will provide on average comparable solutions. As shown in this thesis, the conditional probability approach to selective inference represents a viable solution to control type I error while avoiding any data loss due to data splitting. In the current research environment, incentives and funding policies need to be reshaped in order to bring about effective changes on the overall reliability of the published papers, but the tools to provide rigorous results, while meeting the needs of the researchers, are available for anyone conscientious enough.

Use and misuse of P-values: a conditional approach to post-model-selection inference

GIOÈ, MAURO

2021-03-12

Abstract

Adaptive generation of hypotheses is among the main culprits of the lack of replicability in science. Under conditions of uncertainty, the statements, or the process that generates them, can only be trusted if the reported error rates are reflected in the replication attempts. The discrepancy between the two is due to many factors, but interactive data analysis plays a major role in the inflation of type I error. In this regard, inference after model selection is of particular interest because its misuse can be analyzed through a Monte Carlo simulation. As the findings of this thesis show, inflation of type I error can be quite severe even in low dimensional scenarios, with up to 40% of false positives in the selected set of variables. Depending on the model selection strategy and the structure of the true data-generating mechanism, this percentage varies greatly. The results of the simulation show different performances between the Least Absolute Shrinkage and Selection Operator (LASSO) and the Forward Selection (FS). In particular, the LASSO yields a type I error lower than the FS when the structure of the true data-generating mechanism is additive and a higher one when the structure is multiplicative. The results also provide additional empirical evidence that given an extensive class of problems, most methods will provide on average comparable solutions. As shown in this thesis, the conditional probability approach to selective inference represents a viable solution to control type I error while avoiding any data loss due to data splitting. In the current research environment, incentives and funding policies need to be reshaped in order to bring about effective changes on the overall reliability of the published papers, but the tools to provide rigorous results, while meeting the needs of the researchers, are available for anyone conscientious enough.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di discussione
	
				12-mar-2021
			
	Abstract
	
				Adaptive generation of hypotheses is among the main culprits of the lack of
replicability in science. Under conditions of uncertainty, the statements, or the
process that generates them, can only be trusted if the reported error rates are
reflected in the replication attempts. The discrepancy between the two is due to
many factors, but interactive data analysis plays a major role in the inflation of
type I error. In this regard, inference after model selection is of particular interest
because its misuse can be analyzed through a Monte Carlo simulation. As the
findings of this thesis show, inflation of type I error can be quite severe even in
low dimensional scenarios, with up to 40% of false positives in the selected set
of variables. Depending on the model selection strategy and the structure of the
true data-generating mechanism, this percentage varies greatly. The results of the
simulation show different performances between the Least Absolute Shrinkage
and Selection Operator (LASSO) and the Forward Selection (FS). In particular,
the LASSO yields a type I error lower than the FS when the structure of the true
data-generating mechanism is additive and a higher one when the structure is
multiplicative. The results also provide additional empirical evidence that given
an extensive class of problems, most methods will provide on average comparable
solutions. As shown in this thesis, the conditional probability approach
to selective inference represents a viable solution to control type I error while
avoiding any data loss due to data splitting. In the current research environment,
incentives and funding policies need to be reshaped in order to bring about effective
changes on the overall reliability of the published papers, but the tools to
provide rigorous results, while meeting the needs of the researchers, are available
for anyone conscientious enough.
			
	Appare nelle tipologie:
	
				8.01 Tesi di dottorato

File in questo prodotto:

File	Dimensione	Formato
Use and misuse of P-values.pdf accesso aperto Descrizione: Use and misuse of P-values: a conditional approach to post-model-selection inference Tipologia: Tesi di dottorato Dimensione 1.75 MB Formato Adobe PDF Visualizza/Apri	1.75 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11571/1422614

Citazioni

ND

ND

ND

social impact