# Selective inference in regression

Jonathan Taylor (Stanford)

Inference for Large Scale Data, April 20, 2015

## Outline

• Selective inference

• Running example: model selection with the LASSO (arxiv.org/1311.6238 )

• A general framework for selective inference (arxiv.org/1410.2597 )

• Further examples of selective inference.

## Acknowledgements

### This is joint work with many:

• Yunjin Choi
• Will Fithian
• Jason Lee
• Richard Lockhart
• Joshua Loftus
• Stephen Reid
• Dennis Sun
• Yuekai Sun
• Xiaoying Tian
• Rob Tibshirani
• Ryan Tibshirani
• Others in progress...

# Selective inference

• Arguably, in modern science there is often no hypothesis specified before collecting data.
• Screening in *omics
• Peak / bump hunting in neuroimaging
• Model selection in regression
• Frequentist inference requires specifying hypotheses before collecting data.

• We describe a version of selective inference which allows for valid inference after some exploration.

## Tukey and Exploratory Data Analysis (EDA)

Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test.

## Tukey and Exploratory Data Analysis (EDA)

... confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.

# Selective inference

• Today, I will focus on testing hypotheses suggested by the data.

• Answer is parametric, interesting asymptotic questions remain.

# Running example

• In vitro measurement of resistance of sample of HIV viruses to NRTI drug 3TC.

• 633 cases, and 91 different mutations occuring more than 10 times in sample.

• Source: HIVDB

• Goal: to build an interpretable predictive model of resistance based on mutation pattern.

In [39]:
# Design matrix
# Columns are site / amino acid pairs
X.shape

Out[39]:
(633, 91)

In [40]:
# Variable names
NRTI_muts[:10], len(NRTI_muts)

Out[40]:
(['P6D',
'P20R',
'P21I',
'P35I',
'P35M',
'P35T',
'P39A',
'P41L',
'P43E',
'P43N'],
91)

In [42]:
fig_3TC

Out[42]:

# Model selection with the LASSO

• Many coefficients seem small.

• Use the LASSO to select variables: $\hat{\beta}_{\lambda} = \text{argmin}_{\beta \in \mathbb{R}^p} \frac{1}{2} \|y-X\beta\|^2_2 + \lambda \|\beta\|_1.$

• Theoretically motivated choice of $\lambda$ (Negabhan et al., 2012) $\lambda = \kappa \cdot \mathbb{E}( \|X^T\epsilon\|_{\infty}), \qquad \epsilon \sim N(0, \sigma^2 I).$

• Used $\kappa=1$ below, $\sigma^2$ the usual estimate from full model.

## Variables chosen for 3TC

In [44]:
lambda_theoretical

Out[44]:
43.0

In [46]:
active_3TC

Out[46]:
['P62V',
'P65R',
'P67N',
'P69i',
'P75I',
'P77L',
'P83K',
'P90I',
'P115F',
'P151M',
'P181C',
'P184V',
'P190A',
'P215F',
'P215Y',
'P219R']

In [48]:
fig_3TC

Out[48]:

## Inference after LASSO

• The LASSO selected $\hat{E} \subset$ NRTI_muts of size 16 at $\lambda \approx 43$.

• What to report?

• Naive inference after selection is wrong.

• Reference distribution of selected model is biased because of cherry picking.

• Why not fix it?

In [51]:
fig_select

Out[51]:
In [53]:
fig_select

Out[53]:

## What are these intervals?

In [54]:
fig_select

Out[54]:

Intervals consistent with the data, having observed active set.

In [55]:
fig_select

Out[55]:

Intervals seem to be long.

# Setup for selective inference

• Laid out formally in arxiv.org/1410.2597.

• Data $y \sim F$. (We have no model for $F$ at this point!)

• Set of questions ${\cal Q}$ we might ask about $F$.

• Use some exploratory technique to generate questions $\widehat{\cal Q}(y) \subseteq {\cal Q}.$
• Solve LASSO at some fixed $\lambda$ and look at active set.
• Choose a model by BIC and forward stepwise or best subset.
• Marginal screening.
• Test some or all of the hypotheses suggested by the point process $\widehat{\cal Q}(y)$.

# LASSO path

Instead of a fixed $\lambda$, we might look at the LASSO path.

In [56]:
%%R -i X,Y
library(lars)
plot(lars(X, as.numeric(Y), type='lar'))


# LASSO path

• A sequential procedure might consider "event times" $\lambda_j$.

• Might take ${\cal Q} = \{1 \leq j \leq p\}$

• The selection procedure is $j^*(y)= \widehat{\cal Q}(y) = \text{argmax}_{1 \leq j \leq p} |X_j^Ty|.$

• Note $|X_{j^*(y)}^Ty| = \lambda_1.$

• Under $H_0: \beta \equiv 0$ (and normalization) ( covTest) $\lambda_1(\lambda_1 - \lambda_2) \overset{D}{\to} \text{Exp}(1).$

• In fact, under $H_0:\beta \equiv 0$ (Kac-Rice) $\frac{1 - \Phi(\lambda_1)}{1 - \Phi(\lambda_2)} \overset{D}{=} \text{Unif}(0,1).$

• Sequential aspect makes the multi-step procedure more complicated than fixed $\lambda$...

# What are the questions?

## Linear regression

• Define ${\cal Q} = \left\{(j,E): E \subset \{1, \dots, p\}, j \in E\right\}.$

• Indexes the set of OLS functionals, i.e. partial regression coefficients : $\beta_{j|E}(\mu) = e_j^TX[:,E]^{\dagger}(\mu), \qquad (j,E) \in {\cal Q}.$

# Simultaneous vs. selective inference

• Functionals $\beta_{j|E}$ also appear in POSI.

• Simultaneous inference over all questions $(j,E) \in {\cal Q}$.

• Achieved by controlling FWER: find $K_{\alpha}$ s.t. $\mathbb{P} \left(\sup_{(j,E) \in \cal Q} \frac{\beta_{j|E}(\epsilon)}{\|\beta_{j,E}\|_2} > K_{\alpha}\right) \leq \alpha$ where $\epsilon \sim N(0, \sigma^2 I)$.

### Selective inference uses the LASSO to select

$\widehat{\cal Q}(y) \subset {\cal Q}.$

We control coverage and type I error on selected questions.

# Selective inference for LASSO

• Our inference is based on the distribution ${\mathbb{Q}}_{E,z_E}(\cdot) = {\mathbb{P}}\left( \ \cdot \ \big\vert (\hat{E}, z_{\hat{E}}) = (E, z_E) \right)$ where $\mathbb{P} = \mathbb{P}_{\mu} = N(\mu, \sigma^2 I).$

• We derive exact pivots for $\beta_{j|E}(\mu)$ under $\mathbb{Q}_{E,z_E}$.

• Report intervals based on $\mathbb{Q}_{\hat{E}, z_{\hat{E}}}$.

# Selective inference for the LASSO

• Selection event: \begin{aligned} S(E,z_E) &= \{ (\hat{E}, \hat z_{\hat{E}}) = (E, z_E)\}\\ &= \left\{y: A(E, z_E)y \leq b(E,z_E) \right\} \end{aligned}

• $A(E,z_E)$ and $b(E,z_E)$ come from the KKT conditions.

• Active block $\text{diag}(z_E)\left(X_E^{\dagger}y - \lambda (X_E^TX_E)^{-1}z_E\right) \geq 0.$

• Inactive block $\left\|X_{-E}^T\left((I-X_EX_E^{\dagger})y + \lambda (X_E^T)^{\dagger} z_E \right) \right\|_{\infty} \leq \lambda.$

## Visualizing LASSO partition

(Credit Naftali Harris)

# Inference

Condition on sufficient statistic $X_1^Ty$.

# Inference

Allowing for effect of $X_1$.

# Reduction to univariate problem

• Law of $X_3^Ty$ restricted to slice is a 1-parameter exponential family $\frac{f_{\theta}(z)}{f_0(z)} \propto \exp(\theta z) \cdot 1_{[{\cal V}^-(z^{\perp}),{\cal V}^+(z^{\perp})]}(z)$ with $\theta = \beta_{3|\{1,3\}}(\mu)/\sigma^2$.

• Reference measure: $f_0={\cal L}(\hat{\beta}_{3|\{1,3\}}(y))$.

# Saturated model

• Nuisance parameter is $P_{\eta}^{\perp}\mu$.

# Saturated model

• Each $\eta=\eta(E,z_E)$ determines a truncated univariate Gaussian.
In [57]:
fig_select

Out[57]:

# Selective hypothesis tests

• Under $\mathbb{Q}_{E,z_E}$, can construct tests $\phi_{(j|E)}$ of $H_{0,(j|E)} : \beta_{j|E}(\mu) = 0.$

• Tests satisfy selective type I error guarantee $\mathbb{Q}_{E,z_E}(\phi_{j|E}) = \overset{H_{0,(j|E)}}{\leq} \alpha.$

• Conditional control implies marginal control.

• We report results $\phi_{(j| \hat{E})}(y), j \in \hat{E}$.

In [59]:
pvalue_table

Out[59]:
 Mutation Naive OLS Selective P62V 0.137 0.369 P65R 0.000 0.000 P67N 0.000 0.000 P69i 0.000 0.000 P75I 0.484 0.553 P77L 0.270 0.469 P83K 0.003 0.051 P90I 0.000 0.014 P115F 0.014 0.168 P151M 0.008 0.081 P181C 0.000 0.002 P184V 0.000 0.000 P190A 0.063 0.309 P215F 0.000 0.016 P215Y 0.000 0.000 P219R 0.001 0.099

## What does it all mean?

### Unconditional viewpoint

• Intervals cover random variables (prediction intervals).

• Hypotheses seem to be events! But they're not...

## What does it all mean?

### Conditional viewpoint

• $\hat{E}$ is constant. Forming confidence intervals for fixed parameters.

• Hypothesis tests are fixed parameters of $\mathbb{Q}_{E,z_E}$.

## Who's afraid of random hypotheses?

### Exploratory and confirmatory study

• Measure pilot data $S_1 = y_1 | X_1$

• Build a model: $\hat{E}(y_1)$.

• Measure a confirmatory sample $S_2 = y_2 | X_2$

• Form usual $t$-statistics based on $\hat{\beta}_2 = X_{2,\hat{E}(y_1)}^{\dagger}y_2.$

## Who's afraid of random hypotheses?

### Exploratory and confirmatory study

• We all know that this is valid....

• BUT, the probability space is really the joint law $(y_1, y_2)|(X_1, X_2)$.

# Data splitting

• Use some portion of the data to form a model $\hat{E}(y_1)$.

• Perform usual inference for $X_{2, \hat{E}(y_1)}^{\dagger}\mu_2.$

• Less data for selection, and less data for inference.

### Selective inference (can) use all data for exploration and confirmation

Might use only part of the data for exploration.

In [61]:
fig_carve

Out[61]:
In [63]:
fig_carve

Out[63]:

Data carving shorter than data splitting.

In [64]:
fig_carve

Out[64]:

Inference cannot be reduced to simple univariate problem.

In [66]:
carve_pvalue_table

Out[66]:
 Mutation Data splitting Data carving P41L 0.155 0.048 P62V 0.555 0.204 P65R 0.261 0.000 P67N 0.173 0.000 P69i 0.479 0.000 P77L 0.025 0.231 P83K 0.876 0.013 P115F 0.068 0.066 P116Y 0.220 0.204 P181C 0.000 0.000 P184V 0.000 -0.000 P215F 0.000 0.009 P215Y 0.284 0.000 P219R 0.090 0.035

# Data carving

Holding out more data, data carving still beats data splitting.

# A general framework for selective inference

• Nothing was directly tied to LASSO as long as we can describe selection event, i.e. the event that the selected model became interesting.

• Nothing was directly tied to the model $N(\mu, \sigma^2 I)$ either.

• Can carry out inference known (or unknown) $\sigma$ for selected model: $\mathbb{Q}_{\beta_E;(E,z_E)}(\cdot) = \mathbb{P}_{\beta_E}\left( \ \cdot \ \big\vert (\hat{E}, z_{\hat{E}}) = (E, z_E) \right)$ where $\mathbb{P}_{\beta_E} \overset{D}{=} N(X_E\beta_E, \sigma^2 I).$

• Typically requires Monte Carlo inference.

# Data carving with known variance

• Split the data $y=(y_1,y_2)$, $X=(X_1,X_2)$.

• Run LASSO on $(y_1,X_1,\lambda)$.

• Selection event: affine constraints on $y_1$.

# Data carving with known variance

• Unconditional distribution: $\frac{d\mathbb{P}_{\beta_E}}{d\mathbb{P}_0}(y) \propto \exp\left( \frac{1}{\sigma^2}\beta_E^TX_E^Ty \right), \qquad \mathbb{P}_0 = N(0, \sigma^2 I).$

• Selective distribution: $\frac{d\mathbb{Q}_{\beta_E;(E,z_E)}}{d\mathbb{P}_{\beta_E}}(y) \propto 1_{S(E,z_E)}(y).$

• Inference for $\beta_{j|E}$: condition $\mathbb{Q}_{\beta_E;(E,z_E)}$ on $X_{E\setminus\{j\}}^Ty$.

• Monte Carlo sampling from multivariate Gaussian subject to affine constraints.

In [67]:
fig_carve

Out[67]:
In [68]:
carve_pvalue_table

Out[68]:
 Mutation Data splitting Data carving P41L 0.155 0.048 P62V 0.555 0.204 P65R 0.261 0.000 P67N 0.173 0.000 P69i 0.479 0.000 P77L 0.025 0.231 P83K 0.876 0.013 P115F 0.068 0.066 P116Y 0.220 0.204 P181C 0.000 0.000 P184V 0.000 -0.000 P215F 0.000 0.009 P215Y 0.284 0.000 P219R 0.090 0.035

## Other examples in the literature

• Selective intervals (Benjamini & Yekutieli; Weinstein, Fithian and Benjamini; others)

• Drop-the-losers binomial designs (Sampson & Sill)

• $p$-values for maxima of random fields (Schwartzman & Chen)

• Effect size estimation (Benjamini & Rosenblatt; Zhong & Prentice)

# Least Angle Regression

## Least Angle Regression

• Asymptotic analysis of first step of LAR / LASSO / FS was considered in covTest.

• Selective framework provides exact test of global null $\exp(-\lambda_2(\lambda_1-\lambda_2)) \approx \frac{1 - \Phi(\lambda_1)}{1 - \Phi(\lambda_2)} \overset{H_0:\beta\equiv 0}{\sim} \text{Unif}(0,1)$

• LAR sequence up to $k$ steps (tracking signs on entering) can be expressed as a set of affine inequalities (including AIC stopping).

• Exact extension of covTest beyond first step.

## Categorical variables

• The LAR approach does not generally allow grouped (i.e. categorical variables).

• Extension to groups of covTest.

• Coming soon: inference for complete FS path for grouped variables.

## Square-root LASSO

$\hat{\beta}_{\gamma} = \text{argmin}_{\beta \in \mathbb{R}^p} \|y-X\beta\|_2 + \gamma \|\beta\|_1.$

• Tuning parameter is free of $\sigma$.

• Selection event is no longer convex in general, though selective inference still possible.

• Also coming soon.

## Estimation

• Selective distribution used for hypothesis tests, intervals.

• Selective pseudo MLE: $\int_{\mathbb{R}} X_j^Tz \; \mathbb{Q}_{\hat{\beta}_{j|E}(y);(E,z_E)}\left(dz \big\vert X_{E\setminus j}^Tz=X_{E\setminus j}^Ty \right) = X_j^Ty.$

• Mean estimation in orthogonal design after BH.

• Provides estimate of $\sigma^2$ in $\sqrt{\text{LASSO}}$.

## Asymptotics

• Inference so far is very parameteric.

• We have some partial results: arxiv.org/abs/1501.03588.

• Ryan Tibshirani and others at CMU also have something coming.

• What about GLMs? No explicit results yet.

## Peak inference in neuroimaging & critical points

(Credit Wikipedia)

In some sense, this is where it all started...

## Peak inference in neuroimaging

• Let $(T(x))_{x \in B}$ be an image of test statistics (SPM).

• Set ${\cal Q}=B$, and $\hat{\cal Q}(T)$ to be the set of local maxima / minima.

• Report p-value for each critical point in $\hat{\cal Q}(T)$.

## Peak inference in neuroimaging

• Long history in brain imaging (work of Worsley, Friston, Benjamini).

• Goal was simultaneous inference.

• Simultaneous tools can be converted to selective tools (arxiv.org/1308.3020).

• Similar approach can be used for testing in PCA (arxiv.org/1410.8260).

• Recent work of Schwartzman & Chen (arxiv.org/1405.1400).

• Selective distributions: Slepian models / Palm distributions.

## Thanks

• NSF-DMS 1208857 and AFOSR-113039.

• Many collaborators.