# Statistical Data mining: Streaming searching

• Collect HW 3
• HW 4 due next thursday.

## Biostat: alpha spending rules

• Stopping in clinical trials.
• multiple endpoints / tests.

## Alpha spending in variable selection

• Sequentially look at each variable
• Spend some alpha on it to see if it should enter

## FWER

• Probability of union less than sum of probabilities
• Union is chance of making even one mistake
• FWER = Family wide error rate = worst chance of error
• Theorem: alpha spending controls FWER at level alpha.

## But do we want FWER for prediction?

• FWER guarentees not over fitting
• Draw typical out-of-sample MSE graph
• Out of sample graph version of FWER: It guarentees never going up by alpha by bad luck
• We want the minimum, not a conservative left point
• Note: Some people argue we don't even want FWER for multiple testing
• Makes more sense to tradeoff between type I and type II error

## Better scheme

• For each rejection, give out new alpha to spend
• Called alpha investing rule
• amount given out controls tradeoff between type I and type II error

## Better analysis

• 2 x 2 table (see handout) U/V/T/S. V+S=R. U+V=m0 = number true nulls

## FDR: False Discovery Rate

• FDR = E(V/R)
• FDR < alpha is target
• Simes procedure will control FDR

## EDC: Excess discovery count

• EDC = E(S - gamma R) + alpha
• EDC > 0 is target
• alpha controls FWER(0)
• gamma controls FDR at rate about 1-gamma

## Theorem: alpha investing controls EDC

• proof next time?

dean@foster.net