Last modified: Tue Nov 1 14:48:15 EST 2005
by Dean Foster

# Statistical Data mining: Streaming searching

## Admistrivia

- Collect HW 3
- HW 4 due next thursday.

## Biostat: alpha spending rules

- Stopping in clinical trials.
- multiple endpoints / tests.

## Alpha spending in variable selection

- Sequentially look at each variable
- Spend some alpha on it to see if it should enter

## FWER

- Probability of union less than sum of probabilities
- Union is chance of making even one mistake
- FWER = Family wide error rate = worst chance of error
- Theorem: alpha spending controls FWER at level alpha.

## But do we want FWER for prediction?

- FWER guarentees not over fitting
- Draw typical out-of-sample MSE graph
- Out of sample graph version of FWER: It guarentees never going up by alpha by
bad luck
- We want the minimum, not a conservative left point
- Note: Some people argue we don't even want FWER for multiple testing
- Makes more sense to tradeoff between type I and type II error

## Better scheme

- For each rejection, give out new alpha to spend
- Called alpha investing rule
- amount given out controls tradeoff between type I and type II error

## Better analysis

- 2 x 2 table (see handout) U/V/T/S. V+S=R. U+V=m
_{0} =
number true nulls

## FDR: False Discovery Rate

- FDR = E(V/R)
- FDR < alpha is target
- Simes procedure will control FDR

## EDC: Excess discovery count

- EDC = E(S - gamma R) + alpha
- EDC > 0 is target
- alpha controls FWER(0)
- gamma controls FDR at rate about 1-gamma

## Theorem: alpha investing controls EDC

dean@foster.net