Last modified: Thu Oct 20 11:26:25 EDT 2005
by Dean Foster

# Statistical Data mining: Intrinsically large data (aka graphs)

## Admistrivia

- Homework 3 is due next week. Any questions on it so far?

## n large is often not a problem

- Database people say, "big data is that which doesn't fit in
memory."
- When do we have big data in statistics?
- If n is large we can subsample to generate resonable sized data
- Random sampling is unbiased
- Larry does this in his call center data
- loses signal, so only finds big stuff

- Sampling on Y can be more efficient
- Simillar to matched data in obseravational studies
- If many "no"'s and few "yes"'s use subsample of "no"'s and
all the "yes"
- Eg: direct marketing, fraud detection, targeted advertisements

## When isn't sampling available?

- Graph like data can't be sub-sampled
- examples:
- Citation graph (our working example)
- Wikipedia
- WWW links
- phone call database (long distance say)

- Must keep everything in database
- But not our problem--we only need a vector to represent each X

## Model: Stat engine attached to database query engine

- Try to keep things seperate as much as possible
- query database for a vector of X's
- Collect up a bunch of X's and drop them into a statistical
model
- Now we don't have to learn data structures, database
optimizations, etc. Leave that to the computer scientists
- But we do need to understand the interface.

## What do graph variables look like?

- easy stuff: Cites(x,y), author(a,b)
- meta-cites(a,b) = sum cites(a,x) cites(x,b)
- selfcites(a) = sum(x,y) author(a,x) and cites(a,y) and author(y,x)
- Lots of similar variables

## Searching an infinite set of variables

- Can't list them all like regular regression
- Need a way of walking through them one at at time
- Think of building them up from more primative ones

## Search methods

- Model: finding solution to 16 puzzle
- history: reverse order of tiles
- All the rage in 1700's
- Easy to prove it is impossible

- Depth first search
Nice recursive structure
- Breadth first search
- Doesn't use topology of variables
- lots of memory

- A* algorithm
- Guess where you are so far
- Expand the best node first
- Will find answer if A* is bound on remaining search depth

- Itterative deaping A* (IDA)
- Cool trick: Top of tree is small, so just recompute it!

dean@foster.net