Last modified: Thu Oct 20 11:26:25 EDT 2005 by Dean Foster

Statistical Data mining: Intrinsically large data (aka graphs)

Admistrivia

Database people say, "big data is that which doesn't fit in memory."
When do we have big data in statistics?
If n is large we can subsample to generate resonable sized data
Random sampling is unbiased
- Larry does this in his call center data
- loses signal, so only finds big stuff
Sampling on Y can be more efficient
- Simillar to matched data in obseravational studies
- If many "no"'s and few "yes"'s use subsample of "no"'s and all the "yes"
- Eg: direct marketing, fraud detection, targeted advertisements

Graph like data can't be sub-sampled
examples:
- Citation graph (our working example)
- Wikipedia
- WWW links
- phone call database (long distance say)
Must keep everything in database
But not our problem--we only need a vector to represent each X

Try to keep things seperate as much as possible
query database for a vector of X's
Collect up a bunch of X's and drop them into a statistical model
Now we don't have to learn data structures, database optimizations, etc. Leave that to the computer scientists
But we do need to understand the interface.

Model: finding solution to 16 puzzle
- history: reverse order of tiles
- All the rage in 1700's
- Easy to prove it is impossible
Depth first search
- Breadth first search
  - Doesn't use topology of variables
  - lots of memory
- A* algorithm
  - Guess where you are so far
  - Expand the best node first
  - Will find answer if A* is bound on remaining search depth
- Itterative deaping A* (IDA)
  - Cool trick: Top of tree is small, so just recompute it!
dean@foster.net