Last modified: Thu Oct 20 11:26:25 EDT 2005
by Dean Foster
Statistical Data mining: Intrinsically large data (aka graphs)
Admistrivia
- Homework 3 is due next week. Any questions on it so far?
n large is often not a problem
- Database people say, "big data is that which doesn't fit in
memory."
- When do we have big data in statistics?
- If n is large we can subsample to generate resonable sized data
- Random sampling is unbiased
- Larry does this in his call center data
- loses signal, so only finds big stuff
- Sampling on Y can be more efficient
- Simillar to matched data in obseravational studies
- If many "no"'s and few "yes"'s use subsample of "no"'s and
all the "yes"
- Eg: direct marketing, fraud detection, targeted advertisements
When isn't sampling available?
- Graph like data can't be sub-sampled
- examples:
- Citation graph (our working example)
- Wikipedia
- WWW links
- phone call database (long distance say)
- Must keep everything in database
- But not our problem--we only need a vector to represent each X
Model: Stat engine attached to database query engine
- Try to keep things seperate as much as possible
- query database for a vector of X's
- Collect up a bunch of X's and drop them into a statistical
model
- Now we don't have to learn data structures, database
optimizations, etc. Leave that to the computer scientists
- But we do need to understand the interface.
What do graph variables look like?
- easy stuff: Cites(x,y), author(a,b)
- meta-cites(a,b) = sum cites(a,x) cites(x,b)
- selfcites(a) = sum(x,y) author(a,x) and cites(a,y) and author(y,x)
- Lots of similar variables
Searching an infinite set of variables
- Can't list them all like regular regression
- Need a way of walking through them one at at time
- Think of building them up from more primative ones
Search methods
- Model: finding solution to 16 puzzle
- history: reverse order of tiles
- All the rage in 1700's
- Easy to prove it is impossible
- Depth first search
Nice recursive structure
- Breadth first search
- Doesn't use topology of variables
- lots of memory
- A* algorithm
- Guess where you are so far
- Expand the best node first
- Will find answer if A* is bound on remaining search depth
- Itterative deaping A* (IDA)
- Cool trick: Top of tree is small, so just recompute it!
dean@foster.net