# Statistical Data mining: Intrinsically large data (aka graphs)

• Homework 3 is due next week. Any questions on it so far?

## n large is often not a problem

• Database people say, "big data is that which doesn't fit in memory."
• When do we have big data in statistics?
• If n is large we can subsample to generate resonable sized data
• Random sampling is unbiased
• Larry does this in his call center data
• loses signal, so only finds big stuff
• Sampling on Y can be more efficient
• Simillar to matched data in obseravational studies
• If many "no"'s and few "yes"'s use subsample of "no"'s and all the "yes"

## When isn't sampling available?

• Graph like data can't be sub-sampled
• examples:
• Citation graph (our working example)
• Wikipedia
• phone call database (long distance say)
• Must keep everything in database
• But not our problem--we only need a vector to represent each X

## Model: Stat engine attached to database query engine

• Try to keep things seperate as much as possible
• query database for a vector of X's
• Collect up a bunch of X's and drop them into a statistical model
• Now we don't have to learn data structures, database optimizations, etc. Leave that to the computer scientists
• But we do need to understand the interface.

## What do graph variables look like?

• easy stuff: Cites(x,y), author(a,b)
• meta-cites(a,b) = sum cites(a,x) cites(x,b)
• selfcites(a) = sum(x,y) author(a,x) and cites(a,y) and author(y,x)
• Lots of similar variables

## Searching an infinite set of variables

• Can't list them all like regular regression
• Need a way of walking through them one at at time
• Think of building them up from more primative ones

## Search methods

• Model: finding solution to 16 puzzle
• history: reverse order of tiles
• All the rage in 1700's
• Easy to prove it is impossible
• Depth first search
Nice recursive structure
Can get carried away