The bonus sections for the following problems are optional. If you have time, do at least one of them. No late penalty will be assessed for the first week late if a bonus problem is done. A one week extension will be added for each further bonus completed.
Use this information to generate a one paragraph description of a d-dimensional cube. If you have figured out the simplex, which seems more dangerious, a d-dimensional cube, or a d-dimensional simplex?
qz,d(u) &le z + d + 2 sqrt((2 z + d) log(1/u)) + 2 log(1/u)
qz,d(1-u) &ge z + d - 2 sqrt((2 z + d) log(1/u)).
Using the asymptotic distribution you computed in the very first homework, see how close that normal approximation compares to these bounds.
INSTRUCTIONS FOR DATA CLEANING:
bunzip2 20021010_easy_ham.tar.bz2
tar -xvf 20021010_easy_ham.tar
cat all | tr " " "\n" | sort | uniq -c | sort -n | tail -1000
The command cut can be used to remove the numbers if you need it to. Note: you can use tr to remove the punctuation via "tr -d [:punct:]". This is a trick from Kenny.
How you want to do this 10000 times is up to you. You can either write a little perl script, or C++, or tcsh shell script. For example, in say tcsh syntax it might look like: mkdir ../word_counts
foreach foo ( `cat ../word_list` )
egrep --no-filename --count "\W$foo\W" * > ../word_counts/$foo
end
If you are using rafisher/ljsavage/dpfoster then the paste command will only take 12 words. Instead you might want to convert to row order and the do a transpose when you get it inside of R. The command "echo $filename `cat $filename`" will string out a file onto one line and put the file name in the first position.
Now that we have the data in a table that we can read we need to start fitting it.
The sandwich estimator estimates the variance of epsilon i by (y - hat(y))2. But we have two different hat(y)'s. The first is the forecast under then null that both alpha and beta are zero. This hat(y) is simply zero. The second is hat(y) under the alternative. This works out to be hat(beta). We will use both hat(y)'s to compute our sandwich estimator.