Statistics 540 overview

Instructor: Dean Foster.
email: foster@diskworld.wharton.upenn.edu
class homepage: http://diskworld.wharton.upenn.edu/teaching/540

Objectives.

Learn a programming language. Perl.
Learn web based computing (CGI, periodic and recursive web agents).
Process text to extract information.
Classical linear algebra and optimization techniques. (Thisted)
Computation for Bayesian methods.
Overview of other computing tools. Mathematica, S-Plus.

Structure.

The course will be split into 5 modules. Each module will focus on constructing a program to perform a specific task. The task itself will be broken down into components, and a set of components will be addressed in each class.

Modules.

1. Web based minesweeper. Perl and CGI

i. Hello wide world. Perl, HTML, HTTP and CGI basics. Client/Server paradigm. Take a protocol (HTTP), an interface (CGI), a mark up language (HTML) and mix in a little Perl and you can do anything!
ii. Data structures in Perl. Scalars, lists and associative arrays.
iii. Subroutines, randomization.
iv. Saving "state"; "cookies".

2. Classification potpourri.

Set up a web based interface to offer a user the ability to select a classification method for data analysis. AKA, how to make your methodology available to the world.

We will learn the algorithms behind the classification techniques and construct a web based interface which allows a remote user to implement them.

i. Logistic regression.
ii. Neural networks.
iii. Boosting.

3. Web based agents for mining online patent databases.

We will design recursive web agents to mine technology patent databases. We will construct representations of patent "family trees". We will represent these trees efficiently and discuss models for tree features.

i. Regular expressions and pattern matching. What does

m@(\w+)://([^/:]+)(:\d*)?([^#]*)@

do for you?
ii. Agent construction. The LWP module. OOP in Perl.
iii. Tree representation and analysis.

4. Click by click - collecting, representing and analyzing web click data.

We will customize the apache web server's log files to allow a click by click analysis of a users progression and action through a web site.

i. Massive data set issues.
ii. Graphs and their representations.
iii. Algorithms.

5. The tower of Babel - Yahoo's stock bulletin boards.

Is there any useful information out there? We will design, periodic and adaptive agents to retrieve all posts, then extract and statistically analyze their content. We will obtain real time stock quotes and correlate boards with markets.

i. Language feature extraction.
ii. Document classification - Bayesian models & Markov Chain Simulation.

Grading

Grading will be based on homework. There will be about 10 - 15 assignments. No late assignments will be accepted.

Readings and resources:

Perl
- Perl is the programming language we will learn this semester.
- Elements of Programming With Perl by Andrew L. Johnson. This is said to be a good tutorial for beginners.
- Programming perl 3rd edition? by Larry Wald. The definitive book on perl. We will be using it as the text for the class.
- Wiki page on learning perl.
- A more structured tutorial
- Other books recommended by Wired: Perl: The programmer's companion, The perl cookbook.
- Comprehensive Perl Archive Network (cpan)
HTML
- HTML is the language web pages are written in. It is how you present material to the world.
- There are 100s of books on HTML, many web pages, etc. So find something that works for you. (Using the view-source is a good way of seeing how somebody did something.)
- see here for an introduction to HTML.
- Another on line source is here.
- Here is a quick list of tags.
LaTeX (self taught)
- Latex is how almost all papers are written in statistics. All homework should be written up using Latex.
- My favorite book is "A guide to LaTeX 2e" by Helmut Kopka and Patrick Daly.
CVS or equivalent (self taught)
- CVS is a way of backing up your work as you go along. It also provides an audit trail of what you did as you do it. I won't actually teach how to use it. You will have to learn on your own.
- An FAQ on CVS
- Lots of pointers to other resources.
- Also try "info cvs" from inside emacs (c-h i m cvs)
- Book reference?
Emacs (self taught / optional)
- Emacs is a good editor from the 70s. If you like software that isn't older than you are you may use some PC based editor.
- Fire up emacs and run the emacs tutorial: C-h t.
- After you have worked through that learn how to use info: C-h i.

Last modified: Mon Sep 3 11:24:15 2001