Statistics 540 overview

Statistics 540 overview

Instructor: Dean Foster.
email: foster@diskworld.wharton.upenn.edu
class homepage: http://diskworld.wharton.upenn.edu/teaching/540

Objectives.

  1. Learn a programming language. Perl.
  2. Learn web based computing (CGI, periodic and recursive web agents).
  3. Process text to extract information.
  4. Classical linear algebra and optimization techniques. (Thisted)
  5. Computation for Bayesian methods.
  6. Overview of other computing tools. Mathematica, S-Plus.

Structure.

The course will be split into 5 modules. Each module will focus on constructing a program to perform a specific task. The task itself will be broken down into components, and a set of components will be addressed in each class.

Modules.

1. Web based minesweeper. Perl and CGI

i. Hello wide world. Perl, HTML, HTTP and CGI basics. Client/Server paradigm. Take a protocol (HTTP), an interface (CGI), a mark up language (HTML) and mix in a little Perl and you can do anything!
ii. Data structures in Perl. Scalars, lists and associative arrays.
iii. Subroutines, randomization.
iv. Saving "state"; "cookies".

2. Classification potpourri.

Set up a web based interface to offer a user the ability to select a classification method for data analysis. AKA, how to make your methodology available to the world.

We will learn the algorithms behind the classification techniques and construct a web based interface which allows a remote user to implement them.

i. Logistic regression.
ii. Neural networks.
iii. Boosting.

3. Web based agents for mining online patent databases.

We will design recursive web agents to mine technology patent databases. We will construct representations of patent "family trees". We will represent these trees efficiently and discuss models for tree features.

i. Regular expressions and pattern matching. What does

m@(\w+)://([^/:]+)(:\d*)?([^#]*)@
do for you?
ii. Agent construction. The LWP module. OOP in Perl.
iii. Tree representation and analysis.

4. Click by click - collecting, representing and analyzing web click data.

We will customize the apache web server's log files to allow a click by click analysis of a users progression and action through a web site.

i. Massive data set issues.
ii. Graphs and their representations.
iii. Algorithms.

5. The tower of Babel - Yahoo's stock bulletin boards.

Is there any useful information out there? We will design, periodic and adaptive agents to retrieve all posts, then extract and statistically analyze their content. We will obtain real time stock quotes and correlate boards with markets.

i. Language feature extraction.
ii. Document classification - Bayesian models & Markov Chain Simulation.

Grading

Grading will be based on homework. There will be about 10 - 15 assignments. No late assignments will be accepted.

Readings and resources:


Last modified: Mon Sep 3 11:24:15 2001