Instructor: Dean Foster.
email: foster@diskworld.wharton.upenn.edu
class homepage: http://diskworld.wharton.upenn.edu/teaching/540
The course will be split into 5 modules. Each module will focus on constructing a program to perform a specific task. The task itself will be broken down into components, and a set of components will be addressed in each class.
i. Hello wide world. Perl, HTML, HTTP and CGI basics. Client/Server paradigm.
Take a protocol (HTTP), an interface (CGI), a mark up language (HTML) and
mix in a little Perl and you can do anything!
ii. Data structures in Perl. Scalars, lists and associative arrays.
iii. Subroutines, randomization.
iv. Saving "state"; "cookies".
We will learn the algorithms behind the classification techniques and construct a web based interface which allows a remote user to implement them.
i. Logistic regression.
ii. Neural networks.
iii. Boosting.
We will design recursive web agents to mine technology patent databases. We will construct representations of patent "family trees". We will represent these trees efficiently and discuss models for tree features.
i. Regular expressions and pattern matching. What does
m@(\w+)://([^/:]+)(:\d*)?([^#]*)@do for you?
We will customize the apache web server's log files to allow a click by click analysis of a users progression and action through a web site.
i. Massive data set issues.
ii. Graphs and their representations.
iii. Algorithms.
Is there any useful information out there? We will design, periodic and adaptive agents to retrieve all posts, then extract and statistically analyze their content. We will obtain real time stock quotes and correlate boards with markets.
i. Language feature extraction.
ii. Document classification - Bayesian models & Markov Chain Simulation.
Grading will be based on homework. There will be about 10 - 15 assignments. No late assignments will be accepted.
Last modified: Mon Sep 3 11:24:15 2001