CSE 517A: Machine Learning
Spring 2007: Homework 2
Due 19:15pm, February 27, 2007
Purpose
The goal of this homework is to give you some experience implementing
and experimenting with decision trees.
Preliminaries
Make sure you review your decision tree notes, to make sure you
understand the concepts of entropy and information gain, and how to
construct trees with the basic ID3 algorithm.
Download the mushroom data set from the UCI repository and have a look
at it. The first element in each of the lines is the class label
(poisonous or edible), and the remaining items are the attributes.
We're going to learn to classify mushrooms in this homework. There
are two versions of this data set, one with short names and one with
long names. They are the same. You should feel free to transform
this data set into any format you like, to make it easy to work with.
Assignment
-
Build a decision tree from Mitchell's tennis data set. Since we
talked about learning a tree to predict PlayTennis in class, for
this question, you should build a tree to predict the Outlook, based
on the other attributes. Show all of your working, calculations,
and decisions as you build the tree. You should not use
the Day attribute in your tree. You do not have to typeset your
answer to this question. Feel free to set out your answer by hand
(but, please take time to make it legible and clear). [20 points]
| Day |
Outlook |
Temperature |
Humidity |
Wind |
PlayTennis |
| D1 |
Sunny |
Hot |
High |
Weak |
No |
| D2 |
Sunny |
Hot |
High |
Strong |
No |
| D3 |
Overcast |
Hot |
High |
Weak |
Yes |
| D4 |
Rain |
Mild |
High |
Weak |
Yes |
| D5 |
Rain |
Cool |
Normal |
Weak |
Yes |
| D6 |
Rain |
Cool |
Normal |
Strong |
No |
| D7 |
Overcast |
Cool |
Normal |
Strong |
Yes |
| D8 |
Sunny |
Mild |
High |
Weak |
No |
| D9 |
Sunny |
Cool |
Normal |
Weak |
Yes |
| D10 |
Rain |
Mild |
Normal |
Weak |
Yes |
| D11 |
Sunny |
Mild |
Normal |
Strong |
Yes |
| D12 |
Overcast |
Mild |
High |
Strong |
Yes |
| D13 |
Overcast |
Hot |
Normal |
Weak |
Yes |
| D14 |
Rain |
Mild |
High |
Strong |
No |
-
Implement the ID3 algorithm. Your implementation should be as
general-purpose as possible, and not tied to any particular database
(since you'll be using it to do more classifications in the future),
although you might want to assume some particular input format..
You should verify that your code works by learning the decision tree
you built in the previous question. [30 points]
-
Use your decision tree code to learn to classify edible and
poisonous mushrooms using the data set you downloaded. Evaluate
your results by doing a 10-fold cross validation of your learning
algorithm (build the tree on 90% of the data, test it on the
remaining 10%, being sure to randomize the order of the data
points). [20 points]
-
Extra credit: Pick another large data set from the
repository, and repeat your experiments on it. Again, do a 10-fold
cross-validation to test the quality of your decision tree
algorithm.
-
Extra credit: Implement the ability to handle continuous
attributes in your decision tree algorithm. Download the iris data
set, and try to learn a classifier for the three types of iris in
the file. [10 points]
-
Extra credit: Make your implementation capable of handling
both discrete and continuous attributes in the same data set. This
is tricky, since you have to come up with some way of informing the
algorithm which attributes are continuous and which are discrete.
You might consider doing this in the data file (which you're free to
modify), or when you initialize your algorithm. We'll leave the
implementation details to you. Pick a data set from the repository
(with a mix of continuous and discrete attributes), and build a
decision tree classifier for it. [10 points]
What to Hand In
You should create a short document reporting the results of your work.
The report should include all the details of your experiments, your
analysis of the results, and any relevant figures and graphs. Print
out your report (and include any hand-written bits), and drop it off
in the CSE 517 submission bin on the 5th floor of Lopata. Also, email your code to us, so that
we have a reference copy.
For the first assignment, we didn't strictly enforce the deadline.
However, this time, we will. Any assignments that arrive past the
deadline will suffer late penalties. We'll use the email timestamp
onthe email you send, and the TA's wristwatch to measure time. For
the physical hand-in, we will consider the TA's wristwatch as the
ground truth. Since this may be different from your wristwatch, you
might want to consider getting things in a couple of minutes before
the deadline.