CSE 517A: Machine Learning

Spring 2007: Homework 2

Due 19:15pm, February 27, 2007

Purpose

The goal of this homework is to give you some experience implementing and experimenting with decision trees.

Preliminaries

Make sure you review your decision tree notes, to make sure you understand the concepts of entropy and information gain, and how to construct trees with the basic ID3 algorithm.

Download the mushroom data set from the UCI repository and have a look at it. The first element in each of the lines is the class label (poisonous or edible), and the remaining items are the attributes. We're going to learn to classify mushrooms in this homework. There are two versions of this data set, one with short names and one with long names. They are the same. You should feel free to transform this data set into any format you like, to make it easy to work with.

Assignment

  1. Build a decision tree from Mitchell's tennis data set. Since we talked about learning a tree to predict PlayTennis in class, for this question, you should build a tree to predict the Outlook, based on the other attributes. Show all of your working, calculations, and decisions as you build the tree. You should not use the Day attribute in your tree. You do not have to typeset your answer to this question. Feel free to set out your answer by hand (but, please take time to make it legible and clear). [20 points]
    Day Outlook Temperature Humidity Wind PlayTennis
    D1 Sunny Hot High Weak No
    D2 Sunny Hot High Strong No
    D3 Overcast Hot High Weak Yes
    D4 Rain Mild High Weak Yes
    D5 Rain Cool Normal Weak Yes
    D6 Rain Cool Normal Strong No
    D7 Overcast Cool Normal Strong Yes
    D8 Sunny Mild High Weak No
    D9 Sunny Cool Normal Weak Yes
    D10 Rain Mild Normal Weak Yes
    D11 Sunny Mild Normal Strong Yes
    D12 Overcast Mild High Strong Yes
    D13 Overcast Hot Normal Weak Yes
    D14 Rain Mild High Strong No
  2. Implement the ID3 algorithm. Your implementation should be as general-purpose as possible, and not tied to any particular database (since you'll be using it to do more classifications in the future), although you might want to assume some particular input format.. You should verify that your code works by learning the decision tree you built in the previous question. [30 points]
  3. Use your decision tree code to learn to classify edible and poisonous mushrooms using the data set you downloaded. Evaluate your results by doing a 10-fold cross validation of your learning algorithm (build the tree on 90% of the data, test it on the remaining 10%, being sure to randomize the order of the data points). [20 points]
  4. Extra credit: Pick another large data set from the repository, and repeat your experiments on it. Again, do a 10-fold cross-validation to test the quality of your decision tree algorithm.
  5. Extra credit: Implement the ability to handle continuous attributes in your decision tree algorithm. Download the iris data set, and try to learn a classifier for the three types of iris in the file. [10 points]
  6. Extra credit: Make your implementation capable of handling both discrete and continuous attributes in the same data set. This is tricky, since you have to come up with some way of informing the algorithm which attributes are continuous and which are discrete. You might consider doing this in the data file (which you're free to modify), or when you initialize your algorithm. We'll leave the implementation details to you. Pick a data set from the repository (with a mix of continuous and discrete attributes), and build a decision tree classifier for it. [10 points]

What to Hand In

You should create a short document reporting the results of your work. The report should include all the details of your experiments, your analysis of the results, and any relevant figures and graphs. Print out your report (and include any hand-written bits), and drop it off in the CSE 517 submission bin on the 5th floor of Lopata. Also, email your code to us, so that we have a reference copy.

For the first assignment, we didn't strictly enforce the deadline. However, this time, we will. Any assignments that arrive past the deadline will suffer late penalties. We'll use the email timestamp onthe email you send, and the TA's wristwatch to measure time. For the physical hand-in, we will consider the TA's wristwatch as the ground truth. Since this may be different from your wristwatch, you might want to consider getting things in a couple of minutes before the deadline.

Page written by Bill Smart.