CSE 517A: Machine Learning

Spring 2007: Homework 1

Due 11:59pm, February 12, 2007

Purpose

The main purpose of this first homework is to give you some experience implementing a simple machine learning algorithm, performing experiments with it, and reporting on the results. You will implement some of the instance-based algorithms that we covered in class on some standard data sets, and write a short report on the results of your experiments.

Preliminaries

Go to the UCI Machine Learning Repository and look around. This is a collection of standard data sets used in machine learning. Download the Auto-Mpg database, and edit it to remove the 6 instances that have missing data values. We will be using this data set for most of this homework. Take a look at the data set and make sure you know what the attributes are.

Assignment

  1. Since you're going to be reporting statistics of your algorithm's performance, you want to make sure that you are calculating these statistics accurately. Implement some code to calculate the mean and variance of a set of numbers. Convince us that you are calculating the correct numbers. [10 points]
  2. The first step in many machine learning studies is to get a feel for the data. For this assignment, we're going to try to predict the mpg attribute based on some of the others. Make a set of graphs that plot the other attributes (on the x-axis) against mpg (on the y-axis). Add these plots to your report, and discuss them. If you had to pick on attribute to use when predicting mpg, which one would you choose, based on your graphs? Why? [10 points]
  3. Implement k-nearest neighbor, using a single attribute as input, and mpg as output. Use the single attribute you identified above. Evaluate k-nearest neighbor for a range of k-values, from 1 up to the size of your training set. Graph the results, showing how the mean squared error changes with k. Discuss these results, and try to explain the shape of the graph. [20 points]
  4. Modify your implementation of k-nearest neighbor, so that it deals with more than one input attribute. Rerun your experiments using all of the appropriate attributes (other than mpg) as the input. For each attribute you omit, explain why you omitted it. Graph your results, and discuss them. [10 points]
  5. Implement locally weighted averaging, and rerun your experiments with this algorithm. Graph the mean squared error as a function of the kernel bandwidth, h, and discuss your results. How does the graph for locally weighted averaging relate to the graph for k-nearest neighbor. [20 points]
  6. Extra credit: Combine k-nearest neighbor and locally weighted averaging. Calculate the distance-weighted average using the k nearest points to the query point. Show your results in whatever graphs you think are most appropriate, and discuss them. [10 points]
  7. Extra credit: Pick another data set from the repository, and rerun both the k-nearest neighbor and LWA experiments. Justify your choice of data set, and add the results and discussion to your report. [10 points]
  8. Extra credit: Implement locally weighted regression, and rerun your experiments. Discuss how these results compare to LWA. [10 points]

What to Hand In

You should create a short document reporting the results of your work. The report should include all the details of your experiments, your analysis of the results, and any relevant figures and graphs. You should email your report before the deadline for this homework. Since email can take a little time to arrive, you should make sure that you send it early enough to arrive before the deadline. We will use the timestamp on the email as the canonical time. There will be no grace period for submissions that are timestamped after the deadline. The only exceptions to this rule will be in the case of a documented failure of the CEC email system.

Thoughts

  1. For this lab, you should implement the algorithms yourself, since they're straightforward. However, you should feel free to use anything you like to calculate the mean and variance, and to generate your graphs. Remember to cite any code that you use in your report.
  2. Yes, the first question is annoying. Sorry, but experience has taught us that even the simple things can go wrong if you don't check them. We'll leave it up to you to convince us that you've calculating the correct numbers, but a reasonable way to do this would be to generate a large set of numbers (say 1,000,000) from a known distribution (say, uniform from 0 to 1), and compare your calculated values to the (known) mean and variance of your chosen distribution. If they're close to each other, then we'll probably be convinced.
  3. The best attribute to use to predict mpg is mpg itself. You're not allowed to suggest this, obviously. Also, at least one of the attributes is tricky to plot, and you don't have to graph it. In general, when something like this comes up, try to follow the spirit of the question(try to get a feel for the data), rather than the letter (plot all of the other attributes). If in doubt, ask us.
  4. Since the data sets you're going to be using for this lab are quite small, you can get away with using a linear data structure to store the data points. You will get substantially faster performance if you implement something more intelligent, but there's no extra credit for it. Only the snug glow of a job well done.
  5. When you're graphing your results, you should graph both the mean performance and the 95% error error bounds on the mean, as we discussed in class. If you don't include the error bounds, the results are less meaningful, and we'll deduct some points to encourage you to do it in the future. Try to make these plots easy to read, and be sure to include labels on the axes. You can use whatever program you like to plot the results, but we recommend gnuplot. Excel, or some other spreadsheet program, is an alternative.
  6. Spell check your report. We will deduct 5 points for every spelling error we find. Yes, this is a lot of points. No, there will be no exceptions to this rule. Yes, we'll accept either US or British spelling. We will also deduct points for grammar that is bad enough to obscure your meaning.
  7. When you write your algorithms, test them on a small dataset, where you can verify the results by hand. Trust us, you'll be much happier debugging your code with a small, hand-written data set than with the real thing.
  8. When you evaluate your algorithms, you should use an n-fold cross-validation, with n of at least 10. You should write the code that handles this evaluations in as generic a way as possible, since you'll be reusing it later on in the class. You can either create a file with only the selected attributes in it, or pick the appropriate attributes in your code.
  9. Bandwidth is a continuous quantity, so you will need to choose a number of values to use when generating your graph. Remember that the bandwidth can go from 0 to infinity.
  10. When graphing the results from combining k-NN and LWA, you might want to consider a surface plot, rather than a graph. Take some time to make sure that it shows what you want it to show. For surface plots, you can omit the error bounds, since they're difficult to show clearly.
Page written by Bill Smart.