The main purpose of this first homework is to give you some experience
implementing a simple machine learning algorithm, performing
experiments with it, and reporting on the results. You will implement
some of the instance-based algorithms that we covered in class on some
standard data sets, and write a short report on the results of your
experiments.
Preliminaries
Go to the UCI Machine
Learning Repository and look around. This is a collection of
standard data sets used in machine learning. Download the Auto-Mpg
database, and edit it to remove the 6 instances that have missing data
values. We will be using this data set for most of this homework.
Take a look at the data set and make sure you know what the attributes
are.
Assignment
Since you're going to be reporting statistics of your algorithm's
performance, you want to make sure that you are calculating these
statistics accurately. Implement some code to calculate the mean
and variance of a set of numbers. Convince us that you are
calculating the correct numbers. [10 points]
The first step in many machine learning studies is to get a feel for
the data. For this assignment, we're going to try to predict the
mpg attribute based on some of the others. Make a set of graphs
that plot the other attributes (on the x-axis) against mpg (on the
y-axis). Add these plots to your report, and discuss them. If you
had to pick on attribute to use when predicting mpg, which one would
you choose, based on your graphs? Why? [10 points]
Implement k-nearest neighbor, using a single attribute as input, and
mpg as output. Use the single attribute you identified above.
Evaluate k-nearest neighbor for a range of k-values, from 1 up to
the size of your training set. Graph the results, showing how the
mean squared error changes with k. Discuss these results, and try
to explain the shape of the graph. [20 points]
Modify your implementation of k-nearest neighbor, so that it deals
with more than one input attribute. Rerun your experiments using
all of the appropriate attributes (other than mpg) as the input.
For each attribute you omit, explain why you omitted it. Graph your
results, and discuss them. [10 points]
Implement locally weighted averaging, and rerun your experiments
with this algorithm. Graph the mean squared error as a function of
the kernel bandwidth, h, and discuss your results. How does the
graph for locally weighted averaging relate to the graph for
k-nearest neighbor. [20 points]
Extra credit: Combine k-nearest neighbor and locally weighted
averaging. Calculate the distance-weighted average using the k
nearest points to the query point. Show your results in whatever
graphs you think are most appropriate, and discuss them. [10
points]
Extra credit: Pick another data set from the repository, and
rerun both the k-nearest neighbor and LWA experiments. Justify your
choice of data set, and add the results and discussion to your
report. [10 points]
Extra credit: Implement locally weighted regression, and
rerun your experiments. Discuss how these results compare to LWA.
[10 points]
What to Hand In
You should create a short document reporting the results of your work.
The report should include all the details of your experiments, your
analysis of the results, and any relevant figures and graphs. You
should email your report
before the deadline for this homework. Since email can take a little
time to arrive, you should make sure that you send it early enough to
arrive before the deadline. We will use the timestamp on the email as
the canonical time. There will be no grace period for submissions
that are timestamped after the deadline. The only exceptions to this
rule will be in the case of a documented failure of the CEC email
system.
Thoughts
For this lab, you should implement the algorithms yourself, since
they're straightforward. However, you should feel free to use
anything you like to calculate the mean and variance, and to
generate your graphs. Remember to cite any code that you use in
your report.
Yes, the first question is annoying. Sorry, but experience has
taught us that even the simple things can go wrong if you don't
check them. We'll leave it up to you to convince us that you've
calculating the correct numbers, but a reasonable way to do this
would be to generate a large set of numbers (say 1,000,000) from a
known distribution (say, uniform from 0 to 1), and compare your
calculated values to the (known) mean and variance of your chosen
distribution. If they're close to each other, then we'll probably
be convinced.
The best attribute to use to predict mpg is mpg itself. You're not
allowed to suggest this, obviously. Also, at least one of the
attributes is tricky to plot, and you don't have to graph it. In
general, when something like this comes up, try to follow the spirit
of the question(try to get a feel for the data), rather than the
letter (plot all of the other attributes). If in doubt, ask us.
Since the data sets you're going to be using for this lab are quite
small, you can get away with using a linear data structure to store
the data points. You will get substantially faster performance if
you implement something more intelligent, but there's no extra
credit for it. Only the snug glow of a job well done.
When you're graphing your results, you should graph both the mean
performance and the 95% error error bounds on the mean, as we
discussed in class. If you don't include the error bounds, the
results are less meaningful, and we'll deduct some points to
encourage you to do it in the future. Try to make these plots easy
to read, and be sure to include labels on the axes. You can use
whatever program you like to plot the results, but we recommend gnuplot. Excel, or some other
spreadsheet program, is an alternative.
Spell check your report. We will deduct 5 points for every spelling
error we find. Yes, this is a lot of points. No, there will be no
exceptions to this rule. Yes, we'll accept either US or British
spelling. We will also deduct points for grammar that is bad enough
to obscure your meaning.
When you write your algorithms, test them on a small dataset, where
you can verify the results by hand. Trust us, you'll be much
happier debugging your code with a small, hand-written data set than
with the real thing.
When you evaluate your algorithms, you should use an n-fold
cross-validation, with n of at least 10. You should write the code
that handles this evaluations in as generic a way as possible, since
you'll be reusing it later on in the class. You can either create a
file with only the selected attributes in it, or pick the
appropriate attributes in your code.
Bandwidth is a continuous quantity, so you will need to choose a
number of values to use when generating your graph. Remember that
the bandwidth can go from 0 to infinity.
When graphing the results from combining k-NN and LWA, you might
want to consider a surface plot, rather than a graph. Take some
time to make sure that it shows what you want it to show. For
surface plots, you can omit the error bounds, since they're
difficult to show clearly.