CSE 517A: Machine Learning

Spring 2007: Homework 3

Due 16:59, March 26, 2007

Purpose

Make sure you number your answer to each question.

Preliminaries

Make sure you review locally-weighted averaging, since we'll be referring to it in this homework. Make sure that you look over decision trees, too.

Assignment

This homework is slightly different than the previous ones. For this one, you should attempt at least 70 points worth of questions. Anything else you attempt over and above 70 points will be considered to be extra credit.
  1. What is the maximum likelihood hypothesis (classification) for the mushroom data? This is the single ML hypothesis, considering all of the data. Why is just using the ML hypothesis a bad idea, especially for this data set? What is the MAP hypothesis? What effect does the prior probability of each hypothesis have on the MAP hypothesis? Draw a graph of this effect. [5 points]
  2. Implement Naive Bayes algorithm, and test it on both the tennis example (from the previous homework (using the play tennis attribute as the target) and on the mushroom data set. Compare your results to the results from using a decision tree, and comment on any differences you observe. In particular, you should examine the role of your prior probabilities in the classification of new examples. For this problem, you can remove any data points with unknown attributes from the data set. You should draw graphs of your classification accuracy as a function of your priors. For example, for the mushroom data, plot the classification accuracy of the naive bayes algorithm for P(poisonous) from 0 to 1. As usual, you should do repeated cross-validation trials to get error bounds on your performance. [20 points]
  3. The Naive Bayes algorithm, as presented in class, deals only with discrete-valued attributes. Why do you think this is? (Hint: think about how you calculate the probabilities.) What do you think is the fundamental difficulty in adapting it to continuous values? What assumptions could you make to make the problem of continuous values less problematic? (Hint: we've made this assumption, in one particular form, about a number of random quantities in other algorithms.) [5 points]
  4. Implement the simple k-means algorithm, estimating the mean of the clusters only. Your implementation should handle multi-dimensional real inputs. Test your implementation on well-separated, two-dimensional gaussian data (see below). Show that your algorithm converges to the correct classification, given the correct number of classes as input. Remember that, since you're generating the data, you have access to the ground truth classification (which gaussian generated the data point). You should show how the means vary with the iterations of your algorithm and (hopefully) converge on the correct numbers. Vary the number of clusters and their separation, and discuss how this affects the convergence and classification found by the algorithm. [20 points]
  5. Implement simulated annealing for your EM algorithm. Use the function covered in class (the Boltzmann distribution), and experiment with different temperature schedules. You should record the best parameter setting (in terms of sum of log probabilities) over all iterations, and return this at the end of the algorithm. Note that this might be a different parameter setting than the final one, since simulated annealing can cause things to get worse. Plot the sum of the log probabilities with iterations, and comment on the graph. Compare the results from simulated annealing with those from randomly restarting the algorithm, and also with your basic EM implementation. To be fair, you should allow both algorithms to same total number of iterations. Make sure you test your algorithm on challenging data (either generated by you, or from the UCI repository). [20 points]
  6. Implement the full EM algorithm for one-dimensional inputs. Your algorithm should estimate the means and variances of the clusters, just like the one we demonstrated in class. Verify that your algorithm works on the well-separated data of the previous question. Generate some new data sets, where the gaussians have different variances, and show that your algorithm still works. You can do this by showing that the means and variances converge to the correct values, and by plotting the sum of log probabilities over the data points (showing that it increases, and then levels out). [20 points]
  7. Locally-weighted averaging weights the contributions of the training points by a function of their distance. Interpret this algorithm from a Bayesian point of view. (Hint: think about a probabilistic match between the query point and all of the training points.) What do the kernel functions and bandwidths correspond to in this view? [5 points]

Generating Data

To generate normally-distributed numbers, you can use the following magic.
double generate_normal_rv(const double mean, const double std_dev) {
    const double theta = generate_uniform_rv(0.0, 2.0 * M_PI);
    const double r = sqrt(-2.0 * log(generate_uniform_rv(0.0, 1.0)));
    return (std_dev * r * cos(theta)) + mean;
}
Select a mean for each gaussian, and a common variance. For multi-dimensional gaussians, the mean will be a vector. To generate a point, pick one of these gaussians at random, and then generate each dimension of the point by calling a function like the one above (with the mean value for that dimension). Make sure you plot your data to make sure it looks right before you use it with your learning algorithm.

What to Hand In

Turn in a paper copy of your homework, to the hand-in bin. Also turn in an electronic copy of your code. Please, please, please save the trees. Don't include reams of output from a program, since we will not read it. Only include data that are useful and readable. If you really feel driven to let us see the verbose output of your code, send it in electronic format. That way, at least, we can search in it electronically.
Page written by Bill Smart.