Evaluating the Robustness of Competing Clustering Algorithms
When presented with a dataset, it is beneficial to identify any relationships or trends. One way in which
we can accomplish this is through the application of cluster analysis, a method for developing taxonomies
within a set of observations. While this technique is beneficial in marketing, research, or any profession
requiring data analysis, there are many algorithms for dfining clusters in a dataset. As a result, we raise
the question, which clustering algorithm is the best in various scenarios? In this project we examine three
such clustering techniques: k-means, hierarchical, and MCLUST, and analyze their efficiency in clustering
data on self-generated datasets in which we know the number of clusters.
We begin by investigating what each of these clustering techniques is and how they work. Upon developing
an understanding of how each algorithm works, we assess the performance of each algorithm when the correct
number of clusters is used as well as at various incorrect number of clusters. Furthermore, we contrast how
each performs when the number of dimensions and the degree of the cluster separations are varied. For
comparison between the different clustering algorithms in these areas, we implement the adjusted Rand
statistic. After these analyses we apply the best algorithm to grouping 2008-2009 Eastern College Athletic
Conference (ECAC) men's and women's hockey player data, including information on games played, goals,
assists, penalties, and +=ยก. The intent of this is to identify interesting groups of players.