Evaluating the Robustness of Competing Clustering Algorithms

When presented with a dataset, it is beneficial to identify any relationships or trends. One way in which we can accomplish this is through the application of cluster analysis, a method for developing taxonomies within a set of observations. While this technique is beneficial in marketing, research, or any profession requiring data analysis, there are many algorithms for dfining clusters in a dataset. As a result, we raise the question, which clustering algorithm is the best in various scenarios? In this project we examine three such clustering techniques: k-means, hierarchical, and MCLUST, and analyze their efficiency in clustering data on self-generated datasets in which we know the number of clusters. We begin by investigating what each of these clustering techniques is and how they work. Upon developing an understanding of how each algorithm works, we assess the performance of each algorithm when the correct number of clusters is used as well as at various incorrect number of clusters. Furthermore, we contrast how each performs when the number of dimensions and the degree of the cluster separations are varied. For comparison between the different clustering algorithms in these areas, we implement the adjusted Rand statistic. After these analyses we apply the best algorithm to grouping 2008-2009 Eastern College Athletic Conference (ECAC) men's and women's hockey player data, including information on games played, goals, assists, penalties, and +=ยก. The intent of this is to identify interesting groups of players.