To request a post on a specific topic or if you have any questions email James@StatisticsSolutions.com

Friday, June 26, 2009

Cluster Analysis

The term cluster analysis was first used by Tryon in 1939. Cluster analysis is a multivariate technique, which is used for segmentation in research. In cluster analysis, a cluster is a group of similar objects. In marketing research, cluster analysis is used for segmentation of similar objects, which is similar in buying habits, demographic characteristics, or psychographics.

Statistics Solutions can assist with cluster analysis and additional statistical analysis. Contact Statistics today for a free 30-minute consultation.

Cluster analysis is also known as the data reduction technique. In other words, cluster analysis is an exploratory data analysis technique that groups similar objects in such a way that the distance between the two objects is minimal, or the group of similar objects is grouped in such a way that the association between the variables is maximized. Cluster analysis seeks to minimize the within group variance and maximize the between group variance. For example, it can be used if an A FMCG Company wants to match the profile of the target audience in terms of lifestyle, attitude and perception. In this case, the marketing manager prepares a questionnaire with 20 statements and then performs a cluster analysis on the 20 statements in three clusters. These days, researchers have developed similar techniques for cluster analysis, which have different names, like numerical taxonomy, Q-analysis, typology analysis, classification analysis, etc. To perform cluster analysis, there are a number of techniques available that are based on the procedure, which is used to measure the distance and clustering algorithm.

Assumptions in cluster analysis:

1. The sample taken for cluster analysis should be representative of the population.
2. In cluster analysis, it is assumed that multiple collinearity is minimal.
3. In cluster analysis, the outlier affects the results. Thus, cluster analysis assumes that there is an absence of outliers.
4. In cluster analysis, data may be metric, non-metric, or a combination of both.
5. Naturally occurring groups must be present in the data.
The process of cluster analysis:
1. A cluster analysis starts with the N×K database.
2. In the second step of cluster analysis, different steps are used to create the N×N matrix. In this N×N matrix, each case is similar or dissimilar to another case, based on the k number of variables.
3. In the third step, by using different algorithms, subjects are sorted in to statistically significant groups. In this group, subjects are as homogenous or different to each other as possible.

In cluster analysis, the N×N matrix is created by using one of the following methods: Squared Euclidean Distance, Pearson correlation coefficient, Cosine of vector variables, Minkowski metric, Mahalanobis D2, City block or Manhattan distances, Jaccard’s coefficient, Chebychev distance metric, Gower’s coefficient, etc. In SPSS, most of the techniques are available.

Clustering algorithms: In cluster analysis cluster algorithms are of two types:

1. Hierarchical methods: Cluster analysis in hierarchical method algorithms involve single average (or nearest neighbor), complete average (or furthest neighbor), average linkage, centroid methods, etc.
2. Non-hierarchical methods: A Cluster analysis non-hierarchical algorithm involves sequential threshold methods, parallel methods, optimization methods, etc.

Determining the number of clusters in the data: In cluster analysis, there is no particular procedure that is used to determine the cluster in the data. The following procedure is used in cluster analysis to determine the cluster in the data:

1. Clustering coefficient: In cluster analysis, the coefficient size shows the homogeneity of the objects being merged.
2. Dendrogram: Dendrogram is the pictorial representation of the cluster in the data. In dendrogram, we can see how the observations are combined in each cluster.
3. Vertical icicle: Vertical icicle is another pictorial way to find the number of clusters in the data. In vertical icicle, blanks are clusters and X’s indicate the number of members per cluster.

Cluster analysis and SPSS: Cluster analysis can be conducted using SPSS. To conduct cluster analysis in SPSS, click on the “Analysis” option and select “classify option” and select “required cluster analysis.”