Cluster Analysis in Stata
The first thing to note about cluster analysis is that is is more useful for generating hypotheses than confirming them. Unlike the vast majority of statistical procedures, cluster analyses do not even provide p-values. In fact, while there is some unwillingness to say quite what cluster analysis does do, the general idea is to take observations and break them into groups.
While there is a somewhat infinite number of methods to do this, there are three main bodies of methods, for two of which Stata has built-in commands. The first of these methods are partitioning methods, the second are agglomerative and the third are divisive. Partitioning methods assign each observation to the group with the nearest value (often mean or median). Agglomerative methods begin with each observation in its own group, then puts the two closest values together creating one group of two observations (all the rest of the groups remain single), then putting the next two closest values together so there are two groups of two (and all the other single groups) and continuing the process until the desired number of clusters is reached. The third method is something like a reverse of the agglomerative process, starting with one group containing all observations and working until each group contains a single observation. Divisive methods are very uncommon in the literature due to their time consuming nature and as a result Stata has no command for performing them.
Once you have created a cluster, you can add notes to it using cluster note [cluster name] : [note content]
and list the notes attached to your clusters via cluster notes
without any arguments. You can also generate new grouping variables based on your clusters using the cluster generate [new variable name]
command after a cluster command. For more on this ability see help cluster generate
or Stata's Multivariate Statistics [MV] cluster generate entry.
save
them.- Partition Analysis Commands
- Agglomerative Hierachical Analysis Commands
- Clustering Variables Instead of Observations
- Pre- & Post-Cluster Visualizations