[EM] cluster analysis

Forest Simmons fsimmons at pcc.edu
Fri Feb 27 17:17:01 PST 2004


Thanks for giving it a try.  I'm not familiar with GenStat, but most
hierarchal cluster analysis programs require (1) for input a matrix whose
rows represent the measured attributes of the objects to be "clustered"
and (2) a specification of how you want to measure distance between row
vectors.

(3) Also you are usually given a choice of how to measure disparity of
clusters ... whether by nearest neighbors, furthest points, median points,
etc. Ward's criterion based on minimization of additional variance is
another possible criterion for choice of previous clusters to agglomerate.

Remember that cluster analysis has a goal of giving an hierarchy of
clusters of clusters of clusters, similar to the Cantor fractal
distribution of matter in the universe as Peebles first pointed out
[modern terminology and additional insights from Mandelbroit].

Whether or not you use the nearest neighbor choice for generating the
cluster hierarchy, you need to connect the clusters by nearest neighbors
to generate the tree structure for the ballots (or factions, depending on
which of my suggestions you wanted to try).

For (1) you have to convert the ballots to vector form, so that column k
of the input matrix corresponds to rankings (or average rankings in the
other approach) of candidate k.

For (2) you could use the hamming metric, or the Euclidean metric. The max
metric wouldn't be sensitive enough in this application.  The best metrics
for this application are not given as options in any general purpose stat
software, so these two will have to do for now.

For (3) there are apt to be many ties in the ballot version (if not in the
faction version) no matter the choice of agglomeration methods.

Ties in choices of connecting edges between clusters should be broken by
going with the two vertices with the highest sum of weights.

It may be a while before I have a chance to try this out on the data you
suggested, so I apologize on the one hand, while thanking you for your
energetic initiative on the other hand.

Stepping back for a moment, the important thing is to find a natural tree
structure on the ballots (or factions in the other case) if possible.

[Note that in this context a tree without branches is considered a tree.
The important thing is a connected graph without cycles.]

A minimal spanning tree might be good enough, or at least a start. Cluster
analysis is just one of many possible methods for generating trees in this
context.

If the ballots reflect the tree structure consistently, then there will be
a Condorcet Winner among the ballots (or factions, as the case may be) if
not among the candidates themselves, as shown by the argument in my
original posting on this subject below.

To be more precise about "consistently" reflecting the tree structure:

If removal of vertex A puts vertices B and C in distinct components of the
disconnected graph, and some ballot ranks A below both B and C, then that
ballot is inconsistent with the tree structure.

In other words, the ballots must be consistent with issue space distance
being measured along the edges of the graph before we can expect the
winning faction to give us an actual CW for the original ballot set.

It seems reasonable that when the voters are highly clustered in issue
space, this condition might hold for some tree representing the
clustering.

If you understand the constructive existence proof below, you can see that
this condition can be relaxed, and that cycles can be permitted as long as
they do not keep the "center" from disconnecting the graph into components
each of which has weight no more than half of the total.

All in all this is encouraging for the existence of CW's in actual
political races.

I hope that some of this is interesting to some of our readers.

Forest

On Thu, 26 Feb 2004, James Gilmour wrote:

> Forest wrote:
> > I can think of two general approaches to applying cluster analysis to
> > election methods:   <details cut>
>
> I should be very interested to see how you would apply this to the STV-PR ballots from the 2002 Dáil
> Éireann election.  Full ballot "papers" are available for the three constituencies that used
> electronic voting:
> http://www.meath.ie/election.html
> http://www.dublincountyreturningofficer.com/Results_Download.html
>
> I started to look at this, using GenStat for Windows, but the "missing values" (truncated ballots)
> cause rejection of large numbers of ballot papers or total rejection, depending on the analysis
> attempted.  (NB I have used GenStat for statistical analysis for many years, but this is for me,
> both a new area of statistical analysis and a new area of GenStat use.)
> James
>
> ----
> Election-methods mailing list - see http://electorama.com/em for list info
>




More information about the Election-Methods mailing list