[EM] Dectecting Clone Sets

Mon Nov 16 23:41:55 PST 2009

fsimmons at pcc.edu wrote:
> Here's a suggestion for detecting clone sets based on Range Ballots:
> 
> Define the distance between two candidates as the square root of the
> sum (over the ballots) of the squared diffference of their respective
> ratings.
> 
> If the ballots are approval style, this becomes the square root of
> the number of ballots on which just one of the two candidates is
> approved.
> 
> Use these distances to do a cluster analysis of the candidates.
> 
> The tree structure of the clusters gives the tree structure of the
> clone sets.
> 
> If you have a clone dependent method, like Copeland, that you would
> like to "de-clone," you might find this clone set detectiion method
> to be useful.

I wonder if that could be used to fix the problem of my Kemeny-based 
forced clustering method, which is that it degrades when the "clusters" 
(supporters of various candidates) overlap. It still beats STV (and even 
QPQ) according to my simulations, but improvement is improvement, no?

The problem is this: If you have 2 winners, and everybody votes bloc 
style for one of two "virtual parties", then the clustering method 
minimizes sum distance by allocating every voter to the right bloc.

However, if we add another seat and candidate, call the candidate A, and 
A is unanimously voted top, then we'd ideally want the method to pick A, 
and then pick the other seats as if A had not been in the running*. 
Plain clustering doesn't do that: instead, one of the seats will have A 
elected (as he deserves), but the minimization won't pick randomly from 
later preferences, so the set of ballots not allocated to that seat 
won't be representative of the ballots, had not A run.

One way to fix this is to use elimination, but that's hardly monotonic. 
I've been considering adding a hierarchy to the clustering, which is 
similar to what you're saying above, but I haven't quite found out how 
to do that yet.

Since my multiwinner method uses rankings, I use Kemeny distance to 
determine the distance between ballots and a "representative" ballot for 
a cluster. Then I simply find the orderings that minimize the total 
distance between ballots allocated to the clusters, and to the relevant 
cluster's representative ballot, through brute force (which is why the 
temporal complexity is so horrible).

In any case, even if your tree clustering idea can't be used for my 
method, perhaps the idea of clustering by distance to remove clones 
could be adapted to ranked methods by using the Kemeny distance between 
ballot orders.

-

* This is true only if the distribution of additional preferences for 
those who vote for A is equal to that of the voters in general, A 
notwithstanding. There might be additional constraints as well, which is 
why I'm using "unanimous A top" as an example - since removing A doesn't 
alter the other candidate distributions in any way.