[EM] Simple voting methods

Tue Oct 26 14:10:04 PDT 2021

This post summarises an empirical evaluation of some simple voting 
methods, seeking to fill gaps in the evaluations I've seen. Rather than 
present results here (where the formatting might get mangled) I've put 
them on the web at http://www.masterlyinactivity.com/condorcet.html. 
Results are in good faith but the absence of bugs is not guaranteed.

The generative process is a spatial model. Voters and candidates are 
drawn from the same mixture of 3 two-dimensional gaussian components. 
This allows quite a high degree of asymmetry, yet Condorcet cycles arise 
in only 0.7% of elections. I suspect that this is an upper bound on 
their frequency under sincere voting in real life. There are a million 
elections each with 10001 voters and 9 candidates.

'sptp' is the sum of first and second preferences; 'av'=IRV=Hare; 'mj' 
is an idealised highest median method in which a voter's rating of a 
candidate is minus the Euclidean distance between them. This is 
unrealistically favourable to MJ since it eliminates the requirement on 
voters to quantise their ratings, which they will do inconsistently. 
MJ's performance is estimated from 10x fewer elections than other methods.

'condorcet' is the method which is considered to give the wrong result 
whenever there is a cycle. 'condorcet+X' uses X as a cycle-breaker, so 
'condorcet+borda' = Black.

Tiebreaks may exist in 'full' and 'restricted' forms. If Llull's method 
has a full Borda tiebreak, then a Borda ranking is computed on the 
entire field, and the winner is the Llull winner who comes highest in 
the Borda ranking. A restricted Borda tiebreak is computed by applying 
the Borda count to the Llull tie set – that is, ballots are compressed 
by squeezing out candidates not belonging to the set, and Borda's method 
applied. (Llull=Copeland.) SPTP can only be full and an AV tiebreak is 
generally understood to be restricted.

I don't place much faith in Darlington's model, which draws voters from 
a symmetric distribution. This relies on small sample sizes to prise 
apart the Condorcet methods. There is no reason to suppose that the 
behaviour seen in small samples (which is liable to be influenced by 
numerical ties) will be representative of large electorates.

I find it to be very rare for Llull's method to produce a unique winner 
when there are Condorcet cycles (<3% of the time). Perhaps I should be 
suspicious of this. Smith's method never gives a unique winner in the 
presence of cycles.

There are two metrics. The first is the percentage of elections in which 
each method elects the rightful winner, defined as the candidate whose 
mean distance from voters is minimised. The second is mean loss. If a 
voting method elects a candidate whose mean distance from voters is y, 
and if the minimum mean distance from voters over candidates is x, then 
the loss is y-x. I consider losses to be intrinsically meaningless, but 
to provide a sounder basis for comparison than percentage correctness. 
The average distance from a voter to the best candidate in any election 
is around 1.7, so the differences in performance between competitive 
methods are very small.

Table 5 (measuring percentage correct for a particular form of tactical 
voting) is the sole case in which Condorcet+AV (ie. Condorcet/Hare) 
outperforms Minimax, and when we look at the corresponding figures under 
a loss metric the tables are turned: evidently Condorcet/Hare makes 
fewer but larger mistakes than Minimax, and is weaker overall.

The best result in each table has a solid underline; all other results 
which are better than Minimax have broken underlines. There are various 
interesting points of detail, but overall the striking feature is the 
consistently good performance of Minimax. Smith+BordaR, Smith+MinimaxR 
and and Smith+MinimaxF run it close.

The 8 tables apply 2 metrics to sincere voting and to 3 forms of 
tactical voting. In all 3 types of tactical voting voters whose sincere 
first preference is for a certain candidate (c0) place another candidate 
(c1) at an insincere position in their lists. Type 1 (compromising) 
moves c1 to the top; type 2 (false cycles) puts him second; and type 3 
(burying) puts him at the bottom.

The number of candidates is reduced to 5 for tactical voting. Each of 
the 20 (c0,c1) pairs corresponds to an attempt at subversion, which is 
considered to be successful if the winning candidate is closer to c0, 
and further from the average voter, than the winner under sincere 
voting. The result attributed to each method is the worst amongst its 
sincere result and all successful subversions of the given type. I have 
not considered the effect of simultaneous tactical voting by supporters 
of more than one candidate, either in the same or in different ways.

A choice between voting methods should not be based purely on figures 
such as those given. An important factor is a method's sensitivity to 
'irrelevant' candidates (eg. through strategic nomination). This is not 
tested here, but the Borda count is notoriously weak, and I would be 
inclined to reject any use of it (even as a tiebreak) in consequence.

Another consideration is the pragmatic acceptability of the various 
methods (with simplicity the main aspect). For my part, I would be 
reluctant to accept any use of the Smith set on account of its lack of 
conceptual simplicity.

The last additional consideration is vulnerability to forms of tactical 
voting not studied here. I don't know whether looking at more 
complicated forms would change anything.

Taking everything together, there really seems to be nothing better than 
Minimax. My instinctive preference beforehand was for Llull's method; in 
particular, I found Darlington's reasons for excluding it to be 
unpersuasive. In fact Llull+SPTP performs reasonably well (and can't be 
accused of trading on the strength of its tiebreak), but it doesn't 
offer any advantage over pure Minimax.

And my final conclusion is that it's ridiculously easy to perform an 
evaluation of this sort. 300 lines of code isn't very much. I would 
expect dozens of such evaluations to have been performed so that correct 
ones would confirm each other while buggy ones stood out. Have they?- I 
haven't seen them. For what it's worth, when there's an overlap my 
results are generally consistent with Darlington's and with a much older 
evaluation by Chamberlin and Cohen (but the overlap is small).

My software is on the same web page as the results - not because I think 
it's of interest, but because I think reproducible results are of more 
value than unreproducible ones.

CJC