[EM] Some strategy simulation results and comments

Sun Mar 13 14:19:22 PDT 2011

Hi all,

Here are some results from my strategy simulation.

Let me preface by reiterating what the interesting thing about this simulation
is: The AI voters have to learn each method, with no preprogrammed 
understanding of strategy or even what the ballot possibilities are supposed
to mean. Furthermore the pre-election polling that informs the voters about
each other is conducted using the method itself.

Compare to the past, where the voters in my simulations have either been:
1. completely sincere,
2. basing strategy on approval polling, or
3. using method-specific strategy hard-coded (perhaps unconvincingly) by me.

So, some intro notes on methods and names:
Methods like CWP and Range and IBIFA are using 3-slot ballots. Nothing finer 
is supported.
AWP and CWP are the Approval- and Cardinal-Weighted Pairwise methods by JGA. 
For AWP "ex" and "im" versions are included based on whether approval is 
explicit or implied by ranking.
RB is Random Ballot.
RndP means Random Pair (pairwise comparison between two candidates picked 
randomly).
CdlA is my Conditional Approval. QR (Quick Runoff) and KH (King of the Hill) 
are also my methods.
VDP (Venzke Disq Plurality) is what I will call VFA, so that VFA can refer to 
a ballot format.
VFAR3 is VFA ballot runoff type 3, the only one included in this batch. It's 
like TTR but with the ability for "against" votes to block a candidate from 
the final two.
C//A is Condorcet//Approval. There is C//Aex and C//Aim depending on whether 
the approval cutoff is explicitly placed, or implied from the fact of being 
ranked.
ICA is my Improved Condorcet Approval. Also included are the FBC/SDSC/SFC-
satisfying MDDA and MAMPO methods.
SMDTR and IBIFA are Chris Benham methods. There is "IBIFAcb" (equal-ranking 
(ER) allowed and opposition is defined as Chris originally defined 
it), "IBIFApw" (ER permitted but pairwise opposition is used), and "IBIFAst" 
(ER not permitted; pairwise opposition is used).
DMC is Definite Majority Choice which had popularity here for awhile.
TACC is Forest's Condorcet method. I wasn't sure, but approval is implicit.
VBV is "Venzke Bucklin variant." There is "VBVer" and "VBVst" based on whether 
equal ranking is allowed or not.
QLTD is Woodall's Bucklin variant. Others are DAC, DHSC, and DSC. DSC also has 
a "DSCer" version allowing equal ranking. I did allow equal ranking for DAC 
and DHSC.
IRNR is Instant Runoff using Normalized Ratings.
For Borda, Black, and Baldwin, ER and truncation were not allowed.
C// anything means a Condorcet winner autowins before attempting the rest of 
the method.
MAP is Majority Favorite//Antiplurality.
IFPP is Craig Carey's Improved FPP. It is IRV, but you will eliminate all 
candidates with below-average FPs.
2sMMPO is MMPO on approval ballots. Also "MMPOFPP" is included which is MMPO 
but breaking a tie with (original) FPs.
Basic Condorcet methods: WV and margins (which are Minmax or whatever you 
have). Plus Raynaud(wv).
AER is Approval-Elim Runoff aka Approval AV.
Also included: Antip(lurality), FPP, IRV, TTR, Coombs, Appr(oval), Range3, 
Bucklin, MCA. It is possible that Range fails to turn into Approval 
strategically, not because of poor AI, but because with a small number of 
voting blocs, the methods actually aren't the same strategically.

In this batch there were eight blocs allocated evenly across a 1D spectrum, 
and for each "trial" the three candidates were placed randomly. The "base" 
size of each bloc is the same, but only 50% of any is guaranteed to show up 
for a given poll/election. There are only 1500 trials, which took about eight 
hours on a single thread. Each trial is a scenario (candidate allocation) fed 
to each of 52 included methods in turn, and then each method required about 3-
10 thousand polls (as needed to achieve a certain level of confidence of 
understanding). In total there were 286 million polls.

Eight is not many, but there's no way around an unrealistically small number, 
because the AI needs to be able to perceive that it can affect the outcome.

Simultaneously I ran batches for 1) the same scenario except one candidate is 
guaranteed to be dead center on the spectrum, and 2) a completely random 
scenario not based on a spectrum. But I won't discuss those at this time. We 
have enough to go through! So let's start:

COMPETITIVENESS

A curse of many sims I've run is that most scenarios attempted aren't 
competitive: One candidate will win virtually all the time. This suggests I am 
mainly simulating poor nomination choices and it isn't very interesting to 
study. 

Fortunately, this "evenly spread voters" scenario, combined with more 
intelligent voters, tended to be more competitive with around 40% of trials 
having this property: "the candidate who won the most often, won fewer than 
85% of the polls."

The most competitive: RndP (100%), RB (98.3%), QR (42.7%), AWP (implicit or 
explicit cutoff) (42.4%). Skimming down the list other methods had: margins 
41.9%, MCA 41.9%, WV 41.7%, TTR 41.1%, Range 40.9%, and then approaching the 
bottom: Approval 39%, IRV 38.3%, FPP 37.2%, VDP 36.3%, 2-slot MMPO 24.9%, 
Borda 22.1%, Antip 0.2%(!).

COMPROMISE

A voter used "compromise" strategy if they voted their second-favorite 
candidate first, above the true favorite. Methods satisfying FBC are not 
supposed to require this strategy, and we can check whether the AI perceived 
this.

The "worst" methods for compromise were FPP (avg 20.7% of voters), then VDP 
19.4%, Borda 17.4%, IFPP 17.1%, TTR 17.0%, RndP(!!) 16.0%, DSC 15.2%, IRV 
15.0%, QR 13.4%, VFAR3 13.0%, C//IRV 11.9%, KH 10.6%, Black 10.2%, QLTD 9.9%, 
VBVst 9.4%, C//KH 9.3%, Baldwin 9.2%, Bucklin 8.8%, MAP 8.5%, DAC 8.3%, 
margins 7.6%, IBIFAs (all three; the ER scores may seem high) 6.3-6.6%, IRNR 
6.1%, TACC 5.0%, C//Aex 4.5%, AER 3.7%, CdlA 3.7%, RB(!) 3.5%, Coombs 3.3%...

And now we are getting into some FBC methods. MDDA 3.3%, ICA 3.0%, Range 2.7%, 
MCA 2.4%, MAMPO 2.0%, MMPO 2.0%, SMDTR 1.9%, Approval 1.2%, 2sMMPO 1.0%, 
Antiplurality 0.1%.

Some FBC-failing methods also did as well as this group: DMC 3.1%, C//Aim
3.0%, WV 1.9%, AWPex 1.9%, VBVer 1.9%, AWPim 1.8%, CWP 1.5%, Raynaud 1.2%. 
Congratulations to Raynaud.

We see already an oddity here and there. I do think some of it can be blamed 
on methods with more ballot types being harder for the AI to learn. It's 
possible that sometimes the sincere vote has no detectable benefit over the 
strategic one, so the choice is basically arbitrary. The random methods also 
seem harder, although curiously the AI doesn't request very many polls before 
it believes it has understood how they work.

COMPRESSION

A voter used "compression" strategy if they rated another candidate equal to 
their favorite, tied at the top. Around half the methods don't even allow this 
type of ballot.

The methods with the most are rather predictable. 2sMMPO 44.0%, Approval 
33.2%, Range 28.3% (interesting drop), SMDTR 23.4%, MCA 20.3%, MMPO 15.8%, 
MAMPO 14.6%, Raynaud 12.4%, VBVer 11.8%, TACC 10.9%, WV 10.7%, CWP 10.0%, MDDA 
9.7%, IRNR 8.6%, ICA 8.3%, CdlA 6.9%, C//Aim 6.4%, AWPim 5.6%, margins 4.7%, 
C//Aex 4.5%, IBIFApw & IBIFAcb 4.3%, AWPex 4.3%, DMC 4.1% (!), DAC 3.5%, DHSC 
2.3%, DSCer (1.4% and predicted to be none, so I'm glad).

TRUNCATION

A voter used "truncation" strategy if they bullet-voted for their favorite 
(and had the option to vote for more candidates).

Again, it starts predictably. Appr 65.6%, Range 57.5%, 2sMMPO 54.9%, Bucklin 
54.5%, DAC 50.7%, QLTD 50.5%, MCA 43.6%, MDDA 43.1%, IBIFA (all!) 35.7-38.8%, 
VBVst 30.0%, TACC 25.8%, KH 25.7%, ICA 25.6%, IRNR 25.5%, VBVer 25.0%, SMDTR 
22.9%, C//KH 22.2%, IFPP 22.2% (??), VDP 21.7%, AER 20.8%, VFAR3 20.0%, C//Aim 
19.7%, C//IRV 19.3%, CdlA 18.3%, IRV 18.1% (!), QR 17.1%, DSC 15.7%, MAMPO 
14.7%, DMC 13.7%, CWP 13.1%, AWPim 13.0%, WV 11.1%, margins 10.7%, C//Aex 
10.6%, AWPex 10.0%, DSCer 8.5% (?), Raynaud 8.0%, MMPO 7.8%.

There is some oddness with the LNHarm methods. In theory there is no need to 
truncate and only benefit in listing more options, but there is fairly high 
truncation under IFPP, IRV, QR, and DSC. Only much lower in the list do we see 
MMPO. My theory is that these five methods are listed roughly in increasing 
order of how likely a lower preference is likely to be useful. In IFPP lower 
preferences are most frequently unused, so the AI (I theorize) doesn't learn 
to provide the preferences. In IRV the preferences are used more often, but 
they can't help a higher preference. QR, DSC, and MMPO all seem to provide 
increasing benefit to listing additional preferences, not just insulation from 
harm from those preferences.

BURIAL

A voter used "burial" strategy if ranked their second choice strictly last. A 
narrow definition I know.

This one's not as predictable. VFAR3 was worst with 33.9% voting "against" 
the "wrong" candidate (which actually is how the method is meant to work). 
Borda was next with 27.9%. VDP 25.8% (ditto comment on VFAR3). Baldwin 21.6%, 
DSC 21.3% (surprised), MMPO 20.1%, C//Aex 20.0% (vs 7.7% for the implicit 
version, rather confirming my suspicions), Black 18.8%, margins 18.6%, MAP 
17.1%, DSCer(why better?) 16.5%, WV 16.3%, MAMPO 16.0%, AWPex 15.2%, Coombs 
15.1%, SMDTR 15.1%, Antip 15.1%, IFPP (see paragraph above!) 14.9%, C//IRV 
14.7%, Raynaud 14.4%, IRNR 13.6%, QR 12.4%, RndP(!!) 11.8%, C//KH 11.6%, 
DMC 11.2%, CWP 10.6%, AER 10.5%, IRV 10.0%, KH 9.7%, CdlA 9.3%, AWPim 8.9%, 
VBVst 8.4%, TACC 7.8%, ICA 7.8%, C//Aim 7.8%, VBVer 6.5%, IBIFAs 4.4-6.4%, MCA 
6.3%, MDDA 5.2%, QLTD 4.5%, Range 3.8%, 2sMMPO 3.6%, Bucklin 3.1%, DAC 2.9%, 
Approval 1.0%.

I expect most readers are skimming the above for the best Condorcet methods. 
The best ones are apparently C//A implicit and TACC. C//KH did indeed see less 
burial than C//IRV, and WV than margins. But congrats to C//A and TACC.

PUSH-OVER

I define this unusually. Push-over means you gave your least favorite 
candidate a top ranking/rating. Normally push-over is defined in terms of what 
it can accomplish in IRV, but since my voters don't have that kind of
mentality, it can't be defined that way.

This strategy was not very popular. In Antip it is simultaneously considered 
burial and used by 15.2%. 2sMMPO 3.7%, RndP (!) 2.2%, TTR 1.5%, VFAR3 1.5%, 
CdlA 1.1%, IRNR 1.1%, Appr (!) 1.0%, Raynaud 0.8%, CWP 0.6%, DMC 0.6%, margins 
0.5%... 

And we start to get into very small values. IRV was actually one of the "best" 
methods with 0.07%. It is possible that IRV would do worse in a different 
setup of the simulation. But pushover doesn't seem to be a big concern.

SINCERITY

I called a vote "sincere" if it didn't use any of the above strategies, *or* 
the ballot used was the one I called "the" sincere one for the voter. I 
wouldn't read too much into this figure.

The best method was Random Ballot, 96.5%. Antip 84.5%, TTR 81.6%, Coombs
81.6%, FPP 79.2%... Hm, do I need to stop once I say FPP was the fifth best? 
I'll skip over some then. RndP was only 71.7%. AWPim 71.1%%, AWPex 68.8%, DMC 
68.3%, CWP 65.4%, C//Aim 63.6%, Approval 61.2%, WV 60.4%, margins 58.9%, QR 
57.1%, C//KH 56.8%, IRV 56.8%, ICA 55.7%, VBVer 55.3%, MMPO 54.8%, Borda 
54.7%, C//IRV 54.0%, KH 54.0%, TACC 51.0%, DSC 47.8%, IRNR 47.1%, IBIFA er 
versions 46.3-46.7%, IFPP 45.8%, MDDA 39.1%, QLTD 35.1%, DAC 34.7%, SMDTR 
34.2%, Bucklin 33.6%, VDP 33.0%, VFAR3 32.5%, MCA 27.9%, Range 8.0%. (Range 
has it hard with that middle slot.)

SINCERE CONDORCET WINNERS

How often did the election method elect the sincere CW when there was one? The
figures here were quite good. Well done AI voters!

As the numbers were close I'll give brackets:
99+%: AWPim (99.87%), AWPex, CdlA, C//Aim, VFAR3, C//Aex, ICA, VBVer, DMC
98+%: margins, IBIFA (all 3 in a row), MDDA, MCA, AER, Black, MAP, CWP, DAC, 
MAMPO, MMPO, WV
97+%: QR, Bucklin, QLTD, VBVst, Baldwin, IRNR, Range
96+%: Raynaud, TTR, SMDTR, DHSC, TACC, Coombs, KH
95+%: Approval, C//IRV
94+%: DSCer(??), C//KH, IRV
rest: IFPP 91.5%, DSC 90.3%, 2sMMPO 87.7%, VDP 82.5%, Borda 81.2%, FPP 80.8%, 
RndP 66.2%, Antip 55.9%, RB 45.8%.

Before we throw AWP a parade I want to note that I discovered it can fail the 
Plurality criterion. I'm not sure this will be acceptable to everyone.

SINCERE CONDORCET LOSERS

The results for this were probably even better. The only methods above 1% were 
RB 27.9%, Borda 1.6%, and Antip 1.0%. The best methods were DMC and VFAR3 (I'm 
seeing 0%), followed by WV, TTR, SMDTR, IRV, MAMPO, margins, DAC... Very tiny 
numbers.

UTILITY MAXIMIZERS

How often did the method elect the utility maximizing candidate? This can be 
easily bracketed also.

best: Borda 82.5%.
79+%: VFAR3, CdlA, MAP, C//Aim, AWPim, AWPex, 2sMMPO, VBVer, ICA
78+%: C//Aex, DMC, MAMPO, IBIFA (all), QR, MMPO, MDDA, CWP, margins, Black, 
MCA, VBVst
77+%: SMDTR, AER, IRV, Range, DAC, IRNR, TACC, Bucklin, WV, KH, QLTD, TTR, 
Baldwin, Raynaud, C//KH, C//IRV
76+%: Approval
75+%: Coombs, DSCer(? again unsure why better than DSC)
rest: Antip 74.3%, IFPP 72.4%, DSC 71.8%, VDP 64.8%, FPP 62.1%, RndP 59.5%, RB 
39.7%.

UTILITY MINIMIZERS

How often did the method elect the worst candidate wrt utility? This is not 
exactly the reverse of the previous list.

worst: RB 26.3%, RndPair 4.2%, FPP 4.0%, VDP 3.3%, Borda 2.6%, IFPP 2.2%, DSC 2.0%
1.0+%: IRV 1.6%, C//KH, C//IRV, TTR, KH, Coombs, Antip, TACC
0.5+%: Raynaud, Approval, IRNR, DSCer, SMDTR, 2sMMPO, Baldwin, Range, QLTD
0.3+%: WV, MMPO, CWP, Bucklin, MAMPO, DAC, MCA
0.2+%: margins, AER, VBVst, MDDA, Black, DMC, IBIFA (er and cb)
rest: QR, IBIFAst, VFAR3, C//Aex, MAP, C//Aim, ICA, CdlA, VBVer, AWPex, AWPim.

No perfect scores.

UTILITY AND REGRET

I defined regret as the absolute shortfall from the best available option. 
Sorting these lists doesn't give a different order for this data. Results will 
be the average utility of the winning candidate.

best: Borda (65.2), MAP (65.1), CdlA, AWPex, AWPim, VBVer, VFAR3, C//Aim, ICA, 
C//Aex, DMC (65.0), IBIFAcb, MDDA, MAMPO, IBIFAst, margins, CWP3
64+: Black, IBIFApw, Range, 2sMMPO, MMPO, TACC, SMDTR, MCA, VBVst, DAC, AER, 
QR, QLTD, IRNR, WV, Bucklin, IRV, Raynaud, Baldwin, KH, TTR, C//IRV, C//KH, 
Approval, DHSC, DSCer, Coombs IFPP, DSC
rest: Antip (63.2), VDP, FPP (62.5), RndP (61.0), RB (54.4).

VOTE STABILITY

I tracked the avg percentage of voters who wanted to change their vote from 
poll to poll. At the high end I think we have some garbage values due to there 
being many ways to vote that don't really change anything. For example, IFPP 
scored 45.1%, which is hard to understand. The voters presumably see a 
difference between different lower preferences when actually they are not 
affecting anything.

The low end may be more informative. RB had .04%, FPP .25%, Antip .28%, 
RndPair .29%, Borda .88%, Approval 3.1% (a bit surprisingly low I thought), 
Coombs 3.8%, MAP 5.2%, Black and Baldwin 6ish%, Range 9.5%, IBIFAs 12.3%, TACC 
12.6%, DAC 12.7%, ICA 13.3%.

SPOILERS

I wanted to know what percentage of voters perceived that one of the losing 
candidates had spoiled the outcome. The voter assumes that it would've been a
two-on-two FPP race without the one candidate.

RB 34.7%, Antip 26.0%, RndPair 21.1%, FPP 11.8%, Borda 10.6%, VDP 10.6%,
2sMMPO 6.8%, DSC 5.9%, IFPP 5.2%, IRV 3.3%, C//KH 3.2%, C//IRV 2.9%, Approval 
2.7%, KH 2.6%, Coombs 2.4%, TACC 2.2%, SMDTR 2.0%, TTR 2.0%, Raynaud 1.9%, 
Range 1.7%, IRNR 1.7%, QLTD 1.4%, VBVst 1.4%, Bucklin 1.3%, QR 1.3%, WV 1.1%, 
MMPO 1.1%, DAC 1.1%, MAMPO 1.0%, CWP .95%, MCA .79%, MDDA .77%, IBIFAs .6-
.69%, margins .61%, DMC .53%, VBVer .35%, C//Aex .27%, ICA .27%, C//Aim .22%, 
VFAR3 .21%, CdlA .15%, AWPex .13%, AWPim .07%.

SPOILERS BY 3RD PLACE ONLY

I also wanted to see how many voters felt the race had been spoiled by the 
candidate who placed last in top rankings/ratings (and who also did not win 
himself).

Worst was Antip 20.4%, RB 19.5%, RndP 11.5%, Borda 6.4%, 2sMMPO 4.0%, 
Approval .66%, SMDTR .64%, IRV .57%, Range .35%, C//KH .33%, 
MAMPO .32%, ...skip a bit... MCA .096%, ICA, CWP, QLTD, DSC, Bucklin, QR, DAC, 
VBVst, FPP (!), VFAR3, C//Aex, AER, C//Aim, CdlA, WV, DMC, IFPP (.029%), AWPex 
(.0055%), AWPim, margins (!), TTR.

As far as IRV having *any*: This should be because the presence of the third 
candidate affected others' strategy, so it wasn't as simple as eliminating the 
3rd candidate and being done with him. As far as FPP doing really well: Surely 
this is because nobody was foolish enough to try voting for 3rd place, so, 
spoiler averted.

You may ask then: If voters are smart enough to try to avoid spoilers, are 
these metrics any good? It's little consolation that last place isn't spoiling 
the election when this occurs because he isn't getting any votes. So let's try 
one final metric:

TOP RANKINGS/RATINGS OF THIRD PLACE

In other words, for the candidate who received the fewest "TR"s, how many did 
he get on average?

Antip was the highest at 47.1%, which is not surprising since you aren't 
allowed to vote for fewer than two candidates. 

Next were 2sMMPO 33.2%, Approval 29.5%, Range 27.2%, SMDTR 26.2%, MCA 24.6%, 
MMPO 23.3%, MAMPO 22.6%, Raynaud 21.9%, CWP 21.6%, WV 20.6%, VBVer 20.3%, MDDA 
20.4%, ICA 19.1%, TACC 19.1%, AWPim 18.9%, IRNR 18.9%, CdlA 18.7%, AWPex 
18.4%, C//Aim 18.2%, DMC 17.4%, C//Aex 16.9%, RB 16.9%, Coombs 15.8%, AER 
15.3%, IBIFAcb/pw 15.1%, MAP 14.8%, DAC 14.5%, margins 14.3%, Bucklin 13.6%, 
IBIFAst 13.5%, C//KH 13.5%, KH 12.9%, QLTD 12.4%, Black 12.3%, C//IRV 12.0%, 
VBVst 11.8%, VFAR3 11.6%, QR 10.8%, IRV 10.4%, TTR 9.7%, Borda 9.2%, DSC 6.7%, 
IFPP 6.8%, VDP 3.5%, FPP 2.3%.

Some of the methods allow compression, and multiple top rankings. It is a bit 
unclear what this percentage "should" be. My hunch is that it should be at 
least as high as the RB value of 16.9%. If it's below that then it must be 
that some candidates are not getting the full top ratings/rankings they would 
expect, most probably because their supporters view it as unhelpful to vote 
for them.

----

Note that these figures are only gathered for the trials where the AI has 
supposedly already learned how the method works. Practice polls and 
hypothetical polls aren't included.

Also, I can already say that the rankings I got out of a random (non-spectrum-
based) scenario are not quite the same as the ones here. I'll have to talk 
about that later.

Thanks for reading, and any thoughts.

Kevin Venzke