[EM] IRV counting with integers to handle multiple candidates at the same preference level

Mon Jan 31 13:04:10 PST 2022

On 31.01.2022 03:12, Richard, the VoteFair guy wrote:
> On 1/29/2022 5:28 AM, Kristofer Munsterhjelm wrote:

>> If you want to put every bit to maximum use, then instead of the string
>> "ABCDEFG", you can use a 13 bit integer field; in this case, the field
>> would have value 0. GFEDCBA would correspond to value 5039. Truncation
>> and equal rank make it a bit harder; I tried to construct a bijection
>> between truncated/equal-rank ballots and integers once but didn't quite
>> manage to do it.
> 
> Thank you Forest and Kristofer for answering my question.
> 
> I'm interpreting your replies to mean that of course we can summarize
> the ranked choice ballot data in the way that IRV counting requires.  Is
> that right?

Essentially, what we're saying is that IRV maximally fails the
summability criterion, but computers make storagge cheap.

Summability is still better (it's easier to keep control of a paper
tally than all the computers that the SD card will be visiting), but if
summability is off the table for some reason, it's not the end of the
world. At least not as far as storage requirements are concerned.

> As Kristofer says" "Truncation and equal rank make it a bit harder."
> 
> I believe it might be useful to summarize the ballot info in four ways:
> 
> 1. Counts and encoded rankings for ballots that rank each candidate at a
> different preference level.  Using the method above.
> 
> 2. Separate counts and encoded rankings for the ballots that have shared
> ranking levels, but just sending a subset of these, and truncated to
> only show the ranks for the most popular candidates (omitting the
> clearly can't-win candidates)
> 
> 3. Sending the pairwise vote counts -- both for use in finding pairwise
> losing candidates, and as a complete summary of all the ballot data (but
> not sufficient to track IRV or STV vote transfers).
> 
> 4. Maybe also sending counts for how many ballots mark each oval in the
> pattern.  This is like saying 236 ballots have a mark in column 1 and
> row 1, 932 ballots have a mark in column 2 and row 1, etc.  (If a ballot
> marks more than one oval for the same candidate, only the higher-ranked
> oval would be counted.)  Of course this is equivalent to how Score
> ballots can be summarized.

This is called, I think, a positional matrix. Once you have it, you can
calculate the result according to any weighted positional method (like
Borda, Plurality, Antiplurality, etc.) and also Bucklin.

> Regarding type 2, namely shared-preference-level ballots, if a precinct
> has say 50,000 voters, is there a way to estimate whether the upload
> time is hours versus minutes (at normal internet speeds)?

The number of distinct ballots when truncation and equal-rank are both
allowed is, I think, https://oeis.org/A000629. To store an index of
maximum value n, we need ceil(log_2(n)) bits[1]. So the ballot format
that just gives the first voter's ballot, the second voter's ballot, and
so on, requires v * ceil(log2(n)) bits in total. This is a worst case
format: in most cases there will be more than one voter voting a
particular way, in which case the duplication can be compressed.

(https://electowiki.org/wiki/Space_of_possible_elections#Full_rankings,_equality_allowed
has information about how many distinct ballot types there are for
various ballot types, but it doesn't yet have both truncation and equal
rank at the same time.)

So let's take an example with 50k voters -- and 10 candidates.

With 10 candidates, the table says that there are 204495126 different
ballots, so we need 28 bits per voter. The total space required is 1 400
000 bits, or 175 KB. This would take less than a second to transfer on a
normal internet connection. You could easily fit the data on an 1.44MB
floppy, for that matter.

With 20 candidates, you'd need 73 bits per voter and the total election
image would come to 457 KB. That's the worst case. A global election
(7.9 billion voters) with 20 candidates is 72 GB.

So if a lack of summability is a problem, it's not raw storage that
makes it troublesome: it's keeping the people who'd want to corrupt the
results from doing so, and in particular from doing so undetectably.

> Obviously there can be huge numbers of possible marking patterns when
> shared-preference-level ballots are involved, but how many of those
> patterns would appear on more than just a few ballots?
> 
> Would it be reasonable to allow the precinct software to estimate the
> possible winners and truncate the detailed pattern data to only include
> the possible winners (based on this precinct-level estimate).  Or should
> the central counting location be the only source for requesting this
> kind of truncation?
> 
> Does anyone recognize whether summary method 4 offers any meaningful way
> to eliminate some of the candidates as can't win candidates?

IRV advocates often use this observation to propose a method that (IIRC)
is used in Australia. Basically, each round of IRV is summable. So the
precincts can send the center the Plurality counts (one number per
candidate); then the center says "X lost, eliminate him", and they do
and recalculate, send their new counts to the center, and the process
repeats.

Thus if you allow interactivity, not much storage is required at all.

I don't think it's possible to definitely exclude any candidates on a
precinct level when doing IRV. Imagine that the first precinct has
everybody voting for candidate A (i.e. an unanimous result). Meanwhile,
A is the loser in every other precinct but the first one doesn't know
this. Then there's no way for the first precinct to know it should
eliminate A.

So however you do this early elimination, it has to be from the center's
point of view. Hence there must be communication between the precincts
(telling the center what it needs to know in order to know who to
eliminate), and then communication back to the precinct about who should
be eliminated. But at that point you have pretty much the Australian
protocol and you can just use that.

Determining whether the election is close can be done by the kind of
statistics used in polls in general. Suppose the precincts record a
thousand ballots at random. At each stage of the count, they will know
what the likely margin for the true count will be. If it's possible for
the counts to be somewhere inside the margin and X be eliminated, or for
them to be somewhere else inside the margin and Y be eliminated, then
it's too close to tentatively call. Note that X may be eliminated no
matter what the results are within the margin, and yet by bad luck the
sample just happened to be unrepresentative: so this can't be used to
definitively call an election, only to tentatively do so.

(One problem with IRV is that ties may amplify in unpredictable ways.
If, at any stage in the elimination process, either X or Y may plausibly
be eliminated, the differences in ballots can cascade down and change
the result completely depending on whether X or Y was actually
eliminated. In some cases, e.g. when there's a CW with >1/3 first
preferences, or a majority favorite, IRV is stable, but it's not in full
generality.)

-km

[1] As I mentioned earlier, I think it's possible to get this down to
ceil(log_2(n)*v) by using arithmetic coding, but that would just
complicate my argument so I don't bother.