[EM] Identifying ranked ballot sets

Thu Oct 4 07:15:36 PDT 2018

On Thu, Oct 4, 2018 at 6:56 AM Kristofer Munsterhjelm
<km_elmet at t-online.de> wrote:
>
> On 2018-10-02 01:18, John wrote:
> > [Note:  I'm not on-list; please CC me on replies]
> >
> > Somebody suggested a two-ballot set to me with three candidates, and
> > it turns out you can mutate this set and achieve the same pairwise
> > votes.
>
> [...]
> > I'm trying to identify a concise representation of a set of ranked
> > ballots.  At this time, I believe this may be:
> >
> > 1.  The pairwise race matrix (Win-Lose)
> > 2.  The matrix of appearances at each ordinal rank
> >
> > For BACD and DCAB, #2 is as such:
> >
> >    1  2  3  4
> > A 0  1  1  0
> > B 1  0  0  1
> > C 0  1  1  0
> > D 1  0  0  1
> >
> > It probably doesn't matter if you rank the truncated ballots by
> > counting all unranked as ties at that rank or by not counting them at
> > that rank.
>

[...]

>
>
> Recording a single ballot then requires log(c!) = O(c log c) bits. (All
> logarithms from this point on are base-2 logarithms.) If the order of
> the candidates and the voters matter, and there are v voters, then the
> best you can do is to record each ballot, which will require v * c *
> log(c) bits (up to some constant factor). This is exponential in log(c),
> and thus neither positional nor pairwise matrices can accomplish this.
>

Order of voters doesn't matter; only that the set of ballots is
mathematically-identical.

>
> Another possibility is to have a c!-tuple, where each element counts how
> many voters voted that way. Such a structure would require c! * log(v)
> bits; so it's possible to get something that is polynomial in log(v),
> but at the cost of no longer being polynomial in c. (Strictly speaking,
> the log(v) term can be reduced by noting that not all the elements need
> the full bit length because the sum over all of them is supposed to be v.)
>

[...]

>
> So we can either have a data structure that is exponential in c (being
> O(c! log c!)) or one that's exponential in log(v) (being O(v log v)),
> but we can't have one that's polynomial in both c and log(v).

Damn, okay, the math folks were right:  I can only get to NP-hard and
unlikely, not perfect and recoverable.

There's one other saving throw then.

We can generate an SHA-512 hash from a representation constructed by
strict, consistent rules.  Some states of this representation would be
illegal, so many collisions are eliminated.

Take a simple list of positional ballots (c! variations).  You might
indicate 23 voters A>B>C and 17 C>D>B.  If your representation as such
includes more than 64 bits of information, then you have SHA512
collisions.

If all bit combinations are possible, then any collision is valid.  If
vote counts are 48 bits, then there are obviously a lot of 0 bits
because there aren't 300 trillion voters; likewise, if candidates are
8 bits, there obviously aren't 255 different candidates.  The
non-modifiable bits impact the hash, reducing the likelihood of a
state having a collision with a valid state.

You can further reduce collisions by specifying the data:

 * Pairwise results (polynomial);
 * Number of times each candidate appears in each position (polynomial);
 * Number of times each candidate is truncated (linear);
 * Total number of ballots (non-scaling);
 * SHA512 hash (non-scaling)

That gives you 2c^2+c+1+1 values to store, although the values aren't
all the same size (the hash is obviously bigger than the total number
of ballots).

2c^2 + c.  If you want, you can use the number of times each candidate
is in a position by truncation (A>B>C, candidates D and on are in
position 4 by truncation), which is polynomial.  3c^2.

The likelihood of having similar mathematical relationships rapidly diminishes.

Of course, you could also just provide SHA2, SHA256, SHA512, and MD5,
since at this point we're relying on simultaneous collision as a
defense.

Still, for 10 candidates, with 65,000 voters per polling place, in
binary mode, a 10x10 matrix is 200 bytes.  You're looking at 306 bytes
(assuming you don't eliminate the 20 bytes for the X vs. X positions
in the matrix, always zero) to store all five of those things, or QR
18 error correction level H (30% loss recoverable),  16 Q (25%), 13 M
(15%).  Dense enough to display MANY on a 55-inch 4K display for a few
minutes while people take cell phone pictures, even with the text
SHA512 hash below.

Thanks for clarifying.  I'll have to make some concessions, but this is fine.