[EM] Fast Condorcet-Kemeny calculations -- in polynomial time

Wed Dec 21 12:42:07 PST 2011

As previously promised, I am revealing how Condorcet-Kemeny calculations 
can be done fast.

How fast?  Calculating a "complex" case with 12 choices is so fast that 
I haven't bothered to time the calculations.  As an estimate, it is on 
the order of a couple of seconds or less.

The estimated time for calculating results for 50 choices (!) is less 
than a minute -- which is far less than a "lifetime".

The algorithm is described in the documentation that accompanies the 
just-announced VoteFair ranking software.

The full explanation of the algorithm is described on these three web pages:

     http://www.votefair.org/calculation_details_popularity.html

     http://www.votefair.org/insertion_sort_ranking.html

     http://www.votefair.org/choice_specific_pairwise_score.html

For convenience, below is the description for the "insertion-sort" part 
of the algorithm, which is the lowest-level part of the algorithm.  It 
is followed by a brief description of the full algorithm (excluding the 
initial estimation algorithm).

As a reminder, the Condorcet-Kemeny method calculates a full ranking of 
all choices; it does not just find the single winner.

Also consider that the Condorcet-Kemeny method does not specify how to 
resolve sequence-score ties.  The VoteFair ranking software does resolve 
such sequence-score ties by identifying which choices are tied at which 
ranking levels.

Richard Fobes

-------- Insertion-sort part of the algorithm ---------

This explanation clarifies how the
well-known insertion-sort algorithm is applied
to VoteFair popularity ranking in a way that
greatly reduces the number of calculations
needed to find maximum sequence scores.
(This  method is just part of the full
algorithm, which is explained [elsewhere].)

Consider an example in which there are five
choices named A, B, C, D, and E, with a final
sort order that matches this alphabetical
order.

Notation: The notation A>B refers to how
many voters pairwise prefer choice A over
choice B, and the notation B>A refers to how
many voters pairwise prefer choice B over
choice A.  This notation always uses the
"greater-than" symbol ">", and never uses
the "less-than" symbol "<".

At an intermediate stage in this sorting
example, suppose that the choices A, C, and E
have been sorted -- into this correct
order -- and choice B is about to be
sorted, and choice D remains unsorted.
The pairwise counts for this arrangement are
shown below.  The asterisks show the
separation between sorted choices and unsorted
choices.

       |       |       |       |       |       |
       |   A   |   C   |   E   |   B   |   D   |
       |       |       |       |       |       |
  -----*************************-------+-------+
       * \     |       |       *       |       |
   A   *   \   |  A>C  |  A>E  *  A>B  |  A>D  |
       *     \ |       |       *       |       |
  -----+-------+-------+-------+-------+-------+
       *       | \     |       *       |       |
   C   *  C>A  |   \   |  C>E  *  C>B  |  C>D  |
       *       |     \ |       *       |       |
  -----+-------+-------+-------+-------+-------+
       *       |       | \     *       |       |
   E   *  E>A  |  E>C  |   \   *  E>B  |  E>D  |
       *       |       |     \ *       |       |
  -----*************************-------+-------+
       |       |       |       | \     |       |
   B   |  B>A  |  B>C  |  B>E  |   \   |  B>D  |
       |       |       |       |     \ |       |
  -----+-------+-------+-------+-------+-------+
       |       |       |       |       | \     |
   D   |  D>A  |  D>C  |  D>E  |  D>B  |   \   |
       |       |       |       |       |     \ |
  -----+-------+-------+-------+-------+-------+

The diagonal line passes through empty cells
-- that would otherwise represent a
choice's comparison with itself, such as
A>A.

The diagonal line also is the border between
the upper-right triangular area and the
lower-left triangular area.  The sequence
score for the current sequence is the sum of
all the pairwise counts in the upper-right
triangular area (currently A>C + A>E +
A>B + A>D + C>E + C>B + C>D +
E>B + E>D + B>D).

The goal of these calculations is to find
the maximum sequence score, which means the
goal is to change the sequence so that the
largest pairwise counts move into the
upper-right triangular area, leaving the
smallest pairwise counts in the lower-left
triangular area.

The first step in sorting choice B is to
consider the possibility of moving it to the
left of choice E, which would form the sequence
A, C, B, E.  Here is the pairwise-count
matrix for this sequence.  The asterisks
now include choice B because this is a possible
sort order.

       |       |       |       |       |       |
       |   A   |   C   |   B   |   E   |   D   |
       |       |       |       |       |       |
  -----*********************************-------+
       * \     |       |       |       *       |
   A   *   \   |  A>C  |  A>B  |  A>E  *  A>D  |
       *     \ |       |       |       *       |
  -----*-------+-------+-------+-------*-------+
       *       | \     |       |       *       |
   C   *  C>A  |   \   |  C>B  |  C>E  *  C>D  |
       *       |     \ |       |       *       |
  -----*-------+-------+-------+-------*-------+
       *       |       | \     |       *       |
   B   *  B>A  |  B>C  |   \   |  B>E  *  B>D  |
       *       |       |     \ |  ---  *       |
  -----*-------+-------+-------+-------*-------+
       *       |       |       | \     *       |
   E   *  E>A  |  E>C  |  E>B  |   \   *  E>D  |
       *       |       |  ---  |     \ *       |
  -----*********************************-------+
       |       |       |       |       | \     |
   D   |  D>A  |  D>C  |  D>B  |  D>E  |   \   |
       |       |       |       |       |     \ |
  -----+-------+-------+-------+-------+-------+

The only pairwise counts that crossed the
diagonal line are the (underlined) counts
B>E and E>B, which swapped places.
All the other pairwise counts that move do not
cross the diagonal line; they stay on the same
side of the diagonal line.

As a result, the score for this sequence,
compared to the score for the previous
sequence, increases (or decreases if negative)
by the amount B>E minus E>B.  In
this case (assuming there are no complications
that are explained later) the sequence score
has increased because in the final
(alphabetical) sort order, choice B appears
before choice E.

The next step in sorting choice B is to
consider the possibility of moving it to the
left of choice C, which would form the sequence
A, B, C, E.  Here is the pairwise-count
matrix for this sequence.

       |       |       |       |       |       |
       |   A   |   B   |   C   |   E   |   D   |
       |       |       |       |       |       |
  -----*********************************-------+
       * \     |       |       |       *       |
   A   *   \   |  A>B  |  A>C  |  A>E  *  A>D  |
       *     \ |       |       |       *       |
  -----*-------+-------+-------+-------*-------+
       *       | \     |       |       *       |
   B   *  B>A  |   \   |  B>C  |  B>E  *  B>D  |
       *       |     \ |  ---  |       *       |
  -----*-------+-------+-------+-------*-------+
       *       |       | \     |       *       |
   C   *  C>A  |  C>B  |   \   |  C>E  *  C>D  |
       *       |  ---  |     \ |       *       |
  -----*-------+-------+-------+-------*-------+
       *       |       |       | \     *       |
   E   *  E>A  |  E>B  |  E>C  |   \   *  E>D  |
       *       |       |       |     \ *       |
  -----*********************************-------+
       |       |       |       |       | \     |
   D   |  D>A  |  D>B  |  D>C  |  D>E  |   \   |
       |       |       |       |       |     \ |
  -----+-------+-------+-------+-------+-------+

The only pairwise counts that crossed the
diagonal line are the (underlined) counts
B>C and C>B, which swapped places.
The other pairwise counts that moved remained
on the same side of the diagonal line.

The score for this sequence increases (or
decreases if negative) by the amount B>C
minus C>B.  In this case the sequence
score has increased because (in the final
alphabetical order) choice B appears before
choice C.

The final step in sorting choice B is to
consider the possibility of moving it to the
left of choice A, which would form the sequence
B, A, C, E.  Here is the matrix for this
sequence.

       |       |       |       |       |       |
       |   B   |   A   |   C   |   E   |   D   |
       |       |       |       |       |       |
  -----*********************************-------+
       * \     |       |       |       *       |
   B   *   \   |  B>A  |  B>C  |  B>E  *  B>D  |
       *     \ |  ---  |       |       *       |
  -----*-------+-------+-------+-------*-------+
       *       | \     |       |       *       |
   A   *  A>B  |   \   |  A>C  |  A>E  *  A>D  |
       *  ---  |     \ |       |       *       |
  -----*-------+-------+-------+-------*-------+
       *       |       | \     |       *       |
   C   *  C>B  |  C>A  |   \   |  C>E  *  C>D  |
       *       |       |     \ |       *       |
  -----*-------+-------+-------+-------*-------+
       *       |       |       | \     *       |
   E   *  E>B  |  E>A  |  E>C  |   \   *  E>D  |
       *       |       |       |     \ *       |
  -----*********************************-------+
       |       |       |       |       | \     |
   D   |  D>B  |  D>A  |  D>C  |  D>E  |   \   |
       |       |       |       |       |     \ |
  -----+-------+-------+-------+-------+-------+

The only pairwise counts that crossed the
diagonal line are the (underlined) counts
B>A and A>B, which swapped places.
The other pairwise counts that moved remained
on the same side of the diagonal line.

The score for this sequence increases (or
decreases if negative) by the amount B>A
minus A>B.  In this case the sequence
score has decreased because (in the final
alphabetical order) choice B appears after, not
before, choice A.

At this point choice B has been tested at
each position within the sorted portion.
The  maximum sequence score (for the sorted
portion) occurred when it was between choices A
and C.  As a result, choice B will be
moved to the position between choices A and
C.

Notice that the full sequence score did not
need to be calculated in order to find this
"local" maximum.  These calculations only
need to keep track of increases and decreases
that occur as the being-sorted choice swaps
places with successive already-sorted
choices.

The pairwise-count matrix with choice B in
the second sort-order position (between A and
C) is shown below.  Now choice D is the
only unsorted choice.

       |       |       |       |       |       |
       |   A   |   B   |   C   |   E   |   D   |
       |       |       |       |       |       |
  -----*********************************-------+
       * \     |       |       |       *       |
   A   *   \   |  A>B  |  A>C  |  A>E  *  A>D  |
       *     \ |       |       |       *       |
  -----*-------+-------+-------+-------*-------+
       *       | \     |       |       *       |
   B   *  B>A  |   \   |  B>C  |  B>E  *  B>D  |
       *       |     \ |       |       *       |
  -----*-------+-------+-------+-------*-------+
       *       |       | \     |       *       |
   C   *  C>A  |  C>B  |   \   |  C>E  *  C>D  |
       *       |       |     \ |       *       |
  -----*-------+-------+-------+-------*-------+
       *       |       |       | \     *       |
   E   *  E>A  |  E>B  |  E>C  |   \   *  E>D  |
       *       |       |       |     \ *       |
  -----*********************************-------+
       |       |       |       |       | \     |
   D   |  D>A  |  D>B  |  D>C  |  D>E  |   \   |
       |       |       |       |       |     \ |
  -----+-------+-------+-------+-------+-------+

Choice D would be sorted in the same
way.  Of course the maximum sequence score
would occur when choice D is between choices C
and E, so D is moved there.

       |       |       |       |       |       |
       |   A   |   B   |   C   |   D   |   E   |
       |       |       |       |       |       |
  -----*****************************************
       * \     |       |       |       |       *
   A   *   \   |  A>B  |  A>C  |  A>D  |  A>E  *
       *     \ |       |       |       |       *
  -----*-------+-------+-------+-------+-------*
       *       | \     |       |       |       *
   B   *  B>A  |   \   |  B>C  |  B>D  |  B>E  *
       *       |     \ |       |       |       *
  -----*-------+-------+-------+-------+-------*
       *       |       | \     |       |       *
   C   *  C>A  |  C>B  |   \   |  C>D  |  C>E  *
       *       |       |     \ |       |       *
  -----*-------+-------+-------+-------+-------*
       *       |       |       | \     |       *
   D   *  D>A  |  D>B  |  D>C  |   \   |  D>E  *
       *       |       |       |     \ |       *
  -----*-------+-------+-------+-------+-------*
       *       |       |       |       | \     *
   E   *  E>A  |  E>B  |  E>C  |  E>D  |   \   *
       *       |       |       |       |     \ *
  -----*****************************************

Now there are no more choices to sort, so
the resulting sequence is A, B, C, D, E.
In this sequence the full sequence score
-- which equals A>B + A>C + A>D +
A>E + B>C + B>D + B>E + C>D +
C>E + D>E -- is likely to be the
highest possible sequence score.

Additional calculations, as described below,
are needed because in rare cases it is possible
that moving two or more choices at the same
time could produce a higher sequence
score.  This concept is analogous to
climbing a mountain in foggy conditions by
always heading in the locally higher direction
and ending up at the top of a peak and then,
when the fog clears, seeing a higher peak.

------- Full calculation method for VoteFair popularity ranking -------

This is a description of the full algorithm
used to calculate VoteFair popularity ranking
results.

The algorithm begins by calculating the
Choice-Specific Pairwise-Score ranking.
This pre-sort is a required part of the
process.  Without it, some unusual cases
can cause the calculations to fail to find the
sequence with the highest score.  This
pre-sort is analogous to starting a search for
the highest mountain peak within a mountain
range instead of starting the search within a
large valley.

The next step is to apply the insertion-sort
method as described in the section above,
including starting at the left/highest end.

To ensure that all possible moves of each
choice are considered, the insertion-sort
method is done in both directions.
Sorting in both directions means that in some
sorting passes sorting moves choices to the
left, as explained in the above example.
In other sorting passes sorting starts by
considering the right-most choice as the first
sorted choice, and choices move to the right,
into the sorted portion.  This convention
ensures movement for choices that need to move
right, instead of left, in order to cause an
increase in the score.

Complications can arise when there is
"circular ambiguity", so additional steps are
used.  The most common cases of circular
ambiguity involve several choices that are tied
for the same sort-order position.

A key part of dealing with circular
ambiguity is to follow this convention:
whenever a choice can produce the same,
highest, sequence score at more than one
position, the choice is moved to the farthest
of those highest-sequence-score positions.

Another part of dealing with these
complications is to sort the sequence multiple
times.

During the second sorting pass, if there is
no circular ambiguity, the sequence of the
choices in the pairwise matrix remains the
same.  This lack of movement (when there
is no circular ambiguity) occurs because the
sorted and unsorted portions are
adjacent.  Specifically, each choice to be
sorted is already at the top (for left-movement
sorting) or bottom (for right-movement sorting)
of the "unsorted" portion, and it is being
moved to the bottom (for left-movement sorting)
or top (for right-movement sorting) of
the "sorted" portion.  In such cases
the only thing that moves is the boundary
between the sorted choices and unsorted
choices.

However, in cases that involve circular
ambiguity, the positions of some choices will
change during the second and later sorting
passes.  This happens because the
convention (as explained above) is to move each
choice as far as it will go, within the limits
of maximizing the sequence score.

During the sorting passes the highest
sort-order (sequence) position of each choice
is tracked, and the lowest sort-order position
of each choice is tracked.  These highest
and lowest positions are reset (to current
positions) whenever the sequence score
increases to a higher score.  At the end
of the sorting process the highest and lowest
positions reveal which choices are tied at the
same popularity ranking level.

Using the insertion-sort example, if choices
B, C, and D can be in any order and still
produce the same highest sequence score, then
each of these choices would move to the left of
the other two each time it is sorted, and each
of these choices would have the same
highest-ranked position of second place, and
each would have the same lowest-ranked position
of fourth place. Because these three choices
have the same highest and lowest positions,
they are easily identified as tied (at the same
popularity ranking).

More complex cases of circular ambiguity can
occur.  To deal with these cases, and to
ensure the correctness of the "winner" (the
most popular choice), the sorting process is
repeated for the top half (plus one) of the
highest-ranked choices, and this sub-set
sorting is repeated until there are just three
choices.  For example, if there are 12
choices, the sorting process is done for 12
choices, then the top 7 choices, then the top 4
choices, and finally the top 3 choices.
Then the highest-ranked choice (or the choices
that are tied at the top) is kept at the
highest rank while the other choices are sorted
a final time.  (If, instead, the
least-popular choice is the most important one
to identify correctly, the data supplied to
this algorithm can be inverted according to
preference levels, and then the calculated
ranking can be reversed.)

As a clarification, the extra sub-set
sorting is done only if more than one sequence
has the same highest sequence score.  This
point is significant if the distinction between
VoteFair popularity ranking and the
Condorcet-Kemeny method is relevant.
Specifically, the Condorcet-Kemeny method does
not indicate how such "tied" sequence scores
should be resolved, whereas VoteFair popularity
ranking resolves such "tied" sequence scores as
part of its calculation process.

After all the sorting has been done, the
highest and lowest ranking levels are used to
determine the results.  For each choice
its highest and lowest ranking levels
are added together (which equals twice their
average) and then multiplied times a
constant.  The constant equals 10 times
the number of choices minus one.  These
numbers are converted to integers, and then
these "averaged scaled integerized" values are
used as the non-normalized ranking
levels.  Two or more choices are ranked at
the same level if they have the same
"averaged-scaled-integerized" ranking
values.

The final calculation step is to normalize
the "averaged-scaled-integerized" ranking
levels so that the normalized ranking levels
are consecutive, namely 1, 2, 3, etc. (so that
no ranking levels are skipped).

The result is a ranking that identifies
which choice is first-most popular, which
choice is second-most popular, and so on down
to which choice is least popular.  Ties
can occur at any level.

------ Calculation time ------

The full algorithm used to calculate
VoteFair popularity ranking results  has a
calculation time that is less than or equal to
the following polynomial function:

   T = A + ( B * N ) + ( C * ( N * N ) )

where T is the calculation time, N is the
number of choices, and A and B and C are
constants.  (In mathematical notation, N *
N would be written as N squared.)  This
function includes the calculation time required
for the Choice-Specific Pairwise-Score (CSPS)
pre-sort calculations.

This time contrasts with the slow execution
times  of the "NP-hard" approach, in which
every sequence score is calculated in order to
find the sequence with the highest score.
If every sequence score were calculated (from
scratch), the calculation time would be
proportional to:

   N! * N * N

where N is the number of choices, N! is N
factorial (2 * 3 * 4 * ... * N), and N * N
equals N squared.  Note that N factorial
equals the number of possible sequences, and N
squared times one-half approximately equals the
number of pairwise counts that are added to
calculate each sequence score.

This clarification about calculation time is
included because there is an academically
common -- yet mistaken -- belief that
calculating the "Condorcet-Kemeny method" is
"NP-hard" and cannot be calculated in a time
that is proportional to a polynomial function
of N (the number of choices).

(c) Copyright 2011 Richard Fobes at VoteFair.org

(This description copied from VoteFair.org
with permission.)