[EM] Fast Condorcet-Kemeny calculations -- in polynomial time

Wed Dec 21 13:39:19 PST 2011

This algorithm, at first read, seems that it could still be NP-hard if all
candidates were part of a (three-candidate???) cycle with all other
candidates. Which is, of course, a case which would hardly ever arise in
real life; but one which must be considered in giving a worst-case
execution time order for the algorithm.

Jameson

2011/12/21 Richard Fobes <ElectionMethods at votefair.org>

> As previously promised, I am revealing how Condorcet-Kemeny calculations
> can be done fast.
>
> How fast?  Calculating a "complex" case with 12 choices is so fast that I
> haven't bothered to time the calculations.  As an estimate, it is on the
> order of a couple of seconds or less.
>
> The estimated time for calculating results for 50 choices (!) is less than
> a minute -- which is far less than a "lifetime".
>
> The algorithm is described in the documentation that accompanies the
> just-announced VoteFair ranking software.
>
> The full explanation of the algorithm is described on these three web
> pages:
>
>    http://www.votefair.org/**calculation_details_**popularity.html<http://www.votefair.org/calculation_details_popularity.html>
>
>    http://www.votefair.org/**insertion_sort_ranking.html<http://www.votefair.org/insertion_sort_ranking.html>
>
>    http://www.votefair.org/**choice_specific_pairwise_**score.html<http://www.votefair.org/choice_specific_pairwise_score.html>
>
> For convenience, below is the description for the "insertion-sort" part of
> the algorithm, which is the lowest-level part of the algorithm.  It is
> followed by a brief description of the full algorithm (excluding the
> initial estimation algorithm).
>
> As a reminder, the Condorcet-Kemeny method calculates a full ranking of
> all choices; it does not just find the single winner.
>
> Also consider that the Condorcet-Kemeny method does not specify how to
> resolve sequence-score ties.  The VoteFair ranking software does resolve
> such sequence-score ties by identifying which choices are tied at which
> ranking levels.
>
> Richard Fobes
>
>
> -------- Insertion-sort part of the algorithm ---------
>
> This explanation clarifies how the
> well-known insertion-sort algorithm is applied
> to VoteFair popularity ranking in a way that
> greatly reduces the number of calculations
> needed to find maximum sequence scores.
> (This  method is just part of the full
> algorithm, which is explained [elsewhere].)
>
> Consider an example in which there are five
> choices named A, B, C, D, and E, with a final
> sort order that matches this alphabetical
> order.
>
> Notation: The notation A>B refers to how
> many voters pairwise prefer choice A over
> choice B, and the notation B>A refers to how
> many voters pairwise prefer choice B over
> choice A.  This notation always uses the
> "greater-than" symbol ">", and never uses
> the "less-than" symbol "<".
>
> At an intermediate stage in this sorting
> example, suppose that the choices A, C, and E
> have been sorted -- into this correct
> order -- and choice B is about to be
> sorted, and choice D remains unsorted.
> The pairwise counts for this arrangement are
> shown below.  The asterisks show the
> separation between sorted choices and unsorted
> choices.
>
>      |       |       |       |       |       |
>      |   A   |   C   |   E   |   B   |   D   |
>      |       |       |       |       |       |
>  -----***************************-------+-------+
>      * \     |       |       *       |       |
>  A   *   \   |  A>C  |  A>E  *  A>B  |  A>D  |
>      *     \ |       |       *       |       |
>  -----+-------+-------+-------+**-------+-------+
>      *       | \     |       *       |       |
>  C   *  C>A  |   \   |  C>E  *  C>B  |  C>D  |
>      *       |     \ |       *       |       |
>  -----+-------+-------+-------+**-------+-------+
>      *       |       | \     *       |       |
>  E   *  E>A  |  E>C  |   \   *  E>B  |  E>D  |
>      *       |       |     \ *       |       |
>  -----***************************-------+-------+
>      |       |       |       | \     |       |
>  B   |  B>A  |  B>C  |  B>E  |   \   |  B>D  |
>      |       |       |       |     \ |       |
>  -----+-------+-------+-------+**-------+-------+
>      |       |       |       |       | \     |
>  D   |  D>A  |  D>C  |  D>E  |  D>B  |   \   |
>      |       |       |       |       |     \ |
>  -----+-------+-------+-------+**-------+-------+
>
> The diagonal line passes through empty cells
> -- that would otherwise represent a
> choice's comparison with itself, such as
> A>A.
>
> The diagonal line also is the border between
> the upper-right triangular area and the
> lower-left triangular area.  The sequence
> score for the current sequence is the sum of
> all the pairwise counts in the upper-right
> triangular area (currently A>C + A>E +
> A>B + A>D + C>E + C>B + C>D +
> E>B + E>D + B>D).
>
> The goal of these calculations is to find
> the maximum sequence score, which means the
> goal is to change the sequence so that the
> largest pairwise counts move into the
> upper-right triangular area, leaving the
> smallest pairwise counts in the lower-left
> triangular area.
>
> The first step in sorting choice B is to
> consider the possibility of moving it to the
> left of choice E, which would form the sequence
> A, C, B, E.  Here is the pairwise-count
> matrix for this sequence.  The asterisks
> now include choice B because this is a possible
> sort order.
>
>      |       |       |       |       |       |
>      |   A   |   C   |   B   |   E   |   D   |
>      |       |       |       |       |       |
>  -----***********************************-------+
>      * \     |       |       |       *       |
>  A   *   \   |  A>C  |  A>B  |  A>E  *  A>D  |
>      *     \ |       |       |       *       |
>  -----*-------+-------+-------+**-------*-------+
>      *       | \     |       |       *       |
>  C   *  C>A  |   \   |  C>B  |  C>E  *  C>D  |
>      *       |     \ |       |       *       |
>  -----*-------+-------+-------+**-------*-------+
>      *       |       | \     |       *       |
>  B   *  B>A  |  B>C  |   \   |  B>E  *  B>D  |
>      *       |       |     \ |  ---  *       |
>  -----*-------+-------+-------+**-------*-------+
>      *       |       |       | \     *       |
>  E   *  E>A  |  E>C  |  E>B  |   \   *  E>D  |
>      *       |       |  ---  |     \ *       |
>  -----***********************************-------+
>      |       |       |       |       | \     |
>  D   |  D>A  |  D>C  |  D>B  |  D>E  |   \   |
>      |       |       |       |       |     \ |
>  -----+-------+-------+-------+**-------+-------+
>
> The only pairwise counts that crossed the
> diagonal line are the (underlined) counts
> B>E and E>B, which swapped places.
> All the other pairwise counts that move do not
> cross the diagonal line; they stay on the same
> side of the diagonal line.
>
> As a result, the score for this sequence,
> compared to the score for the previous
> sequence, increases (or decreases if negative)
> by the amount B>E minus E>B.  In
> this case (assuming there are no complications
> that are explained later) the sequence score
> has increased because in the final
> (alphabetical) sort order, choice B appears
> before choice E.
>
> The next step in sorting choice B is to
> consider the possibility of moving it to the
> left of choice C, which would form the sequence
> A, B, C, E.  Here is the pairwise-count
> matrix for this sequence.
>
>      |       |       |       |       |       |
>      |   A   |   B   |   C   |   E   |   D   |
>      |       |       |       |       |       |
>  -----***********************************-------+
>      * \     |       |       |       *       |
>  A   *   \   |  A>B  |  A>C  |  A>E  *  A>D  |
>      *     \ |       |       |       *       |
>  -----*-------+-------+-------+**-------*-------+
>      *       | \     |       |       *       |
>  B   *  B>A  |   \   |  B>C  |  B>E  *  B>D  |
>      *       |     \ |  ---  |       *       |
>  -----*-------+-------+-------+**-------*-------+
>      *       |       | \     |       *       |
>  C   *  C>A  |  C>B  |   \   |  C>E  *  C>D  |
>      *       |  ---  |     \ |       *       |
>  -----*-------+-------+-------+**-------*-------+
>      *       |       |       | \     *       |
>  E   *  E>A  |  E>B  |  E>C  |   \   *  E>D  |
>      *       |       |       |     \ *       |
>  -----***********************************-------+
>      |       |       |       |       | \     |
>  D   |  D>A  |  D>B  |  D>C  |  D>E  |   \   |
>      |       |       |       |       |     \ |
>  -----+-------+-------+-------+**-------+-------+
>
> The only pairwise counts that crossed the
> diagonal line are the (underlined) counts
> B>C and C>B, which swapped places.
> The other pairwise counts that moved remained
> on the same side of the diagonal line.
>
> The score for this sequence increases (or
> decreases if negative) by the amount B>C
> minus C>B.  In this case the sequence
> score has increased because (in the final
> alphabetical order) choice B appears before
> choice C.
>
> The final step in sorting choice B is to
> consider the possibility of moving it to the
> left of choice A, which would form the sequence
> B, A, C, E.  Here is the matrix for this
> sequence.
>
>      |       |       |       |       |       |
>      |   B   |   A   |   C   |   E   |   D   |
>      |       |       |       |       |       |
>  -----***********************************-------+
>      * \     |       |       |       *       |
>  B   *   \   |  B>A  |  B>C  |  B>E  *  B>D  |
>      *     \ |  ---  |       |       *       |
>  -----*-------+-------+-------+**-------*-------+
>      *       | \     |       |       *       |
>  A   *  A>B  |   \   |  A>C  |  A>E  *  A>D  |
>      *  ---  |     \ |       |       *       |
>  -----*-------+-------+-------+**-------*-------+
>      *       |       | \     |       *       |
>  C   *  C>B  |  C>A  |   \   |  C>E  *  C>D  |
>      *       |       |     \ |       *       |
>  -----*-------+-------+-------+**-------*-------+
>      *       |       |       | \     *       |
>  E   *  E>B  |  E>A  |  E>C  |   \   *  E>D  |
>      *       |       |       |     \ *       |
>  -----***********************************-------+
>      |       |       |       |       | \     |
>  D   |  D>B  |  D>A  |  D>C  |  D>E  |   \   |
>      |       |       |       |       |     \ |
>  -----+-------+-------+-------+**-------+-------+
>
> The only pairwise counts that crossed the
> diagonal line are the (underlined) counts
> B>A and A>B, which swapped places.
> The other pairwise counts that moved remained
> on the same side of the diagonal line.
>
> The score for this sequence increases (or
> decreases if negative) by the amount B>A
> minus A>B.  In this case the sequence
> score has decreased because (in the final
> alphabetical order) choice B appears after, not
> before, choice A.
>
> At this point choice B has been tested at
> each position within the sorted portion.
> The  maximum sequence score (for the sorted
> portion) occurred when it was between choices A
> and C.  As a result, choice B will be
> moved to the position between choices A and
> C.
>
> Notice that the full sequence score did not
> need to be calculated in order to find this
> "local" maximum.  These calculations only
> need to keep track of increases and decreases
> that occur as the being-sorted choice swaps
> places with successive already-sorted
> choices.
>
> The pairwise-count matrix with choice B in
> the second sort-order position (between A and
> C) is shown below.  Now choice D is the
> only unsorted choice.
>
>      |       |       |       |       |       |
>      |   A   |   B   |   C   |   E   |   D   |
>      |       |       |       |       |       |
>  -----***********************************-------+
>      * \     |       |       |       *       |
>  A   *   \   |  A>B  |  A>C  |  A>E  *  A>D  |
>      *     \ |       |       |       *       |
>  -----*-------+-------+-------+**-------*-------+
>      *       | \     |       |       *       |
>  B   *  B>A  |   \   |  B>C  |  B>E  *  B>D  |
>      *       |     \ |       |       *       |
>  -----*-------+-------+-------+**-------*-------+
>      *       |       | \     |       *       |
>  C   *  C>A  |  C>B  |   \   |  C>E  *  C>D  |
>      *       |       |     \ |       *       |
>  -----*-------+-------+-------+**-------*-------+
>      *       |       |       | \     *       |
>  E   *  E>A  |  E>B  |  E>C  |   \   *  E>D  |
>      *       |       |       |     \ *       |
>  -----***********************************-------+
>      |       |       |       |       | \     |
>  D   |  D>A  |  D>B  |  D>C  |  D>E  |   \   |
>      |       |       |       |       |     \ |
>  -----+-------+-------+-------+**-------+-------+
>
> Choice D would be sorted in the same
> way.  Of course the maximum sequence score
> would occur when choice D is between choices C
> and E, so D is moved there.
>
>      |       |       |       |       |       |
>      |   A   |   B   |   C   |   D   |   E   |
>      |       |       |       |       |       |
>  -----*******************************************
>      * \     |       |       |       |       *
>  A   *   \   |  A>B  |  A>C  |  A>D  |  A>E  *
>      *     \ |       |       |       |       *
>  -----*-------+-------+-------+**-------+-------*
>      *       | \     |       |       |       *
>  B   *  B>A  |   \   |  B>C  |  B>D  |  B>E  *
>      *       |     \ |       |       |       *
>  -----*-------+-------+-------+**-------+-------*
>      *       |       | \     |       |       *
>  C   *  C>A  |  C>B  |   \   |  C>D  |  C>E  *
>      *       |       |     \ |       |       *
>  -----*-------+-------+-------+**-------+-------*
>      *       |       |       | \     |       *
>  D   *  D>A  |  D>B  |  D>C  |   \   |  D>E  *
>      *       |       |       |     \ |       *
>  -----*-------+-------+-------+**-------+-------*
>      *       |       |       |       | \     *
>  E   *  E>A  |  E>B  |  E>C  |  E>D  |   \   *
>      *       |       |       |       |     \ *
>  -----*******************************************
>
> Now there are no more choices to sort, so
> the resulting sequence is A, B, C, D, E.
> In this sequence the full sequence score
> -- which equals A>B + A>C + A>D +
> A>E + B>C + B>D + B>E + C>D +
> C>E + D>E -- is likely to be the
> highest possible sequence score.
>
> Additional calculations, as described below,
> are needed because in rare cases it is possible
> that moving two or more choices at the same
> time could produce a higher sequence
> score.  This concept is analogous to
> climbing a mountain in foggy conditions by
> always heading in the locally higher direction
> and ending up at the top of a peak and then,
> when the fog clears, seeing a higher peak.
>
> ------- Full calculation method for VoteFair popularity ranking -------
>
> This is a description of the full algorithm
> used to calculate VoteFair popularity ranking
> results.
>
> The algorithm begins by calculating the
> Choice-Specific Pairwise-Score ranking.
> This pre-sort is a required part of the
> process.  Without it, some unusual cases
> can cause the calculations to fail to find the
> sequence with the highest score.  This
> pre-sort is analogous to starting a search for
> the highest mountain peak within a mountain
> range instead of starting the search within a
> large valley.
>
> The next step is to apply the insertion-sort
> method as described in the section above,
> including starting at the left/highest end.
>
> To ensure that all possible moves of each
> choice are considered, the insertion-sort
> method is done in both directions.
> Sorting in both directions means that in some
> sorting passes sorting moves choices to the
> left, as explained in the above example.
> In other sorting passes sorting starts by
> considering the right-most choice as the first
> sorted choice, and choices move to the right,
> into the sorted portion.  This convention
> ensures movement for choices that need to move
> right, instead of left, in order to cause an
> increase in the score.
>
> Complications can arise when there is
> "circular ambiguity", so additional steps are
> used.  The most common cases of circular
> ambiguity involve several choices that are tied
> for the same sort-order position.
>
> A key part of dealing with circular
> ambiguity is to follow this convention:
> whenever a choice can produce the same,
> highest, sequence score at more than one
> position, the choice is moved to the farthest
> of those highest-sequence-score positions.
>
> Another part of dealing with these
> complications is to sort the sequence multiple
> times.
>
> During the second sorting pass, if there is
> no circular ambiguity, the sequence of the
> choices in the pairwise matrix remains the
> same.  This lack of movement (when there
> is no circular ambiguity) occurs because the
> sorted and unsorted portions are
> adjacent.  Specifically, each choice to be
> sorted is already at the top (for left-movement
> sorting) or bottom (for right-movement sorting)
> of the "unsorted" portion, and it is being
> moved to the bottom (for left-movement sorting)
> or top (for right-movement sorting) of
> the "sorted" portion.  In such cases
> the only thing that moves is the boundary
> between the sorted choices and unsorted
> choices.
>
> However, in cases that involve circular
> ambiguity, the positions of some choices will
> change during the second and later sorting
> passes.  This happens because the
> convention (as explained above) is to move each
> choice as far as it will go, within the limits
> of maximizing the sequence score.
>
> During the sorting passes the highest
> sort-order (sequence) position of each choice
> is tracked, and the lowest sort-order position
> of each choice is tracked.  These highest
> and lowest positions are reset (to current
> positions) whenever the sequence score
> increases to a higher score.  At the end
> of the sorting process the highest and lowest
> positions reveal which choices are tied at the
> same popularity ranking level.
>
> Using the insertion-sort example, if choices
> B, C, and D can be in any order and still
> produce the same highest sequence score, then
> each of these choices would move to the left of
> the other two each time it is sorted, and each
> of these choices would have the same
> highest-ranked position of second place, and
> each would have the same lowest-ranked position
> of fourth place. Because these three choices
> have the same highest and lowest positions,
> they are easily identified as tied (at the same
> popularity ranking).
>
> More complex cases of circular ambiguity can
> occur.  To deal with these cases, and to
> ensure the correctness of the "winner" (the
> most popular choice), the sorting process is
> repeated for the top half (plus one) of the
> highest-ranked choices, and this sub-set
> sorting is repeated until there are just three
> choices.  For example, if there are 12
> choices, the sorting process is done for 12
> choices, then the top 7 choices, then the top 4
> choices, and finally the top 3 choices.
> Then the highest-ranked choice (or the choices
> that are tied at the top) is kept at the
> highest rank while the other choices are sorted
> a final time.  (If, instead, the
> least-popular choice is the most important one
> to identify correctly, the data supplied to
> this algorithm can be inverted according to
> preference levels, and then the calculated
> ranking can be reversed.)
>
> As a clarification, the extra sub-set
> sorting is done only if more than one sequence
> has the same highest sequence score.  This
> point is significant if the distinction between
> VoteFair popularity ranking and the
> Condorcet-Kemeny method is relevant.
> Specifically, the Condorcet-Kemeny method does
> not indicate how such "tied" sequence scores
> should be resolved, whereas VoteFair popularity
> ranking resolves such "tied" sequence scores as
> part of its calculation process.
>
> After all the sorting has been done, the
> highest and lowest ranking levels are used to
> determine the results.  For each choice
> its highest and lowest ranking levels
> are added together (which equals twice their
> average) and then multiplied times a
> constant.  The constant equals 10 times
> the number of choices minus one.  These
> numbers are converted to integers, and then
> these "averaged scaled integerized" values are
> used as the non-normalized ranking
> levels.  Two or more choices are ranked at
> the same level if they have the same
> "averaged-scaled-integerized" ranking
> values.
>
> The final calculation step is to normalize
> the "averaged-scaled-integerized" ranking
> levels so that the normalized ranking levels
> are consecutive, namely 1, 2, 3, etc. (so that
> no ranking levels are skipped).
>
> The result is a ranking that identifies
> which choice is first-most popular, which
> choice is second-most popular, and so on down
> to which choice is least popular.  Ties
> can occur at any level.
>
> ------ Calculation time ------
>
> The full algorithm used to calculate
> VoteFair popularity ranking results  has a
> calculation time that is less than or equal to
> the following polynomial function:
>
>  T = A + ( B * N ) + ( C * ( N * N ) )
>
> where T is the calculation time, N is the
> number of choices, and A and B and C are
> constants.  (In mathematical notation, N *
> N would be written as N squared.)  This
> function includes the calculation time required
> for the Choice-Specific Pairwise-Score (CSPS)
> pre-sort calculations.
>
> This time contrasts with the slow execution
> times  of the "NP-hard" approach, in which
> every sequence score is calculated in order to
> find the sequence with the highest score.
> If every sequence score were calculated (from
> scratch), the calculation time would be
> proportional to:
>
>  N! * N * N
>
> where N is the number of choices, N! is N
> factorial (2 * 3 * 4 * ... * N), and N * N
> equals N squared.  Note that N factorial
> equals the number of possible sequences, and N
> squared times one-half approximately equals the
> number of pairwise counts that are added to
> calculate each sequence score.
>
> This clarification about calculation time is
> included because there is an academically
> common -- yet mistaken -- belief that
> calculating the "Condorcet-Kemeny method" is
> "NP-hard" and cannot be calculated in a time
> that is proportional to a polynomial function
> of N (the number of choices).
>
> (c) Copyright 2011 Richard Fobes at VoteFair.org
>
> (This description copied from VoteFair.org
> with permission.)
>
> ----
> Election-Methods mailing list - see http://electorama.com/em for list info
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.electorama.com/pipermail/election-methods-electorama.com/attachments/20111221/20ae6424/attachment-0004.htm>