[EM] Fast Condorcet-Kemeny calculation times, clarification of NP-hardness issue

Sun Mar 4 12:44:58 PST 2012

Finally, after reading the articles cited by Warren Smith (listed at the 
bottom of this reply) plus some related articles, I can reply to his 
insistence that Condorcet-Kemeny calculations take too long to 
calculate.  Also, this reply addresses the same claim that appears in 
Wikipedia both in the "Kemeny-Young method" article and in the 
comparison table within the Wikipedia "Voting systems" article (in the 
"polynomial time" column that Markus Schulze added).

One source of confusion is that Warren, and perhaps others, regard the 
Condorcet-Kemeny problem as a "decision problem" that only has a "yes" 
or "no" answer.  This view is suggested by Warren's reference (below and 
in other messages) to the problem as being NP-complete, which only 
applies to decision problems.  Although it is possible to formulate a 
decision problem based on one or more specified characteristics of the 
Condorcet-Kemeny method, that is a different problem than the 
Condorcet-Kemeny problem.

In the real world of elections, the Condorcet-Kemeny problem is to 
calculate a ranking of all choices (e.g. candidates) that maximizes the 
sequence score (or minimizes the "Kemeny score").

Clearly the Condorcet-Kemeny problem is an optimization problem, not a 
decision problem (and not a search problem).  It is an optimization 
problem because we have a way to measure how closely the solution 
reaches its goal.

(For contrast, consider the NP-hard "subset sum problem" in which the 
goal is to determine whether a specified list of integers contains a 
subset that can be added and/or subtracted to yield zero.  Any subset 
either sums to zero or it doesn't sum to zero.  This makes it easy to 
formulate the related decision (yes/no) problem that asks whether such a 
subset exists for a given set of numbers.)

Because the Condorcet-Kemeny problem is an optimization problem, the 
solution to the Condorcet-Kemeny problem can be an approximation.  If 
this approach is used, it becomes relevant to ask how closely the 
approximation reaches the ranking that has the highest sequence score. 
Yet even this question -- of "how close?" -- is not a decision problem 
(because it goes beyond a yes or no answer).

Keeping in mind that VoteFair popularity ranking calculations are 
mathematically equivalent to the Condorcet-Kemeny method, my claim is 
that VoteFair popularity ranking calculations yield, at the least, the 
same top-ranked choice, and the same few top-ranked choices, as the 
solution produced by examining every sequence score -- except (and this 
is the important part) in cases where the voter preferences are so 
convoluted that any top-ranked choice and any few top-ranked choices 
would be controversial.  As one academic paper elegantly put it: 
"garbage in, garbage out".

More specifically, here is a set of claims that more rigorously state 
the above ambiguous claim.

Claim 1: For _some_ _instances_, a polynomial-time calculation can 
identify the full ranking that produces the highest Condorcet-Kemeny 
sequence score.

Claim 2: For _some_ _instances_, a polynomial-time calculation can rank 
the top most-popular candidates/choices and this partial ranking will be 
the same as the top portion of the full ranking as determined by 
identifying the highest Condorcet-Kemeny sequence score.

Claim 3: For the _remaining_ _instances_ (not covered in Claims 1 and 
2), an approximation of the full Condorcet-Kemeny ranking can be 
calculated in polynomial time.

Claim 4: For any cases in which the top-ranked candidate/choice 
according to the VoteFair popularity ranking algorithm differs from the 
top-ranked candidate/choice according to a full calculation of all 
sequence scores, the outcome of a runoff election between the two 
candidates/choices would be difficult to predict.

As done in the academic literature, I am excluding the cases in which 
more than one sequence has the same highest sequence score.

To help clarify the validity of these claims, I'll use an analogy.

Consider a special case of the rigorously studied Traveling Salesman 
Problem (TSP), which is NP-hard to solve.  (The TSP also can be 
expressed as a decision problem, in which case the decision problem is 
NP-complete, but that variation is not the problem discussed here.)

The special case -- which I will refer to as the non-returning Traveling 
Salesman Problem -- is that we want to know which city the salesman 
visits first, and we want to know, with successively less interest, 
which city the salesman visits second, third, and so on.  Additionally, 
for this special case, we specify that the cities to be visited are 
roughly located between a beginning point "B" and and ending point "E".

To make this special case mathematically equivalent to the normal 
Traveling Salesman Problem in which the salesman returns to the starting 
city, we create a path of closely spaced cities (labeled "+" below) that 
lead back to the starting city "B".

Here is a diagram of this problem.  Remember that the most important 
thing we want to know is which city ("*") the salesman visits first.

B = Beginning city
* = City to visit
E = Ending city for main portion
+ = City on path back to beginning
(periods = background; assumes monospace font)

Instance 1:
.................................................B.
.....................................*............+
..................................................+
.....................................*............+
...................................*..............+
..............................*...................+
..................................................+
................................*.................+
.........................*........................+
......................*.....*.....................+
..................................................+
..................*..*.....*......................+
..........*....*..................................+
.......*...............*..........................+
..........*......*................................+
.....*...............*............................+
.........*....*.........*.........................+
..........*........*..............................+
.............*....................................+
E.................................................+
+.................................................+
+.................................................+
+++++++++++++++++++++++++++++++++++++++++++++++++++

In this case it is obvious which city is the first one on the path from 
B to E.  And it is obvious which are the next four cities on the path.

What we do not know is the sequence of cities after that (for the path 
that is shortest).

Now let's consider a different instance of this non-returning Traveling 
Salesman Problem.

Instance 2:
.................................................B.
..........................*.......................+
........................*....*....................+
................*.........*...*...................+
.............*.........*....*...*.*...............+
................*...*......*.....*...*............+
.......................*......*...*......*........+
..........*......*.........*......*...*...........+
.............*........*.........*......*..........+
..................*.........*......*..............+
.........*.....*.......*..........................+
.............*.....*..........*....*..............+
..................*..*.....*......................+
..........*....*..................................+
.......*...............*..........................+
..........*......*................................+
.....*...............*............................+
.........*....*.........*.........................+
..........*........*..............................+
.............*....................................+
E.................................................+
+.................................................+
+.................................................+
+++++++++++++++++++++++++++++++++++++++++++++++++++

In this instance we cannot know which city is the first city on the 
shortest path until we know the shortest path through all the cities.

Calculating the absolute shortest path in a convoluted case like 
Instance 2 might require a calculation time that is super-polynomial 
(more than what can be expressed as a polynomial function of the city 
count).

However, we can estimate the shortest path.

Such an approximation might identify a first city that is different from 
the first city on the absolute shortest path. If the "wrong" city is 
identified as the first-visited city, it is understandable that this 
occurs because there is not a clearly identifiable first-visit city in 
this instance.

This analogy can be extended to the Condorcet-Kemeny problem.

In normal election situations, the most important part of the solution 
is the first-ranked winner.  In fact, most voting methods are not 
_designed_ to identify more than the first-ranked winner.

In contrast, the Condorcet-Kemeny problem is designed to identify a full 
ranking.  Accordingly, the second-most important part (of solving the 
Condorcet-Kemeny problem) is to identify the top few highest-ranked choices.

Both of these important goals can be achieved without fully ranking all 
the choices.  This is analogous to solving Instance 1 of the 
non-returning Traveling Salesman Problem.

The importance of calculating the few top-ranked choices, and the 
reduced importance of calculating the lower-ranked choices, is further 
demonstrated when the Condorcet-Kemeny method is used to aggregate 
(merge/join/etc.) separate rankings from different search engines (to 
yield "meta-search" results, which is the intended goal specified by IBM 
employees who authored one of the cited articles about Condorcet-Kemeny 
calculations).  Specifically, a search-engine user is unlikely to look 
at the search results beyond the first few pages, which means that 
carefully calculating the full meta-search ranking for thousands of 
search results is pointless, and therefore the calculation time for a 
full ranking is irrelevant.

(As a further contrast, to clarify this point about a partial solution 
being useful, the subset-sum problem does not have a partial solution. 
All that matters is the existence of at least one solution, or the 
absence of any solution.)

Therefore, in some instances we can solve the NP-hard Condorcet-Kemeny 
problem "quickly" (in polynomial time) in the same way that we can 
"quickly" (in polynomial time) solve some instances -- such as Instance 
1 -- of the NP-hard non-returning Traveling Salesman Problem.

In instances where we use an approximate solution for the 
Condorcet-Kemeny problem, the approximate solution can be calculated in 
polynomial time.  Specifically, the algorithm used for VoteFair 
popularity ranking, which seeks to maximize the Condorcet-Kemeny 
sequence score, always can be solved in polynomial time (as evidenced by 
all the programming loops being bounded).

To further clarify these points, consider the following instance of the 
non-returning Traveling Salesman Problem.

Instance 3:
.................................................B.
..........................*.......................+
........................*....*....................+
................*.........*...*...................+
.............*.........*....*...*.*...............+
................*...*......*.....*...*............+
.......................*......*...*......*........+
.................*.........*......*...*...........+
.............*........*.........*......*..........+
..................*.........*......*..............+
.......................*..........................+
...................*..............................+
..................*..*............................+
..........*....*..................................+
.......*...............*..........................+
..........*......*................................+
.....*...............*............................+
.........*....*.........*.........................+
..........*........*..............................+
.............*....................................+
E.................................................+
+.................................................+
+.................................................+
+++++++++++++++++++++++++++++++++++++++++++++++++++

For this instance, we can calculate the absolute shortest path through 
the group of cities closest to the starting point "B" without also 
calculating the absolute shortest path through the group of cities 
closest to the ending point "E".

Similarly some instances of the Condorcet-Kemeny problem do not require 
calculating the exact order of lower-ranked choices (e.g. candidates) in 
order to exactly find the maximum-sequence-score ranking of the 
top-ranked choices.

Now that the word "instance" and the concept of a partial order are 
clear, I will offer proofs for Claims 1, 2, and 3.

Proof of Claim 1: If an instance has a Condorcet winner and each 
successively ranked choice is pairwise preferred over all the other 
remaining choices, this instance can be ranked in polynomial time.

Proof of Claim 2: If an instance has a Condorcet winner and the next few 
successively ranked choices are each pairwise preferred over all the 
remaining choices, the top-ranked choices for this instance can be 
ranked in polynomial time.

Proof of Claim 3: There are polynomial-time approximation methods that 
can efficiently find a sequence that has a Condorcet-Kemeny sequence 
score that is close to the largest sequence score.

(Clarification: I am not claiming that a ranking result based on 
approximation will have the same fairness characteristics that are 
attributed to the "exact" Condorcet-Kemeny method.)

Using lots of real-life data, plus data that has unusual 
calculation-related characteristics, I have tested the VoteFair ranking 
algorithm against the full approach that calculates all sequence scores 
for up to six choices.  In all these cases there are no differences in 
the top-ranked choice, nor are there any differences in the full ranking 
for the cases that have no ties.  (The cases that involve ties involve 
multiple sequences that have the same highest score, the resolution of 
which is not specified in the Condorcet-Kemeny method.)

Of course Claim 4 would be difficult to prove. (This claim says that if 
the two methods do not identify the same winner, the outcome of a runoff 
election would be difficult to predict.)  The point of Claim 4 is to 
clarify the concept of "controversial" and state that if the two methods 
identify different winners, neither winner is uncontroversial.

As a reminder (especially for anyone skimming), I am not saying that the 
Traveling Salesman Problem is mathematically related to the 
Condorcet-Kemeny problem (beyond both being categorized as NP-hard 
problems).  Instead I am using the well-studied traveling salesman 
problem as an analogy to clarify characteristics of the Condorcet-Kemeny 
problem that some election-method experts seem to misunderstand.

Perhaps the misunderstanding arises because the Condorcet-Kemeny method 
must fully rank all the choices in order to identify the top-ranked 
choice.  In contrast, other methods do the opposite, namely they 
identify the top-ranked choice and then, if a further ranking is needed, 
the process is repeated (although for instant-runoff voting and the 
Condorcet-Schulze method the process of calculating the winner yields 
information that can be used to determine some or all of a full ranking).

If anyone has questions about the calculations done by the open-source 
VoteFair popularity ranking software, and especially about its ability 
to efficiently identify the highest sequence score based on meaningful 
voter preferences, I invite them to look at the clearly commented code. 
  The code is on GitHub (in the CPSolver account) and on the Perl CPAN 
archive (which is mirrored on more than two hundred servers around the 
world).

In summary, although the Condorcet-Kemeny method is mathematically 
categorized as an NP-hard problem, the instances that are NP-hard to 
solve involve either the less-important lower-ranked choices (analogous 
to Instance 1 in the non-returning Traveling Salesman Problem), or 
involve convoluted top-ranked voter preferences that yield controversial 
results (analogous to Instances 2 and 3), or both.  For all other 
instances -- which include all meaningful election situations -- 
score-optimized top-ranking results can be calculated in polynomial time.

Clearly, in contrast to what Warren Smith and Markus Schulze and some 
other election-method experts claim, the calculation time required by 
the Condorcet-Kemeny method is quite practical for use in real-life 
elections.

I'll close with a quote from the article by (IBM researchers) Davenport 
and Kalananam that Warren cited: "NP-hardness is a only [sic] worst case 
complexity result which may not reflect the difficulty of solving 
problems which arise in practice."

Richard Fobes

About the citations below: I was not able to read the article by 
Bartholdi, Tovey, and Trick because it requires paying a $35 fee.  Alas, 
it is the article that other articles refer to for the proof of 
NP-hardness.  However, the other articles, plus related academic 
articles, plus Wikipedia articles, provided sufficient perspective.

Again, thank you Warren, for providing the citations.

On 12/24/2011 10:25 AM, Warren Smith wrote:
> Rank Aggregation Revisited
> Cynthia Dwork, Ravi Kumar, Moni Naor, D. Sivakumar
> http://www.eecs.harvard.edu/~michaelm/CS222/rank2.pdf
>
> J. J. Bartholdi, C. A. Tovey, and M. A. Trick: Voting schemes for
> which it can be difficult to tell who won the
> election, Social Choice and Welfare, 6(2):157–165, 1989.
>
> A Computational Study of the Kemeny Rule for Preference Aggregation
> Andrew Davenport and Jayant Kalagnanam
> http://www.aaai.org/Papers/AAAI/2004/AAAI04-110.pdf
>
> Cohen, W.; Schapire, R.; and Singer, Y. 1999. Learning to order
> things. Journal of Artificial Intelligence Research 10:213-270.
> http://www.jair.org/media/587/live-587-1788-jair.ps
>
> etc
> and it should be noted that it is NP-complete to find an ordering better than X,
> and also NP-hard merely to find the Kemeny winner...
>