[EM] Re: Compactness

Tue Jan 20 17:31:03 PST 2004

Hi Ernie,

In my previous message I stated three points in favor of 
distance-compactness (DC) over boundary-compactness (BC).  The most 
important of these was Forest's point (1) that the very definition of 'the' 
boundary between two districts, or even between two neighboring 'atoms' of 
districts, is often inherently ambiguous.

However, after posting the message I started wondering whether I got carried 
away on my point (3) as to supposed advantage of DC over BC in computation.  
So I’ve since tried to subject raw intuition to a bit of precision and 
reality check.

Having done so to a point, perhaps not enough, I must revise what I implied, 
and go halfway to your position.

Very likely, in your context, BC is easier to handle than DC, or anyhow is 
no harder to handle.  It sounds like you have a good algorithm, and for good 
reasons.

On the other hand (details below) there seems to be just about equally good 
news from the DC side too, at least if DC uses squared distance.

Details now follow.

First, some terminology.  Let T be the number of census tracts or other 
‘atoms’ to assign to S seats or ‘districts’.  Assume that the atoms have 
been coded into the consecutive numbering, 1 through T, and likewise the 
districts.  Then a districting plan P is a function which to each integer in 
the range 1 through T assigns an integer in the range 1 through S; or 
equivalently is a T-tuple of integer entries, each in the range 1 through S.

For purposes of evaluating any plan P, we need various fixed data, in 
particular a T-tuple which gives each atom’s population.

Moreover, for purposes of DC we will need for each of the T atoms its 
centroidal X and Y coordinates: that’s 2T data.

For purposes of BC we will need for each atom-pair its ‘boundary length’.  
That’s roughly T^2/2 data if we sloppily code this info as a sparse TxT 
matrix.  Of course, we could smartly condense the info in this matrix into a 
list of triples, one triple for each nonzero boundary between a pair of 
adjoining atoms.   I think my intuition went wrong mainly in that I 
overlooked this very elementary fact and its potentialities.

Euler’s polyhedral formula (v-e+f=2) implies that the number of such nonzero 
boundaries is not too big:  approximately, but always less than, 3T, no 
matter how regular or irregular may be the configuration of atoms.  For each 
of these boundaries, we need store just three numbers: the IDs of both atoms 
plus the boundary value.

So, the two methods seem to require on the order of the same amount of info. 
  What about amount of processing?

Well, the answer there seems to depend on to what extent it is economical in 
time to pre-calculate and pre-process the info into tables which - for 
evaluating plan after plan - are faster to deal with.

That answer in turn will depend partly on stuff with which I am not now 
(even if I ever was) sufficiently familiar: on just how fast (in various 
instances, for various table sizes and calculation routines and hardware and 
software) table lookup can be vs. calculations on the fly.  For instance, 
for all I know about the comparative timing, in some cases DC calculation 
may go faster if we pre-calculate distances into a TxT matrix, and then look 
up these values for each new plan.

The answer will also depend on whether we are generating and testing, one 
after another, lots of essentially different plans - as versus, instead, 
testing variants and small modifications of a given initial plan P.  This 
latter case, which apparently is your focus, is relevant if we have good 
heuristics to suggest a fairly superior but not obviously optimal initial 
plan P.

In this latter case especially, it would pay to pre-process the 
specification of plan P into convenient supplementary tables.  In 
particular, one could prepare a list L of 5-tuples, one for each of the at 
most 3T boundaries.  Each 5-tuple would list not only the three data in the 
boundary’s triple, but also the two atoms’ current district IDs.  Then, a 
small change in plan P would require but small changes in L, and an easy 
pass through L would readily sum the boundary values for different-district 
atom pairs - or, even faster, would get the complementary sum of all values 
for same-district pairs.

But there is good news from the DC side too, if the compactness criterion be 
to minimize the average SQUARED distance between same-district atom pairs.  
This criterion amounts to minimizing average within-district atom-centroid 
squared distance, alias average district (moment of) inertia.  It turns out 
that one can readily account for small changes in districts by storing and 
updating for each district the X- and Y- components both of its centroid and 
of its inertia.  (Thanks to Pythagoras, the X- and Y- inertias sum to the 
total inertia.)  For those who want a named theorem behind the technique, 
it’s the  Parallel Axis Theorem.  Years ago I saw a programmer’s writeup of 
the one-dimensional analog: for the sake of scratch-memory conservation, he 
calculated the mean and variance of a sequence of read-in numbers, by 
updating simultaneously the running mean and running variance.

Anyhow, Ernie, I do not doubt that you have an excellent algorithm for 
updating the BC effect of small changes in the districting plan, and I thank 
you for your corrective!

Joe Weinstein

_________________________________________________________________
There are now three new levels of MSN Hotmail Extra Storage!  Learn more. 
http://join.msn.com/?pgmarket=en-us&page=hotmail/es2&ST=1