by Edsger W. Dijkstra and C.S. Scholten
Renewed interest in algorithms for sorting in situ raised the question of the average distance from node to root in binary trees other than balanced ones. (Here binary trees are to be understood as rooted trees in which nodes have zero, one, or two sons.)
In particular we consider the infinite sequence of trees Ti (i ≥ 0), in which for some fixed p and q (p > q ≥ 0)
With Hi = the number of nodes in Ti, we have
Hn+p = Hn+q + Hn + 1 ;
with Gi = the sum of the distances of te nodes of Ti from its root we have
Gn+p = Gn+q + Gn + Hn+q + Hn .
We are interested in the asymptotic behaviour of Gi / Hi for large i, this ratio being the average distance from the root in Ti.
Without loss of generality we can confine ourselves to the case
With Ai defined by Ai = Hi + 1 and Bi defined by Bi = Gi – 2, we derive
Equation (1) is not homogeneous in the B's, but by solving it for the A's and substituting them in (0), we get a homogeneous recurrence relation for the B's, the characteristic polynomial of which is the product of (0)'s characteristic polynomial and the characteristic polynomial corresponding to the homogeneous part of (1). We conclude that the characteristic equation for the Ai is
and that that for the Bi is
Under the constraints gcd(p,q) = 1 and p > q ≥ 0, (2) enjoys the property of having one positive root, r say, such that r > 1 and all other roots of (2) have a modulus smaller than r. (For a proof of this theorem, see later.)
From this and the theory of linear recurrence relations we conclude
1) | that the leading term of Ai is of the form k · ri for some constant k |
2) | that the leading term of Bi is of the form ( l + L·i ) · ri for some constants l and L. |
( l + L·(n+p) ) · rn+p =
( l + L·(n+q) ) · rn+q + ( l + L·n ) · rn + k · rn+p
Since r is a root of (2) this can be reduced to
We define the skewness of a binary tree as the ratio (≥ 1) of the numbers of nodes in its two subtrees. For the trees Ti it follows from the leading term of Ai that the asymptotic skewness s is given by s = rq, whence q = rlog s. Remembering that r satisfies (2), we find rp = s+1, whence p = rlog (s+1). Hence (4) can be rewritten as
s+1
(s+1) · rlog (s+1) – s·rlog s
Because r > 1 —so that the asymptotic value of Gi / Hi equals that of Bi / Ai—, we conclude that, expressed in r and s, the average distance from the root in Ti is for large i
(s+1) · i
(s+1) · rlog (s+1) – s · rlog s
Consequently, the average distance from the root in a tree from the sequence Ti with N nodes is
(s+1) · log N
(s+1) · log (s+1) – s · log s
i.e. an expression in s and N only.
* * *
We are left with the obligation to prove that f x = 0 with f x = xp – xq – 1 has one positive root r > 1 dominating the others. Since f 0 = –1 and f(+∞) = +∞, f x = 0 has an odd number of positive roots. Because f' x = xq–1 · ( p·xp–q – q ) we conclude that f' x = 0 has at most 1 positive root; hence f x = 0 has 1 positive root r and, because f 1 = –1, we conclude r > 1. In other words
In order to prove dominance of r, we consider a root m·ei·φ of (2), with m > 0. Consequently
mp · ei·p·φ = mq · ei·q·φ + 1 ,
from which we derive —by taking absolute values—
mp < mq + 1 ∨ (p·φ) mod 2π = (q·φ) mod 2π = 0 .
Since gcd(p,q) = 1, this can be rewritten as mp – mq – 1 < 0 ∨ φ = 0 or, in view of (6): m<r ∨ φ = 0. q.e.d.
* * *
Finally we remark that thanks to
p
q= log (s+1)
log s
any value of s can be approximated arbitrarily closely by suitable choice of p and q, a fact that enhances the significance of the expression in s and N for the average distance from the root.
28 August 1981
drs. C. S. Scholten Philips Research Laboratories 5600 MD EINDHOVEN The Netherlands |
prof. dr. E. W. Dijkstra
Burroughs Research Fellow Plataanstraat 5 5671 AL NUENEN The Netherlands |