National Center for Geographic Information and Analysis State University of New York at Buffalo
Department of Geography 105 Wilkeson Quadrangle Buffalo, NY 14261
Verbal street directions are widely used to convey navigational information. The reduction of message volumes is a relevant issue when transmitting such information electronically.
This paper introduces Huffman trees as a method to encode and decode street directions. Huffman trees are binary trees which are based on the probability distribution of a symbol set and the principle that symbols occurring more frequently will be represented by shorter codes than other, less probable ones. Due to this principle Huffman coding represents a powerful method of data compression, provided that the symbol probabilities are accurately estimated.
It is proposed that, for a given street network, the sum of all occurring street names forms a symbol set or alphabet. Each street name is a symbol that will appear with a certain probability in a directional phrase. The task of estimating symbol probabilities is approached by assuming that directional phrases will contain more names of major roadways than names of side streets, even though the overall number of side streets is far higher than the number of major roads. Since it can also be assumed that side streets will be made up of less topological sections than major roads, the number of sections per street can be used to estimate the symbol probability. If the average code length of the symbols in a directional phrase is below the entropy of the probability distribution, then a compression will occur.
In this paper the term verbal street directions is meant to capture the combination of street names, cardinal directions, and other route specifications within directional phrases. These phrases represent an important means of communicating information in support of street-based navigation. Examples are "Go south on Alpha Street until you reach Beta Street, then turn right” or "From the supermarket go south on Gamma Avenue for one mile, then turn left onto Delta Street”. Such navigational information is used in a variety of contexts, both in the private sphere, e.g. to direct guests to a party, and in the public sphere, e.g. to direct a mail courier to the recipient. They are so commonly used that some GIS packages, like ESRIís ARC/INFO, meanwhile offer the option to output routing information in verbal form. If the necessary geospatial data base is centrally processed, there arises the need to transmit such verbal route information to vehicles, outlying service stations, etc. This is where methods of message compression come into play.
Data compression has been shown to be of value both to data storage and data transmission. By reducing the size of the transmitted data one can preserve valuable bandwidth, thereby using the capacity of the communication channel more efficiently (Lelewer & Hirschberg 1987). Messages can be transferred faster, transmission costs can be lowered and more messages be fit into the existing bandwidth. In this paper one specific compression method, namely Huffman coding, and its application to the compression of verbal street directions is proposed.
Huffman coding, named after its inventor (Huffman 1952), is one of the most widely known compression schemes. For instance, the UNIX commands pack/unpack and compress/uncompress are based on it. As Lelewer & Hirschberg (1987) point out, its elegant and combinatorial nature has enabled its use in numerous areas beyond data compression. Despite the fact that other promising approaches to data compression have been proposed, Huffman coding always plays a major role when evaluating the feasibility of the new approaches (Lelewer & Hirschberg 1987, Jones 1991, Bell et al. 1989, Bell et al. 1990). Arithmetic coding, a related statistical compression scheme, has recently arisen as a challenger, but Huffman coding is far from being dead (Bookstein & Klein 1993).
Huffman's scheme is based on statistical coding which means that the probability of a symbol has a direct bearing on the length of its representation. The more probable the occurrence of a symbol is, the shorter will be its bit-size representation.
Morse code is perhaps the best known example for an implementation of that principle in the pre-computer age. In the Morse alphabet, the letter "E", the letter with the highest frequency in the English vocabulary, is represented by the shortest symbol, a single dot. Other, less frequently occurring letters, like "X", are represented by a combination of dots and dashes. One of the problems associated with this dual representation scheme is the indication of the beginning and end of a symbol within a message consisting of a number of symbols. In Morse code this is solved by inserting a pause between symbols. Huffmanís variable length storage scheme elegantly solves that problem by providing means to detect "spaces" between symbols during the decoding process. Thus, a message can be encoded in form of a continuous sequence of bits.
Huffman trees are a special form of binary trees. All that is needed to build such a tree is a list of symbols with associated frequencies, e.g. {(A,52) (B,7) (C,8) (D,8) (E,12) (F,2) (G,1) (H,1) (I,4)} or relative frequencies, e.g. {(A,0.547) (B,0.074) (C,0.084) (D,0.084) (E,0.126) (F,0.021) (G,0.011) (H,0.011) (I,0.042)}, which are used to estimate the respective probabilities. From the above list it can be seen that the symbol "A" has a relative high frequency or probability, while symbol "G" will only rarely appear within a message.
The binary tree is built from the bottom-up starting with the two least frequent symbols. Within the tree, leaf nodes hold single symbols, while branch nodes are composites holding the accumulated set of all the symbols that lie below, as well as the sum of all the respective frequencies. Each new branch node points to those two still unbound leaf or branch nodes with the smallest original or accumulated frequencies.
Figure 1 illustrates how the resulting tree would look like. Note the positions of "0" and "1” at the left or right side of a branch node. They will later form the very content of the message to be transmitted.
Starting at the treetop, we can reach the leaf containing a certain symbol by looking at the lists of symbols associated with the two nodes at the right and left side of the current branch node. If we find the symbol to be contained in one of the two nodes we take the respective turn. Working our way down the tree, we will eventually reach a leaf and can stop the search process for the current symbol. On the way down we record each turn taken in sequence, e.g. "1" for a left turn and "0" for a right turn. Thereby a sequence of bits is created to which each symbol of the message is adding to, until the end of the message is reached.
As an example, the message string "ABCDEFGHI” would be encoded, based on the tree in Figure 1, as "1 0110 0111 000 010 00110 001110 001111 0010”. Note that the spaces are here only added to improve readability, the actual bit code would be continuous.
Figure 1. Huffman tree built in accordance with given symbol probabilities.
Starting at the treetop, the decoder takes a bit-encoded message, like "0110 010", and traces the branches of the tree downward. If the bit is set to "1" it will make a right turn, while "0" will cause it to turn left. This downward movement continues until a leaf is reached. Then the actual symbol, e.g. "B" for "0110" or "E" for "010", can be extracted. Arriving at a leaf also signalizes that the bit sequence for a single symbol has ended. Thus there is no need to indicate the end of a symbol by a special character like the "space" in Morse code. The search for the next symbol will start with the next bit, again from the treetop.
The following is a typical example of a verbal directional phrase:
How can Huffman encoding help us to reduce the size of such a message during data transfer? The answer is that every street name is to be regarded as one symbol within an alphabet of street names. For experimental purposes, the author extracted all the street information from a TIGER data set covering Erie County, New York State, USA. The resulting alphabet consists of Nstreets = 6741 different symbols, i.e. street names.
Data compression consists of two stages, modeling and coding (Rissanen & Langdon 1981, Bell et al. 1989). The goal of modeling is to estimate the probabilities of symbols as accurately as possible. The modeling process has major influence on the efficiency of such statistical compression methods as Huffman coding. This can be illustrated by employing the notion of entropy (Bell et al. 1989, Press et al. 1992, Salton 1989).
If we compute entropy H as
(1)
then this is the average number of bits needed to encode a symbol. The
bound defined by H can be achieved if every symbol i could
be coded with a length . ![]()
How can one estimate the probability with which a street name will occur within a verbal directional phrase? In other words, which streets will we be more probable to be driven on? The answer to this question and consequently the quality of the probability estimates and the degree of compression very much depend on the assumptions of the employed routing procedure. The following discussion assumes a direct point-to-point routing approach, like in emergency response systems. These have a strong focus on speedy access through major highways, the names of which therefore have a higher probability of appearing in a directional phrase than the names of side streets.
One can also assume that major highways will consist of more topological sections than side streets. The higher the number of sections of a certain street, the higher the probability that we will turn onto it from any other street. Therefore, the number of topological sections of a street is an indicator for how probable the appearance of its name is in a directional phrase.
If these assumptions are met by an actual data set then there is a good chance that the average code length for symbols in a directional phrase will be below the entropy bound H and that a compression will therefore occur.
The accuracy of probabilities is not the only condition to be met. If all characters had equal probabilities there still would be no compression at all (Press et al. 1992). That is because the probability distribution of the model determines the degree of compression. In fact, every non-equiprobable symbol distribution would yield a certain amount of compression.
Table 1 shows a selection from the alphabet of street names and of the associated probabilities produced in the modeling process. The probabilities are derived as the relative frequencies of street names. For each symbol, i.e. street name, the symbol length is computed as . If we compare those individual symbol lengths with the entropy H = 11.559, it becomes apparent that many street names could indeed be transmitted in a considerably compressed form, if an appropriate coding technique is applied.
| Street Name | Sections | p | log2p | p*log2p |
| Abbington Ave | 2 | 0.00006 | -14.024 | -0.00084 |
| Abbott Grove Ave | 2 | 0.00006 | -14.024 | -0.00084 |
| ... | ... | ... | ... | ... |
| Bailey Ave | 61 | 0.00183 | -9.093 | -0.01665 |
| ... | ... | ... | ... | ... |
| Hewitt Ave | 7 | 0.00021 | -12.216 | -0.00257 |
| ... | ... | ... | ... | ... |
| Main St | 288 | 0.00865 | -6.854 | -0.05926 |
| ... | ... | ... | ... | ... |
| Millersport Hwy | 33 | 0.00099 | -9.979 | -0.00989 |
| ... | ... | ... | ... | ... |
| Zollars Ave | 3 | 0.00009 | -13.439 | -0.00121 |
| Zurbrick Road | 6 | 0.00018 | -12.439 | -0.00224 |
| Sum = 33312 | Sum = 1.000 | Sum = 11.559 |
Table 1. Computation of entropy and code lengths based on number of sections per street name.
Coding is the attempt to represent a symbol by -log2p bits. Huffman coding can only approximate this with an integer number of bits. For instance, Main Street (-log2p = 6.853 bits) would be represented by 7 bits. In case of high probabilities this can be very inappropriate (Bell et al. 1989) but in our case the probabilities are very small, ranging between 0.003% and 0.864%.
The following demonstrates the compression effects of Huffman coding when applied to the sequence of street names contained in the phrase of section 3.1. Since the entropy has been determined as H = 11.559, every street name in an uncompressed message would be represented through 12 bits.
Millersport Hwy -- Bailey Ave -- Main Street -- Hewitt Ave
12 bit + 12 bits + 12 bits + 12 bits = 48 bits
Using the symbol probabilities one could instead encode the same message through Huffmanís method as follows:
Millersport Hwy -- Bailey Ave -- Main Street -- Hewitt Ave
10 bits + 10 bits + 7 bits + 13 bits = 40 bits.
It would therefore be possible to compress the message from 48 bits to 40 bits, for savings of 17%.
In order to provide a feasible compression solution there is a need to encode a number of additional specifications. Examples are "north”, "south”, "west”, "east”, "left”, "right”. Since these will appear very often in actual messages, they should be inserted at the very top of the Huffman tree where they will not occupy more than a few bits. Based on the tree illustrated in Figure 2, the phrase "turn left on Main Street” would then be encoded as "100 0001001”. Note that there have been no probabilities assigned to those turn directions.
The solution proposed in this paper assumes that the dictionary, i.e. the Huffman tree, has been agreed upon by the sender and receiver and is stored at both sides of the message transfer. The transmission of the actual tree appears unfeasible considering the sheer size of a tree that holds thousands of symbols.
This is an important difference from other approaches that either send the message and the dictionary at the same time or create a dictionary on-the-fly, based on the incoming symbols and their actual frequencies.
Figure 2. Street names and directional information in an Huffman-encoded
tree.
Static Huffman encoding has certain encryptive characteristics. These could be further improved by storing more than one Huffman tree on the sending and receiving sides and randomly switching between those alternative trees by means of a prefix to each message. Due to the nature of the model, changes in the symbol probabilities can lead to changes in the resulting tree and therefore to different bit sequences. Small arbitrary changes to the probabilities would translate to arbitrarily refined Huffman trees, without too much cost in terms of transfer efficiency.
Lelewer & Hirschberg (1987) point out that there are cases when there is no single optimal way of building the Huffman tree because the choice of minimum probabilities is somewhat arbitrary. This is especially true in our example where there are a large number of streets with equal probability. This opens the possibility to modify the tree structure through arbitrary starting points in the algorithm (refer to section 2.2.). Such modifications would than again be stored simultaneously.
The ideas presented in this paper suggest that the use of Huffman coding is a worthwhile option to pursue in order to reduce the size of verbal directional phrases during message transfer.
One should not forget, however, that storage and bandwidth optimization have to be in balance with the computational requirements. Compression algorithms are worth little if they can only be afforded by enormous computational power. An investigation of that issue, following a physical implementation, remains to be done.
Among the issues to be addressed in future work are the practical investigation of some other compression techniques, like arithmetic coding (Witten 1987) or adaptive Huffman coding and a comparison of these techniques with the static Huffman coding method that was presented here.
One relevant issue is the consideration of text position as a contextual variable. Kondragunla & Lelewer (1992) investigated the position of words as a contextual element. They concluded that "while word position is of little value as a primary contextual element, it has merit as a secondary element of context information” (Lelewer, p. 443). The implementation of coding schemes combining word position with Huffman trees seems promising (Basu 1991). With regard to street navigation, "word position” refers to the order of street names in a directional phrase. Factors like that are taken into account by context modeling techniques (Bell et al. 1989). Consider that there is an overwhelming probability that the letter "q” will be followed by the letter "u” in any word of the English language. Similarly, if a vehicle is traveling on a certain street, than the following directional phrase has to contain a street with which it is topologically connected.
Thus, it would be advantageous if the actual geographical situation played a stronger role in the implementation. This means that the original section-based probabilities are weighted according to such factors as connectedness and proximity. For instance, those streets that provide a vehicle with the fastest access from its stationary position to a major traffic artery should have a much higher probability than indicated by the number of its sections. Examples for stationary positions are the parking lot of a rental agency or the car park of an emergency response center. Such geographical tuning of the modeling process would lead to more accurate probabilities and therefore to higher rates of compression.
Basu, D. (1991) Text compression using several Huffman trees. Proc. Data Compression Conf., Snowbird, Utah. p. 452.
Bell, T., Witten, I., Cleary, J. G. (1989) Modeling for text compression. ACM Computing Surveys. 21 (4) pp. 557-591.
Bell, T., Witten, I., Cleary, J. (1990) Text compression. Prentice Hall. Englewood Cliffs, NJ.
Bookstein, A., Klein S.T. (1993) Is Huffman coding dead? Proc. Data Compression Conf., Snowbird, Utah. p.464.
Huffman, D.A. (1952) A method for the construction of minimum redundancy codes. Proc. IRE. 40 (9) pp. 1098-1101.
Jones, S. (1991) Text and context - Document storage and processing. Springer-Verlag. New York.
Kondragunla, U., and Lelewer, D. (1992) Word position as context. Proc. Data Compression Conf, Snowbird, Utah. p. 443.
Lelewer, D., and Hirschberg, D. (1987) Data Compression. ACM Computing Surveys, vol. 19, no. 3, September, pp. 261-296.
Press, W. et.al. (1992) Numerical Recipes in C - The Art of Scientific Computing. Cambridge University Press.
Salton, G. (1989) Automatic text processing - the transformation, analysis and retrieval of information by computer. Addison-Wesley Publishing Company.
Witten, I. H., Neal, R., and Cleary, J.G. (1987) Arithmetic Coding for Data Compression. Commun. ACM 30 (6), pp. 520-540.
© 1996 Institut für Geographie der Universität Salzburg