Symbolic Data

So far our compound data has been made up of numbers.
Now, we will start working with arbitrary symbols.

Quotation#

Lists with symbols look just like Lisp code, except we quote them.
To quote in Lisp, we use the special form #!scheme (quote exp, or the short hand #!scheme 'exp.
For example, #!scheme 'x evaluates to the symbol “x” instead of the variable #!scheme x’s value.
This is just like natural language. If I say, “Say your name,” you will say your name. If I instead say, “Say ‘your name’”, you will literally say the words “your name.”

(define a 1)
(define b 2)

(list a b)
→ (1 2)

(list 'a 'b)
→ (a b)

(list 'a b)
→ (a 2)

Now that we have quotation, we will use #!scheme '() for the empty list instead of #!scheme nil.
We need another primitive now: #!scheme eq?. This checks if two symbols are the same.

(eq? 'a 'b)
→ #f

(eq? 'a 'a)
→ #t

Example: Symbolic Differentiation#

Lets design a procedure that computes the derivative of an algebraic expression.
This will demonstrate both symbol manipulation and data abstraction.

Symbolic differentiation is of special historical significance in Lisp. It was one of the motivating examples behind the development of a computer language for symbol manipulation.

The differentiation program with abstract data#

To start, we will only consider addition and multiplication.
We need three differentiation rules: constant, sum, and product.
Let’s assume we already have these procedures

Procedure	Result
`(variable? e)`	Is `e` a variable?
`(same-variable? v1 v2)`	Are `v1` and `v2` the same variable?
`(sum? e)`	Is `e` a sum?
`(addend e)`	Addend of sum `e`
`(augend e)`	Augend of sum `e`
`(make-sum a1 a2)`	Construct the sum of `a1` and `a2`
`(product? e)`	Is `e` a product?
`(multiplier e)`	Multiplier of the product `e`
`(multiplicand e)`	Multiplicand of the product `e`
`(make-product m1 m2)`	Construct the product of `m1` and `m2`

Then, we can express the differentiation rules like so:

(define (deriv exp var)
  (cond ((number? exp) 0)
        ((variable? exp)
         (if (same-variable? exp var) 1 0))
        ((sum? exp)
         (make-sum (deriv (addend exp) var)
                   (deriv (augend exp) var)))
        ((product? exp)
         (make-sum
          (make-product (multiplier exp)
                        (deriv (multiplicand exp) var))
          (make-product (deriv (multiplier exp) var)
                        (multiplicand exp))))
        (else
         (error "unknown expression type" exp))))

Representing algebraic expressions#

There are many ways we could represent algebraic expressions.
The most straightforward way is parenthesized Polish notation: Lisp syntax.
We can simplify answers in the constructor just like in the rational number example.
Simplifying the same way a human would is hard, partly because the most “simplified” form is sometimes subjective.

Exercises:

Example: Representing Sets#

There are a number of possible ways we could represent sets. A set is a collection of distinct objects. Our sets need to work with the following operations:

Procedure	Result
`element-of-set? x s`	Is x a member of the set s?
`adjoin-set x s`	Set containing x and the elements of set s
`union-set s t`	Set of elements that occur in either s or t
`intersection-set s t`	Set of elements that occur in both s and t

Sets as unordered lists#

One method is to just use lists. Each time we add a new element, we have to check to make sure it isn’t already in the set.
This isn’t very efficient. #!scheme element-of-set? and #!scheme adjoin-set are #\Theta(n)#, while #!scheme union-set and #!scheme intersection-set are $\Theta(n^2)$.
We can make #!scheme adjoin-set $\Theta(1)$ if we allow duplicates, but then the list can grow rapidly, which may cause an overall decrease in performance.

Exercises:

Sets as order lists#

A more efficient representation is a sorted list.
To keep things simple, we will only consider sets of numbers.
Now, scanning the entire list in #!scheme element-of-set? is a worst-case scenario. On average we should expect to scan half of the list.
This is still $\Theta(n)$, but now we can make #!scheme union-set and #!scheme intersection-set $\Theta(n)$ too.

Exercises:

Sets as binary trees#

An even more efficient representation is a binary tree where left branches contain smaller numbers and right branches contain larger numbers.
The same set cna be represented by such a tree in many ways.
If the tree is balanced, each subtree is about half the size of the original tree. This allows us to implement #!scheme element-of-set? in $\Theta(\log(n))$ time.
We’ll represent each node by #!scheme (list number left-subtree right-subtree).
Efficiency hinges on the tree being balanced. We can write a procedure to balance trees, or we could use a difference data structure (B-trees or red—black trees).

Exercises:

Sets and information retrieval#

The techniques discussed for sets show up again and again in information retrieval.
In a data management system, each record is identified by a unique key.
The simplest, least efficient method is to use an udneroder list with $\Theta(n)$ access.
For fast “random access,” trees are usually used.
Using data abstraction, we can start with unordered lists and then later change the constructors and selectors to use a tree representation.

Exercises:

Example: Huffman Encoding Trees#

A code is a system for converting information (such as text) into another form.
ASCII is a fixed-length code: each symbol uses the same number of bits.
Morse code is variable-length: since “e” is the most common letter, it uses a single dot.
The problem with variable-length codes is that you must have a way of knowing when you’ve reach the end of a symbol. Morse code uses a temporal pause.
Prefix codes, like Huffman encoding, ensure that no code for a symbol is a prefix of the code for another symbol this eliminates any ambiguity.
A Huffman code can be represented as a binary tree where each leaf contains a symbol and its weight (relative frequency), and each internal node stores the set of symbols and the sum of weights below it. Zero means “go left”, one means “go right.”

Generating Huffman trees#

Huffman gave an algorithm for constructing the best code for a given set of symbols and their relative frequencies:

Begin with all symbols and their weights as isolated leaves.
Form an internal node branching off to the two least frequent symbols.
Repeat until all nodes are connected.

Representing Huffman trees#

Leaves are represented by #!scheme (list 'leaf symbol weight).
Internal nodes are represented by #!scheme (list left right symbols weight), where #!scheme left and #!scheme right are subtrees, #!scheme symbols is a list of the symbols underneath the node, and #!scheme weight is the sum of the weights of all the leaves beneath the node.

The decoding procedure#

The following procedure decodes a list of bits using a Huffman tree:

(define (decode bits tree)
  (define (decode-1 bits current-branch)
    (if (null? bits)
        '()
        (let ((next-branch
               (choose-branch (car bits) current-branch)))
          (if (leaf? next-branch)
              (cons (symbol-leaf next-branch)
                    (decode-1 (cdr bits) tree))
              (decode-1 (cdr bits) next-branch)))))
  (decode-1 bits tree))

(define (choose-branch bit branch)
  (cond ((= bit 0) (left-branch branch))
        ((= bit 1) (right-branch branch))
        (else (error "bad bit" bit))))

Sets of weighted elements#

Since the tree-generating algorithm requires repeatedly finding the smallest item in a set, whe should represent a set of leaves and trees as a list ordered by weight.
The implementation of #!scheme adjoin-set is similar to [Exercise 2.61] except that we compare the items by weight, and the element being added is never already in the set:

(define (adjoin-set x set)
  (cond ((null? set) (list x))
        ((< (weight x) (weight (car set))) (cons x set))
        (else (cons (car set)
                    (adjoin-set x (cdr set))))))