Week 4  Tries & Substring Search
Tries
 Store characters in nodes (not keys).
 Each node has R children, one for each possible character.
int R = 256; // extended ASCII

Ternary search tries
 Store characters and values in nodes (not keys).
 Each node has 3 children: smaller (left), equal (middle), larger (right).

TST with R^2 branching at root
 Hybrid of Rway trie and TST.
 Do R^2 way branching at root.
 Each of R^2 root nodes points to a TST.

Characterbased Operations

Prefix match.
 Find all keys in a symbol table starting with a given prefix.


Longest prefix.
 Find longest key in symbol table that is a prefix of query string.


Wildcard match.
Substring Search
Bruteforce substring search
 Check for pattern starting at each text position.

 Bruteforce algorithm needs backup for every mismatch.
KnuthMorrisPratt (KMP)
Deterministic ﬁnite state automaton (DFA)
 DFA is abstract stringsearching machine.
 Finite number of states (including start and halt).
 Exactly one transition for each char in alphabet.
 Accept if sequence of transitions leads to halt state.

 State = number of characters in pattern that have been matched.

How to build DFA from pattern?
BoyerMoore
 Intuition.
 Scan characters in pattern from right to left.
 Can skip as many as M text chars when finding one not in the pattern.

 How much to skip?
 if the pattern doesn’t have repeat letter, we can only check the rightmost letter, and perform full check when we hit the rightmost match.
 But if the pattern has repeat letter, like the example: NEEDLE. We will need to move the pointer carefully:

 The ultimate solution:

 every time we occurred an
E
mismatch, backup 5 positions.
 Java implementation
RabinKarp
Basic idea = modular hashing
 Compute a hash of pattern characters
0
to M  1
.
 For each
i
, compute a hash of text characters i
to M + i  1
.
 If pattern hash = text substring hash, check for a match.

Modular hash function
 Using the notation
t_i
for txt.charAt(i)
, we wish to compute
 Horner’s method. Lineartime method to evaluate degreeM polynomial.
 Efficiently computing the hash function
 Can update hash function in constant time!

How to handle hash collision?
 Monte Carlo version. Return match if hash match.
 It is possible return the wrong answer.
 Las Vegas version. Check for substring match if hash match; continue search if false collision.
 Confidence about the answer, but also cost more time to verify.
Summary
 Cost of searching for an Mcharacter pattern in an Ncharacter text.
