# Week 4 - Tries & Substring Search

## 1. Tries

• Store characters in nodes (not keys).
• Each node has R children, one for each possible character.
• int R = 256; // extended ASCII

### 1.1. Ternary search tries

• Store characters and values in nodes (not keys).
• Each node has 3 children: smaller (left), equal (middle), larger (right).

#### TST with R^2 branching at root

• Hybrid of R-way trie and TST.
• Do R^2 -way branching at root.
• Each of R^2 root nodes points to a TST.

### 1.2. Character-based Operations

• Prefix match.

• Find all keys in a symbol table starting with a given prefix.
• Longest prefix.

• Find longest key in symbol table that is a prefix of query string.
• Wildcard match.

• // TODO
• Check for pattern starting at each text position.
• Brute-force algorithm needs backup for every mismatch.

### 2.2. Knuth-Morris-Pratt (KMP)

#### Deterministic ﬁnite state automaton (DFA)

• DFA is abstract string-searching machine.
• Finite number of states (including start and halt).
• Exactly one transition for each char in alphabet.
• Accept if sequence of transitions leads to halt state.
• State = number of characters in pattern that have been matched.

#### How to build DFA from pattern?

• Every mismatch, is to re-simulate the the pattern without first character, and end with the new mismatch character. So we can simulate the process, and copy the result to the new column.
• Java implementation
• copy mismatch:

### 2.3. Boyer-Moore

• Intuition.
• Scan characters in pattern from right to left.
• Can skip as many as M text chars when finding one not in the pattern.
• How much to skip?
• if the pattern doesn't have repeat letter, we can only check the rightmost letter, and perform full check when we hit the rightmost match.
• But if the pattern has repeat letter, like the example: NEEDLE. We will need to move the pointer carefully:
• The ultimate solution:
• every time we occurred an E mismatch, backup 5 positions.
• Java implementation

### 2.4. Rabin-Karp

#### Basic idea = modular hashing

• Compute a hash of pattern characters 0 to M - 1.
• For each i, compute a hash of text characters i to M + i - 1.
• If pattern hash = text substring hash, check for a match.

#### Modular hash function

• Using the notation t_i for txt.charAt(i), we wish to compute
• Horner's method. Linear-time method to evaluate degree-M polynomial.
• Efficiently computing the hash function
• Can update hash function in constant time!

#### How to handle hash collision?

1. Monte Carlo version. Return match if hash match.
• It is possible return the wrong answer.
2. Las Vegas version. Check for substring match if hash match; continue search if false collision.
• Confidence about the answer, but also cost more time to verify.

### 2.5. Summary

• Cost of searching for an M-character pattern in an N-character text.