Week 4 - Tries & Substring Search
- Store characters in nodes (not keys).
- Each node has R children, one for each possible character.
int R = 256; // extended ASCII
Ternary search tries
- Store characters and values in nodes (not keys).
- Each node has 3 children: smaller (left), equal (middle), larger (right).
TST with R^2 branching at root
- Hybrid of R-way trie and TST.
- Do R^2 -way branching at root.
- Each of R^2 root nodes points to a TST.
- Find all keys in a symbol table starting with a given prefix.
- Find longest key in symbol table that is a prefix of query string.
Brute-force substring search
- Check for pattern starting at each text position.
- Brute-force algorithm needs backup for every mismatch.
Deterministic ﬁnite state automaton (DFA)
- DFA is abstract string-searching machine.
- Finite number of states (including start and halt).
- Exactly one transition for each char in alphabet.
- Accept if sequence of transitions leads to halt state.
- State = number of characters in pattern that have been matched.
How to build DFA from pattern?
- Scan characters in pattern from right to left.
- Can skip as many as M text chars when finding one not in the pattern.
- How much to skip?
- if the pattern doesn’t have repeat letter, we can only check the rightmost letter, and perform full check when we hit the rightmost match.
- But if the pattern has repeat letter, like the example: NEEDLE. We will need to move the pointer carefully:
- The ultimate solution:
- every time we occurred an
E mismatch, backup 5 positions.
- Java implementation
Basic idea = modular hashing
- Compute a hash of pattern characters
M - 1.
- For each
i, compute a hash of text characters
M + i - 1.
- If pattern hash = text substring hash, check for a match.
Modular hash function
- Using the notation
txt.charAt(i), we wish to compute
- Horner’s method. Linear-time method to evaluate degree-M polynomial.
- Efficiently computing the hash function
- Can update hash function in constant time!
How to handle hash collision?
- Monte Carlo version. Return match if hash match.
- It is possible return the wrong answer.
- Las Vegas version. Check for substring match if hash match; continue search if false collision.
- Confidence about the answer, but also cost more time to verify.
- Cost of searching for an M-character pattern in an N-character text.