Week 4 - Tries & Substring Search

1. Tries

  • Store characters in nodes (not keys).
  • Each node has R children, one for each possible character.
    • int R = 256; // extended ASCII

1.1. Ternary search tries

  • Store characters and values in nodes (not keys).
  • Each node has 3 children: smaller (left), equal (middle), larger (right).

TST with R^2 branching at root

  • Hybrid of R-way trie and TST.
    • Do R^2 -way branching at root.
    • Each of R^2 root nodes points to a TST.

1.2. Character-based Operations

  • Prefix match.

    • Find all keys in a symbol table starting with a given prefix.
  • Longest prefix.

    • Find longest key in symbol table that is a prefix of query string.
  • Wildcard match.

    • // TODO
  • Check for pattern starting at each text position.
  • Brute-force algorithm needs backup for every mismatch.

2.2. Knuth-Morris-Pratt (KMP)

Deterministic finite state automaton (DFA)

  • DFA is abstract string-searching machine.
    • Finite number of states (including start and halt).
    • Exactly one transition for each char in alphabet.
    • Accept if sequence of transitions leads to halt state.
  • State = number of characters in pattern that have been matched.

How to build DFA from pattern?

    • Every mismatch, is to re-simulate the the pattern without first character, and end with the new mismatch character. So we can simulate the process, and copy the result to the new column.
  • Java implementation
      • copy mismatch:

2.3. Boyer-Moore

  • Intuition.
    • Scan characters in pattern from right to left.
    • Can skip as many as M text chars when finding one not in the pattern.
  • How much to skip?
    • if the pattern doesn't have repeat letter, we can only check the rightmost letter, and perform full check when we hit the rightmost match.
    • But if the pattern has repeat letter, like the example: NEEDLE. We will need to move the pointer carefully:
      • The ultimate solution:
        • every time we occurred an E mismatch, backup 5 positions.
  • Java implementation

2.4. Rabin-Karp

Basic idea = modular hashing

  • Compute a hash of pattern characters 0 to M - 1.
  • For each i, compute a hash of text characters i to M + i - 1.
  • If pattern hash = text substring hash, check for a match.

Modular hash function

  • Using the notation t_i for txt.charAt(i), we wish to compute
  • Horner's method. Linear-time method to evaluate degree-M polynomial.
  • Efficiently computing the hash function
    • Can update hash function in constant time!

How to handle hash collision?

  1. Monte Carlo version. Return match if hash match.
    • It is possible return the wrong answer.
  2. Las Vegas version. Check for substring match if hash match; continue search if false collision.
    • Confidence about the answer, but also cost more time to verify.

2.5. Summary

  • Cost of searching for an M-character pattern in an N-character text.

results matching ""

    No results matching ""