1. 程式人生 > >Trie樹(字典樹)(1)

Trie樹(字典樹)(1)

stdio.h public ctu 哈希 pac 索引 cas proc ren

  Trie樹。又稱字典樹,單詞查找樹或者前綴樹,是一種用於高速檢索的多叉樹結構。
  Trie樹與二叉搜索樹不同,鍵不是直接保存在節點中,而是由節點在樹中的位置決定。

一個節點的全部子孫都有同樣的前綴(prefix),也就是這個節點相應的字符串,而根節點相應空字符串。普通情況下。不是全部的節點都有相應的值,僅僅有葉子節點和部分內部節點所相應的鍵才有相關的值。
  A trie, pronounced “try”, is a tree that exploits some structure in the keys
  - e.g. if the keys are strings, a binary search tree would compare the entire strings, but a trie would look at their individual characters
  - Suf?x trie are a space-ef?cient data structure to store a string that allows many kinds of queries to be answered quickly.
  - Suf?x trees are hugely important for searching large sequences.
  Trie樹,是一種樹形結構,是一種哈希樹的變種。典型應用是用於統計。排序和保存大量的字符串(但不僅限於字符串)。所以常常被搜索引擎系統用於文本詞頻統計。


  一個典型的應用,就是在搜索時出現的搜索提示,比方我輸入“花千”,就會出現“花千骨電視劇”,“花千骨小說”等提示。


技術分享
  Let word be a single string and let dictionary be a large set of words. If we have a dictionary, and we need to know if a single word is inside of the dictionary the tries are a data structure that can help us. But you may be asking yourself, “Why use tries if set and hash tables can do the same?” There are two main reasons:
  1)The tries can insert and find strings in O(L) time (where L represent the length of a single word). This is much faster than set , but is it a bit faster than a hash table.
  2)The set and the hash tables can only find in a dictionary words that match exactly with the single word that we are finding; the trie allow us to find words that have a single character different, a prefix in common, a character missing, etc.
  Trie樹的基本性質能夠歸納為:
  1)根節點不包括字符,除根節點外的每一個節點僅僅包括一個字符。


  2)從根節點到某一個節點。路徑上經過的字符連接起來,為該節點相應的字符串。
  3)每一個節點的全部子節點包括的字符串不同樣。

Trie樹的基本實現
  字典樹的插入(Insert)、刪除( Delete)和查找(Find)都很easy。用一個一重循環就可以,即第i 次循環找到前i 個字母所相應的子樹,然後進行相應的操作。實現這棵字母樹,我們用最常見的數組保存(靜態開辟內存)就可以。當然也能夠開動態的指針類型(動態開辟內存)。至於結點對兒子的指向,一般有三種方法:
  1)對每一個結點開一個字母集大小的數組,相應的下標是兒子所表示的字母,內容則是這個兒子相應在大數組上的位置,即標號。
  2)對每一個結點掛一個鏈表。按一定順序記錄每一個兒子是誰。
  3)使用左兒子右兄弟表示法記錄這棵樹。
  三種方法,各有特點。

第一種易實現。但實際的空間要求較大;另外一種。較易實現。空間要求相對較小,但比較費時;第三種,空間要求最小,但相對費時且不易寫。


  這裏採用第一種:

#include <stdio.h>
#include <iostream>
using namespace std;
#define  MAX    26

typedef struct TrieNode
{
    bool isEnd;
    int nCount;  // 該節點前綴出現的次數
    struct TrieNode *next[MAX]; //該節點的興許節點
} TrieNode;

TrieNode Memory[1000000]; //先分配好內存。 malloc 較為費時
int allocp = 0;

//初始化一個節點。nCount計數為1。 next都為null
TrieNode * createTrieNode()
{
    TrieNode * tmp = &Memory[allocp++];
    tmp->isEnd = false;
    tmp->nCount = 1;
    for (int i = 0; i < MAX; i++)
        tmp->next[i] = NULL;
    return tmp;
}

void insertTrie(TrieNode * root, char * str)
{
    TrieNode * tmp = root;
    int i = 0, k;
    //一個一個的插入字符
    while (str[i])
    {
        k = str[i] - ‘a‘; //當前字符 應該插入的位置
        if (tmp->next[k])
        {
            tmp->next[k]->nCount++;
        }
        else
        {
            tmp->next[k] = createTrieNode();
        }

        tmp = tmp->next[k];
        i++; //移到下一個字符
    }
    tmp->isEnd = true;
}

int searchTrie(TrieNode * root, char * str)
{
    if (root == NULL)
        return 0;
    TrieNode * tmp = root;
    int i = 0, k;
    while (str[i])
    {
        k = str[i] - ‘a‘;
        if (tmp->next[k])
        {
            tmp = tmp->next[k];
        }
        else
            return 0;
        i++;
    }
    return tmp->nCount; //返回最後的那個字符  所在節點的 nCount
}

/*  During delete operation we delete the key in bottom up manner using recursion. The following are possible conditions when deleting key from trie:
Key may not be there in trie. Delete operation should not modify trie.
Key present as unique key (no part of key contains another key (prefix), nor the key itself is prefix of another key in trie). Delete all the nodes.
Key is prefix key of another long key in trie. Unmark the leaf node.
Key present in trie, having atleast one other key as prefix key. Delete nodes from end of key until first leaf node of longest prefix key.  */
bool deleteTrie(TrieNode * root, char * str)
{
    TrieNode * tmp = root;
    k = str[0] - ‘a‘; 
    if(tmp->next[k] == NULL)
        return false;
    if(str == ‘\0’)
        return false;
    if(tmp->next[k]->isEnd && str[1] == ‘\0’)
    {
        tmp->next[k]->isEnd = false;
        tmp->next[k]->nCount--;
        if(tmp->next[k]->nCount == 0)  //really delete
        {
            tmp->next[k] = NULL;
            return true;
        }
        return false;
    }
    if(deleteTrie(tmp->next[k],  str+1)) //recursive
    {
        tmp->next[k]->nCount--;
        if(tmp->next[k]->nCount == 0)  //really delete
        {
            tmp->next[k] = NULL;
            return true;
        }
        return false;
    }
}

int main(void)
{
    char s[11];
    TrieNode *Root = createTrieNode();
    while (gets(s) && s[0] != ‘0‘) //讀入0 結束
    {
        insertTrie(&Root, s);
    }

    while (gets(s)) //查詢輸入的字符串
    {
        printf("%d\n", searchTrie(Root, s));
    }

    return 0;
}

應用例一:
  Longest prefix matching – A Trie based solution
Given a dictionary of words and an input string, find the longest prefix of the string which is also a word in dictionary.

Examples:
  Let the dictionary contains the following words:
{are, area, base, cat, cater, children, basement}

Below are some input/output examples:
Input String     Output
caterer        cater
basemexy       base
child        < Empty >

Solution:
  We build a Trie of all dictionary words. Once the Trie is built, traverse through it using characters of input string. If prefix matches a dictionary word, store current length and look for a longer match. Finally, return the longest match.

// The main method that finds out the longest string ‘input‘
public String getMatchingPrefix(String input)  {
    String result = ""; // Initialize resultant string
    int length = input.length();  // Find length of the input string       

    // Initialize reference to traverse through Trie
    TrieNode crawl = root;   

    // Iterate through all characters of input string ‘str‘ and traverse 
    // down the Trie
    int level, prevMatch = 0; 
    for( level = 0 ; level < length; level++ )
    {    
        // Find current character of str
        char ch = input.charAt(level);    

        // HashMap of current Trie node to traverse down
        HashMap<Character,TrieNode> child = crawl.getChildren();                        

        // See if there is a Trie edge for the current character
        if( child.containsKey(ch) )
        {
           result += ch;          //Update result
           crawl = child.get(ch); //Update crawl to move down in Trie

           // If this is end of a word, then update prevMatch
           if( crawl.isEnd() ) 
                prevMatch = level + 1;
        }            
        else  break;
    }

    // If the last processed character did not match end of a word, 
    // return the previously matching prefix
    if( !crawl.isEnd() )
            return result.substring(0, prevMatch);        

    else return result;
}

應用例二:
  Print unique rows in a given boolean matrix
Given a binary matrix, print all unique rows of the given matrix.

Input:
{0, 1, 0, 0, 1}
{1, 0, 1, 1, 0}
{0, 1, 0, 0, 1}
{1, 1, 1, 0, 0}
Output:
0 1 0 0 1
1 0 1 1 0
1 1 1 0 0
Method 1 (Simple)
  A simple approach is to check each row with all processed rows. Print the first row. Now, starting from the second row, for each row, compare the row with already processed rows. If the row matches with any of the processed rows, don’t print it. If the current row doesn’t match with any row, print it.

  Time complexity: O( ROW^2 x COL )
  Auxiliary Space: O( 1 )

Method 2 (Use Binary Search Tree)
  Find the decimal equivalent of each row and insert it into BST. Each node of the BST will contain two fields, one field for the decimal value, other for row number. Do not insert a node if it is duplicated. Finally, traverse the BST and print the corresponding rows.

  Time complexity: O( ROW x COL + ROW x log( ROW ) )
  Auxiliary Space: O( ROW )

  This method will lead to Integer Overflow if number of columns is large.

Method 3 (Use Trie data structure)
  Since the matrix is boolean, a variant of Trie data structure can be used where each node will be having two children one for 0 and other for 1. Insert each row in the Trie. If the row is already there, don’t print the row. If row is not there in Trie, insert it in Trie and print it.

  Below is C implementation of method 3.

//Given a binary matrix of M X N of integers, you need to return only unique rows of binary array
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>

#define ROW 4
#define COL 5

// A Trie node
typedef struct Node
{
    bool isEndOfCol;
    struct Node *child[2]; // Only two children needed for 0 and 1
} Node;


// A utility function to allocate memory for a new Trie node
Node* newNode()
{
    Node* temp = (Node *)malloc( sizeof( Node ) );
    temp->isEndOfCol = 0;
    temp->child[0] = temp->child[1] = NULL;
    return temp;
}

// Inserts a new matrix row to Trie.  If row is already
// present, then returns 0, otherwise insets the row and
// return 1
bool insert( Node** root, int (*M)[COL], int row, int col )
{
    // base case
    if ( *root == NULL )
        *root = newNode();

    // Recur if there are more entries in this row
    if ( col < COL )
        return insert ( &( (*root)->child[ M[row][col] ] ), M, row, col+1 );

    else // If all entries of this row are processed
    {
        // unique row found, return 1
        if ( !( (*root)->isEndOfCol ) )
            return (*root)->isEndOfCol = 1;

        // duplicate row found, return 0
        return 0;
    }
}

// A utility function to print a row
void printRow( int (*M)[COL], int row )
{
    int i;
    for( i = 0; i < COL; ++i )
        printf( "%d ", M[row][i] );
    printf("\n");
}

// The main function that prints all unique rows in a
// given matrix.
void findUniqueRows( int (*M)[COL] )
{
    Node* root = NULL; // create an empty Trie
    int i;

    // Iterate through all rows
    for ( i = 0; i < ROW; ++i )
        // insert row to TRIE
        if ( insert(&root, M, i, 0) )
            // unique row found, print it
            printRow( M, i );
}

// Driver program to test above functions
int main()
{
    int M[ROW][COL] = {{0, 1, 0, 0, 1},
        {1, 0, 1, 1, 0},
        {0, 1, 0, 0, 1},
        {1, 0, 1, 0, 0}
    };

    findUniqueRows( M );

    return 0;
}

  Time complexity: O( ROW x COL )
  Auxiliary Space: O( ROW x COL )

  This method has better time complexity. Also, relative order of rows is maintained while printing.

Trie樹(字典樹)(1)