英文文章分詞及已知單詞位置計算單詞在文章中起始下標

阿新 • • 發佈：2018-12-01

###英文文章分詞及已知單詞位置計算單詞在文章中起始下標
####背景
1.由於最近專案中需要，要實現類似文章跟讀的效果，但已知的只有每個單詞在文章中的位置下標（即每個單詞在文章中是在第幾個單詞），那麼要實現跟讀效果就必須根據每個單詞在文章中的位置計算出每個單詞在整個文章中的具體下標。比如：“My name is Tom.”,我們只知道“My” 、“name”、“is”、“Tom”在文章中分別是第0、1、2、3個單詞，而要實現跟讀則必須計算出每個詞具體的起始下標，如“My” 、“name”、“is”、“Tom”每個詞在文章中的起始下標分別為0、3、8、11。這樣才能用spanableString實現跟讀效果（用法在下一篇部落格講，這裡只關注分詞相關）

####實現
1.分詞

首先將分別資訊進行分詞。分詞的步驟是先將文字中的多餘空格和換行符、製表符等去掉，只保留每個單詞（帶符號）之間的一個空格，然後根據空格將文字進行分詞。（當然專案中的文章資源需要配合，保證每個詞/詞+符號間有空格）這裡故意保留符號，符號是算在前一個詞裡面的，因為跟讀標記需要把標點也標記進去。

public String[] divideToWords(String text) {
    String[] words;
    String str = text.replaceAll("\\s+|\t|\n|\r" +
            "|\n\n", " ");
    str = str.trim();  //去掉收尾空格
    words = str.split(" ");
    return words;
}

2.計算每個單詞在文章中的下標

實現過程需要注意的點：某個單詞可能在文章中出現多次；某些單詞是縮寫的形式，中間可能存在一些字元而非純字母組成（I’m、don’t等）。
實現步驟：先計算每個詞在文章中出現次數，用hashMap儲存，key為單詞，value為出現次數，同時用另一個hashMap儲存第n個單詞的資訊（只要關注第n個單詞在文章中是第幾次出現）；然後遍歷每個單詞，計算每個單詞在文章中的開始下標。
匹配單詞的規則是前面是非英文字元，後面是空格、回車等（文章最後如果沒有，要先補上，否則最後一個可能匹配不上）；用正則匹配時要注意先轉義* ？等特殊字元，因為這些字元在正則表示式中有特殊的意義和作用。

 public HashMap<Integer, WordEntity> calculateTextStartIndex(String[] words, String text) {

        wordCountMap.clear();
        wordEntitiesMap.clear();

        for (int i = 0; i < words.length; i++) {
            AppsLog.i(TAG, "單詞" + i + " : " + words[i]);

            int repeatCount = 1;
            if (wordCountMap.get(words[i]) != null) {
                repeatCount = wordCountMap.get(words[i]) + 1;
            }
            wordCountMap.put(words[i], repeatCount); //記錄單詞出現次數（第幾次出現）

            WordEntity entity = new WordEntity();
            entity.setUnit_index(i);
            entity.setWord(words[i]);
            entity.setRepeatCount(repeatCount); //第幾次出現
            wordEntitiesMap.put(i, entity); //記錄單詞資訊
        }

        //迴圈遍歷查詢每個單詞的下標
        for (int i = 0; i < wordEntitiesMap.size(); i++) {
            WordEntity entity = wordEntitiesMap.get(i);
            int repeatCount = entity.getRepeatCount();
            String word = entity.getWord();

            int totalIndex = -1;
            String temp = text;
            for (int j = 0; j < repeatCount; j++) {
                int index = -1;
                if (i == 0) {
                    //如果是第一個單詞，直接用indexOf
                    index = temp.indexOf(word);
                    if (index != -1) {
                        AppsLog.i(TAG, "Found1 : " + word + "  at  " + index);
                    }
                } else {
                    //非第一個單詞，需要用正則不斷匹配取得下標，同時得跳過前面已經重複的單詞（檢測要匹配的單詞，單詞的前面必須是空格或者標點符號）
                    String patternWord = word.replace(".", "\\.")   //轉義替換正則的特殊字元
                            .replace("?", "\\?")
                            .replace("*", "\\*")
                            .replace("+", "\\+")
                            .replace("^", "\\^")
                            .replace("(", "\\(")
                            .replace(")", "\\)")
                            .replace("$", "\\$")
                            .replace("[", "\\[")
                            .replace("]", "\\]")
                            .replace("{", "\\{")
                            .replace("}", "\\}")
                            .replace("|", "\\|");
                    //.replace("\\", "\\\\") 
                    //前面必須非英文字母，後面為空格或回車等（I'm）
                    Pattern p = Pattern.compile("[^\\w]" + patternWord + "[\\s\\r\\n]"); 
                    Matcher m = p.matcher(temp);
                    if (m.find()) {
                        index = m.start() + 1;   //起始下標
                        AppsLog.i(TAG, "Found2 : " + word + "  at  " + index);
                    }
                }

                if (index != -1) {  //找到
                    temp = temp.substring(index + word.length());  //去掉已匹配過的前面的文字
                    if (totalIndex == -1) {
                        totalIndex = 0;
                    }
                    totalIndex += (index + word.length());  //算幾次index的總和加詞的長度
                    if (totalIndex != 0) {
                        if (j == (repeatCount - 1)) {     //最後一次匹配，index減去單詞長度
                            totalIndex -= word.length();
                        }
                    }
                } else {
                    if (totalIndex == -1) {
                        totalIndex = 0;
                    }
                    if (totalIndex != 0) {
                        if (j == (repeatCount - 1)) {
                            totalIndex -= word.length();
                        }
                    }
                }
            }
            if (totalIndex != -1) {
                entity.setStartPos(totalIndex);
            }
            wordEntitiesMap.put(i, entity);
        }
        for (int i = 0; i < wordEntitiesMap.size(); i++) {
            AppsLog.i(TAG, "" + i + " : " + wordEntitiesMap.get(i));
        }

        return wordEntitiesMap;
    }

英文文章分詞及已知單詞位置計算單詞在文章中起始下標

英文文章分詞及已知單詞位置計算單詞在文章中起始下標

最大概率法分詞及性能測試

分詞及詞雲圖設計

Python的jieba分詞及TF-IDF和TextRank 演算法提取關鍵字

中文分詞及繪製詞雲

分詞及詞雲圖繪製-R語言

solr5.5整合IK分詞及mysql定時資料同步的開發記錄

搜尋引擎solr7.2.1+Jetty 分詞及自定義擴充套件詞庫的配置

【資料彙編】結巴中文分詞官方文件和原始碼分析系列文章

Lucene使用單字分詞及短語查詢實現類似全模糊查詢效果

[python] 使用Jieba工具中文分詞及文字聚類概念

jieba分詞及詞性標註

Lucene基礎（三）-- 中文分詞及高亮顯示

POJ2255-已知二叉樹前序中序求後序

已知三點計算三角形面積

機器人理論（4）逆向運動學：已知物體位置反推關軸角度

已知一個元素,在一個list中找出相似的元素, 模糊匹配

已知先序遍歷和中序遍歷，輸出後序遍歷

已知前序遍歷和中序遍歷,求後序遍歷的程式實現

已知一個元素,在一個list中找出近似值, 模糊匹配

英文文章分詞及已知單詞位置計算單詞在文章中起始下標

相關推薦