leetcode 187. Repeated DNA Sequences 編碼計數統計重複字串 + 移動視窗

阿新 • • 發佈：2019-01-01

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT”,

Return:
[“AAAAACCCCC”, “CCCCCAAAAA”].

這道題考察的就是重複出現的字串，這道題給我的啟發很強。

當然，最直接的方法就是暴力求解，但是這個會超時，其實也可以採用HashMap來做（這個方法也可以直接accept），我在網上看到了一個基於編碼的做法，這個做法很棒，直接看程式碼吧！

這道題十分需要注意的地方就是位運算的優先順序，注意所有的位運算以後都注意要新增括號，因為加法的優先順序高於位運算，所以不新增括號是錯誤的。

程式碼如下：

import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

/*
 * http://blog.csdn.net/xudli/article/details/43666725
 * 
 * 因為只有4個字母,所以可以建立自己的hashkey, 每兩個BITS,
 * 
 * 對應一個 incoming character. 超過20bit 即10個字元時, 只保留20bits.
 * 
 * (hash<<2) + map.get(c)  符號優先順序,  << 一定要括起來.
 * 
 * */ 

public class Solution 
{
    /*
     * 這個是網上編碼的做法，很不錯的想法，值得反思和學習
     * 不過有點麻煩
     * */
    public List<String> findRepeatedDnaSequencesByCode(String s) 
    {
        List<String> res=new ArrayList<>();
        if(s==null || s.length()<=10)
            return res;

        Map<Character, Integer> map=new HashMap<Character, Integer>();
        map.put('A', 0);
        map.put('C', 1);
        map.put('G', 2);
        map.put('T', 3);

        /*
         * set儲存的是所有的可能的字串，
         * uniqueSet儲存的的是判斷的重複出現的字串
         * 使用uniqueSet是為了避免重複新增重複元素
         * */
        Set<Integer> set=new HashSet<>();
        Set<Integer> uniqueSet=new HashSet<>();
        int hash=0;
        for(int i=0;i<s.length();i++)
        {
            Character one=s.charAt(i);
            if(i<9)
                hash = (hash<<2) + map.get(one);
            else
            {
                hash = (hash<<2) + map.get(one);
                hash &= (1<<20)-1;

                if(set.contains(hash)==false)
                    set.add(hash);
                else if(set.contains(hash) && !uniqueSet.contains(hash))
                {
                    uniqueSet.add(hash);
                    res.add(s.substring(i-9,i+1));
                }
            }
        }
        return res;
    }

    /*
     * 使用HashMap去做，這個也很不錯，可以直接accept
     * 
     * */
    public List<String> findRepeatedDnaSequences(String s) 
    {
        List<String> res=new ArrayList<>();
        if(s==null || s.length()<=10)
            return res;

        Map<String, Integer> map=new HashMap<>();
        for(int i=10;i<=s.length();i++)
        {
            String key=s.substring(i-10, i);
            map.put(key, map.getOrDefault(key, 0)+1);
        }

        Set<String> set=map.keySet();
        for(String key : set)
        {
            if(map.get(key)>=2)
                res.add(key);
        }
        return res;
    }


    /*
     * 暴力去做，這個肯定超時
     * */
    public List<String> findRepeatedDnaSequencesByLoop(String s) 
    {
        List<String> res=new ArrayList<>();
        if(s==null || s.length()<=10)
            return res;

        for(int i=0;i<s.length();i++)
        {
            if(i+10<=s.length())
            {
                String one=s.substring(i,i+10);
                for(int j=i+1;j<s.length();j++)
                {
                    if(j+10<=s.length())
                    {
                        if(one.equals(s.substring(j, j+10)))
                        {
                            if(!res.contains(one))
                                res.add(one);
                            break;
                        }
                    }else
                        break;
                }
            }
            else
                break;
        }
        return res;
    }
}

下面是C++的做法，這道題最直接的方法就是使用map來做查詢，但是可能超時，後來網上看到了一個使用編碼的做法，很棒的做法，值得學習

程式碼如下：

#include <iostream>
#include <algorithm>
#include <climits>
#include <vector>
#include <stack>
#include <queue>
#include <map>
#include <set>
#include <string>
#include <unordered_map>

using namespace std;

class Solution 
{
public:
    vector<string> findRepeatedDnaSequences(string s) 
    {
        vector<string> res;
        map<char, int> mmp;
        mmp['A'] = 0;
        mmp['C'] = 1;
        mmp['G'] = 2;
        mmp['T'] = 3;
        set<int> st;
        set<int> uniqueSt;
        int hash = 0;
        for (int i = 0; i < s.length(); i++)
        {
            if (i < 9)
                hash = (hash << 2) + mmp[s[i]];
            else
            {
                hash = (hash << 2) + mmp[s[i]];
                hash = hash & ((1 << 20) - 1);
                if (st.find(hash) == st.end())
                    st.insert(hash);
                else if (uniqueSt.find(hash) == uniqueSt.end())
                {
                    uniqueSt.insert(hash);
                    res.push_back(s.substr(i-9,10));
                }
            }
        }
        return res;
    }
};

leetcode 187. Repeated DNA Sequences 編碼計數統計重複字串 + 移動視窗

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is s

LeetCode 187. Repeated DNA Sequences 20170706 第三十次作業

如果作業 log {} TTT enc series compose bst All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAA

[LeetCode] 187. Repeated DNA Sequences 求重復的DNA序列

item series style result table hashset nbsp identify substring All DNA is composed of a series of nucleotides abbreviated as A, C, G, and

LeetCode--187. Repeated DNA Sequences

題目連結：https://leetcode.com/problems/repeated-dna-sequences/ 要求尋找長度為10的DNA重複子字串思路一：這裡可以考慮一個HashMap來儲存出現的子字串及其出現次數，出現第二次的則加入最終答案中，而首次出現的就加入Hashmap中，

187. Repeated DNA Sequences

topic some ive ack 所有 write 影響 useful content 題目： All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for

*187. Repeated DNA Sequences (hashmap, one for loop)(difference between subsequence & substring)

sequence value n-2 return hashset cga AS repeated des All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for

leetcode:(187) Repeated DNA Sequence(java)

/** * 題目： * All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". * When studying DNA

187. Repeated DNA Sequences - Medium

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to ident

[LeetCode] Repeated DNA Sequences 求重複的DNA序列

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to ide

[Swift]LeetCode187. 重復的DNA序列 | Repeated DNA Sequences

desc pre 出現 ins value find let amp strings All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "AC

[Swift]LeetCode187. 重複的DNA序列 | Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to ident

Leetcode: Repeated DNA Sequence

and == 10個 nas rect 想是運算 tco contains 方法2：進一步的方法是用HashSet, 每次取長度為10的字符串，O(N)時間遍歷數組，重復就加入result，但這樣需要O(N)的space, 準確說來O(N*10bytes), java而言

[LeetCode] 459. Repeated Substring Pattern 重復子字符串模式

length 模式 empty gpo highlight 題目 elf vector win Given a non-empty string check if it can be constructed by taking a substring of it and a

leetcode-819-Most Common Word（詞頻統計）

may graph after ons most p s size nor 累加題目描述： Given a paragraph and a list of banned words, return the most frequent word that is not in

LeetCode#686: Repeated String Match

Description Given two strings A and B, find the minimum number of times A has to be repeated such that B is a substring of it. If no such solu

leetcode 946 Validate Stack Sequences

leetcode 946 Validate Stack Sequences 1.題目描述 2.解題思路（1）方法1 （2）方法2 3.Python程式碼（1）方法1 （2）方法2

Leetcode 89：格雷編碼（超詳細的解法！！！）

格雷編碼是一個二進位制數字系統，在該系統中，兩個連續的數值僅有一個位數的差異。給定一個代表編碼總位數的非負整數 n，列印其格雷編碼序列。格雷編碼序列必須以 0 開頭。示例 1: 輸入: 2 輸出: [0,1,3,2] 解釋: 00 - 0 01 - 1 11 - 3 10

【Leetcode】811. 子域名訪問計數

題目描述：一個網站域名，如"discuss.leetcode.com"，包含了多個子域名。作為頂級域名，常用的有"com"，下一級則有"leetcode.com"，最低的一級為"discuss.leetcode.com"。當我們訪問域名"discuss.leetcode.

LeetCode筆記——89格雷編碼

題目：格雷編碼是一個二進位制數字系統，在該系統中，兩個連續的數值僅有一個位數的差異。給定一個代表編碼總位數的非負整數 n，列印其格雷編碼序列。格雷編碼序列必須以 0 開頭。示例 1: 輸入: 2 輸出: [0,1,3,2] 解釋: 00 - 0 01 - 1

leetcode （Repeated String Match）

Title：Repeated String Match 686 Difficulty：Easy 原題leetcode地址: https://leetcode.com/problems/repeated-string-match/ 1. &nb

leetcode 187. Repeated DNA Sequences 編碼計數統計重複字串 + 移動視窗

相關推薦