1. 程式人生 > >生物資訊資料格式:fastq格式

生物資訊資料格式:fastq格式

文章目錄

格式說明

FASTQ檔案中每個序列通常有四行:

1.第一行:必須以“@”開頭,後面跟著唯一的序列ID識別符號,然後跟著可選的序列描述內容,識別符號與描述內容用空格分開;

2.第二行:序列字元(核酸為[AGCTN]+,蛋白為氨基酸字元);

3.第三行:必須以“+”開頭,後面跟著可選的ID識別符號和可選的描述內容,如果“+”後面有內容,該內容必須與第一行“@”後的內容相同;

4.第四行:鹼基質量字元,每個字元對應第二行相應位置鹼基或氨基酸的質量,該字元可以按一定規則轉換為鹼基質量得分,鹼基質量得分可以反映該鹼基的錯誤率。這一行的字元數與第二行中的字元數必須相同。

檢視fatsq檔案

!cat ./data/test1.fq
@HWI-ST1223:80:D1FMTACXX:2:1101:1243:2213 1:N:0:AGTCAA
TCTGTGTAGCCNTGGCTGTCCTGGAACTCACTTTGTAGACCAGGCTGGCATGCACCACCACNNNCGGCTCATTTGTCTTTNNTTTTTGTTTTGTTCTGTA
+
BCCFFFFFFHH#4AFHIJJJJJJJJJJJJJJJJJIJIJJJJJGHIJJJJJJJJJJJJJIIJ###--5ABECFFDDEEEEE##,[email protected]?CDD<AD>C:@>
@HWI-ST1223:80:D1FMTACXX:2:1101:1375:2060 1:N:0:AGTCAA
NTGCTGAGCCACGACAAGGATCCCAGAGGGCCNAGCCCTGCATCTTGTATGGACCAGTTACNCATCAAAAGAGACTACTGTAGGCACCATCAATCAGATC
+
#1:DDDD;?CFFHDFEEIGIIIIIIG;DHFGG#)0?BFBDHBFF<FCFEFD;@[email protected]=7?E#,,,;=(>3;=;;C>[email protected]
@HWI-ST1223:80:D1FMTACXX:2:1101:1383:2091 1:N:0:AGTCAA
NGTTCGTGTGGAACCTGGCGCTAAACCATTCGTAGACGACCTGCTTCTGGGTCGGGGTTTCGTACGTAGCAGAGCAGCTCCCTCGCTGCGATCTATTGAA
+
#1=DDFDFHHHHHJGJJJJJJJJJJJJJJJIJIGDHIHIGIJJJJJJJIIIGHHFDD3>BDDBDDDDDDDDDDBDCCBDDDDDDDDDDDBBDDDDEEACD
@HWI-ST1223:80:D1FMTACXX:2:1101:1452:2138 1:N:0:AGTCAA
NTCTAGGAGGTCTAGAAAGCCCAGGCCACCGGTACAAACATCAAGGGTGTTACGGATGTGCCGCTCTGAACCTCCAGGACGACTTTGATTTCAACTACAA
+
#[email protected]DDDDDDDDDDCDDDDDDDDDDDCCCEDEDDDDDDD 

每個鹼基的質量值與錯誤率之間的關係:

Q = 10 l o g 10 P Q = -10 log_{10}P

其中P代表該鹼基被測序錯誤的概率,如果該鹼基測序出錯的概率為0.001,則Q應該為30,那麼30+33=63,那麼63對應的ASCii碼為“?”,則在第四行中該鹼基對應的質量代表值即為“?”

print(ord("?"))
63

一般地,鹼基質量從0-40,既ASCii碼為從 “!”(0+33)到“I”(40+33)。以上是sanger中心採用記錄read測序質量的方法

對於每個鹼基的質量編碼標示,不同的軟體採用不同的方案,目前有5種方案:

  • Sanger,Phred quality score,值的範圍從0到92,對應的ASCII碼從33到126,但是對於測序資料(raw read data)質量得分通常小於60,序列拼接或者mapping可能用到更大的分數。

  • Solexa/Illumina 1.0, Solexa/Illumina quality score,值的範圍從-5到63,對應的ASCII碼從59到126,對於測序資料,得分一般在-5到40之間;

  • Illumina 1.3+,Phred quality score,值的範圍從0到62對應的ASCII碼從64到126,低於測序資料,得分在0到40之間;

  • Illumina 1.5+,Phred quality score,但是0到2作為另外的標示,詳見http://solexaqa.sourceforge.net/questions.htm#illumina

  • Illumina 1.8+

avatar

ASCII值小於等於58(相應的質量得分小於等於25)對應的字元只有在Phred+33的編碼中被使用,所有Phred+64所使用的字元的ASCII值都大於等於59。在通常情況下,ASCII值大於等於74的字元只出現在Phred+64中。利用這些資訊即可在程式中進行判斷

例項演練

判斷fastq序列編碼是Phred33(Illumina1.8+) or Phred64(Illumina1.3+)

def fq_phred_check(inputfile): 
    """判斷fastq序列編碼"""
    fastq_file = open(inputfile,'r',encoding='utf-8') 
    count = 1 
    for line in fastq_file: 
        line_strip = line.strip() 
        if count % 4 == 0: 
            for each in line_strip: 
                ASCII_each = ord(each) 
                if ASCII_each > 75: 
                    phred_value = 64
                    break
                elif ASCII_each < 58:
                    phred_value = 33
                    break      
        count += 1 
    fastq_file.close()  
    return phred_value

inputfile = './data/test1.fq'
phred_value = fq_phred_check(inputfile)
print(phred_value)
33

fastq轉換fasta格式

def fastq_fasta(inputfile): 
    """將fastq轉換成fasta序列格式""" 
    fastq_file = open(inputfile,'r',encoding='utf-8') 
    out_file_name = inputfile.strip().split('/')[-1].split('.')[0] + '.fasta' 
    output_fasta = open(out_file_name,'w',encoding='utf-8') 
    i = 0 
    for line in fastq_file: 
        if i % 4 == 0: 
            line_new = line.strip().split('\n')[0].replace('@','>') 
            output_fasta.write(line_new + '\n') 
        elif (i - 1) % 4 == 0: 
            output_fasta.write(line)
        i = i + 1 
    print("fastq transform fasta success", out_file_name)
    fastq_file.close() 
    output_fasta.close() 
    
def main(): 
    inputfile = './data/test1.fq' 
    fastq_fasta(inputfile) 
    
if __name__ == '__main__': 
    main()

fastq transform fasta success test1.fasta

Linux 操作fastq

獲取資料

mkdir -p ~/biosoft
cd ~/biosoft
wget https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.3.4.3/bowtie2-2.3.4.3-linux-x86_64.zip

unzip bowtie2-2.3.4.3-linux-x86_64.zip
cd ~/biosoft/bowtie2-2.3.4.3-linux-x86_64/example/reads

統計reads_1.fq檔案中共有多少條序列資訊

[sunchengquan 08:07:20 ~/local/app/bowtie2-2.2.4/example/reads]
$ ll
總用量 8.4M
-rwxrwx--- 1 sunchengquan sunchengquan 4.0M 10月 23 2014 longreads.fq
-rwxrwx--- 1 sunchengquan sunchengquan 2.2M 10月 23 2014 reads_1.fq
-rwxrwx--- 1 sunchengquan sunchengquan 2.2M 10月 23 2014 reads_2.fq
-rwxrwx--- 1 sunchengquan sunchengquan 9.1K 10月 23 2014 simulate.pl
[sunchengquan 08:07:27 ~/local/app/bowtie2-2.2.4/example/reads]
$ wc reads_1.fq
  40000   40000 2285692 reads_1.fq
[sunchengquan 08:07:42 ~/local/app/bowtie2-2.2.4/example/reads]
$ wc -l reads_1.fq
40000 reads_1.fq

[sunchengquan 09:06:02 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $[`wc -l reads_1.fq |cut -d' ' -f1`/4]
10000
[sunchengquan 09:06:52 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $((`wc -l reads_1.fq |cut -d' ' -f1`/4))
10000
[sunchengquan 09:07:10 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $(expr `wc -l reads_1.fq |cut -d' ' -f1`/4)
40000/4
[sunchengquan 09:07:35 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $(expr `wc -l reads_1.fq |cut -d' ' -f1` / 4)
10000
[sunchengquan 09:12:17 ~/local/app/bowtie2-2.2.4/example/reads]
$ let c=`wc -l reads_1.fq |cut -d' ' -f1`/4
[sunchengquan 09:12:36 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo $c
10000
[sunchengquan 09:16:11 ~/local/app/bowtie2-2.2.4/example/reads]
$ a=`wc -l reads_1.fq |cut -d' ' -f1`
[sunchengquan 09:16:23 ~/local/app/bowtie2-2.2.4/example/reads]
$ b=4
[sunchengquan 09:16:27 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo `expr $a / $b`
10000
[sunchengquan 09:18:03 ~/local/app/bowtie2-2.2.4/example/reads]
$ echo `expr $a / 4`
10000

輸出reads_1.fq檔案中的識別符號(即以@開頭的那一行)

[sunchengquan 09:18:22 ~/local/app/bowtie2-2.2.4/example/reads]
$ grep '^@' reads_1.fq |head -3
@r1
@r2
@r3

[sunchengquan 09:21:59 ~/local/app/bowtie2-2.2.4/example/reads]
$ grep '^@' reads_1.fq |cut -f1 |cut -c1|head -3
@
@
@


[sunchengquan 09:22:38 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fq |paste - - - - |wc
  10000   40000 2285692
[sunchengquan 09:21:51 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fq |paste - - - - |cut -f1|cut -c1|head -3
@
@
@

[sunchengquan 09:27:19 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1) {print $0}}' reads_1.fq|head -3
@r1
@r2
@r3

[sunchengquan 09:28:47 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1) print $0}' reads_1.fq|head -3
@r1
@r2
@r3
[sunchengquan 09:31:38 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1){print}}' reads_1.fq|head -3
@r1
@r2
@r3
[sunchengquan 09:32:28 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{print NR}' reads_1.fq|head -3
1
2
3

輸出reads_1.fq檔案中所有序列的資訊(即每個序列的第二行)

[sunchengquan 09:34:51 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print $0 }}' reads_1.fq|head -1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG

輸出reads_1.fq檔案中‘+’及其後面的描述資訊(即每個序列的第三行)

[sunchengquan 09:35:06 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==3){print $0 }}' reads_1.fq|head -3
+
+
+

輸出reads_1.fq檔案中質量值資訊(即每個序列的第四行)

[sunchengquan 09:36:24 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==0){print $0 }}' reads_1.fq|head -1
+"@6<:27(F&5)9)"B:%B+A-%5A?2$HCB0B+0=D<7E/<.03#[email protected]==?C"7>;))%;,3-$.A06+<-1/@@?,26">=?*@'0;$:;??G+:#+(A?9+10!8!?()?7C>

計算reads_1.fq檔案含有N鹼基的reads個數

[sunchengquan 09:39:13 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print $0 }}' reads_1.fq|grep N|wc
   6429    6429  782897

[sunchengquan 09:41:03 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print $0 }}' reads_1.fq|grep -c N
6429

計算reads_1.fq檔案裡面的序列鹼基總數

[sunchengquan 12:46:57 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print}}' reads_1.fq |grep -o '[ATGCN]' |wc
1088399 1088399 2176798

[sunchengquan 12:47:12 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print length}}' reads_1.fq |head -1
122
[sunchengquan 12:49:36 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print length($0)}}' reads_1.fq |head -1
122
[sunchengquan 12:58:27 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print length($0)}}' reads_1.fq |paste -s -d +|bc
1088399

計算reads_1.fq檔案中所有reads的N鹼基總數

[sunchengquan 12:58:50 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print}}' reads_1.fq |grep -o '[N]' |wc
  26001   26001   52002

計算reads_1.fq中測序鹼基質量值恰好為Q20的個數

[sunchengquan 17:14:00 ~/local/app/bowtie2-2.2.4/example/reads]
$ python -c 'print (chr(33+20))'
5
[sunchengquan 17:15:32 ~/local/app/bowtie2-2.2.4/example/reads]
$ python -c 'print (chr(33+30))'
?

[sunchengquan 17:15:43 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==0){print}}' reads_1.fq |grep -o '[5]'|wc -l 
21369
[sunchengquan 17:17:18 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==0){print}}' reads_1.fq |grep -o '5'|wc -l 
21369

計算reads_1.fq中測序鹼基質量值恰好為Q30的個數

[sunchengquan 17:17:27 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==0){print}}' reads_1.fq |grep -o '?'|wc -l 
21574

計算reads_1.fq中所以序列的第一位鹼基的ATCGNactg分佈情況

[sunchengquan 17:26:36 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==2){print}}' reads_1.fq |cut -c1|sort |uniq -c
   2184 A
   2203 C
   2219 G
   1141 N
   2253 T

將reads_1.fq轉為reads_1.fa檔案(即將fastq轉化為fasta)

[sunchengquan 17:33:13 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1){print ">"$0} else if(NR%4==2){print}}' reads_1.fq |head -2
>@r1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG
[sunchengquan 17:45:57 ~/local/app/bowtie2-2.2.4/example/reads]
$ awk '{if(NR%4==1){print ">"substr($0,2)} else if(NR%4==2){print}}' reads_1.fq |head -2
>r1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG


[sunchengquan 17:46:36 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fq |paste - - - - |cut -f1,2|tr '\t' '\n'|head -2
@r1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG
[sunchengquan 17:48:14 ~/local/app/bowtie2-2.2.4/example/reads]
$ cat reads_1.fq |paste - - - - |cut -f1,2|tr '\t' '\n'|tr '@' '>'|head -2
>r1
TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG

統計上述reads_1.fa檔案中共有多少條序列

[sunchengquan 17:50:29 ~/local/app/bowtie2-2.2.4/example/reads]
$