1. 程式人生 > >24、sam- 詳解 https://davetang.org/wiki/tiki-index.php?page=SAM

24、sam- 詳解 https://davetang.org/wiki/tiki-index.php?page=SAM

sco tran lis string轉換 similar in use 位置 rac tro

編輯距離Edit Distance:
從字符串a變到字符串b,所需要的最少的操作步驟(插入I,刪除D,更改)為兩個字符串之間的編輯距離。這也是sam文檔中對NM這個tag的定義。編輯距離是對兩個字符串相似度的度量(參見文章:Edit Distance http://www.cnblogs.com/lihaozy/archive/2012/12/31/2840152.html)

舉個例子:兩個字符串“eeba”和“abca”的編輯距離是多少?
根據定義,通過三個步驟:1.將e變為a 2.刪除e 3.添加c,我們可以將“eeba”變為“abca”。所以,“eeba”和“abca”之間的編輯距離為3

技術分享

1、如何通過bam文件統計比對的indel和mismatch信息

92S59M8I17M1D6M1D67M

(1)CIGAR 的格式:操作長度 + 相應的操作符。第6列

常用的操作符有3個M for match or mismatch, I for insertion and D for deletion;此外還有一些擴展的操作符去描述 clipping, padding and splicing。(註:目前只 在blasr比對結果中見過=和X)

M(匹配比對,包含match和mismatch),=(純match),X(純mismatch),

I(插入到參考序列中

D(從參考序列中刪除

N、(從參考序列中跳過

S、Soft clip on the read (剪切掉的序列還在 in <seq>)

H、Hard clip on the read (剪切掉的序列不在 in <seq>)

P、silent deletion from the padded reference sequence

(2) cigar字段可以統計indel信息。

(3) cigar字段無法統計mismatch,這個時候就可以用到NM tag了,mismatch = NM – I - D = 25 – 8 – 1 – 1 = 15

2、Optional fields 的格式 <TAG>:<VTYPE>:<VALUE>第12

VTYPE:

Type Description
A Printable character
i Signed 32-bin interger
f Single-precision float number
Z Printable string
H Hex string (high nybble first)

TAG:

技術分享

AS:i:<N> Alignment score.可以為負的,在local下可以為正的。 只有當Align≥1 time才出

XS:i:<N> Alignment score for second-best alignment. 當Align>1 time出現

YS:i:<N> Alignment score for opposite mate in the paired-end alignment. 當該read是雙末端測序中的一條時出現

XN:i:<N> The number of ambiguous bases in the reference covering this alignment.(推測是指不知道錯配發生在哪個位置,推測是針對於插入和缺失,待查證)
XM:i:<N> 錯配堿基的數目
XO:i:<N> The number of gap opens(針對於比對中的插入和缺失)
XG:i:<N> The number of gap extensions(針對於比對中的插入和缺失)
NM:i:<N> The edit distance(read string轉換成reference string需要的最少核苷酸的edits:插入/缺失/替換)
YF:Z:<S> 該reads被過濾掉的原因。可能為LN(錯配數太多,待查證)、NS(read中包含N或者.)、SC(match bonus低於設定的閾值)、QC(failing quality control,待證)
YT:Z:<S> 值為UU表示不是pair中一部分(單末端?)、CP(是pair且可以完美匹配) DP(是pair但不能很好的匹配)、UP(是pair但是無法比對到參考序列上)
MD:Z:<S> 比對上的錯配堿基的字符串表示

3、FLAG 第2列

技術分享

1 the read is paired in sequencing, no matter whether it is mapped in a pair
1 the read is mapped in a proper pair
0 not unmapped
0 mate is not unmapped
0 forward strand
1 mate strand is negative
0 the read is not the first read in a pair
1 the read is the second read in a pair

4、 alignments的類型

SAM可以存儲 clipped alignments, spliced alignments, multi-part alignments, padded alignments and alignments in colour space

11個必須的字段(mandatory fields)和一個可選的字段,字段之間用tag分割

   <QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> [<TAG>:<VTYPE>:<VALUE> [...]]

1. QNAME,比對片段的(template)的編號;
2. FLAG,位標識,template mapping情況的數字表示,每一個數字代表一種比對情況,這裏的值是符合情況的數字相加總和;
3. RNAME,參考序列的編號,如果註釋中對SQ-SN進行了定義,這裏必須和其保持一致,另外對於沒有mapping上的序列,這裏是’*‘;
4. POS,比對上的位置,註意是從1開始計數,沒有比對上,此處為0;

5. MAPQ,mappint的質量;
6. CIGAR,簡要比對信息表達式(Compact Idiosyncratic Gapped Alignment Report),其以參考序列為基礎,使用數字加字母表示比對結果,比如3S6M1P1I4M,前三個堿 基被剪切去除了,然後6個比對上了,然後打開了一個缺口,有一個堿基插入,最後是4個比對上了,是按照順序的;
7. RNEXT,下一個片段比對上的參考序列的編號,沒有另外的片段,這裏是’*‘,同一個片段,用’=‘;
8. PNEXT,下一個片段比對上的位置,如果不可用,此處為0;
9. TLEN,Template的長度,最左邊得為正,最右邊的為負,中間的不用定義正負,不分區段(single-segment)的比對上,或者不可用時,此處為0;

10、SEQ,序列片段的序列信息,如果不存儲此類信息,此處為’*‘,註意CIGARM/I/S/=/X對應數字的和要等於序列長度;

11.QUAL,序列的質量信息,格式同FASTQ一樣。

12、可選字段(optional fields),格式如:TAG:TYPE:VALUE,其中TAG有兩個大寫字母組成,每個TAG代表一類信息,每一行一個TAG只能出現一次,TYPE表示TAG對應值 的類型,可以是字符串、整數、字節、數組等。

(1)Clipped alignment

REF: AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTG

READ: gggGTGTAACC-GACTAGgggg


read中大寫字母表示與參考基因匹配小寫字母表示read的剪掉部分。- 表示與參考基因組相比read缺失的堿基中 。例子是 3S8M1D6M4S ( 3 soft, 8 match, 1 deletion, 6 match and 4 soft).

(2)Spliced alignment


在 cDNA-to-genome的比對中, 為了區分內含子(...),外顯子的缺失(---)。通過引入操作符“”N“來代表在參考序列上長的skip,(we may want to distinguish introns from deletions in exons.)

REF: AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTG
READ:          GTGTAACCC................................TCAGAATA

...‘ 表示 intron. 這個比對的CIGAR是 : 9M32N8M.

(3)Multi-part alignment

One query sequence may be aligned to multiple places on the reference genome, either with or without overlaps. In SAM, we keep multiple hits as multiple alignment records. To avoid presenting the full query sequence multiple times for non-overlapping hits, we introduce operation ‘H‘ to describe hard clipped alignment. Hard clipping (H) is similar to soft clipping (S). They are different in that hard clipped subsequence is not present in the alignment record. The example alignment in "clipped alignment" can also be represented with CIGAR: 3H8M1D6M4H, but in this case, the sequence stored in SAM is "GTGTAACCGACTAG", instead of "GGGGTGTAACCGACTAGGGGG" if soft clipping is in use.

(4)Padded alignment

Most sequence aligners only give the sequences inserted to the reference genome, but do not present how these inserted sequences are aligned against
each other. Alignment with inserted sequences fully aligned is called padded alignment. Padded alignment is always produced by de novo assemblers and
is important for an alignment viewer to display the alignment properly. To store padded alignment, we introduce operation ‘P‘ which can be considered
as a silent deletion from padded reference sequence. In the following example, GA on READ1 and A on READ2 are inserted to the reference. With unpadded
CIGAR, we would not be able to distinguish the following padded multi-alignments:

REF:  CACGATCA**GACCGATACGTCCGA
READ1:  CGATCAGAGACCGATA
READ2:    ATCA*AGACCGATAC
READ3:   GATCA**GACCG

The padded CIGAR are different:
READ1: 6M2I8M
READ2: 4M1P1I9M
READ3: 5M2P5M

(5)Alignments in colour space

Colour alignments are stored as normal nucleotide alignments with additional tags describing the raw colour sequences, qualities and colour-specific
properties.





24、sam- 詳解 https://davetang.org/wiki/tiki-index.php?page=SAM