1. 程式人生 > >轉錄組分析---Hisat2+StringTie+Ballgown使用

轉錄組分析---Hisat2+StringTie+Ballgown使用

轉錄組分析---Hisat2+StringTie+Ballgown使用

  (2016-10-10 08:14:45) 轉載
標籤: 

生物資訊學

 

轉錄組

 
1.Hisat2建立基因組索引:
First, using the python scripts included in the HISAT2 package, extract splice-site and exon information from the gene annotation file:   $ extract_splice_sites.py gemome.gtf >genome.ss#得到剪接位點資訊
$ extract_exons.py genome.gtf >genome.exon#得到外顯子資訊   Second, build a HISAT2 index:   $ hisat2-build --ss genome.ss --exon genome.exon genome.fa genome   備註extract_splice_sites.py 和 extract_exons.py 在hisat2軟體包中涵蓋了,這兩步不是必須的,只是為了發現剪下位點,也可以直接: $ hisat2-build  
genome.fa genome   2. 利用hisat2比對到基因組:   hisat2 -p 8 --dta -x genome -1 file1_1.fastq.gz -2 file1_2.fastq.gz -S file1.sam hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 file2_1.fastq.gz -2 file2_2.fastq.gz -S file2.sam   備註:--dta:輸出轉錄組型的報告檔案 -x:基因組索引 -S : 輸出sam檔案 -p: 執行緒數 其他引數: Input:   -q                 query input files are FASTQ .fq/.fastq (default)   --qseq             query input files are in Illumina's qseq format   -f                 query input files are (multi-)FASTA .fa/.mfa   -r                 query input files are raw one-sequence-per-line   -c                 , , are sequences themselves, not files   -s/--skip    skip the first reads/pairs in the input (none)   -u/--upto    stop after first reads/pairs (no limit)   -5/--trim5   trim bases from 5'/left end of reads (0)   -3/--trim3   trim bases from 3'/right end of reads (0)   --phred33          qualities are Phred+33 (default)   --phred64          qualities are Phred+64   --int-quals        qualities encoded as space-delimited integers    Alignment:   -N           max # mismatches in seed alignment; can be 0 or 1 (0)   -L           length of seed substrings; must be >3, <32 (22)   -i          interval between seed substrings w/r/t read len (S,1,1.15)   --n-ceil    func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)   --dpad       include extra ref chars on sides of DP table (15)   --gbar       disallow gaps within nucs of read extremes (4)   --ignore-quals     treat all quality values as 30 on Phred scale (off)   --nofw             do not align forward (original) version of read (off)   --norc             do not align reverse-complement version of read (off)    Spliced Alignment:   --pen-cansplice              penalty for a canonical splice site (0)   --pen-noncansplice           penalty for a non-canonical splice site (12)   --pen-canintronlen          penalty for long introns (G,-8,1) with canonical splice sites   --pen-noncanintronlen       penalty for long introns (G,-8,1) with noncanonical splice sites   --min-intronlen              minimum intron length (20)   --max-intronlen              maximum intron length (500000)   --known-splicesite-infile   provide a list of known splice sites   --novel-splicesite-outfile  report a list of splice sites   --novel-splicesite-infile   provide a list of novel splice sites   --no-temp-splicesite               disable the use of splice sites found   --no-spliced-alignment             disable spliced alignment   --rna-strandness          Specify strand-specific information (unstranded)   --tmo                              Reports only those alignments within known transcriptome   --dta                              Reports alignments tailored for transcript assemblers   --dta-cufflinks                    Reports alignments tailored specifically for cufflinks    Scoring:   --ma         match bonus (0 for --end-to-end, 2 for --local)   --mp ,   max and min penalties for mismatch; lower qual = lower penalty <2,6>   --sp ,   max and min penalties for soft-clipping; lower qual = lower penalty <1,2>   --np         penalty for non-A/C/G/Ts in read/ref (1)   --rdg ,  read gap open, extend penalties (5,3)   --rfg ,  reference gap open, extend penalties (5,3)   --score-min min acceptable alignment score w/r/t read length                      (L,0.0,-0.2)    Reporting:   (default)          look for multiple alignments, report best, with MAPQ    OR   -k           report up to alns per read; MAPQ not meaningful    OR   -a/--all           report all alignments; very slow, MAPQ not meaningful    Effort:   -D           give up extending after failed extends in a row (15)   -R           for reads w/ repetitive seeds, try sets of seeds (2)    Paired-end:   --fr/--rf/--ff     -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)   --no-mixed         suppress unpaired alignments for paired reads   --no-discordant    suppress discordant alignments for paired reads    Output:   -t/--time          print wall-clock time taken by search phases   --un           write unpaired reads that didn't align to   --al           write unpaired reads that aligned at least once to   --un-conc      write pairs that didn't align concordantly to   --al-conc      write pairs that aligned concordantly at least once to   (Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.   --un-gz , to gzip compress output, or add '-bz2' to bzip2 compress output.)   --quiet            print nothing to stderr except serious errors   --met-file  send metrics to file at (off)   --met-stderr       send metrics to stderr (off)   --met        report internal counters & metrics every secs (1)   --no-head          supppress header lines, i.e. lines starting with @   --no-sq            supppress @SQ header lines   --rg-id     set read group id, reflected in @RG line and RG:Z: opt field   --rg        add ("lab:value") to @RG line of SAM header.                      Note: @RG line only printed when --rg-id is set.   --omit-sec-seq     put '*' in SEQ and QUAL fields for secondary alignments.    Performance:   -o/--offrate override offrate of index; must be >= index's offrate   -p/--threads number of alignment threads to launch (1)   --reorder          force SAM output order to match order of input reads   --mm               use memory-mapped I/O for index; many 'bowtie's can share    Other:   --qc-filter        filter out reads that are bad according to QSEQ filter   --seed       seed for random number generator (0)   --non-deterministic seed rand. gen. arbitrarily instead of using read attributes   --version          print version information and quit   -h/--help          print this usage message     3.  將sam檔案sort並轉化成bam:
  $ samtools sort [email protected] 8 -o file1.bam file1.sam $ samtools sort [email protected] 8 -o file2.bam file2.sam   4. 組裝轉錄本:   $ stringtie -p 8 -G genome.gtf -o file1.gtf –l file1 file1.bam $ stringtie -p 8 -G genome.gtf -o file2.gtf –l file2 file2.bam lncRNA (-f 0.01 -a 10 -j 1 -c 0.01)
其中:  -G reference annotation to use for guiding the assembly process (GTF/GFF3)  -l name prefix for output transcripts (default: STRG)  -f minimum isoform fraction (default: 0.1)  -m minimum assembled transcript length (default: 200)  -o output path/file name for the assembled transcripts GTF (default: stdout)  -a minimum anchor length for junctions (default: 10)  -j minimum junction coverage (default: 1)  -t disable trimming of predicted transcripts based on coverage     (default: coverage trimming is enabled)  -c minimum reads per bp coverage to consider for transcript assembly     (default: 2.5)  -v verbose (log bundle processing details)  -g gap between read mappings triggering a new bundle (default: 50)  -C output a file with reference transcripts that are covered by reads  -M fraction of bundle allowed to be covered by multi-hit reads (default:0.95)  -p number of threads (CPUs) to use (default: 1)  -A gene abundance estimation output file  -B enable output of Ballgown table files which will be created in the     same directory as the output GTF (requires -G, -o recommended)  -b enable output of Ballgown table files but these files will be     created under the directory path given as  -e only estimate the abundance of given reference transcripts (requires -G)  -x do not assemble any transcripts on the given reference sequence(s)  -h print this usage message and exit     5. 合併所有樣本的gtf檔案   $ stringtie --merge -p 8 -G genome.gtf -o stringtie_merged.gtf mergelist.txt   6. 新轉錄本的註釋(lncRNA必備,普通轉錄組忽略)   gffcompare –r genomegtf –G –o merged stringtie_merged.gtf   備註:gffcompare 是獨立軟體,下載地址http://ccb.jhu.edu/software/stringtie/gff.shtml,結果如下; = Predicted transcript has exactly the same introns as the reference transcript c Predicted transcript is contained within the reference transcript j Predicted transcript is a potential novel isoform that shares at least one splice junction with a reference transcript e Predicted single-exon transcript overlaps a reference exon plus at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment i Predicted transcript falls entirely within a reference intron o Exon of predicted transcript overlaps a reference transcript p Predicted transcript lies within 2 kb of a reference transcript (possible polymerase run-on fragment) r Predicted transcript has >50% of its bases overlapping a soft-masked (repetitive) reference sequence u Predicted transcript is intergenic in comparison with known reference transcripts x Exon of predicted transcript overlaps reference but lies on the opposite strand s Intron of predicted transcript overlaps a reference intron on the opposite strand   7. 轉錄本定量和下游ballgown軟體原始檔案構建:   $ stringtie –e –B -p 8 -G stringtie_merged.gtf -o ballgown/file1/file1.gtf file1.bam $ stringtie –e –B -p 8 -G stringtie_merged.gtf -o ballgown/file2/file2.gtf file2.bam   8. Ballgown差異表達分析:   >library(ballgown) >library(RSkittleBrewer) >library(genefilter) >library(dplyr) >library(devtools) >pheno_data = read.csv("geuvadis_phenodata.csv")#讀取表型資料 >bg_chrX = ballgown(dataDir = "ballgown", samplePattern = "file", pData=pheno_data)#讀取表達量 >bg_chrX_filt = subset(bg_chrX,"rowVars(texpr(bg_chrX)) >1",genomesubset=TRUE)#過濾低表達量基因 >results_transcripts = stattest(bg_chrX_filt,feature="transcript",covariate="sex",adjustvars =c("population"), getFC=TRUE, meas="FPKM")#差異表達分析,運用的是一般線性模型,比較組sex,影響因素:population >results_genes = stattest(bg_chrX_filt, feature="gene",covariate="sex", adjustvars = c("population"), getFC=TRUE,meas="FPKM")#基因差異表達 >results_transcripts=data.frame(geneNames=ballgown::geneNames(bg_chrX_filt),geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)#增加基因名字,id >results_transcripts = arrange(results_transcripts,pval)#按pval sort >results_genes = arrange(results_genes,pval) >write.csv(results_transcripts, "chrX_transcript_results.csv", row.names=FALSE) >write.csv(results_genes, "chrX_gene_results.csv", row.names=FALSE) >subset(results_transcripts,results_transcripts$qval<0.05) >subset(results_genes,results_genes$qval<0.05)   9. 結果視覺化:   >tropical= c('darkorange', 'dodgerblue', 'hotpink', 'limegreen', 'yellow') >palette(tropical) >fpkm = texpr(bg_chrX,meas="FPKM") >fpkm = log2(fpkm+1) >boxplot(fpkm,col=as.numeric(pheno_data$sex),las=2,ylab='log2(FPKM+1)') >ballgown::transcriptNames(bg_chrX)[12] ## 12 ## "NM_012227" >ballgown::geneNames(bg_chrX)[12] ## 12 ## "GTPBP6" >plot(fpkm[12,] ~ pheno_data$sex, border=c(1,2), main=paste(ballgown::geneNames(bg_chrX)[12],' : ', ballgown::transcriptNames(bg_chrX)[12]),pch=19, xlab="Sex", ylab='log2(FPKM+1)') >points(fpkm[12,] ~ jitter(as.numeric(pheno_data$sex)), col=as.numeric(pheno_data$sex)) >plotTranscripts(ballgown::geneIDs(bg_chrX)[1729], bg_chrX, main=c('Gene XIST in sample ERR188234'), sample=c('ERR188234')) >plotMeans('MSTRG.56', bg_chrX_filt,groupvar="sex",legend=FALSE)