1. 程式人生 > >R語言統計入門課程推薦——生物科學中的資料分析Data Analysis for the Life Sciences

R語言統計入門課程推薦——生物科學中的資料分析Data Analysis for the Life Sciences

image

Data Analysis for the Life Sciences是哈佛大學PH525x系列課程——生物醫學中的資料分析(PH525x series - Biomedical Data Science
),課程全部採用R語言進行統計分析理論教學與實戰。教材採用Rmarkdown語言編寫,易輕鬆易讀,又保證分析的可重複性,代表了科學界最先進的可重複計算要求,我們不僅可以系統學習一個生物學家所要掌握的統計知識,還能新手用程式碼實現,並達到CNS發表可重複程式碼的要求。

image

傳統的統計材料關注數學原理。而本文重點是用計算機實現資料分析。本書採用例項來講解數學原理,提供程式碼親自實現分析。全文采用R markdown編寫,保證讀者完成全部分析。

image

關於作者:

Rafael A Irizarry是哈佛大學公共衛生學院丹娜法伯癌症研究院的生物統計和計算生物學教授,有17年分析基因組資料的經驗。

Michael I Love是北卡教堂山大學生統與遺傳系助理教授。研究方向為利用統計模型發現基因組資料中的生物為規律,並開發了Bioconductor中開源統計軟體。

課程原始碼https://github.com/genomicsclass/labs 包括課程所有原始碼、測試資料和結果

網頁版教程: https://genomicsclass.github.io/book/ ,包括課程的Rmd執行結果網頁教程,和Rmd原始碼的每節導航和下載連結。

電子書https://leanpub.com/dataanalysisforthelifesciences/ 方便下載各版本在移動端閱讀

有意思的是可選擇免費學習,或最高付給作者80$。

教程大綱

https://genomicsclass.github.io/book/

PH525x series - Biomedical Data Science

  • R markdown source files
  • ePub version on Leanpub
  • Links to the HarvardX class pages
  • External resources and books
  • Finding more help for data analysis

Chapter 0 - 簡介Introduction

  • Introduction [Rmd]
  • Getting started [Rmd]
  • Getting started exercises
  • 資料操作dplyr introduction [Rmd]
  • dplyr introduction exercises
  • Mathematical notation [Rmd]

Chapter 1 - 推理統計基礎Inference

  • 隨機變數Random variables [Rmd]
  • Random variables exercises
  • 群體與樣本Populations and samples [Rmd]
  • Populations and samples exercises
  • CLT and t-distribution [Rmd]
  • CLT and t-distribution exercises
  • CLT in practice [Rmd]
  • CLT in practice exercises
  • t-test in practice [Rmd]
  • 置信區間Confidence intervals [Rmd]
  • Power calculations [Rmd]
  • Power calculations exercises
  • Monte carlo [Rmd]
  • Monte carlo exercises
  • 排列檢驗Permutation tests [Rmd]
  • Permutation tests exercises
  • 關聯研究Association tests [Rmd]
  • Association tests exercises

Chapter 2 - 資料探索Exploratory Data Analysis

  • Exploratory data analysis [Rmd]
  • Plots to avoid [Rmd]
  • Exploratory data analysis exercises

Chapter 3 - 穩健統計Robust Statistics

  • Robust summaries [Rmd]
  • Rank tests [Rmd]
  • Robust summaries exercises

Chapter 4 - 矩陣代數Matrix Algebra

  • 迴歸Introduction to using regression [Rmd]
  • Introduction to using regression exercises
  • Matrix notation [Rmd]
  • Matrix notation exercises
  • Matrix operations [Rmd]
  • Matrix operations exercises
  • Matrix algebra examples [Rmd]
  • Matrix algebra examples exercises

Chapter 5 - 線性模型 Linear Models

  • Linear models introduction [Rmd]
  • Linear models introduction exercises
  • Expressing design formula [Rmd]
  • Expressing design formula exercises
  • Linear models in practice [Rmd]
  • Linear models in practice exercises
  • Standard errors [Rmd]
  • Standard errors exercises
  • Interactions and contrasts [Rmd]
  • Interactions and contrasts exercises
  • Collinearity [Rmd]
  • Collinearity exercises
  • QR and regression [Rmd]
  • Linear models going further [Rmd]

Chapter 6 - 推斷高維資料Inference for High-Dimensional Data

  • Introduction to high-throughput data [Rmd]
  • Introduction to high-throughput data exercises
  • Inference for high-throughput data [Rmd]
  • Inference for high-throughput data exercises
  • Multiple testing [Rmd]
  • Multiple testing exercises
  • EDA for high-throughput data [Rmd]
  • EDA for high-throughput data exercises

Chapter 7 - 統計模型Statistical Modeling

  • Modeling [Rmd]
  • Modeling exercises
  • Bayes theorem [Rmd]
  • Bayes theorem exercises
  • Hierarchical models [Rmd]
  • Hierarchical models exercises

Chapter 8 - 降維Distance and Dimension Reduction

  • Distance [Rmd]
  • Distance exercises
  • PCA motivation [Rmd]
  • SVD [Rmd]
  • SVD exercises
  • Projections [Rmd]
  • Rotations [Rmd]
  • MDS [Rmd]
  • MDS exercises
  • PCA [Rmd]

Chapter 9 - 機器學習Practical Machine Learning

  • 聚類和熱圖Clustering and heatmaps [Rmd]
  • Clustering and heatmaps exercises
  • Conditional expectation [Rmd]
  • Conditional expectation exercises
  • Smoothing [Rmd]
  • Smoothing exercises
  • Machine learning [Rmd]
  • Crossvalidation [Rmd]
  • Crossvalidation exercises

Chapter 10 - 批次效應Batch Effects

  • Introduction to batch effects [Rmd]
  • Confounding [Rmd]
  • Confounding exercises
  • EDA with PCA [Rmd]
  • EDA with PCA exercises
  • Adjusting with linear models [Rmd]
  • Adjusting with linear models exercises
  • Factor analysis [Rmd]
  • Factor analysis exercises
  • Adjusting with factor analysis [Rmd]
  • Adjusting with factor analysis exercises

Chapter 11 - 生物R包簡介Introduction to Bioconductor

  • Mike Love’s general reference card
  • Motivations and core values (optional)
  • Installing Bioconductor and finding help [Rmd]
  • Data structure and management for genome scale experiments [Rmd]
    • Coordinating multiple tables: ExpressionSet
    • Institutional archives: GEO, ArrayExpress
  • Interlude: Working with general genomic features using GenomicRanges
    • IRanges introduced
    • Intra-range operations
    • Inter-range operations
    • GRanges
    • Calculating overlaps
  • Range-oriented solutions for current experimental paradigms
    • SummarizedExperiment: for RNA-seq and 450k methylation
    • External storage for very large assays
    • GenomicFiles for families of BAM or BED
    • DNA Variants: VCF handling with VariantAnnotation and VariantTools
    • Handling multiomic archives like TCGA
    • Cloud-oriented solutions: e.g., Google BigQuery
  • Short read mapping/alignment software (optional) [Rmd]

Chapter 12 - 基因組註釋Genomic Annotation with Bioconductor

  • More details on GRanges [Rmd]
    • Run-length encoding, views
    • Application to genomic landmarks
    • Application to 450k methylation array visualization
  • General overview of Bioconductor annotation [Rmd]
    • Levels: reference sequence, regions of interest, pathways
    • Discovering reference sequence
    • A build of the human genome
    • Gene/Transcript/Exon catalogs from UCSC and Ensembl
    • Importing and exporting regions and scores
    • AnnotationHub: brokering thousands of annotation resources
    • OrgDb: simple interface to annotation databases
    • Finding and managing gene sets
    • OrganismDb: unifying diverse annotation
  • Cheat sheet on Bioconductor annotation [Rmd]
  • Translating addresses between genome builds: liftOver [Rmd]

Chapter 13 - 假設檢驗Genome-scale hypothesis testing with Bioconductor

  • 區分生物重複和技術重複的變異Distinguishing biological and technical variability [Rmd]
    • An experiment with pooled and individual samples
    • Measuring technical variation
    • Measuring biological variation
    • Interpretation
  • 多重比較Multiple comparisons with genewise t-tests [Rmd]
    • Gene-wise testing
    • Naive enumeration of genes
    • Demonstrating danger of multiple testing with a set of sham comparisons
    • Adjusting for multiplicity with qvalue
    • Adjusted counts in the sham case
  • Moderated t tests via limma [Rmd]
    • A spike-in dataset
    • Naive t-tests
    • Three steps with limma: lmFit, eBayes, topTable
    • Exposing the spiked-in genes
    • A view of the shrinkage of variance estimates
  • 基因集分析Introducing gene sets and gene set analysis [Rmd]
    • Data wrangling
      • A dataset for comparing expression by gender
      • Finding surrogate variables/batch effect correction
    • The Broad Institute MsigDb
      • Identifier remapping
      • Categorical testing
      • Statistical summaries for sets: Wilcoxon
      • Statistical summaries for sets: t statistics
    • Adjusting for within-set correlation
    • A permutation procedure

Chapter 14 - 基因組資料視覺化Visualization of genome scale data

  • 視覺化任務與策略A basic overview of visualization tasks and strategies[Rmd]
    • Gene models
    • Gene models plus data
    • Driving visualizations with functions
    • Using the browser to drive visualization functions via shiny
    • Queriable dynamic displays with plotly
  • Annotation-oriented visualizations
    • Sketching the binding landscape over chromosomes with ggbio’s karyogram layout [Rmd]
    • Plotting data in the context of genomic features with Gviz [Rmd]
  • Visualizing NGS data [Rmd]
  • Interactive visualization
    • Graphical user interfaces for multivariate data with shiny [Rmd]
    • Clustering gene expression data with shiny [Rmd]
  • Final remarks on visualization [Rmd]

Chapter 15: 並行與記憶體不足Pursuing scalability in genomic analysis: parallelism and out-of-memory data

  • Parallel computing with R and Bioconductor [Rmd]
    • Demonstrating simple speedup in multicore environments
    • Implicit parallelism with BiocParallel and GenomicAlignments
  • External data: data interfaces that spare RAM[Rmd]
    • SQLite for annotation
    • Tabix-indexed BAM
    • HDF5
    • An illustration of NoSQL with S4: mongodb and RaggedMongoExpt[Rmd]
  • Benchmarking various out-of-memory solutions[Rmd]
  • Introduction to Bioconductor’s Amazon Machine Instance for cluster creation and use in EC2 [Rmd]
  • Sharded GRanges for scalable integrative analysis[Rmd]

Chapter 16: 多組學資料Multi-omic data integration

  • Basic examples of multi-omic integration[Rmd]
    • Transcription factor (TF) binding and gene coexpression in yeast
    • TF binding and GWAS hits in humans
  • Using RTCGAToolbox outputs to integrate clinical, mutation, expression and methylation assays[Rmd]
    • Basic data acquisition
    • Working with clinical data
      • Defining a severity marker
      • Extracting survival times
    • Working with mutations
    • Curation tasks for discrepant identifier formats
    • Working with expression data
      • Associating tumor stage with expression patterns
      • Linking DNA methylation with expression patterns
  • Application to visualization: kataegis and rainfall plot[Rmd]

Chapter 17: Fostering reproducible genome-scale analysis

  • Overview of unit on reproducibility[Rmd]
    • Basic definitions
    • Infrastructure requirements
    • Statistical aspects of reproducibility
    • Analysis of reproducibility probability (Boos and Stefanski 2011)
    • Costs of highly reproducible designs
  • Package structure, creation, installation, management[Rmd]
    • What is a package?
    • Using package.skeleton
    • Using makeOrganismPackage
    • Using devtools
      • create() to set up folders and DESCRIPTION
      • Composing documentation plus code
      • document(), install()
    • Conclusions, including a link to a recent Nature Toolbox article on Bioconductor

如何學習

我們選擇線上閱讀網頁版教程,結合原始碼進行練習。

https://genomicsclass.github.io/book/ 逐節閱讀學習,內容較多。讀者可挑選適合自己的章節學習即可。

有實戰的內容,都有Rmd的原始碼,下載用本地的Rstudio開啟即可。

批量下載所有資源

Windows下載:https://github.com/genomicsclass/labs/archive/master.zip

Linux下使用git或wget下載

# 方法1. 解壓後為labs-master目錄
wget -c https://github.com/genomicsclass/labs/archive/master.zip
unzip master.zip

# 方法2. 下載為labs目錄下
git clone [email protected]:genomicsclass/labs.git

猜你喜歡

寫在後面

為鼓勵讀者交流、快速解決科研困難,我們建立了“巨集基因組”專業討論群,目前己有國內外2000+ 一線科研人員加入。參與討論,獲得專業解答,歡迎分享此文至朋友圈,並掃碼加主編好友帶你入群,務必備註“姓名-單位-研究方向-職稱/年級”。技術問題尋求幫助,首先閱讀《如何優雅的提問》學習解決問題思路,仍末解決群內討論,問題不私聊,幫助同行。
image

學習擴增子、巨集基因組科研思路和分析實戰,關注“巨集基因組”
image

image

點選閱讀原文,跳轉最新文章目錄閱讀
https://mp.weixin.qq.com/s/5jQspEvH5_4Xmart22gjMA