1. 程式人生 > >【bioinfo】生物信息學——代碼遇見生物學的地方

【bioinfo】生物信息學——代碼遇見生物學的地方

答案 work online result number associate nbsp rdp dell

:從進入生信領域到現在,已經過去快8年了。生物信息學包含了我最喜歡的三門學科:生物學、計算機科學和數學。但是如果突然問起,什麽是生物信息學,我還是無法給出一個讓自己滿意的答案。於是便有了這篇博客。

起源


據說在1970年,荷蘭科學家Paulien Hogeweg和Ben Hesper最早在荷蘭語中創造了"bioinformatica"一詞,英語中的"bioinformatics" 在1978年首次被使用。這兩位科學家當時使用該詞來表示:

The study of information processes in biotic systems.

該定義中有兩個關鍵詞:生物系統(biotic systems)和信息過程(information processes)。但是這裏的"信息過程"不太好理解。

此外,從該領域的著名期刊——"bioinformatics"期刊名稱的變化也可以從另一個角度來考證"生物信息學"這個詞的接受程度。"bioinformatics"創立於1985年,改名前的期刊名為:Computer Applications in the Biosciences (CABIOS)同時也是國際計算生物學會(the International Society for Computational Biology, ISCB)的會刊,在1998年改為現在的名字。

各個不同時期的定義


wiki

【定義1】首先看一下維基百科對生物信息學的解釋:

Bioinformatics /?ba?.o???nf?r?mæt?ks/ (About this soundlisten) is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.

Bioinformatics and computational biology involve the analysis of biological data, particularly DNA, RNA, and protein sequences. The field of bioinformatics experienced explosive growth starting in the mid-1990s, driven largely by the Human Genome Project and by rapid advances in DNA sequencing technology.

The primary goal of bioinformatics is to increase the understanding of biological processes.

這裏的定義強調交叉學科以及對生物學數據的理解,認為最主要的生物學數據是DNA、RNA和蛋白質的序列數據。並指出生物信息學最重要的目標是增加對生物過程的理解。

2000年

【定義2】下面是NIH Biomedical Information Science and Technology Initiative在2000年給出的定義:

Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

該定義強調計算工具和方法(相當於軟件和算法),以及數據的采集、存儲、組織、存檔、分析和可視化。該定義在2012年還被冷泉港實驗室的一個下屬機構在一篇介紹生物信息學的博客中引用過。

2001年

【定義3】2001年,人類基因組計劃還沒有完成。下面是2001年發表的一篇標題為"What is bioinformatics? A proposed definition and overview of the field"的論文中的解釋:

Bioinformatics is conceptualizing biology in terms of macromolecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied maths, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

Analyses in bioinformatics predominantly focus on three types of large datasets available in molecular biology: macromolecular structures, genome sequences, and the results of functional genomics experiments (eg expression data). Additional information includes the text of scientific papers and “relationship data” from metabolic pathways, taxonomy trees, and protein-protein interaction networks.

這裏的定義強調生物大分子和數據的規模。認為生物學數據主要包括大分子的結構數據、基因組序列和功能基因組學實驗數據(如表達數據等),此外還包括科學論文數據(可以進行文本挖掘)以及來自pathway等地方的關系數據(相互作用)。

該文章的作者從寬度(數據量的變化)和深度(不同生物學過程中的不同大分子)兩個維度對生物信息學中包含的主要問題進行了分類:

技術分享圖片

圖1:The Bioinformatics Spectrum, from http://bioinfo.mbb.yale.edu/what-is-it/

從寬度(信息學的角度)上來說,隨著數據量的增加(從一條序列到多條序列),提出的問題也不一樣,需要用到的算法和工具也不一樣;從深度(物理學的角度)上來說,不同的生物學對象(DNA、蛋白質序列)在各個生物過程(蛋白質的折疊,發生於蛋白質表面的相互作用等)中執行著不同的功能。

該文章的作者還定義了"組學"的概念:

A key approach in genomic research is to divide the cellular contents into distinct sub-population, each given an -omic term. Broadly, these ‘omes can be divided into those that represent a population of molecules, and those that define their actions. For example, the proteome is the full complement of proteins encoded by the genome, and the secretome is the part of it secreted from the cell.

各種不同的組學列表(OMES TABLE):http://bioinfo.mbb.yale.edu/what-is-it/omes/

【定義4】下面是網站bioplanet在2001年給出的定義:

Bioinformatics is the application of computer technology to the management of biological information. Computers are used to gather, store, analyze and integrate biological and genetic information which can then be applied to gene-based drug discovery and development.

該定義中的生物信息(biological information)可以理解為生物數據,強調數據的采集、存儲、分析和整合。最後還給出了生物信息學的應用:基於基因的藥物開發。該定義直到2017年,還有其他網站引用。

2005年

【定義5】以下是網站TechTarget給出的定義:

Bioinformatics is the science of developing computer databases and algorithms for the purpose of speeding up and enhancing biological research.
New academic programs are training students in bioinformatics by providing them with backgrounds in molecular biology, engineering, ethics and computer science, including database design and analytical approaches to data mining.

該定義強調數據庫和算法,且提到了倫理學。

【定義6】下面是英屬哥倫比亞大學THE SCIENCE CREATIVE QUARTERLY上面的一篇文章給出的定義:

Bioinformatics involves the integration of computers, software tools, and databases in an effort to address biological questions.

Many scientists today refer to the next wave in bioinformatics as systems biology, an approach to tackle new and complex biological questions. Systems biology involves the integration of genomics, proteomics, and bioinformatics information to create a whole system view of a biological entity.

技術分享圖片

The genes involved in the pathway, how they interact, and how modifications change the outcomes downstream, can all be modeled using systems biology. Any system where the information can be represented digitally offers a potential application for bioinformatics. Thus bioinformatics can be applied from single cells to whole ecosystems.

Genome sequence by itself has limited information. To interpret genomic information(基因組信息的解釋), comparative analysis of sequences needs to be done and an important reagent for these analyses are the publicly accessible sequence databases. Without the databases of sequences (such as GenBank), in which biologists have captured information about their sequence of interest, much of the rich information obtained from genome sequencing projects would not be available(公共數據庫的重要性).

The same way developments in microscopy foreshadowed discoveries in cell biology, new discoveries in information technology and molecular biology are foreshadowing discoveries in bioinformatics.

In many ways, bioinformatics provides the tools for applying scientific method to large-scale data and should be seen as a scientific approach for asking many new and different types of biological questions.

Although technology enables bioinformatics, bioinformatics is still very much about biology. Biological questions drive all bioinformatics experiments. Important biological questions can be addressed by bioinformatics and include understanding the genotype-phenotype connection for human disease, understanding structure to function relationships for proteins, and understanding biological networks.

這篇文章的定義也強調了數據庫的重要性並給出了原因:一段基因組序列本身的信息是有限的,需要與其他已註釋序列進行比較來研究其功能(例如利用Blast軟件在公共數據庫GenBank中註釋一段新的DNA序列)。在當時(05年)已經有許多科學家提出"系統生物學"是下一個階段的生物信息學。此外,文中提到:"對於任何系統(從單個細胞到整個生態系統),只要其信息可以數字化,生物信息學在該系統就可能有用武之地"。生物信息學之於分子生物學,就像顯微鏡之於細胞生物學

這篇文章還給出了很多有價值的觀點:

  • 生物信息學不僅僅可以作為工具來解決問題,也應該被當成一種科學方法來提出新的和不同類型的生物學問題;
  • 盡管生物信息學依賴於技術,但是所有的生物信息學實驗還是被生物學問題所驅動;
  • 一些可以用生物信息學來處理的重要生物學問題:理解基因型-表型在人類疾病中的關聯,理解蛋白質結構與功能之間的關系,理解生物網絡;
  • 生物信息學的進步也依賴於生產數據的工具和技術(例如新的更便宜的測序技術,高通量生物芯片技術,更精確的質譜技術等)的進步。

2010年

【定義7】下面兩個定義收錄於聖地亞哥州立大學(San Diego State University)計算機科學與生物學教授Dr. Robert Edwards的一篇博客中:

“Bioinformatics is the application of statistics and computer science to the field of molecular biology. It includes computational biology, algorithm development, statistics techniques, data modeling and visualization.” – Owen White (2010)

“Bioinformatics is a science where we integrate computer science, genetics and genomics.” – Atul Butte (2010)

上面的定義中提到了統計學和計算機科學在分子生物學領域的應用,以及數據模型和可視化。生物信息學領域早期的前輩們有很多都是從遺傳學轉過來的。

2011年

【定義8】據說是生物信息學領域最大的專業網站Bioinformatics.org,按照生物信息學發展的不同階段,對生物信息學的研究內容作了介紹:

生物信息學最寬泛的定義會包含DNA序列或乳房X光片等數據,因此也可以包含醫學圖像處理的內容。但是平時用到的生物信息學指定的範圍要窄的多:主要是指計算分子生物學。

It is debatable whether bioinformatics and the discipline computational biology, literally "biology that involves computation," are the same or distinct. To some, both bioinformatics and computational biology are defined as any use of computers for processing any biologically-derived information, whether DNA sequences or breast X-rays. Therefore, there are other fields, e.g. medical imaging / image analysis, that might be considered part of bioinformatics. This would be the broadest definition of the term. But, in practice, the definition used by most people is even narrower; bioinformatics to them is a synonym for computational molecular biology: any use of computers to characterize the molecular components of living things.

從信息學的角度來看,會強調包含在生物數據中的信息(數據 - 信息 - 知識):

To others, bioinformatics is a grammatical contraction of "biological informatics" and is therefore related to the computer science disciplines of information science and/or information technology. This definition would thus emphasize the information contained within the biological data, also implying that large amounts of data would be managed and/or analyzed.

前基因組時代的生物信息學基本上就是指序列分析:

Most biologists talk about "doing bioinformatics" when they use computers to store, retrieve, analyze or predict the composition or the structure of biomolecules. As computers become more powerful you could probably add simulate to this list of bioinformatics verbs. "Biomolecules" include your genetic material---nucleic acids---and the products of your genes: proteins. These are the concerns of pre-genomic or "classical" bioinformatics, which deal primarily with sequence analysis.
Fredj Tekaia at the Institut Pasteur offers this definition of bioinformatics:
"The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information."

後基因組時代的生物信息學發生了很大的變化:研究重點從基因本身到基因產物的轉移,以及對生物醫學實驗數據的分析。

The greatest achievement of bioinformatics methods, the Human Genome Project, is practically complete. Because of this the nature and priorities of bioinformatics research and applications have changed. People often talk portentously of our living in the "post-genomic" era. This affects bioinformatics in several ways:

Now that we possess multiple whole genomes, we can look for differences and similarities between all the genes of multiple species. From such studies we can draw particular conclusions about species and general ones about evolution. This kind of science is often referred to as comparative genomics.

There are now technologies designed to measure the relative number of copies of a genetic message (levels of gene expression) at different stages in development or disease or in different tissues. Such technologies, such as DNA microarrays will grow in importance(新的檢測技術).

Other, more direct, large-scale ways of identifying gene functions and associations (for example yeast two-hybrid methods) will grow in significance and with them the accompanying bioinformatics of functional genomics.

There will be a general shift in emphasis (of sequence analysis especially) from genes themselves to gene products.

This will lead to:

  • attempts to catalog the activities and characterize interactions between all gene products (in humans): proteomics );
  • attempts to crystallography and or predict the structures of all proteins (in humans): structural genomics.

What some people refer to as research or medical informatics, the management of all biomedical experimental data associated with particular molecules or patients---from mass spectroscopy, to in vitro assays to clinical side-effects---will move from the concern of those working in drug company and hospital I.T. (information technology) into the mainstream of cell and molecular biology and migrate from the commercial and clinical to academic sectors.
It is worth noting that all of the above post-genomic areas of research depend upon established, pre-genomic sequence analysis techniques.

此外該網站還特別提到了生物學與計算機科學之間奇妙的關系:生物大分子通常由結構簡單的單體聚合而成(這點與計算機中用一些簡單的語法編寫一個具有獨立功能的軟件非常相似);以及生物學對計算機科學的啟發,例如遺傳算法、(人工)神經網絡的結構等。

It is a mathematically interesting property of most large biological molecules that they are polymers; ordered chains of simpler molecular modules called monomers. Think of the monomers as beads or building blocks which, despite having different colors and shapes, all have the same thickness and the same way of connecting to one another. Monomers that can combine in a chain are of the same general class, but each kind of monomer in that class has its own well-defined set of characteristics. And many monomer molecules can be joined together to form a single, far larger, macromolecule. Macromolecules can have exquisitely specific informational content and/or chemical properties. According to this scheme, the monomers in a given macromolecule of DNA or protein can be treated computationally as letters of an alphabet, put together in pre-programmed arrangements to carry messages or do work in a cell.

There are also whole other disciplines of biologically-inspired computation, e.g. genetic algorithms, AI, and neural networks. Often these areas interact in strange ways. Neural networks, inspired by crude models of the functioning of nerve cells in the brain, are used in a program called PHD to predict, surprisingly accurately, the secondary structures of proteins from their primary sequences.

2013年

【定義9】阿肯色大學小石城分校(University of Arkansas at Little Rock, UALR)在BIOINFORMATICS PROGRAM中對生物信息的解釋:

  • As a discipline that builds upon the fields of computer and information science, bioinformatics relies heavily upon strategies to acquire, store, organize, archive, analyze, and visualize data.
  • As a discipline that builds upon computational biology, bioinformatics encompasses the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.
  • As a discipline that builds upon the life, health, and medical sciences, bioinformatics supports medical informatics; gene mapping in pedigrees and population studies; functional-, structural-, and pharmaco-genomics; proteomics, and dozens of other evolving “-omics.”
  • As a discipline that builds upon the basic sciences, bioinformatics depends on a strong foundation of chemistry, biochemistry, biophysics, biology, genetics, and molecular biology which allows interpretation of biological data in a meaningful context.
  • As a discipline whose core is mathematics and statistics, bioinformatics applies these fields in ways that provide insight to make the vast, diverse, and complex life sciences data more understandable and useful, to uncover new biological insights, and to provide new perspectives to discern unifying principles.

In short, bioinformaticians (or bioinformaticists) bring a multidisciplinary perspective to many of the critical problems facing the health-science profession today.

該定義從5個不同的方面,對生物信息學進行了解釋:

  • 建立在計算機和信息學科之上的生物信息學,側重於數據的采集、存取、分析及可視化;
  • 建立在計算生物學之上的生物信息學,側重於數據分析和理論方法的開發,以及數學模型和計算機模擬技術在生物學研究中的應用;
  • 建立在生命科學和醫學之上的生物信息學,側重於醫學信息數據和各種不同的組學數據的分析;
  • 建立在基礎科學之上的生物信息學,側重於在更基礎的層面(化學結構、生化過程等)對生物學數據進行解釋;
  • 建立在數學和統計學之上的生物信息學,側重於對大量、不同類型的復雜數據(例如高維數據或高度異質性的數據)進行分析;

從上面的定義來看,更加凸顯了生物信息學的交叉學科屬性。

2017年

【定義10】生物信息學家Dr. Maria Nattestad用下面的話向非科學家介紹自己的工作:

I use computers to analyze biological data.

在一篇博客中,她將生物信息學與數據科學進行了比較,發現它們非常相似:

技術分享圖片

圖2:生物信息學 vs 數據科學

按照上圖的理解,生物信息學就是一種特別的數據科學。Dr. Maria Nattestad認為生物信息學非常有趣的原因之一是:該學科聚集了不同領域的人,這些人帶著不同的背景和傾向,使用不同的方式來思考生物學問題。她將生物信息學分成了以下三個部分:

  1. Data analysis is the most natural starting point for biologists and involves the most domain expertise because it specifically involves interpreting the data. The ability to detect oddities or interesting patterns in the data can heavily depend on your knowledge of the biological system the data comes from.
  2. Bioinformatics software development is an approach to bioinformatics that I see computer scientists naturally take on. They may also do data analysis, but will have a hard time resisting building real software products. The software they develop can take many forms, from command-line tools to web applications.
  3. Modeling is very fashionable with physicists and mathematicians. You can tell their work apart by the fact that it’s full of equations and written in LateX.

2018年

【定義11】2018年是瑞士生物信息學研究所(Swiss Institute of Bioinformatics, SIB)建立20周年。在其官網上對生物信息學的定義如下:

The application of computer technology to the understanding and effective use of biological and clinical data. It is the discipline that stores, analyses and interprets the ‘big data’ generated by life science experiments, or clinical data, using computer science.

相對於其他定義,這裏強調對數據的高效利用,以及對生命科學大數據的處理。

下面是SIB定義的生物信息學的研究內容:

Databases and knowledgebases for storing, retrieving and organizing biological information to maximize the value of biological data;

Software tools for modelling, visualizing, analysing, interpreting and comparing biological data;

Computing and storage infrastructure to process large amounts of data;

Analysis of complex biological datasets or systems in the context of particular research projects;

Research in a wide variety of biological fields using computer- and data science and leading to applications in diverse areas, from agriculture to precision medicine.

Bioinformatics is thus a multidisciplinary field bringing together biologists, computer scientists and mathematicians, as well as statisticians and physicists.

【定義12】下面是賓夕法尼亞州立大學的生物信息學教授István Albert,在他的書《The Biostar Handbook: A Beginner‘s Guide to Bioinformatics》中對生物信息學的定義:

Bioinformatics is a data science that investigates how information is stored within and processed by living organisms.

上面的定義非常簡潔,將生物信息學看做是數據科學,研究生物體中的信息如何保存和處理。

該書的介紹部分,講了生物信息學的變化過程:

In its early days––perhaps until the beginning of the 2000s––bioinformatics was synonymous with sequence analysis. Scientists typically obtained just a few DNA sequences, then analyzed them for various properties. Today, sequence analysis is still central to the work of bioinformaticians, but it has also grown well beyond it.

In the mid-2000s, the so-called next-generation, high-throughput sequencing instruments (such as the Illumina HiSeq) made it possible to measure the full genomic content of a cell in a single experimental run. With that, the quantity of data shot up immensely as scientists were able to capture a snapshot of everything that is DNA-related.

These new technologies have transformed bioinformatics into an entirely new field of data science that builds on the "classical bioinformatics" to process, investigate, and summarize massive data sets of extraordinary complexity.

2005年左右,二代測序儀的出現,讓生物信息學進入了大數據時代。

下面是作者的進一步追問:到底什麽是生物信息學?

But what is bioinformatics, really?

So now that you know what bioinformatics is all about, you‘re probably wondering what it‘s like to practice it day-in-day-out as a bioinformatician. The truth is, it‘s not easy. Just take a look at this "Biostar Quote of the Day" from Brent Pedersen in Very Bad Things:
I‘ve been doing bioinformatics for about 10 years now. I used to joke with a friend of mine that most of our work was converting between file formats. We don‘t joke about that anymore.

Jokes aside, modern bioinformatics relies heavily on file and data processing. The data sets are large and contain complex interconnected information. A bioinformatician‘s job is to simplify massive datasets and search them for the information that is relevant for the given study. Essentially, bioinformatics is the art of finding the needle in the haystack.

看到同樣有人在該領域工作快10年,但還是搞不清楚什麽是生物信息學,我就放心了。這裏特別強調了數據量,並且最後說生物信息學就是在大海撈針的藝術。

這裏推薦一下給作者的這本書,可以作為生物信息學的入門書來看,而且不止我一個人推薦該書,微信公眾號"生信媛"的創建人得到授權後翻譯了本書,在下面的文章中可以找到所有內容的鏈接:

英文版:https://www.biostarhandbook.com/

中文版目錄:http://blog.sciencenet.cn/blog-3334560-1078097.html

我的定義


上面介紹了自生物信息學這個詞誕生後,從2000年到2018年之間的12個不同的定義。從總體上來看,最開始的定義更強調數據的采集、存儲和獲取等過程,更偏向於計算機科學;隨著相關檢測技術和生物數據分析基礎平臺的發展和完善,現在的定義更多的強調從整體上對數據進行整合分析以及高通量實驗帶來的大數據的挑戰,更偏向於系統生物學。

下面是我基於自己的理解,給生物信息學下的定義:

生物信息學是圍繞生物數據展開的,因此與數據科學有著天然的緊密聯系。生物數據是各種檢測儀器(測序儀、質譜和電鏡等)對不同的生物過程進行量化時產生的。生物過程以各類生物大分子(DNA、RNA、蛋白質、多糖等)或小分子代謝物以及腸道菌群等與人體共生的微生物為基本的結構和功能單位,主要包括這些基本單位的新陳代謝(合成與分解,物質與能量的相互轉化)和相互作用(信息的交流,即調控)。生物信息學就是利用統計或機器學習等數據科學領域的方法對生物數據進行分析和解釋,從靜態(結構和功能,細胞內的定位等)和動態(調控,轉運等)兩個方面來研究生物過程的科學。

為了完成上述任務,大致可以分為三個步驟:數據的管理(已有數據的註釋、存儲、檢索和數據交換,以及新數據的提交);數據分析工具的開發;工具的使用以及對結果生物學意義的解釋。我非常認同Dr. Raunak Shrestha在他的博客中的說法:生物信息學的終極目標是在分子水平理解一個活細胞是如何工作的。

如果要問我最喜歡哪個定義,除了我自己的定義之外,我最喜歡在一段視頻中看到的定義:Bioinformatics: Where code meets biology.

Reference


https://en.wikipedia.org/wiki/Bioinformatics

http://bioinfo.mbb.yale.edu/what-is-it/

https://searchoracle.techtarget.com/definition/bioinformatics

https://edwards.sdsu.edu/research/what-is-bioinformatics/

https://www.scq.ubc.ca/what-is-bioinformatics/

https://tse3.mm.bing.net/th?id=OIP.G1tK2zPG0f3T71ITT84G3wHaHo&pid=15.1

https://www.bioinformatics.org/wiki/Bioinformatics

http://omgenomics.com/what-is-bioinformatics/

https://www.sib.swiss/about-sib/what-is-bioinformatics

https://www.sib.swiss/about-sib/what-we-do

https://raunakms.wordpress.com/2010/06/05/what-is-bioinformatics-%E2%80%93-a-general-perspective/

https://www.youtube.com/watch?v=mWbuVlIX5jg

【bioinfo】生物信息學——代碼遇見生物學的地方