Bioinformatics
JIMMY CHENG-HO LIN
School of Medicine, Johns Hopkins University,
Baltimore, MD, USA
jimmy.lin@jhmi.edu
Synonyms
Computational biology; Biocomputing
Definition
Due the relative young age of the field, there have
been many definitions produced. While bioinformatics
purists emphasize the analysis of large-scale genomic
and transcriptomic data, looser definitions define bioinformatics
as any intersection of biology and computer
science including analysis of scientific literature, epidemiological
statistics, etc. Perhaps an inclusive definition
can be proposed:
The application of computational, statistical, and mathematical
methods to biological information to complement,
aid, and expedite scientific discovery and
enhance biological research. The three main aims
include:
1) DATABASE: acquisition, gathering, storage, organization
and management of large-scale data
2) ALGORITHM/TOOLS: development of algorithms
and computational tools to analyze and classify the
data
3) CONCLUSIONS/PREDICTIONS: process, abstract,
and integrate the data to make conclusions
and predictions
The data include and are not limited to nucleotide, proteomic,
genomic, phylogenetic, chemical, structural,
phenotypical, functional, ontological, and transcriptomic
information.
Basic Characteristics
Since the development of protein sequencing by Sanger
in 1955 and the Atlas of protein sequences by Margaret
Dayhoff in 1965, there has been a revolution of
high-throughput technologies that generate biological
information on an increasingly large scale. In August
2005, the International Nucleotide Sequence Database
Collaboration announced that the public collections of
DNA and RNA sequences had exceeded 100 gigabases
(or 100,000,000,000 bases, or “letters” of the genetic
code), which represent both individual genes and partial
and complete genomes of over 165,000 organisms.
In response to this deluge of data, computer scientists
and biologists collaborated in creating a new field of
study named bioinformatics.
Bioinformatics of Genomes
Bioinformatics is driven by high-throughput technologies.
In 1977, Fred Sanger introduced nucleotide/DNA
sequencing technology (Sanger et al. 1977) and by
1980, the first complete gene sequence for an organism
(FX174) was completed. In 1995, the first complete
genome, H. influenze genome was completed. The
draft of the human genome was reported in 2001 and
completed in 2003. As of 2006, there are over 350 complete
genomes with over 450 more in progress. Bioinformatics
is thus necessary to organize and analyze all
this data.
Currently, the major databases for genomic information
include Genbank at NCBI, Ensembl at the European
Bioinformatics Institute, DNA Data Bank of Japan
at the National Institute of Genetics, and the UCSC
Genome Browser at UC Santa Cruz.
There are many computational tools and algorithms
that enabled the genomic revolution. Most notably, Jim
Kent’s GigAssembler (Kent 2001) program enabled the
consolidation of sequence information from over ten
labs to produce the draft human genome for the public
effort. A computational problem central to sequence
analysis is the alignment and comparison of sequences.
The program was first solved by Needleman-Wunsch
(Needleman 1970) and current implementations
are based on multiple sequence alignment (MSA)
algorithm suite named Clustal (Higgins, Sharp 1988).
Another important problem has been the identification
of similar sequences in whole genomic and databases
searches. Current implementations that solve the problem
include BLAST (Altschul et al. 1990), PSI-Blast
(Altschul et al. 1997), and Blat (Kent 2002).
Bioinformatics of Transcriptomes
Besides large-scale sequencing, two other groups of
technologies have revolutionized bioinformatics, namely
transcriptomics ( transcriptome) and proteomics.
In 1995, two independent technologies were developed
to measure gene expression on a large-scale: serial
analysis of gene expression (SAGE) (Velculescu et al.
1995) and microarray (Shena et al. 1995). By 1997, it
was possible to measure the entire transcriptional profile
of a complete Eukaryotic genome (Saccharomyces
cerevisiae) on a microarray chip (DeRisi et al. 1997).
Consolidated databases of gene expression include
Array Express repository at the EBI, Gene Expression
Omnibus at NCBI, mouse Gene Expression Database at
Jackson Laboratory, Sym Atlas with Novartis, and the
Stanford Microarray Database.
A large set of different algorithms were developed to
analyze these expression data.
Initial algorithms were based on clustering genes with
similar gene expression together ( clustering algorithms)
(Niehrs 1999) while programs incorporated later
methods such as self organizing maps (Tamayo et
al. 1999), bayesian networks (Friedman et al. 2000)
and principal component analysis.
Bioinformatics of Proteomes
Since the development of protein sequencing in 1955
by Fred Sanger, protein research has greatly advanced.
The study of proteomics relies on technologies such
as two-dimensional gel electrophoresis and mass spectrometry
to identify the entire constitution of proteins
in an organism. The first proteome was published in
1995 by Wasinger for the smallest known self-replicating
organism, Mycoplasma genitalium (Wasinger
et al. 1995). Yeast-two hybrid technology allowed
researchers to identify all the interactions between proteins.
Furthermore, as more and more crystal structures
were solved for the different proteins, in 1973, the
Brookhaven Protein Databank was created to store the
data.
The main databases for protein information include
Pfam (Bateman et al. 2000), UCSC Proteome Browser
(Hsu et al. 2004), Swiss-Prot, and UniProt (Wu et al.
2006) and many databases exist for specific proteins or
post-translational modifications. The major structural
genomics databases and classification schemes include
Protein DataBank (PDB) at Brookhaven National Labs,
Structural Classification of Proteins (SCOP) (Murzin et
al. 1995), CATH (Pearl et al. 2005) Protein Structure
Classification Database (UCL), and FSSP Database
(Holm, Sander 1996).
The major question in proteomic bioinformatics is the
in silico prediction of structure of proteins, also known
as the protein folding problem. On all three levels
of primary, secondary, and tertiary structure, numerous
methods have been attempted such as comparative
modeling, threading, energyminimization, and ab initio
sequence methods. Various algorithms have also been
developed to query structure databases for similar structures, such as DALI Server at EBI and Vector Alignment
Search Tool (VAST) at NCBI.
Paradigm Shifts in Bioinformatics
In this post-genomic age, with the availability of
large amounts of information on all levels, biological
research is no longer confined to experimental methods
based on single genes. Now, investigators have a wealth
of information at their disposal. The new challenge is
to consolidate, integrate, evaluate, and obtain data from
established sources to generate hypotheses or produce
a set of targets that can then be validated and investigated
using experimental methods.
With more computation resources and more data available,
researcher can now start to think of genes and proteins
in relation to the vast network of interactions within
the genome and think more in terms of pathways
and systems. Just like biotechnological advances such
as PCR, Western blots, and microarrays have revolutionized
biology, future biological research will be intimately
involved with bioinformatics databases, tools,
and analyses.
Cross-References
Bayesian Network
Clustering Algorithms
Genome
Multiple Sequence Alignment
Principal Component Analysis
Protein Folding Problem
Proteome
Self-OrganizingMaps
Serial Analysis of Gene Expression
Transcriptome
References
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990)
Basic local alignment search tool. J Mol Biol 215:403–410
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller
W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST:
a new generation of protein database search programs. Nucl
Acids Res 25:3389–3402
Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer
EL (2000) The Pfam protein families database. Nucl
Acids Res 28:263–266
DeRisi JL, Iyer VR, Brown PO (1997) Exploring the metabolic
and genetic control of gene expression on a genomic scale.
Science 278(5338):680–686
Friedman N, Linial M, Nachman I, Pe’er D (2000) Using
Bayesian network to analyze expression data. In: Proc. 4th
Annu. Int. Conf. Computal. Mol. Biol. (RECOMB 2000).
Universal Academy Press, Tokyo, Japan, pp 127–135
Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing
multiple sequence alignment on a microcomputer.
Gene 15;73(1):237–244
Holm L, Sander C (1996) Mapping the protein universe. Science
273:595–603
Hsu F, Pringle TH, Kuhn RM, Karolchik D, Diekhans M, Haussler
D, Kent WJ (2004) The UCSC Proteome Browser. Nucl
Acids Res 33(suppl 1):D454–D458
Kent WJ, Haussler D (2001) Assembly of the working draft
of the human genome with GigAssembler. Genome Res
11(9):1541–1548
Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome
Res 12(4):656–664
Lander et al (2001) Initial sequencing and analysis of the human
genome. Nature 15;409(6822):860–921
Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP:
a structural classification of proteins database for the investigation
of sequences and structures. J Mol Biol 247:536–540
Needleman SB,Wunsch CD (1970) A general method applicable
to the search for similarities in the amino acid sequence of
two proteins. J Mol Biol 48:443–453
Niehrs C, Pollet N (1999) Synexpression groups in eukaryotes.
Nature 402:483–487
Pearl F, Todd A, Sillitoe I, Dibley M, Redfern O, Lewis T, Bennett
C, Marsden R, Grant A, Lee D, Akpor A, Maibaum
M, Harrison A, Dallman T, Reeves G, Diboun I, Addou S,
Lise S, Johnston C, Sillero A, Thornton J, Orengo C (2005)
The CATH Domain Structure Database and related resources
Gene3D and DHS provide comprehensive domain family
information for genome analysis. Nucl Acids Res 33:D247–
D251
Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing
with chain-terminating inhibitors. Proc Natl Acad Sci USA
74(12):5463–5467
Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative
monitoring of gene expression patterns with a complementary
DNA microarray. Science 270(5235):467–470
Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky
E et al (1999) Interpreting patterns of gene expression
with self-organizing maps: methods and application
to hematopoietic differentiation. Proc Natl Acad Sci USA
96:2907–2912
Velculescu VE, Zhang L, Vogelstein B, and Kinzler KW (1995)
Serial Analysis of Gene Expression. Science 270:484–487
Wasinger VC, Cordwell SJ, Cerpa-Poljak A, Yan JX, Gooley
AA, Wilkins MR, Duncan MW, Harris R, Williams
KL, Humphery-Smith I (1995) Progress with gene-product
mapping of the Mollicutes: Mycoplasma genitalium. Electrophoresis
16(7):1090–1094
Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann
B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane
M, Martin MJ, Mazumder R, O’donovan C, Redaschi N,
Tags: computational biology, computer scientists, data computer, database acquisition, dayhoff, dna and rna, high throughput, international nucleotide sequence, international nucleotide sequence database, jimmy lin, johns hopkins university, lin school, nucleotide sequence database, protein sequences, protein sequencing, public collections, rna sequences, scientific discovery, storage organization, throughput technologies