Bioinformatics MOT290
Lecture2:
Basics of Bioinformatics I
Professor Sigrun Reumann
Centre for Organelle Research
Faculty of Science and Technology
University of Stavanger
N-4036 Stavanger
Norway
Tel: +47 51 83 18 97
Fax: +47 51 83 17 50
E-mail: sigrun.reumann@uis.no
Room KE C-278
August 26th, 2009
Content: 1. Repetition of Lecture 1
2. How to hypothesize the function of an unknown protein?
4. Sequence alignments and important parameters
5. Blast search of an unknown protein
3. Databases
Basics of Bioinformatics I
The potential and the need for Bioinformatics:
A daily example
M
RubisCO,
VDAC
SDRa
EH3
DCI
ECH2
METE1
PED1/
ECHIb
BGL1, TGG2
PYK10
LACS7
RH11
118
A
pH9
ca. pH5.5
85
47
36
26
20
GDC-P
GGT1
BADH
alb.
ATPaseβ
RubisCO LSU
GGT1
GGT2
GGT1
efTU
R act.
PSBP
GRP8GRP7
SSU
CSD3ST2
MIF
SCP-2
and
SDRd
ECHIb
GOX1/2
CAT2/3
CSY3
6PGDH
SHMT
MDAR
SGT
HPR
ESM1
B
C
HMGDH
MSD1
URI
ATF1
TLP
mMDH
SDRc
pMDH1/2
APX3
NS
pMDH,
pMDH2
B BGL1
RH37
OPCL1
OPR3
SGT
PKT2
C
RBP
ACX4
pMDH1/2
SGT
ACNA
ECHIc
Estimated purity: >95%
Reumann et al. (2007) Plant Cell19, 3170-3193
About50% of the
proteins identified
in proteome studies
are yet unknown.
Follow-up question:
Which functions can be
Postulated, e.g., for the
unknown protein UP3?
How can we define the function
of unknown proteins? Which questions to ask?
Subcellular localization :
Where is the protein localized in the cell?
Is it indeed localized in peroxisomes?Is it only present in peroxisomes or dual/multiple targeted?
Fundamental scientific questions:Expression :
In which tissue and under which conditions is its gene
expressed?
Interaction partners :
Does the protein interact with other proteins?
Are the interactions weak or strong, transient or constitutive?
Physiological function :
Is the protein an enzymes?
What are the substrates, products and co-factors?
Is it a regulatory protein (protein kinase, phosphatase)?
How can we define the function
of unknown proteins? Which questions to ask?Time required for experi-
mental investigation
Subcellular localization :
Where is the protein localized in the cell?Is it indeed localized in peroxisomes?Is it present only in peroxisomes or dual/multiple targeted?Fundamental scientific questions:Expression :
In which tissue and under which conditions is its gene expressed?
Physiological function :
Is the protein an enzymes?
What are the substrates, products and co-factors? Is it a regulatory protein (protein kinase, phosphatase)? Time required for experi-
mental investigation
ca. 2-3 months ca. 2-3 months ca. 3 months ca. 6-12 months ca. 3-12 months ca. 6-36 months
Interaction partners :
Does the protein interact with other proteins? Are the interactions weak or strong, transient or constitutive?
How can we define the function
of unknown proteins? Which questions to ask?
Hypothesis-driven research
Unbiased research: ”omics”research
Hypotheses are generally central
to successful research
Finding quickly the needle in a haystack.
Hypothesis-driven research
Unbiased research: ”omics”
research
Hypotheses are generally central
to successful research
Selection of a small number of specific
well-defined experiments to verify or
falsify the hypotheses
Comprehensive analyses of ”-omes”or complex
systems, thereby being able discover something
unexpected.
E.g. Bioinformatics analyses may indicate that UP3 is related
to the enzyme malate dehydrogenase (MDH)
How to postulate the function
of an unknown protein? BIOINFORMATICS Question: Is the unknown protein 3 of interest (UP3, At2g21670) similar
(homologous) to proteins of known function?
Oxaloacetate + NADH Malate + NAD +
MDH Hypothesis :The unknown protein UP3 catalyzes an enzymatic redox reaction that is similar to that catalyzed by MDH:
Substrate X + NADH Product Y + NAD +
UP3(C 6/C 8dicarboxylic acid?)
How to find similar proteins Array of known function by bioinformatics?
The need:
1. One or (several) databases that contain known proteins
2. Methods to access the database and retrieve specific sequences
3. Methods to compare sequences and determine their degree
of sequence similarity
4. Methods and knowledge to deduce evolutionary relationship
between proteins and to hypothesize structure-function relationships
Types of Databases (DB)
?Primary databases
–Sequence data (”raw data”)
?DNA/RNA
?Proteins
?Secondary databases
–Special properties
?Protein families
?Motifs
?Conserved domains
?Structures
?Etc.
?NCBI
–National Center for Biotechnology Information (NCBI,
https://www.doczj.com/doc/0916908844.html,/)
–Comprises many different databases,
e.g., Genbank, Protein DB, Genome DB, Structure DB, etc.
The major sequence
databases: NCBI
–Offers also a large number and variety of bioinformatics tools –Via ENTREZ all databases at NCBI can be searched at once!
–Mostly used :The …non-redundant database “of Genbank
Growth of Genbank
Genome sizes
?EBI (European Bioinformatics Institute
https://www.doczj.com/doc/0916908844.html, /embl/), SIB & PIR
–The universal protein resource ?Uniprot/Swissprot (Proteins)?EMBL (Nucleotides)?Ensembl (Genomes)?Interpro (Protein families, functional
sites and domains)?PDB (Structures)
Two other
major sequence
databases: EBI and DDBJ ?DDBJ –DNA databank of Japan
?DNA database
?Analysis tools
International Nucleotide Sequence
Database Collaboration (INSDC)
All 3 main databases are
connected and frequently
exchange their data
?
FASTA format (FAST Alignment format)>Q86548|Q86548_BHV4 Thymidine kinase -Bovine herpesvirus 4
MAGTSIPGKCEDCFTCESYSFNFESDSGFSCEDVAQYRDLSQYRGLPQLNTCHERHQQDI
YERDNVYEATFFTPPREKLPSITALSESFNNLSFTDGTTTSPDWPAKPAPAHLLYPRPPT
GYTDAYFIFFEGVMGVGKTTLLKTLARNDMNNVIIFEEAMNYWKTVFSDCHKLIYDVMKM
GKHGSFSVSSKLLACQMKFLTPLKSLGRVTAMFSGKEMQRSPGRSDVSHWVMFDRHALSA
CLVFPLVMLKSGMLSFEHFISLVSTFRANEGDIIVLIALDHGEAIRRIKTRKRPEEETIS
LTYVIRLHWCFVAVYNTWRLLQYFTAREATEVCLDIKTINGLCLSKDICYQKCLEFQTLW
NESLLAVLRDIIFPYKSDCTILELCLNVCTQIQKLKFVVSDASKYIGDVRGLWEDISTQI
LQSRVIKTRPVDWANLKSLAKEFCG
A universal computer-readible sequence format:
The FASTA format
The FASTA sequence format is divided into
a header and lines of sequence data.
[In Genbank searches at NCBI sequences can
just be copied and pasted.]
?FASTA format (FAST Alignment format)>Q86548|Q86548_BHV4 Thymidine kinase -Bovine herpesvirus 4
MAGTSIPGKCEDCFTCESYSFNFESDSGFSCEDVAQYRDLSQYRGLPQLNTCHERHQQDI
YERDNVYEATFFTPPREKLPSITALSESFNNLSFTDGTTTSPDWPAKPAPAHLLYPRPPT
LQSRVIKTRPVDWANLKSLAKEFCG
A universal computer-readible sequence format:
The FASTA format
Header:
The sequence in FASTA format begins with a single-line description . The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence , and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier.
Sequence:
The header is followed by the sequence in 1-letter code for both proteins and nucleotide sequences. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence.
How to estimate whether two proteins are similar and share the same evolutionary origin?
P1: ALIIVAGGTGSSAG
An ancient peptide P1:
What can happen during evolution to a peptide/ protein?
How to estimate
whether two proteins are similar?
P1: ALIIVAGGTGSSAG
An ancient peptide P1:What can happen during evolution to a peptide / protein?
P1:ALI I VAGGTGSSAG
Amino acid exchanges,
a) moderate:b) strong:P2:ALI G VAGGTGSSAG
P1:ALIIVAGGT -GSSAG
Amino acid insertions:P4:AL PG VAGGT R GSSAG
P1:ALIIVAGGT -GSSAG
Amino acid deletions:P5:AL PG VAGGT R G --AG
P1:ALIIVAGGT ----------GSSAG
Peptide duplications:P6:AL PG VAGGT ALPGVAGGTR G --AG
P1:AL I IVAGGTGSSAG
P3:AL P GVAGGTGSSAG