当前位置：文档之家› MOT290_WS09_L2_Basics_I_090826_Lecture_B

MOT290_WS09_L2_Basics_I_090826_Lecture_B

Bioinformatics MOT290

Lecture2:

Basics of Bioinformatics I

Professor Sigrun Reumann

Centre for Organelle Research

Faculty of Science and Technology

University of Stavanger

N-4036 Stavanger

Norway

Tel: +47 51 83 18 97

Fax: +47 51 83 17 50

E-mail: sigrun.reumann@uis.no

Room KE C-278

August 26th, 2009

Content: 1. Repetition of Lecture 1

2. How to hypothesize the function of an unknown protein?

4. Sequence alignments and important parameters

5. Blast search of an unknown protein

3. Databases

Basics of Bioinformatics I

The potential and the need for Bioinformatics:

A daily example

RubisCO,

VDAC

SDRa

EH3

DCI

ECH2

METE1

PED1/

ECHIb

BGL1, TGG2

PYK10

LACS7

RH11

118

pH9

ca. pH5.5

GDC-P

GGT1

BADH

alb.

ATPaseβ

RubisCO LSU

GGT1

GGT2

GGT1

efTU

R act.

PSBP

GRP8GRP7

SSU

CSD3ST2

MIF

SCP-2

and

SDRd

ECHIb

GOX1/2

CAT2/3

CSY3

6PGDH

SHMT

MDAR

SGT

HPR

ESM1

HMGDH

MSD1

URI

ATF1

TLP

mMDH

SDRc

pMDH1/2

APX3

pMDH,

pMDH2

B BGL1

RH37

OPCL1

OPR3

SGT

PKT2

RBP

ACX4

pMDH1/2

SGT

ACNA

ECHIc

Estimated purity: >95%

Reumann et al. (2007) Plant Cell19, 3170-3193

About50% of the

proteins identified

in proteome studies

are yet unknown.

Follow-up question:

Which functions can be

Postulated, e.g., for the

unknown protein UP3?

How can we define the function

of unknown proteins? Which questions to ask?

Subcellular localization :

Where is the protein localized in the cell?

Is it indeed localized in peroxisomes?Is it only present in peroxisomes or dual/multiple targeted?

Fundamental scientific questions:Expression :

In which tissue and under which conditions is its gene

expressed?

Interaction partners :

Does the protein interact with other proteins?

Are the interactions weak or strong, transient or constitutive?

Physiological function :

Is the protein an enzymes?

What are the substrates, products and co-factors?

Is it a regulatory protein (protein kinase, phosphatase)?

How can we define the function

of unknown proteins? Which questions to ask?Time required for experi-

mental investigation

Subcellular localization :

Where is the protein localized in the cell?Is it indeed localized in peroxisomes?Is it present only in peroxisomes or dual/multiple targeted?Fundamental scientific questions:Expression :

In which tissue and under which conditions is its gene expressed?

Physiological function :

Is the protein an enzymes?

What are the substrates, products and co-factors? Is it a regulatory protein (protein kinase, phosphatase)? Time required for experi-

mental investigation

ca. 2-3 months ca. 2-3 months ca. 3 months ca. 6-12 months ca. 3-12 months ca. 6-36 months

Interaction partners :

Does the protein interact with other proteins? Are the interactions weak or strong, transient or constitutive?

How can we define the function

of unknown proteins? Which questions to ask?

Hypothesis-driven research

Unbiased research: ”omics”research

Hypotheses are generally central

to successful research

Finding quickly the needle in a haystack.

Hypothesis-driven research

Unbiased research: ”omics”

research

Hypotheses are generally central

to successful research

Selection of a small number of specific

well-defined experiments to verify or

falsify the hypotheses

Comprehensive analyses of ”-omes”or complex

systems, thereby being able discover something

unexpected.

E.g. Bioinformatics analyses may indicate that UP3 is related

to the enzyme malate dehydrogenase (MDH)

How to postulate the function

of an unknown protein? BIOINFORMATICS Question: Is the unknown protein 3 of interest (UP3, At2g21670) similar

(homologous) to proteins of known function?

Oxaloacetate + NADH Malate + NAD +

MDH Hypothesis :The unknown protein UP3 catalyzes an enzymatic redox reaction that is similar to that catalyzed by MDH:

Substrate X + NADH Product Y + NAD +

UP3(C 6/C 8dicarboxylic acid?)

How to find similar proteins Array of known function by bioinformatics?

The need:

1. One or (several) databases that contain known proteins

2. Methods to access the database and retrieve specific sequences

3. Methods to compare sequences and determine their degree

of sequence similarity

4. Methods and knowledge to deduce evolutionary relationship

between proteins and to hypothesize structure-function relationships

Types of Databases (DB)

?Primary databases

–Sequence data (”raw data”)

?DNA/RNA

?Proteins

?Secondary databases

–Special properties

?Protein families

?Motifs

?Conserved domains

?Structures

?Etc.

?NCBI

–National Center for Biotechnology Information (NCBI,

https://www.doczj.com/doc/0916908844.html,/)

–Comprises many different databases,

e.g., Genbank, Protein DB, Genome DB, Structure DB, etc.

The major sequence

databases: NCBI

–Offers also a large number and variety of bioinformatics tools –Via ENTREZ all databases at NCBI can be searched at once!

–Mostly used :The …non-redundant database “of Genbank

Growth of Genbank

Genome sizes

?EBI (European Bioinformatics Institute

https://www.doczj.com/doc/0916908844.html, /embl/), SIB & PIR

–The universal protein resource ?Uniprot/Swissprot (Proteins)?EMBL (Nucleotides)?Ensembl (Genomes)?Interpro (Protein families, functional

sites and domains)?PDB (Structures)

Two other

major sequence

databases: EBI and DDBJ ?DDBJ –DNA databank of Japan

?DNA database

?Analysis tools

International Nucleotide Sequence

Database Collaboration (INSDC)

All 3 main databases are

connected and frequently

exchange their data

FASTA format (FAST Alignment format)>Q86548|Q86548_BHV4 Thymidine kinase -Bovine herpesvirus 4

MAGTSIPGKCEDCFTCESYSFNFESDSGFSCEDVAQYRDLSQYRGLPQLNTCHERHQQDI

YERDNVYEATFFTPPREKLPSITALSESFNNLSFTDGTTTSPDWPAKPAPAHLLYPRPPT

GYTDAYFIFFEGVMGVGKTTLLKTLARNDMNNVIIFEEAMNYWKTVFSDCHKLIYDVMKM

GKHGSFSVSSKLLACQMKFLTPLKSLGRVTAMFSGKEMQRSPGRSDVSHWVMFDRHALSA

CLVFPLVMLKSGMLSFEHFISLVSTFRANEGDIIVLIALDHGEAIRRIKTRKRPEEETIS

LTYVIRLHWCFVAVYNTWRLLQYFTAREATEVCLDIKTINGLCLSKDICYQKCLEFQTLW

NESLLAVLRDIIFPYKSDCTILELCLNVCTQIQKLKFVVSDASKYIGDVRGLWEDISTQI

LQSRVIKTRPVDWANLKSLAKEFCG

A universal computer-readible sequence format:

The FASTA format

The FASTA sequence format is divided into

a header and lines of sequence data.

[In Genbank searches at NCBI sequences can

just be copied and pasted.]

?FASTA format (FAST Alignment format)>Q86548|Q86548_BHV4 Thymidine kinase -Bovine herpesvirus 4

MAGTSIPGKCEDCFTCESYSFNFESDSGFSCEDVAQYRDLSQYRGLPQLNTCHERHQQDI

YERDNVYEATFFTPPREKLPSITALSESFNNLSFTDGTTTSPDWPAKPAPAHLLYPRPPT

LQSRVIKTRPVDWANLKSLAKEFCG

A universal computer-readible sequence format:

The FASTA format

Header:

The sequence in FASTA format begins with a single-line description . The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence , and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier.

Sequence:

The header is followed by the sequence in 1-letter code for both proteins and nucleotide sequences. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence.

How to estimate whether two proteins are similar and share the same evolutionary origin?

P1: ALIIVAGGTGSSAG

An ancient peptide P1:

What can happen during evolution to a peptide/ protein?

How to estimate

whether two proteins are similar?

P1: ALIIVAGGTGSSAG

An ancient peptide P1:What can happen during evolution to a peptide / protein?

P1:ALI I VAGGTGSSAG

Amino acid exchanges,

a) moderate:b) strong:P2:ALI G VAGGTGSSAG

P1:ALIIVAGGT -GSSAG

Amino acid insertions:P4:AL PG VAGGT R GSSAG

P1:ALIIVAGGT -GSSAG

Amino acid deletions:P5:AL PG VAGGT R G --AG

P1:ALIIVAGGT ----------GSSAG

Peptide duplications:P6:AL PG VAGGT ALPGVAGGTR G --AG

P1:AL I IVAGGTGSSAG

P3:AL P GVAGGTGSSAG