当前位置:文档之家› Running title Protein Function Prediction by Domain Composition

Running title Protein Function Prediction by Domain Composition

Running title Protein Function Prediction by Domain Composition
Running title Protein Function Prediction by Domain Composition

Bioinformatics Advance Access published February 19, 2004

Yeast Function Prediction Cai & Doig1 Prediction of Saccharomyces Cerevisiae Protein Functional Class from

Functional Domain Composition

Yudong Cai & Andrew J. Doig*

Department of Biomolecular Sciences, UMIST, P.O. Box 88, Manchester M60 1QD, UK.

* To whom correspondence should be addressed: Phone: +44-161-2004224, Fax: +44-161-2360409, e-mail: Andrew.Doig@https://www.doczj.com/doc/2e9000999.html,

Running title: Protein Function Prediction by Domain Composition

Key Words: yeast, functional genomics, Support Vector Machines, InterPro, nearest neighbour; InterPro

Abbreviations:NNA, Nearest Neighbour Algorithm; SVM, Support Vector Machine; ORF, Open Reading Frame

Copyright (c) 2004 Oxford University Press

ABSTRACT

Motivation: A key goal of genomics is to assign function to genes, especially for orphan sequences.

Results:We compared the clustered functional domains in the SBASE database to each protein sequence using BLASTP. This representation for a protein is a vector, where each of the non-zero entries in the vector indicates a significant match between the sequence of interest and the SBASE domain. The machine learning methods Nearest Neighbour Algorithm and Support Vector Machines are used for predicting protein functional classes from this information. We find that the best results are found using the SBASE-A database and the Nearest Neighbour Algorithm, namely 72% accuracy for 79% coverage. We tested assigning function based on searching for InterPro sequence motifs and by taking the most significant BLAST match within the dataset. We applied the functional domain composition method to predict the functional class of 2018 currently unclassified yeast ORFs. Availability: A program for the prediction method, called FCPFD (Functional Class Prediction based on Functional Domains), is available by contacting Y.D. Cai at

y.cai@https://www.doczj.com/doc/2e9000999.html,.

INTRODUCTION

The most widely used methods for predicting the function of a new protein sequence involve sequence alignment, either global (over the entire sequence) or local (as in BLAST or FASTA) (Altschul, et al., 1997, Pearson, 1996). Advanced sequence comparison methods (e.g. PSI-Blast (Altschul, et al., 1997)) have pushed the level of sequence identity required to infer homology down to below 20%. The identification of sequence motifs or pattern matching tools is another powerful method to predict function. For example, the PRINTS (Attwood, et al., 2000), BLOCKS (Henikoff, et al., 1999), Pfam (Bateman, et al., 2000), ProDom (Corpet, et al., 1999), SMART (Ponting, et al., 1999), Domo (Gracy and Argos, 1998), Identify (Nevill-Manning, et al., 1998), PROF-PAT (Bachinsky, et al., 2000) and PROSITE (Hofmann, et al., 1999) databases can be used to search an unknown sequence for hundreds of known motifs. InterPro integrates numerous resources for protein families, domains and functional sites (Apweiler, et al., 2000). Correlations between short signals and functional annotations in a protein database can be used (Perez, et al., 2002). Alternatively, structure can first be predicted and then used to infer function (Skolnick and Fetrow, 2000), such as by searching for three-dimensional structural templates of sequence motifs (Kasuya and Thornton, 1999). Even a low level of structure prediction, such as an assignment into a broad fold class, is useful in function prediction, since fold classes show trends for particular functions (Hegyi and Gerstein, 1999).

More recent methods go beyond sequence matching (Eisenberg, et al., 2000, Marcotte, 2000, Pellegrini, 2001). A phylogenetic profile shows the pattern of presence or absence of a protein between genomes. Proteins that have the same phylogenetic profiles can be inferred to be functionally related (Pellegrini, et al., 1999). Protein-protein interactions can be predicted by searching for interacting pairs of proteins that are fused to a single protein chain in another organism (Marcotte, et al., 1999), as such pairs are often functionally related (the Rosetta Stone method). The gene neighbour method uses the observation that if the genes that encode two proteins are close on a chromosome, the proteins tend to be

functionally related (Dandekar, et al., 1999, Overbeek, et al., 1999). While such methods can be powerful (Marcotte, et al., 1999), they are limited to cases where their presence varies between genomes, gene fusion has occurred or a set of gene neighbours are observed, and so are not applicable to all cases. More general methods are therefore often needed. Predictions of human protein functional classes and enzyme categories were made with neural networks using sequence properties such as phosphorylation sequences, number of charged residues, predicted secondary structure and average hydrophobicity (Jensen, et al., 2002). Functional class can also be predicted using amino acid and amino acid pair composition (Morrison, et al., 2003). Data mining prediction methods devise rules in the form of IF… THEN statements that make predictions of function using sequence based attributes, predicted secondary structure and sequence similarity. These have been shown to give accurate predictions for a limited number of sequences in the E. coli and M. tuberculosis genomes, sometimes in the absence of homology (King, et al., 2000, King, et al., 2001).

The major difficulty in this field is finding any information for the large number of genes that are so distant from any known sequence that alignment methods cannot offer any prediction. Traditional sequence alignment and pattern matching approaches are generally not capable of detecting functional and structural homologues when sequence identity falls below about 20%. There is therefore a great need for methodologies capable of predicting protein function within this twilight zone and beyond. This situation is not uncommon: for example, there are no apparent orthologues for 1/3 of the yeast genome (≈2000 proteins)

(http://mips.gsf.de/proj/yeast/catalogues/) and ≈15,000 proteins of totally unknown function within the human genome (http://www.genome.ad.jp/dbget-bin/get_htext?H.sapiens.kegg). Here we use functional domain composition to predict protein function. These approaches are not intended to replace existing profile or homology-based search protocols, rather to use alternative data/patterns in the sequence data to go beyond traditional approaches.

The yeast Saccharomyces cerevisiae was the first eukaryotic organism to have its entire genome sequenced (Goffeau, et al., 1996, Mewes, et al., 1997, Zagulski, et al., 1998).

The challenge now is identification of the Open Reading Frames (ORFs) and complete functional assignment of these coding regions. Experimental assignment of these ORFs, such as the EUROFAN project (Dujon, 1998), usually takes the form of deletion studies. The ORF of interest is deleted or its expression suppressed. Its function is then inferred from the resulting phenotype under different environmental conditions, for example by withdrawing certain nutrients from the growth media. Alternatively, analysis of cellular mRNA levels can reveal co-expression of certain ORFs under certain environmental conditions. Co-expression of two or more ORFs implies that they function as part of the same cellular pathway. These methods are both time-consuming and expensive.

Here, we present a method for predicting the functions of yeast ORF functions from functional domain composition. We believe this will be a key tool for increasing the throughput of experimental studies. The results of the method will act as a guide to the possible functions an ORF can adopt, whilst virtually eliminating some functions altogether. Experimentalists may therefore approach functional assignment experiments already knowing the most likely and the least likely functions of all ORFs in the yeast genome.

METHODS

Dataset

The MIPS database contains 6294 ORFs of Saccharomyces cerevisiae. The aim of this project is to assign a protein into one of 13 functional classes (Metabolism; Energy; Cell Growth; Transcription; Protein Syntheses; Protein Destination; Transport Facilitation; Intracellular Transport; Cellular Biosynthesis; Signal Transduction; Cell Rescue; Cellular Homeostasis and Cellular Organisation), as defined using the MIPS classification system (Mewes, et al., 1999). MIPS assignments were taken from the May 2001 release. A non-homologous data set was generated using CLUSTAL-W (Thompson, et al., 1994) to leave no two proteins having greater than 20% sequence identity. 3484 sequences of known function

were reduced to 3010 sequences in the training set (listed in Supplementary Information, Table 1).

Functional Domain Composition

We use SBASE, a collection of around 300,000 annotated structural, functional, ligand-binding and topogenic segments of proteins collected from the literature, protein sequence databases and genomic databases (Vlahovicek, et al., 2002). The protein domains are defined by their sequence boundaries given by the publishing authors or in one of the primary sequence databases (Swiss-Prot, PIR, TREMBL etc.). Domain groups are included if they have well defined sequence boundaries, and if they can be distinguished from other sequences using a similarity search technique. The SBASE database uses a set theoretical approach for representing similarities. Sequences are considered similar if they are members of a similarity group in which all or most sequences are similar to each other and less similar to other members of the database. SBASE-A and SBASE-B Release 9.0 were used

(ftp://ftp.icgeb.trieste.it/pub/SBASE (Vlahovicek, et al., 2002)). There are 2425 domain types in SBASE-A (The consolidated domain collection. Release 9.0) and 739 domain types in SBASE-B (Miscellaneous experimental domain groups. Release 9.0).

We use a set of discrete numbers as a vector to define the native functional domains within a protein sequence. With each of the 2425 or 739 domains as a vector-base, a protein can be defined as a 2425-D or 739-D (dimensional) vector according to the following procedure:

(1)Use BLASTP to compare a protein sequence with each of the 300000 sequences grouped

in 2425 domain sequences in SBASE-A or 739 domain sequences in SBASE-B to find the high-scoring segment pairs (HSPs) and the smallest sum probability (P).

(2)If the HSP score >> 70 and P < 0.8 for SBASE-A, or the HSP score >> 30 and P < 1.0 for

SBASE-B in comparing the protein sequence with the i-th domain sequence, then the i-th

component of the protein in the 2425-D or 739-D space is assigned the HSP score;otherwise, it is assigned a value of zero.

(3) The vector for each protein sequence can thus be explicitly formulated as

X = (X1, X2 (2425)

where

X1 =HSP score, when HSP score >> 75 (30) and P < 0.8 (1.0)

0, otherwise

Nearest Neighbour Algorithm (NNA)

The Nearest Neighbour Algorithm (NNA) (Cover and Hart, 1967, Friedman , et al.,1975) can be used particularly in the situations when the distributions of the patterns and the categories of the patterns are unknown. NNA classifies the new patterns into their class membership by comparing the features of the unknown new patterns with the features of the patterns that have already been classified. The approach will weight heavily the evidence derived from the nearby patterns. It is attractive because it is simple to implement and has a low probability of error.

Consider a set of patterns x 1, x 2… x n which have been classified into categories 1,

2

… n , from which an unknown sample x can be classified into those categories using the

NNA. Firstly, the nearest neighbour of x can be defined as

nn(x) = x i

where d(x, x i ) = min k =1n

d (x , x k )

The nearest neighbour algorithm chooses to classify x to the category χj ∈ {1, 2…

m

} if its nearest neighbour also belongs to the category χj ∈ {1, 2… m }. It can be

expressed as:

If nn(x) = x

i and x

i

∈χj∈ {1, 2… m}

Then x ∈χ

j

The score is defined as:

d(x, x

i ) = 1 – (x x

i

)/(x x

i

).

We rank the prediction categories of each protein into {

1

,

2

m

} if and only if

min(d(x,x

i )(x

i

1

))≤min(d(x,x

i

)(x

i

2

))≤ … ≤min(d(x,x

i

)(x

i

m

))

Here we use “≤” because some proteins belong to more than one category (they are multifunctional) or some proteins whose functions are different share the same domains.

For each protein I, in the dataset of the N function-known proteins, we create a vector X(I, J) (I=1…N, J=1…M) by searching the domain database (M domain types). For a function-unknown protein N+1, we determine its vector X(N+1, J) (J=1…M) by the domain database search. We then calculate the similarity score between protein I and each function-known protein:

Score(N+1,I)=1?(X(I,J).X(N+1,J))/(X(I,J)X(N+1,J))

I=1…N, J=1…M

We rank the prediction classes of protein I into {Class(1), Class(2)… Class(m)} (m is the number of functional classes) if and only if

min(Score(N+1,I

i )I

i

∈Class(1)≤min(Score(N+1,I

i

)I

i

∈Class(2)≤...

≤Score(N+1,I

i )I

i

∈Class(m)))

Two examples of the results are found in Tables 1 and 2.

Support Vector Machines (SVMs)

Support Vector Machines (SVMs) are a class of learning machines based on statistical learning theory. The basic idea of applying SVM to pattern classification can be stated as follows: First, map the input vectors into one feature space (possible with a higher dimension), either linearly or non-linearly, which is relevant to the selection of the kernel function. Then, within the feature space from the first step, seek an optimized linear division.

i.e. construct a hyperplane which separates two classes. (This can be extended to more than two classes). SVM training always seeks a global optimized solution and avoids over-fitting,so it has the ability to deal with a large number of features. A complete description of the theory of SVMs for pattern recognition is in Vapnik's book (Vapnik, 1998). SVMs have been used in a range of problems including drug design (Burbridge , et al., 2000), image recognition and text classification (Joachims, 1998).

In this paper, we apply Vapnik's Support Vector Machine (Vapnik, 1995) for predicting the protein functional class. We downloaded the SVMlight program which is an implementation (in C Language) of SVM for the problem of pattern recognition. The optimization algorithm used in SVMlight has been described (Joachims, 1999, Joachims,1999). The code has been used in text classification and image recognition (Joachims, 1998).

Suppose we are given a set of samples, i.e., a series of input vectors X i ∈R d (i =1,...,N )

with corresponding labels y i ∈{+1,?1}(i =1,...,N ).

where –1 and +1 are used to stand respectively for the two classes. The goal here is to construct one binary classifier or derive one decision function which has a small probability of misclassifying a future sample (from the available samples). Both the basic linear separable case and the most useful linear non-separable case for most real life problems are considered here:

The linearly separable case

In this case, there exists a separating hyper plane whose function is 0=+?b X W v

r ,which implies:

y i (r W ?r

x i +b )≥1,i =1,...,N By minimizing

12r W 2

subject to this constraint, the SVM approach tries to find a

unique separating hyperplane. Here 2r

w is the Euclidean norm of w r , which maximizes the

distance between the hyper plane (Optimal Separating Hyperplane or OSH (Cortes and Vapnik, 1995) and the nearest data points of each class. The classifier is called the largest margin classifier. By introducing Lagrange multipliers i , using the Karush-Kuhn-Tucker conditions and the Wolfe dual theorem of optimization theory, the SVM training procedure amounts to solving the following convex QP problem:

Max :i ?12i =1n ∑i j j =1n

∑i =1n ∑?y i y j ?r X

i ?r X j subject to the following two conditions:i ≥0

i i y =0,i =1,...,N i =1

N

∑The solution is a unique globally optimized result having the following expansion: r W =i y i i =1

N ∑i ?r x Only if the corresponding i >0, are these x i r

called Support Vectors. When a SVM is trained, the decision function can be written as:

f (r x )=sgn(y i i i ?r x ?v x i =1N ∑+b )where ()sgn in the above formula is the given sign function.

The linearly non-separable case

Two important techniques needed for this case are given respectively as below:(i)

“soft margin” technique.

In order to allow for training errors, Cortes and Vapnik (Cortes and Vapnik, 1995)introduced slack variables:

N

i i

,...,1,0=>

The relaxed separation constraint is given as:

i y (w ?i r x +b )≥1?i ,),...,1(N i =The OSH can be found by minimizing 122w +C i i =1N ∑instead of

1

2

2w for the above two constraints.Where C is a regularization parameter used to decide a trade- off between the training error and the margin.

(ii) “kernel substitution” technique

SVM performs a nonlinear mapping of the input vector x from the input space d

R into a higher dimensional Hilbert space, where the mapping is determined by the kernel function. Then as in the linearly separable case, it finds the OSH in the space H corresponding to a non-linear boundary in the input space.

Two typical kernel functions are listed below:

K (i r x ,j r x )=d

(i r x ?j r x +1)

K (i r x ,j r x )=exp(?r 2i r x ?j r x )Where the first one is called the polynomial kernel function of degree d which will eventually revert to the linear function when d =1, the latter one is called the RBF (Radial Basic Function) kernel. Finally, for the selected kernel function, the learning task amounts to solving the following QP problem,

Max :i ?12i =1N ∑i j j =1N

∑i =1N ∑?y i y j ?K (r X

i ?r X j )subject to:0≤i a ≤C

i i y =0,i =1,...,N i =1

N

∑The form of the decision function is

f (r x )=sgn(y i i ?K r x ,r

x

i ()

i =1N ∑+b )

For a given data set, only the kernel function and the regularity parameter C must be selected to specify one SVM. In this work we used the radial basis function and set C to 1000.

InterPro

An alternative to SBASE would be to use InterPro (Apweiler , et al., 2000) to search the entire training set for hits to motifs in PROSITE, PRINTS, ProDom, Pfam, SMART,TIGRFAMs, SWISS-PROT and TrEMBL. We therefore submitted the entire set of yeast ORFs to InterPro. We found that 1662 ORFs in the unknown protein MIPS class had hits from InterPro, while there were 2018 hits in this class for SBASE-A. Hence SBASE was more useful in this case.

We used InterPro to predict functional class as follows: We used InterPro release 5.2(Mulder , et al., 2003), built from Pfam 7.3, PRINTS 33.0, PROSITE 17.5, ProDom 2001.3,SMART 3.1, TIGRFAMs 1.2 and the current SWISS-PROT and TrEMBL data. This release of InterPro contains 5875 entries, representing 1272 domains, 4491 families, 97 repeats and 15 post-translational modification sites. Each ORF is coded as a 5875 dimensional vector with an entry of 1 if there is a hit with a domain, otherwise, 0. This is the input vector for the NNA, implemented as above.

BLAST

We used BLAST to predict functional class as follows: A BLAST search for the 3010ORFs in the training set was performed by searching each sequence against the remaining

3009 ORFs. The functional class of the best match was taken as the prediction, regardless of any HSP score cut-off.

RESULTS

Jackknife test using Nearest Neighbour Algorithm

We examined the prediction quality by the jackknife test, a leave-one-out cross validation. During the process of the jackknife test, the training and testing datasets are open, and a protein will in turn move from each to the other. The function domain composition of each protein is input to the NNA. Examples of the results are shown in Tables 1 and 2. From Table 2, we can see this method can give some false positives (i.e. YAL020C scores equally highly for classes 3, 4, 8 and 13, though only 13 is correct). This is because some different functional classes share the domains.

Insert Tables 1 and 2

For the total number of 3010 ORFs, the results are shown in Tables 3, 4, 5 and 6. The complete set of results for SBASE-A are in Supplementary Information, Table 2. For a significant proportion of the ORFs (21% for SBASE-A and 8% for SBASE-B) no prediction can be made. These are 435 sequences that have no hits to any motif in SBASE-A and 194 orphan proteins that have no common domains with any other sequence, leaving 2381 sequences for which predictions can be made. If a prediction can be made it is usually correct. Table 4 subdivides the results by functional class, showing considerable variation, depending on the frequency of SBASE motifs in each class. Transport facilitation is most accurately predicted at 90% accuracy using SBASE-A, while Cellular Biogenesis much worse than any other functional class at 15% using SBASE-A. Our overall success rate of 72% accuracy for 79% coverage compares very well to a random guess (19%, given that the mean number of class assignments is 2.4), especially given that homologous sequences have been removed from the training set. According to the yeast genome database

(http://mips.gsf.de/proj/yeast/CYGD/db/index.html) of December 2002, 3289/6317 (52%) of

proteins are of known function, while 1046/6317 (17%) can be assigned a function based on a FASTA similarity to a known protein. This leaves 1982/6317 (32%) of unknown function. Our success rates therefore compare favourably to existing methods.

Insert Tables 3, 4, 5 and 6

Jackknife test using Support Vector Machines

For the SVMs, we chose a Gaussian Radial Basis Function as the kernel function. The width of the Gaussian RBFs (in this paper, we use the default value in SVMlight ) is selected as that which minimized an estimate of the VC-dimension. The parameter C that controls the error-margin tradeoff is set at 1000. The function domain composition of each protein is input to the SVMs. After being trained, the hyperplane output by the SVMs was obtained. This indicates that the trained model, i.e. hyperplane output which includes the important information, is able to identify the protein functional classes. The SVM method applies to two-class problems. In this paper, for the 13-class problems, we use a simple and effective method: “all-versus-all” method (Ding and Dubchak, 2001) to transfer it into a two-class problem. We used leave-one-out cross-validation (jackknife test). 435 sequences have no matches to any SBASE-A sequences so 435/3010 = 14.5% of sequences have no prediction. For the remaining 2575 the success rates are given in Table 7, namely 36% for 86% coverage.

Insert Table 7

The results in Tables 3, 5 and 7 show that the nearest neighbour algorithm is better than the SVMs, despite being computationally much simpler. This may be because the nature of the large and complicated training set is a particular problem for an SVM. The SBASE-A search is better than SBASE-B search. This is probably because the 2425 domains in SBASE-A are long enough to avoid false positive hits when using a BLAST search. As the 739 domains in SBASE-B are quite short, false positive hits from a BLAST search are likely. The false positives are noise for the prediction. Our previous work used amino acid composition and amino acid pair composition to assign MIPS functional class using the

SIMCA algorithm (Morrison, et al., 2003). The accuracy of that work was considerably lower (32%), though the coverage was higher (98%).

Functional Class Assignment using InterPro

We ran the 3010 protein sequences in the training set through InterPro, searching for sequence motifs. 2423 sequences had InterPro hits, leaving 587 that could not be assigned with this method. Table 8 shows the accuracy of the InterPro assignments. The overall success rate of 71% and the coverage (81%) are comparable to the SBASE-A NNA method (72% accuracy and 79% coverage). The pattern of success rate by class is similar with the two methods as class 9, cellular biogenesis, remained particularly problematic.

Functional Class Assignment using BLAST

We compared all of the 3010 protein sequences in the training set against each other, using no cut-off and taking the class of highest scoring sequence as the prediction, no matter how poor the alignment, giving the results shown in Table 9. Despite the training set containing only non-homologous sequences, the resulting poor alignments were still useful, giving an overall accuracy of 70%. This is still a little poorer than the SBASE-A NNA or the InterPro methods. However, the coverage is now 100%. The success rate by class is again similar to the previous methods.

Application to Unclassified Genes

The January 2002 release of the MIPS functional classification scheme lists 115 ORFs as Classification Not Yet Clear Cut and 2399 as Unclassified Proteins. We applied our functional domain composition method using the SBASE-A database and the NNA algorithm to predict the functions of these proteins. From these 2845, 827 cannot be predicted with this method as they have no matches to any functional domain. Predictions for the remaining 2018 ORFs are in the Table 3 in the Supplementary Information. While these sequences have no significant sequence identity to any ORFs of known function in the remainder of the

genome, this is also true for our training set. We therefore expect that these predictions will also have an accuracy of approximately 70%.

DISCUSSION

Our results show that the functional class of a protein in the s. cerevisiae genome is predictable with high accuracy and coverage. The development in statistical prediction of protein attributes generally consists of two aspects: one is to construct a training dataset and the other is to formulate a prediction algorithm. The latter can be further separated into two sub-sections: one is how to give a mathematical expression to effectively represent a protein sequence and the other is how to find an algorithm to accurately perform the prediction. Here, our training dataset contains 3010 non-homology ORFs which are refined from the whole yeast genome. The functional domain composition incorporates both the sequence-order information and the functional type information and should complement other methods for protein functional class prediction. Success rates and coverage for different methods are summarised in Table 10. The Nearest Neighbour algorithm gave a much better performance than Support Vector Machines and is much faster to run. The results using the SBASE-A domain database were slightly more accurate to InterPro and BLAST.

Insert Table 10

Our predictions for the previously unclassified proteins should be of great value to programs determining experimentally the function of yeast genes, notably the EUROFAN project (Dujon, 1998). The methodology we present should also be generally applicable to all genomes with functional classification schemes that can be used for training.

The program of the prediction method which use NNA, called FCPFD (Functional Class Prediction based on Functional Domains), is available by contacting Y.D. Cai at

y.cai@https://www.doczj.com/doc/2e9000999.html,.

Acknowledgements

This work was funded by the BBSRC (grant number 36/BIO14432).

REFERENCES

Altschul, S. F., Madden, T. L., Sch?ffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman,

D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database

search programs. Nuc. Acids Res., 25, 3389-402.

Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M. D. R., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A.,

Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N. J., Oinn, T. M., Pagni, M.,

Servant, F., Sigrist, C. J. A. and Zdobnov, E. M. (2000) InterPro—an integrated

documentation resource for protein families, domains and functional sites.

Bioinformatics, 16, 1145-1150.

Attwood, T. K., Croning, M. D. R., Flower, D. R., Lewis, A. P., Mabey, J. E., Scordis, P., Selley, J. and Wright, W. (2000) PRINT-S: the database formerly known as PRINTS.

Nuc. Acids Res., 28, 225-227.

Bachinsky, A. G., Frolov, A. S., Naumochkin, A. N., Nizolenko, L. P. and Yarigin, A. A.

(2000) PROF-PAT 1.3: updated database of patterns used to detect local similarities.

Bioinformatics, 16, 358-366.

Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K. L. and Sonnhammer, E. L. L.

(2000) The Pfam protein families database. Nuc. Acids Res., 28, 263-266. Burbridge, R., Trotter, M., Holden, S. and Buxton, B. (2000) Drug Design by Machine Learning: Support Vector Machine for Pharmaceutical Data Analysis. Proceedings of the AISB’00 Symposium on Artificial Intelligence in Bioinformatics, 1-4.

Corpet, F., Gouzy, J. and Kahn, D. (1999) Recent improvements of the ProDom database of protein domain families. Nuc. Acids Res., 27, 263-7.

Cortes, C. and Vapnik, V. (1995) Support vector networks. Machine Learning, 20, 273-293.

Cover, T. M. and Hart, P. E. (1967) Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, 13, 21-27.

Dandekar, T., Snel, B., Huynen, M. and Bork, P. (1999) Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci., 23, 324-328. Ding, C. H. Q. and Dubchak, I. (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17, 349-358.

Dujon, B. (1998) European Functional Analysis Network (EUROFAN) and the functional analysis of the Saccharomyces cerevisiae genome. Electrophoresis, 19, 617-624. Eisenberg, D., Marcotte, E. M., Xenarios, I. and Yeates, T. O. (2000) Protein function in the post-genomic era. Nature, 405, 823-826.

Friedman, J. H., Baskett, F. and Shustek, L. J. (1975) An Algorithm for Finding Nearest Neighbors. IEEE Transactions on Computers, C-24, 1000-1006.

Goffeau, A., Barrell, B. G., Bussey, H., Davis, R. W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J. D., Jacq, C., Johnston, M., Louis, E. J., Mewes, H. W., Murakami, Y.,

Philippsen, P., Tettelin, H. and Oliver, S. G. (1996) Life with 6000 genes. Science,

274, 546, 563-567.

Gracy, J. and Argos, P. (1998) Automated protein sequence database classification. I.

Integration of compositional similarity search, local similarity search, and multiple

sequence alignment. Bioinformatics, 14, 164-173.

Hegyi, H. and Gerstein, M. (1999) The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J. Mol. Biol., 288, 147-

64.

Henikoff, S., Henikoff, J. G. and Pietrokovski, S. (1999) Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics, 15, 471-9.

Hofmann, K., Bucher, P., Falquet, L. and Bairoch, A. (1999) The PROSITE database, its status in 1999. Nuc. Acids Res., 27, 215-9.

Jensen, L. J., Gupta, R., Blom, N., Devos, D., Tamames, J., Kesmir, C., Nielsen, H., Staerfeldt, H. H., Rapacki, K., Workman, C., Andersen, C. A. F., Knudsen, S., Krogh,

A., Valencia, A. and Brunak, S. (2002) Prediction of human protein function from

post-translational modifications and localization features. J. Mol. Biol., 319, 1257-

1265.

Joachims, T. (1999) Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning (B. Scholkopf, C. Burges and A. Smola, eds.),

169-184.

Joachims, T. (1998) Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the European Conference on Machine

Learning.

Joachims, T. (1999) Transductive Inference for Text Classification using Support Vector Machines. International Conference on Machine Learning (ICML).

Kasuya, A. and Thornton, J. M. (1999) Three-dimensional srtucture analysis of Prosite patterns. J. Mol. Biol., 286, 1673-1691.

King, R. D., Karwath, A., Clare, A. and Dehaspe, L. (2000) Accurate prediction of protein functional class in the M. tuberculosis and E. coli genomes using data mining. Yeast, 17, 283-293.

King, R. D., Karwath, A., Clare, A. and Dehaspe, L. (2001) The utility of different representations of protein sequences for predicting functional class. Bioinformatics,

17, 445-454.

Marcotte, E. M. (2000) Computational genetics: finding protein function by nonhomology methods. Curr. Op. Struct. Biol., 10, 359-365.

Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W., Yeates, T. O. and Eisenberg, D.

(1999) Detecting protein function and protein-protein interactions from genome

sequences. Science, 285, 751-3.

Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O. and Eisenberg, D. (1999) A combined algorithm for genome-wide prediction of protein function. Nature, 402, 83-

86.

Mewes, H. W., Albermann, K., Bahr, M., Frishman, D., Gleissner, A., Hani, J., Heumann, K., Kleine, K., Maierl, A., Oliver, S. G., Pfeiffer, F. and Zollner, A. (1997) Overview of

the yeast genome. Nature, 387, Suppl 7-65.

Mewes, H. W., Jeumann, K., Kaps, A., Mayer, K., Pfeiffer, F., Stocker, S. and Frishman, D.

(1999) Mips: a database for genomes and protein sequences. Nuc. Acids Res., 27, 44-

48.

Morrison, R. G., Cai, Y., Doig, A. J. and Mortishire-Smith, R. J. , unpublished.

Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Barrell, D., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., Bucher, P., Copley, R. R., Courcelle, E., Das, U.,

Durbin, R., Falquet, L., Fleischmann, W., Griffiths-Jones, S., Haft, D., Harte, N.,

Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lopez, R., Letunic, I., Lonsdale,

D., Silventoinen, V., Orchard, S.

E., Pagni, M., Peyruc, D., Ponting, C. P., Selengut, J.

D., Servant, F., Sigrist, C. J. A., Vaughan, R. and Zdobnov,

E. M. (2003) The InterPro

Database, 2003 brings increased coverage and new features. Nuc. Acids. Res., 31,

315-318.

Nevill-Manning, C. G., Wu, T. D. and Brutlag, D. L. (1998) Highly specific protein sequence motifs for genome analysis. Proc. Nat. Acad. Sci. U.S.A., 95, 5865-5871.

Oliver, S. G. (1997) EUROFAN's analysis of yeast gene function enters its second phase.

Genome Digest, 4.

Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G. D. and Maltsev, N. (1999) The use of gene clusters to infer functional coupling. Proc. Nat. Acad. Sci. U.S.A., 96, 2896-

2901.

Pearson, W. R. (1996) Effective protein sequence comparison. Meth. Enzymol., 266, 227-258.

蛋白质结构预测和序列分析软件

蛋白质结构预测和序列分析软件 2010-05-08 20:40 转载自布丁布果 最终编辑布丁布果 4月18日 蛋白质数据库及蛋白质序列分析 第一节、蛋白质数据库介绍 一、蛋白质一级数据库 1、 SWISS-PROT 数据库 SWISS-PROT和PIR是国际上二个主要的蛋白质序列数据库,目前这二个数据库在EMBL和GenBank数据库上均建立了镜像 (mirror) 站点。 SWISS-PROT数据库包括了从EMBL翻译而来的蛋白质序列,这些序列经过检验和注释。该数据库主要由日内瓦大学医学生物化学系和欧洲生物信息学研究所(EBI)合作维护。SWISS-PROT 的序列数量呈直线增长。2、TrEMBL数据库: SWISS-PROT的数据存在一个滞后问题,即把EMBL的DNA序列准确地翻译成蛋白质序列并进行注释需要时间。一大批含有开放阅读框(ORF) 的DNA序列尚未列入SWISS-PROT。为了解决这一问题,TrEMBL(Translated EMBL) 数据库被建立了起来。TrEMBL也是一个蛋白质数据库,它包括了所有EMBL库中的蛋白质编码区序列,提供了一个非常全面的蛋白质序列数据源,但这势必导致其注释质量的下降。 3、PIR数据库: PIR数据库的数据最初是由美国国家生物医学研究基金会(National Biomedical Research Foundation, NBRF)收集的蛋白质序列,主要翻译自GenBank的DNA序列。 1988年,美国的NBRF、日本的JIPID(the Japanese International Protein Sequence Database 日本国家蛋白质信息数据库)、德国的MIPS(Munich Information Centre for Protein Sequences摹尼黑蛋白质序列信息中心)合作,共同收集和维护PIR数据库。PIR根据注释程度(质量)分为4个等级。4、 ExPASy数据库: 目前,瑞士生物信息学研究所(Swiss Institute of Bioinformatics, SIB)创建了蛋白质分析专家系统(Expert protein analysis system, ExPASy )。涵盖了上述所有的数据库。网址:https://www.doczj.com/doc/2e9000999.html, 我国的北京大学生物信息中心(https://www.doczj.com/doc/2e9000999.html,) 设立了ExPASy的镜像(Mirror)。 主要蛋白质序列数据库的网址 SWISS-PROT https://www.doczj.com/doc/2e9000999.html,/sprot 或 https://www.doczj.com/doc/2e9000999.html,/expasy_urls.html TrEMBL https://www.doczj.com/doc/2e9000999.html,/sprot PIR https://www.doczj.com/doc/2e9000999.html,/pirwww MIPS——Munich Information Centre for Protein Sequences http://mips.gsf.de/ JIPID——the Japanese International Protein Sequence Database 已经和PIR合并 ExPASy https://www.doczj.com/doc/2e9000999.html, 二、蛋白质结构数据库 1、PDB数据库:

蛋白质结构预测在线软件

蛋白质预测在线分析常用软件推荐 蛋白质预测分析网址集锦 物理性质预测: Compute PI/MW http://expaxy.hcuge.ch/ch2d/pi-tool.html Peptidemasshttp://expaxy.hcuge.ch/sprot/peptide-mass.html TGREASE ftp://https://www.doczj.com/doc/2e9000999.html,/pub/fasta/ SAPS http://ulrec3.unil.ch/software/SAPS_form.html 基于组成的蛋白质识别预测 AACompIdent http://expaxy.hcuge.ch ... htmlAACompSim http://expaxy.hcuge.ch/ch2d/aacsim.html PROPSEARCH http://www.e mbl-heidelberg.de/prs.html 二级结构和折叠类预测 nnpredict https://www.doczj.com/doc/2e9000999.html,/~nomi/nnpredict Predictprotein http://www.embl-heidel ... protein/SOPMA http://www.ibcp.fr/predict.html SSPRED http://www.embl-heidel ... prd_info.html 特殊结构或结构预测 COILS http://ulrec3.unil.ch/ ... ILS_form.html MacStripe https://www.doczj.com/doc/2e9000999.html,/ ... acstripe.html 与核酸序列一样,蛋白质序列的检索往往是进行相关分析的第一步,由于数据库和网络技校术的发展,蛋白序列的检索是十分方便,将蛋白质序列数据库下载到本地检索和通过国际互联网进行检索均是可行的。 由NCBI检索蛋白质序列 可联网到:“http://www.ncbi.nlm.ni ... gi?db=protein”进行检索。 利用SRS系统从EMBL检索蛋白质序列 联网到:https://www.doczj.com/doc/2e9000999.html,/”,可利用EMBL的SRS系统进行蛋白质序列的检索。 通过EMAIL进行序列检索 当网络不是很畅通时或并不急于得到较多数量的蛋白质序列时,可采用EMAIL方式进行序列检索。 蛋白质基本性质分析 蛋白质序列的基本性质分析是蛋白质序列分析的基本方面,一般包括蛋白质的氨基酸组成,分子质量,等电点,亲水性,和疏水性、信号肽,跨膜区及结构功能域的分析等到。蛋白质的很多功能特征可直接由分析其序列而获得。例如,疏水性图谱可通知来预测跨膜螺旋。同时,也有很多短片段被细胞用来将目的蛋白质向特定细胞器进行转移的靶标(其中最典型的

蛋白质结构预测和序列分析软件

蛋白质结构预测和序列分析软件蛋白质数据库及蛋白质序列分析 第一节、蛋白质数据库介绍 一、蛋白质一级数据库 1、 SWISS-PROT 数据库 SWISS-PROT和PIR是国际上二个主要的蛋白质序列数据 库,目前这二个数据库在EMBL和GenBank数据库上均建 立了镜像 (mirror) 站点。 SWISS-PROT数据库包括了从EMBL翻译而来的蛋白质序 列,这些序列经过检验和注释。该数据库主要由日内瓦大 学医学生物化学系和欧洲生物信息学研究所(EBI)合作维 护。SWISS-PROT的序列数量呈直线增长。 2、TrEMBL数据库: SWISS-PROT的数据存在一个滞后问题,即 进行注释需要时间。一大批含有开放阅读 了解决这一问题,TrEMBL(Translated E 白质数据库,它包括了所有EMBL库中的 质序列数据源,但这势必导致其注释质量 3、PIR数据库: PIR数据库的数据最初是由美国国家生物医学研究基金 会(National Biomedical Research Foundation, NBRF) 收集的蛋白质序列,主要翻译自GenBank的DNA序列。 1988年,美国的NBRF、日本的JIPID(the Japanese International Protein Sequence Database日本国家蛋 白质信息数据库)、德国的MIPS(Munich Information Centre for Protein Sequences摹尼黑蛋白质序列信息 中心)合作,共同收集和维护PIR数据库。PIR根据注释 程度(质量)分为4个等级。 4、 ExPASy数据库: 目前,瑞士生物信息学研究所(Swiss I 质分析专家系统(Expert protein anal 据库。 网址:https://www.doczj.com/doc/2e9000999.html, 我国的北京大学生物信息中心(www.cbi.

蛋白质结构预测在线软件

蛋白质预测分析网址集锦? 物理性质预测:? Compute PI/MW?? ?? SAPS?? 基于组成的蛋白质识别预测? AACompIdent???PROPSEARCH?? 二级结构和折叠类预测? nnpredict?? Predictprotein??? SSPRED?? 特殊结构或结构预测? COILS?? MacStripe?? 与核酸序列一样,蛋白质序列的检索往往是进行相关分析的第一步,由于数据库和网络技校术的发展,蛋白序列的检索是十分方便,将蛋白质序列数据库下载到本地检索和通过国际互联网进行检索均是可行的。? 由NCBI检索蛋白质序列? 可联网到:“”进行检索。? 利用SRS系统从EMBL检索蛋白质序列? 联网到:”,可利用EMBL的SRS系统进行蛋白质序列的检索。? 通过EMAIL进行序列检索?

当网络不是很畅通时或并不急于得到较多数量的蛋白质序列时,可采用EMAIL方式进行序列检索。? 蛋白质基本性质分析? 蛋白质序列的基本性质分析是蛋白质序列分析的基本方面,一般包括蛋白质的氨基酸组成,分子质量,等电点,亲水性,和疏水性、信号肽,跨膜区及结构功能域的分析等到。蛋白质的很多功能特征可直接由分析其序列而获得。例如,疏水性图谱可通知来预测跨膜螺旋。同时,也有很多短片段被细胞用来将目的蛋白质向特定细胞器进行转移的靶标(其中最典型的例子是在羧基端含有KDEL序列特征的蛋白质将被引向内质网。WEB中有很多此类资源用于帮助预测蛋白质的功能。? 疏水性分析? 位于ExPASy的ProtScale程序(?)可被用来计算蛋白质的疏水性图谱。该网站充许用户计算蛋白质的50余种不同属性,并为每一种氨基酸输出相应的分值。输入的数据可为蛋白质序列或SWISSPROT数据库的序列接受号。需要调整的只是计算窗口的大小(n)该参数用于估计每种氨基酸残基的平均显示尺度。? 进行蛋白质的亲/疏水性分析时,也可用一些windows下的软件如,bioedit,dnamana等。? 跨膜区分析? 有多种预测跨膜螺旋的方法,最简单的是直接,观察以20个氨基酸为单位的疏水性氨基酸残基的分布区域,但同时还有多种更加复杂的、精确的算法能够预测跨膜螺旋的具体位置和它们的膜向性。这些技术主要是基于对已知

蛋白质结构预测方法综述

蛋白质结构预测方法综述 卜东波陈翔王志勇 《计算机不能做什么?》是一本好书,其中文版序言也堪称佳构。在这篇十余页的短文中,马希文教授总结了使用计算机解决实际问题的三步曲,即首先进行形式化,将领域相关的实际问题抽象转化成一个数学问题;然后分析问题的可计算性;最后进行算法设计,分析算法的时间和空间复杂度,寻找最优算法。 蛋白质空间结构预测是很有生物学意义的问题,迄今亦有很多的工作。有意思的是,其中一些典型工作恰恰是上述三步曲的绝好示例,本文即沿着这一路线作一总结,介绍于后。 1 背景知识 生物细胞种有许多蛋白质(由20余种氨基酸所形成的长链),这些大分子对于完成生物功能是至关重要的。蛋白质的空间结构往往决定了其功能,因此,如何揭示蛋白质的结构是非常重要的工作。 生物学界常常将蛋白质的结构分为4个层次:一级结构,也就是组成蛋白质的氨基酸序列;二级结构,即骨架原子间的相互作用形成的局部结构,比如alpha螺旋,beta片层和loop区等;三级结构,即二级结构在更大范围内的堆积形成的空间结构;四级结构主要描述不同亚基之间的相互作用。 经过多年努力,结构测定的实验方法得到了很好的发展,比较常用的有核磁共振和X光晶体衍射两种。然而由于实验测定比较耗时和昂贵,对于某些不易结晶的蛋白质来说不适用。相比之下,测定蛋白质氨基酸序列则比较容易。因此如果能够从一级序列推断出空间结构则是非常有意义的工作。这也就是下面的蛋白质折叠问题: 1蛋白质折叠问题(Protein Folding Problem) 输入: 蛋白质的氨基酸序列

输出: 蛋白质的空间结构 蛋白质结构预测的可行性是有坚实依据的。因为一般而言,蛋白质的空间结构是由其一级结构确定的。生化实验表明:如果在体外无任何其他物质存在的条件下,使得蛋白质去折叠,然后复性,蛋白质将立刻重新折叠回原来的空间结构,整个过程在不到1秒种内即可完成。因此有理由认为对于大部分蛋白质而言,其空间结构信息已经完全蕴涵于氨基酸序列中。从物理学的角度讲,系统的稳定状态通常是能量最小的状态,这也是蛋白质预测工作的理论基础。 2 蛋白质结构预测方法 蛋白质结构预测的方法可以分为三种: 同源性(Homology )方法:这类方法的理论依据是如果两个蛋白质的序列比较相似,则其结构也有很大可能比较相似。有工作表明,如果序列相似性高于75%,则可以使用这种方法进行粗略的预测。这类方法的优点是准确度高,缺点是只能处理和模板库中蛋白质序列相似性较高的情况。 从头计算(Ab initio ) 方法:这类方法的依据是热力学理论,即求蛋白质能量最小的状态。生物学家和物理学家等认为从原理上讲这是影响蛋白质结构的本质因素。然而由于巨大的计算量,这种方法并不实用,目前只能计算几个氨基酸形成的结构。IBM 开发的Blue Gene 超级计算机,就是要解决这个问题。 穿线法(Threading )方法:由于Ab Initio 方法目前只有理论上的意义,Homology 方法受限于待求蛋白质必需和已知模板库中某个蛋白质有较高的序列相似性,对于其他大部分蛋白质来说,有必要寻求新的方法。Threading 就此应运而生。 以上三种方法中,Ab Initio 方法不依赖于已知结构,其余两种则需要已知结构的协助。通常将蛋白质序列和其真实三级结构组织成模板库,待预测三级结构的蛋白质序列,则称之为查询序列(query sequence)。 3 蛋白质结构预测的Threading 方法 Threading 方法有三个代表性的工作:Eisenburg 基于环境串的工作、Xu Ying 的Prospetor 和Xu Jinbo 、Li Ming 的RAPTOR 。 Threading 的方法:首先取出一条模版和查询序列作序列比对(Alignment),并将模版蛋白质与查询序列匹配上的残基的空间坐标赋给查询序列上相应的残基。比对的过程是在我们设计的一个能量函数指导下进行的。根据比对结果和得到的查询序列的空间坐标,通过我们设计的能量函数,得到一个能量值。将这个操作应用到所有的模版上,取能量值最低的那条模版产生的查询序列的空间坐标为我们的预测结果。 需要指出的是,此处的能量函数却不再是热力学意义上的能量函数。它实质上是概率的负对数,即 ,我们用统计意义上的能量来代替真实的分子能量,这两者有大致相同的形式。 p E log ?=如果沿着马希文教授的观点看上述工作 ,则更有意思:Eisenburg 指出如果仅仅停留在简单地使用每个原子的空间坐标(x,y,z)来形式化表示蛋白质空间结构,则难以进一步深入研究。Eisenburg 创造性地使用环境串表示结构,从而将结构预测问题转化成序列串和环境串之间的比对问题;其后,Xu Ying 作了进一步发展,将蛋白质序列表示成一系列核(core )组成的序列,Core 和Core 之间存在相互作用。因此结构就表示成Core 的空间坐标,以及Core 之间的相互作用。在这种表示方法的基础上,Xu Ying 开发了一种求最优匹配的动态规划算法,得到了很好的结果。但是由于其较高的复杂度,在Prospetor2上不得不作了一些简化;Xu Jinbo 和Li Ming 很漂亮地解决了这个问题,将求最优匹配的过程表示成一个整数规划问题,并且证明了一些常用

蛋白质结构预测

实习 5 :蛋白质结构预测 学号20090***** 姓名****** 专业年级生命生技**** 实验时间2012.6.21 提交报告时间2012.6.21 实验目的: 1.学会使用GOR和HNN方法预测蛋白质二级结构 2.学会使用SWISS-MODEL进行蛋白质高级结构预测 实验内容: 1.分别用GOR和HNN方法预测蛋白质序列的二级结构,并对比异同性。 2.利用SWISS-MODEL进行蛋白质的三级结构预测,并对预测结果进行解释。 作业: 1. 搜索一条你感兴趣的蛋白质序列,分别用GOR和HNN进行二级结构预测,解释预测结果,分析两个方法结果有何异同。 答:所选用蛋白质序列为>>gi|390408302|gb|AFL70986.1| gag protein, partial [Human immunodeficiency virus] (1)GOR预测结果: 图1 图1是每个氨基酸在序列中所处的状态,可以看出序列的二级结构预测结果为: 1到9位个氨基酸为无规卷曲,10到33位氨基酸为α螺旋,34到37位为β折叠,38到45位为无规卷曲,46到49位为α螺旋,50到53位为无规卷曲,54到65为α螺旋,66到72位为无规卷曲,73到95位为α螺旋,96到101位为无规卷曲,102到108为β折叠,109到115位为无规卷曲,117位为β折叠。 图2 图2为各种结构在序列中所占的比例,其中Alpha helix占53.85%,Extended strand占11.11%,Random coil占35.04%,无他二级结构。

图3 图3为各个氨基酸在序列中的状态以及二级结构在全序列中二级结构分布情况。 (2)HNN预测: 图4 图4是每个氨基酸在序列中所处的状态,可以看出序列的二级结构预测结果为: 1到6位个氨基酸为无规卷曲,7到34位氨基酸为α螺旋,35到37位为β折叠,38位为α螺旋,39到44位为无规卷曲,45到49位为α螺旋,50到55位为无规卷曲,56到65为α螺旋,66到71位为无规卷曲,72到83位为α螺旋,84到86位为无规卷曲,87到95位为α螺旋,96到102为无规卷曲,103到108位为β折叠,108到117位为无规卷曲。 图5 图5为各种结构在序列中所占的比例,其中Alpha helix占55.56%,Extended strand占7.69%,Random coil占36.75%,无他二级结构。

蛋白质结构预测网址

蛋白质结构预测网址 物理性质预测: Compute PI/MW Peptidemass TGREASE SAPS 基于组成的蛋白质识别预测 AACompIdent PROPSEARCH 二级结构和折叠类预测 nnpredict Predictprotein SSPRED 特殊结构或结构预测 COILS MacStripe 与核酸序列一样,蛋白质序列的检索往往是进行相关分析的第一步,由于数据库和网络技校术的发展,蛋白序列的检索是十分方便,将蛋白质序列数据库下载到本地检索和通过国际互联网进行检索均是可行的。 由NCBI检索蛋白质序列 可联网到:“”进行检索。 利用SRS系统从EMBL检索蛋白质序列 联网到:”,可利用EMBL的SRS系统进行蛋白质序列的检索。 通过EMAIL进行序列检索 当网络不是很畅通时或并不急于得到较多数量的蛋白质序列时,可采用EMAIL方式进行序列检索。 蛋白质基本性质分析 蛋白质序列的基本性质分析是蛋白质序列分析的基本方面,一般包括蛋白质的氨基酸组成,分子质量,等电点,亲水性,和疏水性、信号肽,跨膜区及结构功能域的分析等到。蛋白质的很多功能特征可直接由分析其序列而获得。例如,疏水性图谱可通知来预测跨膜螺旋。同时,也有很多短片段被细胞用来将目的蛋白质向特定细胞器进行转移的靶标(其中最典型的例子是在羧基端含有KDEL序列特征的蛋白质将被引向内质网。WEB中有很多此类资源用于帮助预测蛋白质的功能。 疏水性分析 位于ExPASy的ProtScale程序()可被用来计算蛋白质的疏水性图谱。该网站充许用户计算蛋白质的50余种不同属性,并为每一种氨基酸输出相应的分值。输入的数据可为蛋白质序列或SWISSPROT数据库的序列接受号。需要调整的只是计算窗口的大小(n)该参数用于估计每种氨基酸残基的平均显示尺度。 进行蛋白质的亲/疏水性分析时,也可用一些windows下的软件如, bioedit,dnamana等。 跨膜区分析 有多种预测跨膜螺旋的方法,最简单的是直接,观察以20个氨基酸为单位的疏水性氨基酸残基的分布区域,但同时还有多种更加复杂的、精确的算法能够预测跨膜螺旋的具体位置和它们的膜向性。这些技术主要是基于对已知跨膜螺旋的研究而得到的。自然存在的跨膜螺旋Tmbase 数据库,可通过匿名FTP获得(),参见表一

蛋白质功能-结构-相互作用预测网站工具合集

蛋白质组学 蛋白质是生物体的重要组成部分,参与几乎所有生理和细胞代谢过程。此外,与基因组学和转录组学比较,对一个细胞或组织中表达的所有蛋白质,及其修饰和相互作用的大规模研究称为蛋白质组学。 蛋白质组学通常被认为是在基因组学和转录组学之后,生物系统研究的下一步。然而,蛋白质组的研究远比基因组学复杂,这是由于蛋白质内在的复杂特点,如蛋白质各种各样的翻译后修饰所决定的。并且,研究基因组学的技术要比研究蛋白质组学的技术强得多,虽然在蛋白质组学研究中,质谱技术的研究已取得了一些进展。 尽管存在方法上的挑战,蛋白质组学正在迅速发展,并且对癌症的临床诊断和疾病治疗做出了重要贡献。几项研究鉴定出了一些蛋白质在乳腺癌、卵巢癌、前列腺癌和食道癌中表达变化。例如,通过蛋白质组学技术,人们可以在患者血液中明确鉴定出肿瘤标志物。表1列出了更多的蛋白质组学技术用于研究癌症的例子。 另外,高尔基体功能复杂。最新研究表明,它除了参与蛋白加工外,还能参与细胞分化及细胞间信号传导的过程,并在凋亡中扮演重要角色,其功能障碍也许和肿瘤的发生、发展有某种联系。根据人类基因组研究,约1000多种人类高尔基体蛋白质中仅有500~600种得到了鉴定,建立一条关于高尔基体蛋白质组成的技术路线将有助于其功能的深入研究。 蛋白质组学是一种有效的研究方法,特别是随着亚细胞器蛋白质组学技术的迅猛发展,使高尔基体的全面研究变为可能。因此研究人员希望能以胃癌细胞中的高尔基体为研究对象,通过亚细胞器蛋白质组学方法,建立胃癌细胞中高尔基体的蛋白质组方法学。 研究人员采用蔗糖密度梯度的超速离心方法分离纯化高尔基体,双向凝胶电泳(2-DE)分离高尔基体蛋白质,用ImageMaster 2D软件分析所得图谱,基质辅助激光解吸离子化飞行时间质谱(MALDI-TOF MS)鉴定蛋白质点等一系列亚细胞器蛋白质组学方法建立了胃癌细胞内高尔基体的蛋白图谱。 最后,人们根据分离出的纯度较高的高尔基体建立了分辨率和重复性均较好的双向电泳图谱,运用质谱技术鉴定出12个蛋白质,包括蛋白合成相关蛋白、膜融合蛋白、调节蛋白、凋亡相关蛋白、运输蛋白和细胞增殖分化相关蛋白。通过亚细胞器分离纯化、双向电泳的蛋白分离及MALDI-TOF MS蛋白鉴定分析,研究人员首次成功建立了胃癌细胞SGC7901中高尔基体的蛋白质组学技术路线。 3.1 蛋白质功能预测工具 也许生物信息学方法在癌症研究中最常用的就是基因功能预测方法,但是这些数据库只存储了基因组的大约一半基因的功能。为了在微阵列资料基础上完成功能性的富集分析,基因簇的功能注解是非常重要的。近几年生物学家研发了一些基因功能预测的方法,这些方法旨在超越传统的BLAST搜索来预测基因的功能。基因功能预测可以以氨基酸序列、三级结构、与之相互作用的配体、相互作用过程或基因的表达方式为基础。其中最重要的是基于氨基酸序列的分析,因为这种方法适合于微阵列分析的全部基因。 在表3中,前三项列举了三种同源搜索方法。FASTA方法虽然应用还不太广泛,但它要优于BLAST,或者至少相当。FASTA程序是第一个使用的数据库相似性搜索程序。为了达到较高的敏感程度,程序引用取代矩阵实行局部比对以获得最佳搜索。美国弗吉尼亚大学可以提供这项程序的地方版本,当然数据库搜索结果依赖于要搜索的数据库序列。如果最近的序列数据库版本在弗吉尼亚大学不能获得,那么就最好试一下京都大学(Kyoto University)的KEGG站点。PSI-BLAST(位点特异性反复BLAST)是BLAST的转化版本,PSI-BLAST的特色是每次用profile 搜索数据库后再利用搜索的结果重新构建profile,然后用新的profile再次搜索数据库,如此反复直至没有新的结果产生为止。PSI-BLAST先用带空位的BLAST搜索数据库,将获得的序列通过多序列比对来构建第一个profile。PSI-BLAST自然地拓展了BLAST方法,能寻找蛋白质序列中的隐含模式,有研究表明这种方法可以有效地找到很多序列差异较大而结构功能相似的相关蛋白,所以它比BLAST和FASTA有更好的敏感性。PSI-BLAST服务可以

蛋白质结构与功能的生物信息学研究

实验名称:蛋白质结构与功能的生物信息学研究 实验目的:1.掌握运用BLAST工具对指定蛋白质的氨基酸序列同源性搜索的方法。 2.掌握用不同的工具分析蛋白质的氨基酸序列的基本性质 3掌握蛋白质的氨基酸序列进行三维结构的分析 4.熟悉对蛋白质的氨基酸序列所代表蛋白的修饰情况、所参与的 代谢途径、相互作用的蛋白,以及与疾病的相关性的分析。实验方法和流程: 一、同源性搜索 同源性从分子水平讲则是指两个核酸分子的核苷酸序列或两个蛋白质分子的氨基酸序列间的相似程度。BLAST工具能对生物不同蛋白质的氨基酸序列或不同的基因的DNA序列极性比对,并从相应数据库中找到相同或相似序列。对指定的蛋白质的氨基酸序列进行同源性搜索步骤如下: ↓ 登录网址https://www.doczj.com/doc/2e9000999.html,/blast/ ↓ 输入序列后,运行blast工具 ↓ 序列比对的图形结果显示

序列比对的图形结果:用相似性区段(Hit)覆盖输入序列的范围判断两个序列 的相似性。如果图形中包含低得分的颜色(主要是红色) 区段,表明两序列的并非完全匹配。 ↓ 匹配序列列表及得分

各序列得分 可选择不同的比对工具 备注: Clustal是一款用来对()的软件。可以用来发现特征序列,进行蛋白分类,证明序列间的同源性,帮助预测新序列二级结构与三级结构,确定PCR引物,以及 在分子进化分析方面均有很大帮助。Clustal包括Clustalx和Clustalw(前者是 图形化界面版本后者是命令界面),是生物信息学常用的多序列比对工具。 该序列的比对结果有100条,按得分降序排列,其中最大得分2373,最小得分 分为1195. ↓ 详细的比对序列的排列情况 第一个匹配 序列 第一个序列的匹配率为100% Score表示打分矩阵计算出来的值,由搜索算法决定的,值越大说明匹配程度

蛋白质结构与功能的关系

蛋白质结构与功能的关系 专业:植物学 摘要:蛋白质特定的功能都是由其特定的构象所决定的,各种蛋白质特定的构象又与其一级结构密切相关。天然蛋白质的构象一旦发生变化,必然会影响到它的生物活性。由于蛋白质的构象的变化引起蛋白质功能变化,可能导致蛋白质构象紊乱症,当然也能引起生物体对环境的适应性增强。而分子模拟技术为蛋白质的研究提供了一种崭新的手段。在理论上解决了结构预测和功能分析以及蛋白质工程实施方面所面临的难题。它在蛋白质的结构预测和模建工作中占有举足轻重的地位,实现了生物技术与计算机技术的完美结合。 关键词:蛋白质的结构、功能;折叠/功能关系;蛋白质构象紊乱症;分子模拟技术;同源建模 RNase是由124个氨基酸残基组成的单肽链,分子中 8 个Cys的-SH构成4对二硫键,形成具有一定空间构象的蛋白质分子。在蛋白质变性剂和一些还原剂存在下,酶分子中的二硫键全部被还原,酶的空间结构破坏,肽链完全伸展,酶的催化活性完全丧失。当用透析的方法除去变性剂和巯基乙醇后,发现酶大部分活性恢复,所有的二硫键准确无误地恢复原来状态。若用其他的方法改变分子中二硫键的配对方式,酶完全丧失活性。这个实验表明,蛋白质的一级结构决定它的空间结构,而特定的空间结构是蛋白质具有生物活性的保证。前体与活性蛋白质一级结构的关系,由108个氨基酸残基构成的前胰岛素原,在合成的时候完全没有活性,当切去N-端的24个氨基酸信号肽,形成84个氨基酸的胰岛素原,胰岛素原也没活性,在包装分泌时,A、B链之间的33个氨基酸残基被切除,才形成具有活性的胰岛素。 功能不同的蛋白质总是有着不同的序列;种属来源不同而功能相同的蛋白质的一级结构,可能有某些差异,但与功能相关的结构也总是相同。若一级结构变化,蛋白质的功能可能发生很大的变化。蛋白质特定的功能都是由其特定的构象所决定的,各种蛋白质特定的构象又与其一级结构密切相关。天然蛋白质的构象一旦发生变化,必然会影响到它的生物活性。由于蛋白质的构象的变化引起蛋白质功能变化,可能导致蛋白质构象紊乱症,当然也能引起生物体对环境的适应性增强。 虽然蛋白质结构与生物功能的关系比序列与功能的关系更加紧密,但结构与功能的这种关联亦若隐若现,并不能排除折叠差别悬殊的蛋白质执行相似的功能,折叠相似的蛋白质执行差别悬殊功能的现象的存在。无奈,该领域仍不得不将100多年前Fisher提出的“锁一钥

蛋白质结构与功能的关系

蛋白质结构与功能的关系 (The relationship between protein structure and function) 摘要蛋白质特定的功能都是由其特定的构象所决定的,各种蛋白质特定的构象又与其一级结构密切相关。天然蛋白质的构象一旦发生变化,必然会影响到它的生物活性。由于蛋白质的构象的变化引起蛋白质功能变化,可能导致蛋白质构象紊乱症,当然也能引起生物体对环境的适应性增强!现而今关于蛋白质功能研究还有待发展,一门新兴学科正在发展,血清蛋白组学,生物信息学等!本文仅就蛋白质结构与其功能关系进行粗略阐述。 关键词:蛋白质结构;折叠/功能关系;蛋白质构象紊乱症;分子伴侣 Keywords:protein structure;fold/function relationship;protein conformational disorder;molecular chaperons 虽然蛋白质结构与生物功能的关系比序列与功能的关系更加紧密,但结构与功能的这种关联亦若隐若现,并不能排除折叠差别悬殊的蛋白质执行相似的功能,折叠相似的蛋白质执行差别悬殊功能的现象的存在。无奈,该领域仍不得不将100多年前Fisher提出的“锁一钥匙”模型(“lock—key”model)和50多年前Koshand提出的诱导契合模型(induce fitmodel)作为蛋白质实现功能的理论基础。这2个略显粗糙的模型只是认为蛋白质执行功能的部位局限在结构中的一个或几个小区域内,此类区域通常是蛋白质表面上的凹洞或裂隙。这种凹洞或裂隙被称为“活性部位(active site)”或“别构部位(fallosteric site)”,凹陷部位与配体分子在空间形状和静电上互补。此外,在酶的活性部位中还存在着几个作为催化基团(catalyticgroup)的氨基酸残基。对蛋白质未来的研究应从实验基本数据的归纳和统计入手,从原始的水平上发现蛋白质的潜藏机制【1】。 蛋白质结构与功能关系的研究主要是以力求刻画蛋白质的3D结构的几何学为基础的。蛋白质结构既非规则的几何形,又非完全的无规线团(randomcoil),而是有序(α一螺旋和β一折叠)与无序(线团或环域loop)的混合体。理解蛋白质3D结构的技巧是将结构简化,只保留某种几何特征或拓扑模式,并将其数字化。探求数字中所蕴含的规律,且根据这一规律将蛋白质进行分类,再将分类的结构与蛋白质的功能进行比较,以检验蛋白质抽象结构的合理性。如果一种对蛋白质结构的简化、比较和分类能与蛋自质的功能有较好地对应关系,那么这就是一种对蛋白质结构的有价值的理解。蛋白质结构中,多种弱力(氢键、范德华力、静电相互作用、疏水相互作用、堆积力等)和可逆的二硫键使多肽链折叠成特定的构象。从某种意义上说,共价键维系了蛋白质的一级结构;主链上的氢键维系了蛋白质的二级结构;而氨基酸侧链的相互作用和二硫桥维系着蛋白质的三级结构。亚基(subunit)内部的侧链相互作用是构象稳定的基础,蛋白质链之间的侧链的相互作用是亚基组装(四级结构)的基础,而蛋白质中侧链与配体基团问的相互作用是蛋白质行使功能的基础。 牛胰核糖核酸酶(RNase)变性和复性的实验是蛋白质结构与功能关系的很好例证。蛋白质空间结构遭到破坏;,可导致蛋白质的理比性质和生物学性质的变化,这就是蛋白质变性。变性的蛋白质,只要其一级结构仍然完好,可在一定条件下恢复其空间结构,随之理化性质和生物学性质也可重现,这被称为复性。RNase是由124个氨基酸残基组成的一条肽链,分子中8个半胱氨酸的巯基构成4对二硫键,进而形成具有一定空间构象的活性蛋白质。天然RNase遇尿素和β巯基乙醇时发生变性,其分子中的氢键和4个二硫键解开,严密的空间结构遭破坏,丧失了生物学活性,但一级结构完整无损。若去除尿素和β巯基乙醇,RNase又可恢复其原有构象和生物学活性。RNase分子中的8个巯基若随机排列成二硫键可有105种方式。有活性的RNase只是其中的一种,复性时之所以选择了自

蛋白质结构与功能的生物信息学研究汇总

实验名称:蛋白质结构与功能的生物信息学研究实验目的:1.掌握运用BLAST工具对指定蛋白质的氨基酸序列同源性搜索 的方法。 2.掌握用不同的工具分析蛋白质的氨基酸序列的基本性质 3掌握蛋白质的氨基酸序列进行三维结构的分析 4.熟悉对蛋白质的氨基酸序列所代表蛋白的修饰情况、所参与的 代谢途径、相互作用的蛋白,以及与疾病的相关性的分析。 实验方法和流程: 一、同源性搜索 同源性从分子水平讲则是指两个核酸分子的核苷酸序列或两个蛋白质分子的氨基酸序列间的相似程度。BLAST工具能对生物不同蛋白质的氨基酸序列或不同的基因的DNA序列极性比对,并从相应数据库中找到相同或相似序列。对 指定的蛋白质的氨基酸序列进行同源性搜索步骤如下: ↓ 登录网址https://www.doczj.com/doc/2e9000999.html,/blast/ ↓ 输入序列后,运行blast工具 ↓ 序列比对的图形结果显示

序列比对的图形结果:用相似性区段(Hit)覆盖输入序列的范围判断两个序列 的相似性。如果图形中包含低得分的颜色(主要是红色) 区段,表明两序列的并非完全匹配。 ↓ 匹配序列列表及得分

各序列得分 可选择不同的比对工具 备注: Clustal是一款用来对()的软件。可以用来发现特征序列,进行蛋白分类,证明 序列间的同源性,帮助预测新序列二级结构与三级结构,确定PCR引物,以及 在分子进化分析方面均有很大帮助。Clustal包括Clustalx和Clustalw(前者是图 形化界面版本后者是命令界面),是生物信息学常用的多序列比对工具。 该序列的比对结果有100条,按得分降序排列,其中最大得分2373,最小得分 分为1195. ↓ 详细的比对序列的排列情况 第一个匹配 序列 第一个序列的匹配率为100% Score表示打分矩阵计算出来的值,由搜索算法决定的,值越大说明匹配程度

蛋白质结构预测在线软件

蛋白质预测分析网址集锦 物理性质预测: Compute PI/MW SAPS 基于组成的蛋白质识别预测 AACompIdent PROPSEARCH 二级结构和折叠类预测 nnpredict Predictprotein SSPRED 特殊结构或结构预测 COILS MacStripe 与核酸序列一样,蛋白质序列的检索往往是进行相关分析的第一步,由于数据库和网络技校术的发展,蛋白序列的检索是十分方便,将蛋白质序列数据库下载到本地检索和通过国际互联网进行检索均是可行的。 由NCBI检索蛋白质序列 可联网到:“”进行检索。 利用SRS系统从EMBL检索蛋白质序列 联网到:”,可利用EMBL的SRS系统进行蛋白质序列的检索。 通过EMAIL进行序列检索 当网络不是很畅通时或并不急于得到较多数量的蛋白质序列时,可采用EMAIL方式进行序列检索。 蛋白质基本性质分析 蛋白质序列的基本性质分析是蛋白质序列分析的基本方面,一般包括蛋白质的氨基酸组成,分子质量,等电点,亲水性,和疏水性、信号肽,跨膜区及结构功能域的分析等到。蛋白质的很多功能特征可直接由分析其序列而获得。例如,疏水性图谱可通知来预测跨膜螺旋。同时,也有很多短片段被细胞用来将目的蛋白质向特定细胞器进行转移的靶标(其中最典型的例子是在羧基端含有KDEL序列特征的蛋白质将被引向内质网。WEB中有很多此类资源用于帮助预测蛋白质的功能。 疏水性分析 位于ExPASy的ProtScale程序()可被用来计算蛋白质的疏水性图谱。该网站充许用户计算蛋白质的50余种不同属性,并为每一种氨基酸输出相应的分值。输入的数据可为蛋白质序列或SWISSPROT数据库的序列接受号。需要调整的只是计算窗口的大小(n)该参数用于估计每种氨基酸残基的平均显示尺度。 进行蛋白质的亲/疏水性分析时,也可用一些windows下的软件如,bioedit,dnamana等。 跨膜区分析 有多种预测跨膜螺旋的方法,最简单的是直接,观察以20个氨基酸为单位的疏水性氨基酸残基的分布区域,但同时还有多种更加复杂的、精确的算法能够预测跨膜螺旋的具体位置和它们的膜向性。这些技术主要是基于对已知跨膜螺旋的研究而得到的。自然存在的跨膜螺旋Tmbase 数据库,可通过匿名FTP获得(,参见表一

蛋白质结构及功能预测

物理性质预测 Compute PI/MW http://expaxy.hcuge.ch/ch2d/pi-tool.html Peptidemass http://expaxy.hcuge.ch/sprot/peptide-mass.html TGREASE ftp://https://www.doczj.com/doc/2e9000999.html,/pub/fasta/ SAPS http://ulrec3.unil.ch/software/SAPS_form.html 基于组成的蛋白质识别预测 http://expaxy.hcuge.ch/ch2d/aacompi.html AACompSim http://expaxy.hcuge.ch/ch2d/aacsim.html PROPSEARCH http://www.embl-heidelberg.de/prs.html 二级结构和折叠类预测 https://www.doczj.com/doc/2e9000999.html,/~nomi/nnpredictPredictprotein http://www.embl-heidelberg.de/predictprotein/SOPMA http://www.ibcp.fr/predict.htmlSSPRED http://www.embl-heidelberg.de/sspred/ssprd_info.html 特殊结构或结构预测 http://ulrec3.unil.ch/software/COILS_form.htmlMacStripe https://www.doczj.com/doc/2e9000999.html,/matsudaira/macstripe.html 检索 由NCBI检索蛋白质序列 https://www.doczj.com/doc/2e9000999.html,:80/entrz/query.fcgi?db=protein进行检索。 利用SRS系统从EMBL检索蛋白质序列 https://www.doczj.com/doc/2e9000999.html,/可利用EMBL的SRS系统进行蛋白质序列的检索。 通过EMAIL进行序列检索 当网络不是很畅通时或并不急于得到较多数量的蛋白质序列时,可采用EMAIL方式进行序列检索。 疏水性分析 位于ExPASy的ProtScale程序https://www.doczj.com/doc/2e9000999.html,/cgi-bin/protscale.pl可被用来计算蛋白质的疏水性图谱。该网站充许用户计算蛋白质的50余种不同属性,并为每一种氨基酸输出

蛋白质结构预测在线软件

蛋白质结构预测在线软 件 Company Document number:WUUT-WUUY-WBBGB-BWYTT-1982GT

蛋白质预测分析网址集锦? 物理性质预测:? Compute PI/MW? ? SAPS? 基于组成的蛋白质识别预测? AACompIdentPROPSEARCH? 二级结构和折叠类预测? nnpredict? Predictprotein? SSPRED? 特殊结构或结构预测? COILS? MacStripe? 与核酸序列一样,蛋白质序列的检索往往是进行相关分析的第一步,由于数据库和网络技校术的发展,蛋白序列的检索是十分方便,将蛋白质序列数据库下载到本地检索和通过国际互联网进行检索均是可行的。? 由NCBI检索蛋白质序列? 可联网到:“”进行检索。? 利用SRS系统从EMBL检索蛋白质序列? 联网到:”,可利用EMBL的SRS系统进行蛋白质序列的检索。? 通过EMAIL进行序列检索?

当网络不是很畅通时或并不急于得到较多数量的蛋白质序列时,可采用EMAIL方式进行序列检索。? 蛋白质基本性质分析? 蛋白质序列的基本性质分析是蛋白质序列分析的基本方面,一般包括蛋白质的氨基酸组成,分子质量,等电点,亲水性,和疏水性、信号肽,跨膜区及结构功能域的分析等到。蛋白质的很多功能特征可直接由分析其序列而获得。例如,疏水性图谱可通知来预测跨膜螺旋。同时,也有很多短片段被细胞用来将目的蛋白质向特定细胞器进行转移的靶标(其中最典型的例子是在羧基端含有KDEL序列特征的蛋白质将被引向内质网。WEB中有很多此类资源用于帮助预测蛋白质的功能。? 疏水性分析? 位于ExPASy的ProtScale程序()可被用来计算蛋白质的疏水性图谱。该网站充许用户计算蛋白质的50余种不同属性,并为每一种氨基酸输出相应的分值。输入的数据可为蛋白质序列或SWISSPROT数据库的序列接受号。需要调整的只是计算窗口的大小(n)该参数用于估计每种氨基酸残基的平均显示尺度。? 进行蛋白质的亲/疏水性分析时,也可用一些windows下的软件如, bioedit,dnamana等。? 跨膜区分析? 有多种预测跨膜螺旋的方法,最简单的是直接,观察以20个氨基酸为单位的疏水性氨基酸残基的分布区域,但同时还有多种更加复杂的、精确的算法能够预测跨膜螺旋的具体位置和它们的膜向性。这些技术主要是基于对已知跨膜螺旋的研究而得到的。自然存在的跨膜螺旋Tmbase 数据库,可通过匿名FTP获得(,参见表一? 资源名称网址说明?

蛋白质的结构和功能的关系

蛋白质结构与功能的关系 摘要:蛋白质特定的功能都是由其特定的构象所决定的,各种蛋白质特定的构象又与其一级结构密切相关。天然蛋白质的构象一旦发生变化,必然会影响到它的生物活性。由于蛋白质的构象的变化引起蛋白质功能变化,可能导致蛋白质构象紊乱症,当然也能引起生物体对环境的适应性增强!现而今关于蛋白质功能研究还有待发展,一门新兴学科正在发展,血清蛋白组学,生物信息学等!本文仅就蛋白质结构与其功能关系进行粗略阐述。 关键词:蛋白质分子一级结构、空间结构、折叠/功能关系、蛋白质构象紊乱症;分子伴侣 正文: 1、蛋白质分子一级结构和功能的关系 蛋白质分子中关键活性部位氨基酸残基的改变,会影响其生理功能,甚至造成分子病(molecular disease)。例如镰状细胞贫血,就是由于血红蛋白分子中两个β亚基第6位正常的谷氨酸变异成了缬氨酸,从酸性氨基酸换成了中性支链氨基酸,降低了血红蛋白在红细胞中的溶解度,使它在红细胞中随血流至氧分压低的外周毛细血管时,容易凝聚并沉淀析出,从而造成红细胞破裂溶血和运氧功能的低下。 另一方面,在蛋白质结构和功能关系中,一些非关键部位氨基酸残基的改变或缺失,则不会影响蛋白质的生物活性。例如人、猪、牛、羊等哺乳动物胰岛素分子A链中8、9、10位和B链30位的氨基酸残基各不相同,有种族差异,但这并不影响它们都具有降低生物体血糖

浓度的共同生理功能。 蛋白质一级结构与功能间的关系十分复杂。不同生物中具有相似生理功能的蛋白质或同一种生物体内具有相似功能的蛋白质,其一级结构往往相似,但也有时可相差很大。如催化DNA复制的DNA聚合酶,细菌的和小鼠的就相差很大,具有明显的种族差异,可见生命现象十分复杂多样。 2、蛋白质分子空间结构和功能的关系 蛋白质分子空间结构和其性质及生理功能的关系也十分密切。不同的蛋白质,正因为具有不同的空间结构,因此具有不同的理化性质和生理功能。如指甲和毛发中的角蛋白,分子中含有大量的α-螺旋二级结构,因此性质稳定坚韧又富有弹性,这是和角蛋白的保护功能分不开的;而胶原蛋白的三股π螺旋平行再几股拧成缆绳样胶原微纤维结构,使其性质稳定而具有强大的抗张力作用 又如细胞质膜上一些蛋白质是离子通道,就是因为在其多肽链中的一些α-螺旋或β-折叠二级结构中,一侧多由亲水性氨基酸组成,而另一侧却多由疏水性氨基酸组成,因此是具有“两亲性”(amphipathic)的特点,几段α-螺旋或β-折叠的亲水侧之间就构成了离子通道,而其疏水侧,即通过疏水键将离子通道蛋白质固定在细胞质膜上。载脂蛋白也具有两亲性,既能与血浆中脂类结合,又使之溶解在血液中进行脂类的运输。 3、折叠/功能关系 体内各种蛋白质都有特殊的生理功能,这与空间构象有着密切的

相关主题
文本预览
相关文档 最新文档