生物信息学主要英文术语及释义
- 格式:pdf
- 大小:351.33 KB
- 文档页数:25
中文名称PAM:可接受点突变矩阵;EST:表达序列标签;CDS:编码区序列EXON:外显子;ORF:开发阅读框名词解释Fasta格式:FASTA格式又称Pearson的格式,该种序列格式要求序列的标题行以大于号">"开头,下一行起为具体的序列。
一般建议每行的字符数不超过80个,以比对程序的处理。
医学主题词(MeSH)是用规范化的医学术语来描述生物医学概念。
NIH的工作人员按MeSH词表规定,浏览生物医学期刊全文后标引出每篇文献中的MeSH主题词,其中论述文献中心的主题词称主要主题词(major topic headings),论述主题某一方面的内容的词称为副主题词。
直系同源(Orthologs)是指来自于不同物种的由垂直家系(物种形成)进化而来的蛋白,并且典型的保留与原始蛋白有相同的功能。
序列模体(motif):通常指蛋白序列中相邻或相近的一组具有保守性的残基,与蛋白质分子及家族的功能有关。
计分矩阵(scoring matrix):记分矩阵是描述残基(氨基酸或碱基)在比对中出现的概率值的表。
在记分矩阵中的值是两种概率比值的对数,一个是在序列比对中氨基酸随机发生的概率。
这个值只是指出每个氨基酸出现的独立几率的概率。
另一个是在序列比对中,一对残基的出现的有意义的概率。
这些概率来源于已知有效的真实的序列的比对的样本。
简答题:1、什么是生物信息学?其研究内容包括那些方面?生物信息学从事对基因组研究相关生物信息的获取、加工、储存、分配、分析和解释。
这一定义包括了两层含义,一是对海量数据的收集、整理与服务,也就是管好这些数据;另一个是从中发现新的规律,也就是用好这些数据。
1.基因组相关信息的收集、储存、管理与提供2.新基因的发现与鉴定3.非编码区信息结构分析4.生物进化的研究5.完整基因组的比较研究6.基因组信息分析的方法研究7.大规模基因功能表达谱的分析8.蛋白质分子空间结构的预测、模拟和分子设计药物设计2、什么是人类基因组计划?人类基因组计划是一项规模宏大的科学计划,其旨在测定组成人类染色体(指单倍体)中所包含的30亿个核苷酸序列的碱基组成,从而绘制下人类基因组图谱,并且辨识并呈现其上的所有基因及其序列,进而破译人类遗传信息。
Abstract Syntax Notation (ASN.l)(NCBI发展的许多程序,如显示蛋白质三维结构的Cn3D 等所使用的内部格式)A language that is used to describe structured data types formally, Within bioinformatits,it has been used by the National Center for Biotechnology Information to encode sequences, maps, taxonomic information, molecular structures, and biographical information in such a way that it can be easily accessed and exchanged by computer software.Accession number(记录号)A unique identifier that is assigned to a single database entry for a DNA or protein sequence.Affine gap penalty(一种设置空位罚分策略)A gap penalty score that is a linear function of gap length, consisting of a gap opening penalty and a gap extension penalty multiplied by the length of the gap. Using this penalty scheme greatly enhances the performance of dynamic programming methods for sequence alignment. See also Gap penalty.Algorithm(算法)A systematic procedure for solving a problem in a finite number of steps, typically involving a repetition of operations. Once specified, an algorithm can be written in a computer language and run as a program.Alignment(联配/比对/联配)Refers to the procedure of comparing two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences. Of the two types of alignment, local and global, a local alignment is generally the most useful. See also Local and Global alignments.Alignment score(联配/比对/联配值)An algorithmically computed score based on the number of matches, substitutions, insertions, and deletions (gaps) within an alignment. Scores for matches and substitutions Are derived from a scoring matrix such as the BLOSUM and PAM matrices for proteins, and aftine gap penalties suitable for the matrix are chosen. Alignment scores are in log odds units, often bit units (log to the base 2). Higher scores denote better alignments. See also Similarity score, Distance in sequence analysis.Alphabet(字母表)The total number of symbols in a sequence-4 for DNA sequences and 20 for protein sequences.Annotation(注释)The prediction of genes in a genome, including the location of protein-encoding genes, the sequence of the encoded proteins, anysignificantmatches to other Proteins of known function, and the location of RNA-encoding genes. Predictions are based on gene models; e.g., hidden Markov models of introns and exons in proteins encoding genes, and models of secondary structure in RNA.Anonymous FTP(匿名FTP)When a FTP service allows anyone to log in, it is said to provide anonymous FTP ser-vice. A user can log in to an anonymous FTP server by typing anonymous as the user name and his E-mail address as a password. Most Web browsers now negotiate anonymous FTP logon without asking the user for a user name and password. See also FTP.ASCIIThe American Standard Code for Information Interchange (ASCII) encodes unaccented letters a-z, A-Z, the numbers O-9, most punctuation marks, space, and a set of control characters such as carriage return and tab. ASCII specifies 128 characters that are mapped to the values O-127. ASCII tiles are commonly called plain text, meaning that they only encode text without extra markup.BAC clone(细菌人工染色体克隆)Bacterial artificial chromosome vector carrying a genomic DNA insert, typically 100–200 kb. Most of the large-insert clones sequenced in the project were BAC clones.Back-propagation(反向传输)When training feed-forward neural networks, a back-propagation algorithm can be used to modify the network weights. After each training input pattern is fed through the network, the network’s output is compared with the desired output and the amount of error is calculated. This error is back-propagated through the network by using an error function to correct the network weights. See also Feed-forward neural network.Baum-Welch algorithm(Baum-Welch算法)An expectation maximization algorithm that is used to train hidden Markov models.Baye’s rule(贝叶斯法则)Forms the basis of conditional probability by calculating the likelihood of an event occurring based on the history of the event and relevant background information. In terms of two parameters A and B, the theorem is stated in an equation: The condition-al probability of A, given B, P(AIB), is equal to the probability of A, P(A), times the conditional probability of B, given A, P(BIA), divided by the probability of B, P(B). P(A) is the historical or prior distribution value of A, P(BIA) is a new prediction for B for a particular value of A, and P(B) is the sum of the newly predicted values for B. P(AIB) is a posterior probability, representing a new prediction for A given the prior knowledge of A and the newly discovered relationships between A and B.Bayesian analysis(贝叶斯分析)A statistical procedure used to estimate parameters of an underlyingdistribution based on an observed distribution. See also Baye’s rule.Biochips(生物芯片)Miniaturized arrays of large numbers of molecular substrates, often oligonucleotides, in a defined pattern. They are also called DNA microarrays and microchips.Bioinformatics (生物信息学)The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology. /The discipline of obtaining information about genomic or protein sequence data. This may involve similarity searches of databases, comparing your unidentified sequence to the sequences in a database, or making predictions about the sequence based on current knowledge of similar sequences. Databases are frequently made publically available through the Internet, or locally at your institution.Bit score (二进制值/ Bit值)The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.Bit unitsFrom information theory, a bit denotes the amount of information required to distinguish between two equally likely possibilities. The number of bits of information, AJ, required to convey a message that has A4 possibilities is log2 M = N bits.BLAST (基本局部联配搜索工具,一种主要数据库搜索程序)Basic Local Alignment Search Tool. A set of programs, used to perform fast similarity searches. Nucleotide sequences can be compared with nucleotide sequences in a database using BLASTN, for example. Complex statistics are applied to judge the significance of each match. Reported sequences may be homologous to, or related to the query sequence. The BLASTP program is used to search a protein database for a match against a query protein sequence. There are several other flavours of BLAST. BLAST2 is a newer release of BLAST. Allows for insertions or deletions in the sequences being aligned. Gapped alignments may be more biologically significant.Block(蛋白质家族中保守区域的组块)Conserved ungapped patterns approximately 3-60 amino acids in length in a set of related proteins.BLOSUM matrices(模块替换矩阵,一种主要替换矩阵)An alternative to PAM tables, BLOSUM tables were derived using local multiple alignments of more distantly related sequences than were used for the PAM matrix. These are used to assess thesimilarity of sequences when performing alignments.Boltzmann distribution(Boltzmann 分布)Describes the number of molecules that have energies above a certain level, based on the Boltzmann gas constant and the absolute temperature.Boltzmann probability function(Boltzmann 概率函数)See Boltzmann distribution.Bootstrap analysisA method for testing how well a particular data set fits a model. For example, the validity of the branch arrangement in a predicted phylogenetic tree can be tested by resampling columns in a multiple sequence alignment to create many new alignments. The appearance of a particular branch in trees generated from these resampled sequences can then be measured. Alternatively, a sequence may be left out of an analysis to deter-mine how much the sequence influences the results of an analysis.Branch length(分支长度)In sequence analysis, the number of sequence changes along a particular branch of a phylogenetic tree.CDS or cds (编码序列)Coding sequence.Chebyshe, d inequalityThe probability that a random variable exceeds its mean is less than or equal to the square of 1 over the number of standard deviations from the mean.Clone (克隆)Population of identical cells or molecules (e.g. DNA), derived from a single ancestor.Cloning Vector (克隆载体)A molecule that carries a foreign gene into a host, and allows/facilitates the multiplication of that gene in a host. When sequencing a gene that has been cloned using a cloning vector (rather than by PCR), care should be taken not to include the cloning vector sequence when performing similarity searches. Plasmids, cosmids, phagemids, YACs and PACs are example types of cloning vectors.Cluster analysis(聚类分析)A method for grouping together a set of objects that are most similar from a larger group of related objects. The relationships are based on some criterion of similarity or difference. For sequences, a similarity or distance score or a statistical evaluation of those scores is used.CobblerA single sequence that represents the most conserved regions in a multiple sequence alignment. The BLOCKS server uses the cobbler sequence to perform a database similarity search as a way to reach sequences that are more divergent than would be found using the single sequences in the alignment for searches.Coding system (neural networks)Regarding neural networks, a coding system needs to be designed for representing input and output. The level of success found when training the model will be partially dependent on the quality of the coding system chosen.Codon usageAnalysis of the codons used in a particular gene or organism.COG(直系同源簇)Clusters of orthologous groups in a set of groups of related sequences in microorganism and yeast (S. cerevisiae). These groups are found by whole proteome comparisons and include orthologs and paralogs. See also Orthologs and Paralogs.Comparative genomics(比较基因组学)A comparison of gene numbers, gene locations, and biological functions of genes in the genomes of diverse organisms, one objective being to identify groups of genes that play a unique biological role in a particular organism.Complexity (of an algorithm)(算法的复杂性)Describes the number of steps required by the algorithm to solve a problem as a function of the amount of data; for example, the length of sequences to be aligned.Conditional probability(条件概率)The probability of a particular result (or of a particular value of a variable) given one or more events or conditions (or values of other variables).Conservation (保守)Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.Consensus(一致序列)A single sequence that represents, at each subsequent position, the variation found within corresponding columns of a multiple sequence alignment.Context-free grammarsA recursive set of production rules for generating patterns of strings. These consist of a set of terminal characters that are used to create strings, a set of nonterminal symbols that correspond to rules and act as placeholders for patterns that can be generated using terminal characters, a set of rules for replacing nonterminal symbols with terminal characters, and a start symbol.Contig (序列重叠群/拼接序列)A set of clones that can be assembled into a linear order. A DNA sequence that overlaps with another contig. The full set of overlapping sequences (contigs) can be put together to obtain the sequence for a long region of DNA that cannot be sequenced in one run in a sequencing assay. Important in genetic mapping at the molecular level.CORBA(国际对象管理协作组制定的使OOP对象与网络接口统一起来的一套跨计算机、操作系统、程序语言和网络的共同标准)The Common Object Request Broker Architecture (CORBA) is an open industry standard for working with distributed objects, developed by the Object Management Group. CORBA allows the interconnection of objects and applications regardless of computer language, machine architecture, or geographic location of the computers.Correlation coefficient(相关系数)A numerical measure, falling between - 1 and 1, of the degree of the linear relationship between two variables. A positive value indicates a direct relationship, a negative value indicates an inverse relationship, and the distance of the value away from zero indicates the strength of the relationship. A value near zero indicates no relationship between the variables.Covariation (in sequences)(共变)Coincident change at two or more sequence positions in related sequences that may influence the secondary structures of RNA or protein molecules.Coverage (or depth) (覆盖率/厚度)The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a 'high-quality base' is defined as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20).Database(数据库)A computerized storehouse of data that provides a standardized way for locating, adding, removing, and changing data. See also Object-oriented database, Relational database.DendogramA form of a tree that lists the compared objects (e.g., sequences or genes in a microarray analysis) in a vertical order and joins related ones by levels of branches extending to one side of the list.Depth (厚度)See coverageDirichlet mixturesDefined as the conjugational prior of a multinomial distribution. One use is for predicting the expected pattern of amino acid variation found in the match state of a hid-den Markov model (representing one column of a multiple sequence alignment of proteins), based on prior distributions found in conserved protein domains (blocks).Distance in sequence analysis(序列距离)The number of observed changes in an optimal alignment of two sequences, usually not counting gaps.DNA Sequencing (DNA测序)The experimental process of determining the nucleotide sequence of a region of DNA. This is done by labelling each nucleotide (A, C, G or T) with either a radioactive or fluorescent marker which identifies it. There are several methods of applying this technology, each with their advantages and disadvantages. For more information, refer to a current text book. High throughput laboratories frequently use automated sequencers, which are capable of rapidly reading large numbers of templates. Sometimes, the sequences may be generated more quickly than they can be characterised.Domain (功能域)A discrete portion of a protein assumed to fold independently of the rest of the protein andpossessing its own function.Dot matrix(点标矩阵图)Dot matrix diagrams provide a graphical method for comparing two sequences. One sequence is written horizontally across the top of the graph and the other along the left-hand side. Dots are placed within the graph at the intersection of the same letter appearing in both sequences. A series of diagonal lines in the graph indicate regions of alignment. The matrix may be filtered to reveal the most-alike regions by scoring a minimal threshold number of matches within a sequence window.Draft genome sequence (基因组序列草图)The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes.DUST (一种低复杂性区段过滤程序)A program for filtering low complexity regions from nucleic acid sequences.Dynamic programming(动态规划法)A dynamic programming algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a table or matrix. Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found. This algorithm is used for producing sequence alignments, given a scoring system for sequence comparisons.EMBL (欧洲分子生物学实验室,EMBL数据库是主要公共核酸序列数据库之一)European Molecular Biology Laboratories. Maintain the EMBL database, one of the major public sequence databases.EMBnet (欧洲分子生物学网络)European Molecular Biology Network: /was established in 1988, and provides services including local molecular databases and software for molecular biologists in Europe. There are several large outposts of EMBnet, including EXPASY.Entropy(熵)From information theory, a measure of the unpredictable nature of a set of possible elements. The higher the level of variation within the set, the higher the entropy.Erdos and Renyi lawIn a toss of a “fair” coin, the number of heads in a row that can be expected is the logarithm of the number of tosses to the base 2. The law may be generalized for more than two possible outcomes by changing the base of the logarithm to the number of out-comes. This law was used to analyze the number of matches and mismatches that can be expected between random sequences as a basis for scoring the statistical significance of a sequence alignment.EST (表达序列标签的缩写)See Expressed Sequence TagExpect value (E)(E值)E value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. In a database similarity search, the probability that an alignment score as good as the one found between a query sequence and a database sequence would be found in as many comparisons between random sequences as was done to find the matching sequence. In other types of sequence analysis, E has a similar meaning.Expectation maximization (sequence analysis)An algorithm for locating similar sequence patterns in a set of sequences. A guessed alignment of the sequences is first used to generate an expected scoring matrix representing the distribution of sequence characters in each column of the alignment, this pattern is matched to each sequence, and the scoring matrix values are then updated to maximize the alignment of the matrix to the sequences. The procedure is repeated until there is no further improvement.Exon (外显子)Coding region of DNA. See CDS.Expressed Sequence Tag (EST) (表达序列标签)Randomly selected, partial cDNA sequence; represents it's corresponding mRNA. dbEST is a large database of ESTs at GenBank, NCBI.FASTA (一种主要数据库搜索程序)The first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable which specifies the size of a "word". (Pearson andLipman)Extreme value distribution(极值分布)Some measurements are found to follow a distribution that has a long tail which decays at high values much more slowly than that found in a normal distribution. This slow-falling type is called the extreme value distribution. The alignment scores between unrelated or random sequences are an example. These scores can reach very high values, particularly when a large number of comparisons are made, as in a database similarity search. The probability of a particular score may be accurately predicted by the extreme value distribution, which follows a double negative exponential function after Gumbel.False negative(假阴性)A negative data point collected in a data set that was incorrectly reported due to a failure of the test in avoiding negative results.False positive (假阳性)A positive data point collected in a data set that was incorrectly reported due to a failure of the test. If the test had correctly measured the data point, the data would have been recorded as negative.Feed-forward neural network (反向传输神经网络)Organizes nodes into sequence layers in which the nodes in each layer are fully connected with the nodes in the next layer, except for the final output layer. Input is fed from the input layer through the layers in sequence in a “feed-forward” direction, resulting in output at the final layer. See also Neural network.Filtering (window size)During pair-wise sequence alignment using the dot matrix method, random matches can be filtered out by using a sliding window to compare the two sequences. Rather than comparing a single sequence position at a time, a window of adjacent positions in the two sequences is compared and a dot, indicating a match, is generated only if a certain minimal number of matches occur.Filtering (过滤)Also known as Masking. The process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores. See SEG and DUST.Finished sequence(完成序列)Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps.Fourier analysisStudies the approximations and decomposition of functions using trigonometric polynomials.Format (file)(格式)Different programs require that information be specified to them in a formal manner, using particular keywords and ordering. This specification is a file format.Forward-backward algorithmUsed to train a hidden Markov model by aligning the model with training sequences. The algorithm then refines the model to reduce the error when fitted to the given data using a gradient descent approach.FTP (File Transfer Protocol)(文件传输协议)Allows a person to transfer files from one computer to another across a network using an FTP-capable client program. The FTP client program can only communicate with machines that run an FTP server. The server, in turn, will make a specific portion of its tile system available for FTP access, providing that the client is able to supply a recognized user name and password to the server.Full shotgun clone (鸟枪法克隆)A large-insert clone for which full shotgun sequence has been produced.Functional genomics(功能基因组学)Assessment of the function of genes identified by between-genome comparisons. The function of a newly identified gene is tested by introducing mutations into the gene and then examining the resultant mutant organism for an altered phenotype.gap (空位/间隙/缺口)A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment.Gap penalty(空位罚分)A numeric score used in sequence alignment programs to penalize the presence of gaps within an alignment. The value of a gap penalty affects how often gaps appear in alignments produced by the algorithm. Most alignment programs suggest gap penalties that are appropriate for particular scoring matrices.Genetic algorithm(遗传算法)A kind of search algorithm that was inspired by the principles of evolution. A population of initial solutions is encoded and the algorithm searches through these by applying a pre-defined fitness measurement to each solution, selecting those with the highest fitness for reproduction. New solutions can be generated during this phase by crossover and mutation operations, defined in the encoded solutions.Genetic map (遗传图谱)A genome map in which polymorphic loci are positioned relative to one another on the basis of the frequency with which they recombine during meiosis. The unit of distance is centimorgans (cM), denoting a 1% chance of recombination.Genome(基因组)The genetic material of an organism, contained in one haploid set of chromosomes.Gibbs sampling methodAn algorithm for finding conserved patterns within a set of related sequences. A guessed alignment of all but one sequence is made and used to generate a scoring matrix that represents the alignment. The matrix is then matched to the left-out sequence, and a probable location of the corresponding pattern is found. This prediction is then input into a new alignment and another scoring matrix is produced and tested on a new left-out sequence. The process is repeated until there is no further improvement in the matrix.Global alignment(整体联配)Attempts to match as many characters as possible, from end to end, in a set of twomore sequences.Gopher (一个文档发布系统,允许检索和显示文本文件)Graph theory(图论)A branch of mathematics which deals with problems that involve a graph or network structure. A graph is defined by a set of nodes (or points) and a set of arcs (lines or edges) joining the nodes. In sequence and genome analysis, graph theory is used for sequence alignments and clustering alike genes.GSS(基因综述序列)Genome survey sequence.GUI(图形用户界面)Graphical user interface.H (相对熵值)H is the relative entropy of the target and background residue frequencies. (Karlin and Altschul, 1990). H can be thought of as a measure of the average information (in bits) available per position that distinguishes an alignment from chance. At high values of H, short alignments can be distinguished by chance, whereas at lower H values, a longer alignment may be necessary. (Altschul, 1991)Half-bitsSome scoring matrices are in half-bit units. These units are logarithms to the base 2 of odds scores times 2.Heuristic(启发式方法)A procedure that progresses along empirical lines by using rules of thumb to reach a solution. The solution is not guaranteed to be optimal.Hexadecimal system(16制系统)The base 16 counting system that uses the digits O-9 followed by the letters A-F.HGMP (人类基因组图谱计划)Human Genome Mapping Project.Hidden Markov Model (HMM)(隐马尔可夫模型)In sequence analysis, a HMM is usually a probabilistic model of a multiple sequence alignment, but can also be a model of periodic patterns in a single sequence, representing, for example, patterns found in the exons of a gene. In a model of multiple sequence alignments, each column of symbols in the alignment is represented by a frequency distribution of the symbols called a state, and insertions and deletions by other states. One then moves through the model along a particular path from state to state trying to match a given sequence. The next matching symbol is chosen from each state, recording its probability (frequency) and also the probability of going to thatparticular state from a previous one (the transition probability). State and transition probabilities are then multiplied to obtain a probability of the given sequence. Generally speaking, a HMM is a statistical model for an ordered sequence of symbols, acting as a stochastic state machine that generates a symbol each time a transition is made from one state to the next. Transitions betweenstates are specified by transition probabilities.Hidden layer(隐藏层)An inner layer within a neural network that receives its input and sends its output to other layers within the network. One function of the hidden layer is to detect covariation within the input data, such as patterns of amino acid covariation that are associated with a particular type of secondary structure in proteins.Hierarchical clustering(分级聚类)The clustering or grouping of objects based on some single criterion of similarity or difference.An example is the clustering of genes in a microarray experiment based on the correlation between their expression patterns. The distance method used in phylogenetic analysis is another example.Hill climbingA nonoptimal search algorithm that selects the singular best possible solution at a given state or step. The solution may result in a locally best solution that is not a globally best solution.Homology(同源性)A similar component in two organisms (e.g., genes with strongly similar sequences) that can be attributed to a common ancestor of the two organisms during evolution.Horizontal transfer(水平转移)The transfer of genetic material between two distinct species that do not ordinarily exchange genetic material. The transferred DNA becomes established in the recipient genome and can be detected by a novel phylogenetic history and codon content com-pared to the rest of the genome.HSP (高比值片段对)High-scoring segment pair. Local alignments with no gaps that achieve one of the top alignment scores in a given search.HTGS/HGT(高通量基因组序列)High-throughout genome sequences。
生 生物 物信 信息 息学 学常 常用 用基 基本 本词 词汇汇表 来源:生物电子学国家重点实验室A 英 文 名 词中 文 名 词 解 释 A ( Adenine )腺嘌呤 作为碱基的两种嘌呤中的一种。
active site活化位点 蛋白质三维表面催化作用发生的区域。
alignment 比对 为了确定两个同源核酸或蛋白质序列的累计差异而进行的配对称为比对。
alignment of alignments 比对的比对 即比对的对象不是简单的序列,而是序列的比对。
alleles 等位基因 一个基因的不同版本。
alpha carbon α 碳 在氨基酸中与侧链( R 基团)相连的中心碳原子。
alternative splicing 可变剪接 从一个单独的 hnRNA 生成两个或多个 mRNA 分子的过程。
amino terminus (Nterminal) 氨基端( N 端) 在一个多肽中, 具有自由氨基的分子端, 对应于基因的 5'端。
antiparallel 反向平行表示相反的方向;在双链 DNA 中,这意味着如果一条链是 5' 到 3' 的,则其互补链 方向 是 3' 到 5' 的。
Bbase pair 碱基对 (1)在双链 DNA 中嘌呤和嘧啶之间的相互作用(特别指 A 和 T 之间,G 和 C 之间);(2)双链 DNA 序列长度的基本单位。
beta turns β 转角 在反向平行的 β 折叠片中,当 β 链反转方向的时候蛋白质内部形成的 U 型结构Bioinformatics 生物信息学 应用信息科学的理论、方法和技术,管理、分析和利用生物分子数据。
Biocomputing 生物计算本书中特指用计算机技术分析和处理生物分子数据。
Basic Local Alignment 基本的局部比对搜索 一种常用的序列数据库搜索工具。
生物信息学术语BLAST :Basic Local Alignment Search Tool,基本的基于局部对准的搜索工具;一种快速查找与给定序列具有连续相同片断的序列的技术。
Entrez :美国国家生物技术信息中心所提供的在线资源检索器。
该资源将GenBank序列与其原始文献出处链接在一起。
NCBI :美国国立生物技术信息中心(National Center for Biotechnology Information),1988年设立,为美国国家医学图书馆(NLM)和国家健康协会(NIH)下属部门之一。
提供生物医学领域的信息学服务,如世界三大核酸数据库之一的GenBank数据库,PubMed医学文献检索数据库等。
Conserved sequence :保守序列。
演化过程中基本上不变的DNA中的碱基序列或蛋白质中的氨基酸序列。
Domain :功能域。
蛋白质中具有某种特定功能的部分,它在序列上未必是连续的。
某蛋白质中所有功能域组合其起来决定着该蛋白质的全部功能。
EBI:欧洲生物信息学研究所(European Bioinformatics Institute)。
The National Center for Biotechnology Information (NCBI) at the NationalLibrary of Medicine (NLM), National Institutes of Health (NIH)EMBL :欧洲分子生物学实验室(uropean Molecular Biology Laboratory)。
GenBank :由美国国家生物技术信息中心提供的核酸序列数据库。
Gene :基因。
遗传的基本的物理和功能单位。
一个基因就是位于某条染色体的某个位置上的核苷酸序列,其中蕴含着某种特定功能产物(如蛋白质或RNA分子)的编码。
DUST :A program for filtering low complexity regions from nucleic acid sequences.Gene expression :基因表达。
Botany植物学Cell theory细胞学说cell membrane细胞膜nucleus 细胞核Organelle 细胞器cell wall细胞壁cytoplasm细胞质protoplast原生质体cell cycle细胞周期mitochondrion 线粒体photosynthesis光合作用unit membrane 单位膜chloroplast 叶绿体chlorophyll叶绿素xanthophyll叶黄素carotene胡萝卜素golgiosome高尔基体ribosome 核糖体lysosome溶酶体microfilament微丝nuclear fission核分裂reproduction繁殖primary wall初生壁secondary wall次生壁plasmodesma胞间连丝mitosis有丝分裂amitosis无丝分裂meiosis减数分裂cytokinesis胞质分裂interphase间期prophase前期metaphase中期anaphase后期telophase末期tissue组织pistil 雌蕊stamen雄蕊ovary子房pollination传粉pollen tube花粉管porogamy珠孔受精chalazogamy合点受精mesogamy中部受精apomixis无融合生殖apogamy无配子生殖patrogenesis孤雄生殖parthenogensis 孤雌生殖apospory无孢子生殖pericarp果皮life history生活史root system根系main root主根lateral root侧根taproot system直根系fibrous root system须根系cortex皮层vascular cylinder 维管柱pericycle中柱鞘xylem ray 木射线vascular ray 维管射线phloem ray韧皮射线root cap根冠Casparian strip凯氏带primary xylem初生木质部primary phloem初生韧皮部vascular ray 维管射线xylem ray 木射线phelloderm栓内层phloem ray韧皮射线embryo胚homologous organ同源器官analogous organ同功器官endosperm胚乳seed coat种皮radicle胚根plumule胚芽hypocotyl下胚轴cotyledon子叶dormancy休眠seed germination种子萌发eukaryote真核生物prokaryote原核生物algae藻类blue-green algae蓝藻trichogyne受精丝mucopolysaccharide黏多糖gelatinous sheath 胶质鞘exospore外生孢子heterosexual cell异性细胞green algae绿藻isogamy同配生殖anisogamy 异配生殖anisogamy 卵式生殖zygogamy 接合生殖haploid单倍体diploid二倍体polyploid多倍体carposporophyte孢子体brown algae褐藻sea-tangle海带agar琼脂fungi菌类parasitism寄生saprophytic腐生的lichen地衣archegonium颈卵器antheridium精子器antiphyte孢子体gametophyte配子体protonema原丝体bryophyta 苔藓植物cruciferae十字花科vascular plants微管植物aquatic plant水生植物salicaceae杨柳科angiosperm被子植物endoplasmic reticulum内质网vegetative reproduction营养繁殖intercellular layer胞间层phellogen& cork cambium木栓形成层asexual reproduction无性繁殖sexual propagation有性繁殖tetradynamous stamen四强雄蕊didynamous stamen二强雄蕊monodelphous stamen单体雄蕊diadelphous stamen二体雄蕊triadelphous stamen三体雄蕊polyadelphous stamens多体雄蕊synantherous stamen聚药雄蕊primary wall cells初生壁细胞vegetative cell营养细胞male sterility雄性不育filiform apparatus丝状器meristem zone 分生区elongation zone伸长区maturation zone成熟区embryophyte有胚植物specific parasitism专性寄生specific saprophyte专性腐生facultative parasitism兼性寄生facultative saprophyte兼性腐生sexual generation有性世代asexual generation无性世代Zoology动物学cell细胞prokaryotic cell原核细胞eukaryotic cell真核细胞protein蛋白质nucleic acid核酸carbohydrate糖lipid脂质protoplasm原生质inclusion内含物cell cycle细胞周期pulmonary alveolus肺泡flagellum鞭毛food vacuole食物泡pinocytosis胞饮作用fission裂体生殖microgamete小配子zygote合子microtubule微管contraction silk 收缩丝merogenesis 卵裂blastocoele 囊胚腔complete cleavage完全卵裂layering分层cynapse突触myoneme肌丝myocyte肌细胞mesoglea中胶层monoecism雌雄同体dioecism雌雄异体velum缘膜radial symmetry辐射对称nerve net神经网planula 浮浪幼虫bilateral symmetry两侧对称mesoderm中胚层tubule cell 管细胞osmoregulation渗透调节acetabulum 腹吸盘oral sucker口吸盘metacercaria囊蚴pseudocoel假体腔cuticle角质膜cloacal pore泄殖孔renette腺肾细emunctory排泄管resting egg休眠卵metamere体节metamerism分节现象sense organ 感觉器periostracum壳皮层prismatic layer壳层nacreous layer珍珠层veliger 面盘幼虫glochidium 钩介幼虫adductor闭壳肌segmentation异律分节linear animal线形动物pericardial cavity围心腔cervical vertebra颈椎sacral vertebra荐椎pulmonary vein肺静脉precaval vein 前腔静脉bladder气囊middle ear中耳tympanum cavity中耳腔amnion羊膜neopallium新皮层lagena 瓶状囊wishbone叉骨postcaval vein后腔静脉glandular stomach腺胃air sac气囊salt gland盐腺sclerotic ring 巩膜骨viviparity胎生placenta胎盘allantois尿囊rumen瘤胃bursa of fabricius 腔上囊masticatory stomach肌胃reticulum网胃omasum瓣胃abomasum皱胃cochlea耳蜗earthworm蚯蚓internal naris内鼻孔amniota羊膜动物arthropod节肢动物coelenterate腔肠动物annelid环节动物cell membrane&plasma membrane细胞膜epithelial tissue上皮组织connective tissue结缔组织cartilage tissue软骨组织osseous tissue骨组织muscular tissue肌肉组织cardiac muscle心肌intercalated disc闰盘Nissl's body尼氏小体colony &group群体meroblastic cleavage不完全卵裂colonial theory 群体说gastrovascular cavity消化循环腔muscle system肌肉体系excretory system排泄系统reproductive system生殖系统digestive system消化系统archinephric duct原肾管basal lamina & basal membrane基膜cross-fertilization异体受精self-fertilization自体受精final host终寄主first intermidate host第一中间寄主semicircular canal半规管second intermediate host第二中间寄主Genetics遗传学heredity 遗传variation 变异gene 基因pisum sativum 豌豆segregation 分离gamete 生殖细胞zygote 合子allele 等位基因genotype 基因型phenotype 表现型test cross 测交oryza sativa 水稻diploid 二倍体haploid 单倍体centromere 着丝粒satellite 随体linker 连丝mitosis 有丝分裂mesoblast中胚层spindle 纺锤体interphase 间期spindle fiber 纺锤丝vicia faba蚕豆nucleoplasm 核质spermatogenous 精原细胞oogonium 卵原细胞spermatid 精细胞Phenocopy 拟表型epistasis上位效应mutant突变型gametic lethal配子致死zygotic lethal合子致死autosome 常染色体dominant lethal显性致死carrier 携带者homozygote 纯合体heterozygote 杂合体genotype 基因型phenotype 表现型linkage group 连锁群interference 干涉coincidence 并发率genetic map 遗传学图wild type野生型mutation 突变heterokaryon 异核体auxotroph 营养缺陷型strain 菌株recipient 受体donor 供体fragment 片段induction 诱导prophage 原噬菌体transduction 转导Mendel’s laws 孟德尔定律law of segregation 分离定律first filial generation 子一代parental generation 亲代dominant character 显性性状recessive character 隐性性状hereditary determinant 遗传因子parental combination 亲组合recombination 重组合punnett square 棋盘法Mendelian character 孟德尔性状primary constriction 初级缢痕secondary constriction 次级缢痕nucleolar organizer 核仁形成区first polar body 第一极体second polar body 第二极体sister chromatids 姐妹染色单体female gametic nucleus 卵核multiple alleles 复等位基因sex-chromosome性染色体sex-linked inheritance 伴性遗传primary constriction 初级缢痕secondary constriction 次级缢痕complementary gene互补基因homologous chromosome 同源染色体sister chromatids 姐妹染色单体secondary oocyte 次级卵母细胞three-point testcross 三点测交primary spermatocyte 初级精母细胞secondary spermatocyte 次级精母细胞first division segregation 第一次分裂分离second division segregation 第二次分裂分离law of independent assortment 自由组合定律Biochemistry 生物化学essential element必需元素trace elements微量元素proteoglycan蛋白聚糖amino acid氨基酸primary structure 一级结构random coil无规卷曲structural domain 结构域subunit亚基degeneration变性adenine腺嘌呤guanine鸟嘌呤cytosine胞嘧啶thymine胸腺嘧啶uracil尿嘧啶nucleoside 核苷nucleotide核苷酸base pairing碱基配对base pair碱基对数base碱基数gyrase旋转酶nucleosome核小体complementary DNA互补DNA plasmid质粒transposons转座子repetitive sequence重复序列exon外显子intron内含子variable loop可变环ribonuclease核糖核酸酶renaturation复性hyperchromic effect增色效应base stacking force碱基堆积力annealing退火melting-out temperature熔解温度hypochromic effect减色效应maltose麦芽糖sucrose蔗糖lactose乳糖starch淀粉glycogen糖原cellulose纤维素cellulase纤维素酶selectivity选择性substrate底物holoenzyme全酶cofactor辅因子coenzyme辅酶oxidase氧化酶metabolism新陈代谢assimilation同化作用catabolism异化作用metabolite代谢产物biological oxidation 生物氧化cytochrome细胞色素rotenone鱼藤酮amytal阿密妥antimycin A抗霉素A cyanide氰化物glycolysis糖酵解ethanol乙醇citrate柠檬酸cis-aconitate 顺乌头酸succinic acid琥珀酸oxaloacetic acid草酰乙酸acetyl-coenzyme乙酰辅酶fumarate延胡索酸glyoxylate cycle 乙醛酸循环malate苹果酸fatty acid 脂肪酸carbon unit一碳单位replicon复制子core enzyme 核心酶primosome引发体Okazaki fragment冈崎片段leading chain 前导链lagging strand后随链terminator终止子telomere端粒telomerase端粒酶replication fork复制叉vector载体promoter启动子terminator终止子operon操纵子codon密码子degeneracy简并性hormone激素citric acid cycle 柠檬酸循环deamination脱氨基作用urea cycle尿素循环euchromatin常染色质messenger RNA信使RNAtransfer RNA转移RNA ribosome RNA核糖体RNA metabolic regulation代谢调节feedback regulation反馈调节structural gene结构基因promoter gene启动基因operator gene操纵基因regulator gene调节基因termination factor终止因子triplet code三联体密码initiator codon起始密码termination codon终止密码semiconservative replication半保留复制ornithine cycle鸟氨酸循环ketogenic amino acid生酮氨基酸glucogenic amino acid生糖氨基酸oxidative deamination氧化脱氨作用transamination转氨基作用reverse transcription逆转录decarboxylation脱羧作用semidiscontinuous replication半不连续复制reverse transcriptase 逆转录酶missense mutation错义突变synonymous mutation同义突变neutral mutation中性突变nonsense mutation无义突变phosphatidic acid 磷脂酸essential amino acids 必需氨基酸dihydrouracil loop二氢尿嘧啶环anticodon loop反密码子环double-strand circular DNA 双链环形DNA superhelical DNA 超螺旋DNA open circular DNA 开环DNA linear DNA 线形DNAbase stacking force 碱基堆积力secondary structure二级结构super-secondary structure超二级结构tertiary structure三级结构quaternary structure四级结构negative supercoil DNA负超螺旋DNA positive supercoil DNA正超螺旋DNAGlyceraldehyde-3-phosphate甘油醛-3-二磷酸glucogenic and ketogenic amino acid生糖兼生酮氨基酸restriction endonuclease限制性内切酶polymerase chain reaction聚合酶链反应Microbiology微生物学living creatures 生物culture medium 培养基lawn菌苔culture plate 培养平板bacteria 细菌archaea 古生菌eukaryote真核生物prokaryote 原核生物protozoan 原生动物hypha 菌丝mycoplasma 支原体yeast 酵母菌plasmolysis 质壁分离Escherichia Coli大肠杆菌murein胞壁质peptidoglycan 肽聚糖mucopeptide黏肽outer membrane外膜chromosome染色体nucleolus 核仁nucleoid 拟核chromatin 染色质centromere 着丝粒telomere 端粒protoplast 原生质体mycoplasma 支原体glycoprotein 糖蛋白mesosome 间体cytoplasm细胞质megnetosome磁小体nucleoid拟核glycocalyx 糖被capsule 荚膜flagellum 鞭毛lysosome 溶酶体chloroplast 叶绿体thylakoid类囊体inorganic salt 无机盐peptone 蛋白胨sulfur bacteria 硫细菌beef extract牛肉膏vitamin 维生素inclusion body 内含物lithotroph 无机营养型medium 培养基agar 琼脂organotroph 有机营养型antiport 逆向运输active transport 主动运输pinocytosis 胞饮作用catabolism 分解代谢passive transport 被动运输uniport 单向运输anabolism 合成代谢fermentation发酵batch culture 分批培养log phase 对数生长期stationary phase 稳定生长期lag phase 迟缓期decline phase衰亡期aerobe 好氧菌antibiotic 抗生素antigenome 反基因组transformation 转化genome 基因组plasmid 质粒transforming factor 转化因子diploid 二倍体haploid 单倍体transposable element 转座因子conjugation接合作用transposon转座子phenotype 表型genotype基因型auxotroph营养缺陷型wild-type野生型transition 转换transversion 颠换spontaneous mutation 自发突变reverse mutation 回复突变sexduction 性导transduction 转导promoter 启动子operon 操纵子recombination repair 重组修复repressor 阻遏蛋白corepressor辅阻遏物clone 克隆denaturation 变性annealing 退火extension 延伸cloning vector 克隆载体replicon 复制子telomere 端粒cohesive end 黏性末端promoter 启动子terminator 终止子gene therapy 基因治疗phylogeny 系统发育ammonification 氨化作用nitrification 硝化作用denitrification 反硝化作用expression vector 表达载体aerobic respiration有氧呼吸anaerobic respiration无氧呼吸origin of replication 复制起始点incompatibility 不亲和性gene mutation 基因突变synonymous mutation 同义突变chromosomal aberration 染色体畸变missense mutation 错义突变frame-shift mutation 移码突变lactose operon 乳糖操纵子negative transcription control 负转录调控tryptophan operon 色氨酸操纵子cytoplasmic inheritance 细胞质遗传genetic engineering 基因工程recombinant DNA technology 重组DNA技术palindromic structure 回文结构spread plate method 涂布平板法pour plate method 倾注培养法streak plate method 平板划线法shake tube method 稀释摇管法continuous culture 连续培养。
生物信息学主要英文术语及释义Coding region of DNA. See CDS.Expressed Sequence Tag (EST) (表达序列标签)Randomly selected, partial cDNA sequence; represents it's corresponding mRNA. dbEST is a large database of ESTs at GenBank, NCBI.FASTA (一种主要数据库搜索程序)The first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable which specifies the size of a "word". (Pearson and Lipman)Extreme value distribution(极值分布)Some measurements are found to follow a distribution that has a long tail which decays at high value s much more slowly than that found in a normal distribution. This slow-falling type is called the extreme value distribution. The alignment scores between unrelated or random sequences are an example. These scores can reach very high value s, particularly when a large number of comparisons are made, as in a database similarity search. The probability of a particular score may be accurately predicted by the extreme value distribution, which follows a double negative exponential function after Gumbel.False negative(假阴性)A negative data point collected in a data set that was incorrectly reported due to a failure of the test in avoiding negative results.False positive (假阳性)A positive data point collected in a data set that was incorrectly reported due to a failure of the test. If the test had correctly measured the data point, the data would have been recorded as negative.Feed-forward neural network (反向传输神经网络)Organizes nodes into sequence layers in which the nodes in each layer are fully connected with the nodes in the next layer, except for the final output layer. Input is fed from the input layer through the layers in sequence in a “feed-forward” direction, resulting in output at the final layer. See also Neural network.Filtering (window size)During pair-wise sequence alignment using the dot matrix method, random matches can be filtered out by using a sliding window to compare the two sequences. Rather than comparing a single sequence position at a time, a window of adjacent positions in the two sequences is compared and a dot, indicating a match, is generated only if a certain minimal number of matches occur.Filtering (过滤)Also known as Masking. The process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores. See SEG and DUST.Finished sequence(完成序列)Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps.Fourier analysisStudies the approximations and decomposition of functions using trigonometric polynomials.Format (file)(格式)Different programs require that information be specified to them in a formal manner, using particular keywords and ordering. This specification is a file format.Forward-backward algorithmUsed to train a hidden Markov model by aligning the model with training sequences. The algorithm then refines the model to reduce the error when fitted to the given data using a gradient descent approach.FTP (File Transfer Protocol)(文件传输协议)Allows a person to transfer files from one computer to another across a network using an FTP-capable client program. The FTP client program can only communicate with machines that run an FTP server. The server, in turn, will make a specific portion of its tile system available for FTP access, providing that the client is able to supply a recognized user name and password to the server.Full shotgun clone (鸟枪法克隆)A large-insert clone for which full shotgun sequence has been produced.Functional genomics(功能基因组学)Assessment of the function of genes identified by between-genome comparisons. The function of a newly identified gene is tested by introducing mutations into the gene and then examining the resultant mutant organism for an altered phenotype.gap (空位/间隙/缺口)A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment. Gap penalty(空位罚分)A numeric score used in sequence alignment programs to penalize the presence of gaps within an alignment. The value of a gap penalty affects how often gaps appear in alignments produced by the algorithm. Most alignment programs suggest gap penalties that are appropriate for particular scoring matrices.Genetic algorithm(遗传算法)A kind of search algorithm that was inspired by the principles of evolution. A population of initial solutions is encoded and the algorithm searches through these by applying a pre-defined fitness measurement to each solution, selecting those with the highest fitness for reproduction. New solutions can be generated during this phase by crossover and mutation operations, defined in the encoded solutions.Genetic map (遗传图谱)A genome map in which polymorphic loci are positioned relative to one another on the basis of the frequency with which they recombine during meiosis. The unit of distance is centimorgans (cM), denoting a 1% chance of recombination.Genome(基因组)The genetic material of an organism, contained in one haploid set of chromosomes.Gibbs sampling methodAn algorithm for finding conserved patterns within a set of related sequences. A guessed alignment of all but one sequence is made and used to generate a scoring matrix that represents the alignment. The matrix is then matched to the left-out sequence, and a probable location of the corresponding pattern is found. This prediction is then input into a new alignment and another scoring matrix is produced and tested on a new left-out sequence. The process is repeated until there is no further improvement in the matrix.Global alignment(整体联配)Attempts to match as many characters as possible, from end to end, in a set of twomore sequences. Gopher (一个文档发布系统,允许检索和显示文本文件)Graph theory(图论)A branch of mathematics which deals with problems that involve a graph or network structure. A graph is defined by a set of nodes (or points) and a set of arcs (lines or edges) joining the nodes. In sequence and genome analysis, graph theory is used for sequence alignments and clustering alike genes.GSS(基因综述序列)Genome survey sequence.GUI(图形用户界面)Graphical user interface.H (相对熵值)H is the relative entropy of the target and background residue frequencies. (Karlin and Altschul, 1990).H can be thought of as a measure of the average information (in bits) available per position that distinguishes an alignment from chance. At high value s of H, short alignments can be distinguished by chance, whereas at lower H value s, a longer alignment may be necessary. (Altschul, 1991)Half-bitsSome scoring matrices are in half-bit units. These units are logarithms to the base 2 of odds scores times 2.Heuristic(启发式方法)A procedure that progresses along empirical lines by using rules of thumb to reach a solution. The solution is not guaranteed to be optimal.Hexadecimal system(16制系统)The base 16 counting system that uses the digits O-9 followed by the letters A-F.HGMP (人类基因组图谱计划)Human Genome Mapping Project.Hidden Markov Model (HMM)(隐马尔可夫模型)In sequence analysis, a HMM is usually a probabilistic model of a multiple sequence alignment, but can also be a model of periodic patterns in a single sequence, representing, for example, patterns found in the exons of a gene. In a model of multiple sequence alignments, each column of symbols in the alignment is represented by a frequency distribution of the symbols called a state, and insertions and deletions by other states. One then moves through the model along a particular path from state to state trying to match a given sequence. The next matching symbol is chosen from each state, recording its probability (frequency) and also the probability of going to that particular state from a previous one (the transition probability). State and transition probabilities are then multiplied to obtain a probability of the given sequence. Generally speaking, a HMM is a statistical model for an ordered sequence of symbols, acting as a stochastic state machine that generates a symbol each time a transition is made from one state to the next. Transitions betweenstates are specified by transition probabilities.Hidden layer(隐藏层)An inner layer within a neural network that receives its input and sends its output to other layers within the network. One function of the hidden layer is to detect covariation within the input data, such as patterns of amino acid covariation that are associated with a particular type of secondary structure in proteins.Hierarchical clustering(分级聚类)The clustering or grouping of objects based on some single criterion of similarity or difference.Anexample is the clustering of genes in a microarray experiment based on the correlation between their expression patterns. The distance method used in phylogenetic analysis is another example.Hill climbingA nonoptimal search algorithm that selects the singular best possible solution at a given state or step. The solution may result in a locally best solution that is not a globally best solution.Homology(同源性)A similar component in two organisms (e.g., genes with strongly similar sequences) that can be attributed to a common ancestor of the two organisms during evolution.Horizontal transfer(水平转移)The transfer of genetic material between two distinct species that do not ordinarily exchange genetic material. The transferred DNA becomes established in the recipient genome and can be detected by a novel phylogenetic history and codon content com-pared to the rest of the genome.HSP (高比值片段对)High-scoring segment pair. Local alignments with no gaps that achieve one of the top alignment scores in a given search.HTGS/HGT(高通量基因组序列)High-throughout genome sequencesHTML(超文本标识语言)The Hyper-Text Markup Language (HTML) provides a structural description of a document using a specified tag set. HTML currently serves as the Internet lingua franca for describing hypertext Web page documents.HyperplaneA generalization of the two-dimensional plane to N dimensions.HypercubeA generalization of the three-dimensional cube to N dimensions.Identity (相同性/相同率)The extent to which two (nucleotide or amino acid) sequences are invariant.Indel(插入或删除的缩略语)An insertion or deletion in a sequence alignment.Information content (of a scoring matrix)A representation of the degree of sequence conservation in a column of ascoring matrix representing an alignment of related sequences. It is also the number of questions that must be asked to match the column to a position in a test sequence. For bases, the max-imum possible number is 2, and for proteins, 4.32 (logarithm to the base 2 of the number of possible sequence characters).Information theory(信息理论)A branch of mathematics that measures information in terms of bits, the minimal amount of structural complexity needed to encode a given piece of information.Input layer(输入层)The initial layer in a feed-forward neural net. This layer encodes input information that will be fed through the network model.Interface definition languageUsed to define an interface to an object model in a programming language neutral form, where an interface is an abstraction of a service defined only by the operations that can be performed on it. Internet(因特网)The network infrastructure, consisting of cables interconnected by routers, that pro-vides global connectivity for individual computers and private networks of computers. A second sense of the word internet is the collective computer resources available over this global network.Interpolated Markov modelA type of Markov model of sequences that examines sequences for patterns of variable length in order to discriminate best between genes and non-gene sequences.Intranet(内部网)Intron (内含子)Non-coding region of DNA.Iterative(反复的/迭代的)A sequence of operations in a procedure that is performed repeatedly.Java(一种由SUN Microsystem开发的编程语言)K (BLAST程序的一个统计参数)A statistical parameter used in calculating BLAST scores that can be thought of as a natural scale for search space size. The value K is used in converting a raw score (S) to a bit score (S').K-tuple(字/字长)Identical short stretches of sequences, also called words.lambda (λ,BLAST程序的一个统计参数)A statistical parameter used in calculating BLAST scores that can be thought of as a natural scale for scoring system. The value lambda is used in converting a raw score (S) to a bit score (S').LAN(局域网)Local area network.Likelihood(似然性)The hypothetical probability that an event which has already occurred would yield a specific outcome. Unlike probability, which refers to future events, likelihood refers to past events. Linear discriminant analysisAn analysis in which a straight line is located on a graph between two sets of data pointsin a location that best separates the data points into two groups.Local alignment(局部联配)Attempts to align regions of sequences with the highest density of matches. In doing so, one or more islands of subalignments are created in the aligned sequences.Log odds score(概率对数值)The logarithm of an odds score. See also Odds score.Low Complexity Region (LCR) (低复杂性区段)Regions of biased composition including homopolymeric runs, short-period repeats, and more subtle overrepresentation of one or a few residues. The SEG program is used to mask or filter LCRs in amino acid queries. The DUST program is used to mask or filter LCRs in nucleic acid queries.Machine learning(机器学习)The training of a computational model of a process or classification scheme to distinguish between alternative possibilities.Markov chain(马尔可夫链)Describes a process that can be in one of a number of states at any given time. The Markov chain is defined by probabilities for each transition occurring; that is, probabilities of the occurrence of state sj given that the current state is sp Substitutions in nucleic acid and protein sequences are generally assumed to follow a Markov chain in that each site changes independently of the previous history ofthe site. With this model, the number and types of substitutions observed over a relatively short period of evolutionary time can be extrapolated to longer periods of time. In performing sequence alignments and calculating the statistical significance of alignment scores, sequences are assumed to be Markov chains in which the choice of one sequence position is not influenced by another.Masking (过滤)Also known as Filtering. The removal of repeated or low complexity regions from a sequence in order to improve the sensitivity of sequence similarity searches performed with that sequence.Maximum likelihood (phylogeny, alignment)(最大似然法)The most likely outcome (tree or alignment), given a probabilistic model of evolutionary change in DNA sequences.Maximum parsimony(最大简约法)The minimum number of evolutionary steps required to generate the observed variation in a set of sequences, as found by comparison of the number of steps in all possible phylogenetic trees.Method of momentsThe mean or expected value of a variable is the first moment of the value s of the variable around the mean, defined as that number from which the sum of deviations to all value s is zero. The standard deviation is the second moment of the value s about the mean, and so on.Minimum spanning treeGiven a set of related objects classified by some similarity or difference score, the mini-mum spanning tree joins the most-alike objects on adjacent outer branches of a tree and then sequentially joins less-alike objects by more inward branches. The tree branch lengths are calculated by the same neighbor-joining algorithm that is used to build phylogenetic trees of sequences from a distance matrix. The sum of the resulting branch lengths between each pair of objects will be approximately that found by the classification scheme.MMDB (分子建模数据库)Molecular Modelling Database. A taxonomy assigned database of PDB (see PDB) files, and related information.Molecular clock hypothesis(分子钟假设)The hypothesis that sequences change at the same rate in the branches of an evolutionarytree.Monte Carlo(蒙特卡罗法)A method that samples possible solutions to a complex problem as a way to estimate a more general solution.Motif (模序)A short conserved region in a protein sequence. Motifs are frequently highly conserved parts of domains.Multiple Sequence Alignment (多序列联配)An alignment of three or more sequences with gaps inserted in the sequences such that residues with common structural positions and/or ancestral residues are aligned in the same column. Clustal W is one of the most widely used multiple sequence alignment programsMutation data matrix(突变数据矩阵,即PAM矩阵)A scoring matrix compiled from the observation of point mutations between aligned sequences. Also refers to a Dayhoff PAM matrix in which the scores are given as log odds scores.N50 length (N50长度,即覆盖50%所有核苷酸的最大序列重叠群长度)A measure of the contig length (or scaffold length) containing a 'typical' nucleotide. Specifically, it isthe maximum length L such that 50% of all nucleotides lie in contigs (or scaffolds) of size at least L. Nats (natural logarithm)A number expressed in units of the natural logarithm.NCBI (美国国家生物技术信息中心)National Center for Biotechnology Information (USA). Created by the United States Congress in 1988, to develop information systems to support thebiological research community.Needleman-Wunsch algorithm(Needleman-Wunsch算法)Uses dynamic programming to find global alignments between sequences.Neighbor-joining method(邻接法)Clusters together alike pairs within a group of related objects (e.g., genes with similar sequences) to create a tree whose branches reflect the degrees of difference among the objects.Neural network(神经网络)From artificial intelligence algorithms, techniques that involve a set of many simple units that hold symbolic data, which are interconnected by a network of links associated with numeric weights. Units operate only on their symbolic data and on the inputs that they receive through their connections. Most neural networks use a training algorithm (see Back-propagation) to adjust connection weights, allowing the network to learn associations between various input and output patterns. See also Feed-forward neural network.NIH (美国国家卫生研究院)National Institutes of Health (USA).Noise(噪音)In sequence analysis, a small amount of randomly generated variation in sequences that is added to a model of the sequences; e.g., a hidden Markov model or scoring matrix, in order to avoid the model overfitting the sequences. See also Overfitting.Normal distribution(正态分布)The distribution found for many types of data such as body weight, size, and exam scores. The distribution is a bell-shaped curve that is described by a mean and standard deviation of the mean. Local sequence alignment scores between unrelated or random sequences do not follow this distribution but instead the extreme value distribution which has a much extended tail for higher scores. See also Extreme value distribution.Object Management Group (OMG)(国际对象管理协作组)A not-for-profit corporation that was formed to promote component-based software by introducing standardized object software. The OMG establishes industry guidelines and detailed object management specifications in order to provide a common framework for application development. Within OMG is a Life Sciences Research group, a consortium representing pharmaceutical companies, academic institutions, software vendors, and hardware vendors who are working together to improve communication and inter-operability among computational resources in life sciences research. See CORBA.Object-oriented database(面向对象数据库)Unlike relational databases (see entry), which use a tabular structure, object-oriented databases attempt to model the structure of a given data set as closely as possible. In doing so, object-oriented databases tend to reduce the appearance of duplicated data and the complexity of query structure often found in relational databases.Odds score(概率/几率值)The ratio of the likelihoods of two events or outcomes. In sequence alignments and scoring matrices,the odds score for matching two sequence characters is the ratio of the frequency with which the characters are aligned in related sequences divided by the frequency with which those same two characters align by chance alone, given the frequency of occurrence of each in the sequences. Odds scores for a set of individually aligned positions are obtained by multiplying the odds scores for each position. Odds scores are often converted to logarithms to create log odds scores that can be added to obtain the log odds score of a sequence alignment.OMIM (一种人类遗传疾病数据库)Online Mendelian Inheritance in Man. Database of genetic diseases with references to molecular medicine, cell biology, biochemistry and clinical details of the diseases.Optimal alignment(最佳联配)The highest-scoring alignment found by an algorithm capable of producing multiple solutions. This is the best possible alignment that can be found, given any parameters supplied by the user to the sequence alignment program.ORF (开放阅读框)Open Reading Frame. A series of codons (base triplets) which can be translated into a protein. There are six potential reading frames of an unidentifed sequence; TBLASTN (see BLAST) transalates a nucleotide sequence in all six reading frames, into a protein, then attempts to align the results to sequeneces in a protein database, returning the results as a nucleotide sequence. The most likely reading frame can be identified using on-line software (e.g. ORF Finder).Orthologous(直系同源)Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. A pair of genes found in two species are orthologous when the encoded proteins are 60-80% identical in an alignment. The proteins almost certainly have the same three-dimensional structure, domain structure, and biological function, and the encoding genes have originated from a common ancestor gene at an earlier evolutionary time. Two orthologs 1 and II in genomes A and B, respectively, may be identified when the complete genomes of two species are available: (1) in a database similarity search of all of the proteome of B using I as a query, II is the best hit found, and (2) I is the best hit when 11 is used as a query of the proteome of B. The best hit is the database sequence with the highest expect value (E). Orthology is also predicted by a very close phylogenetic relationship between sequences or by a cluster analysis. Compare to Paralogs. See also Cluster analysis.Output layer(输出层)The final layer of a neural network in which signals from lower levels in the network are input into output states where they are weighted and summed togive an outpu t signal. For example, the output signal might be the prediction of one type of protein secondary structure for the central amino acid in a sequence window.OverfittingCan occur when using a learning algorithm to train a model such as a neural net or hid-den Markov model. Overfitting refers to the model becoming too highly representative of the training data and thus no longer representative of the overall range of data that is supposed to be modeled.。
逐个克隆法:对连续克隆系中排定的BAC克隆逐个进行亚克隆测序并进行组装(公共领域测序计划)。
全基因组鸟枪法:在一定作图信息基础上,绕过大片段连续克隆系的构建而直接将基因组分解成小片段随机测序,利用超级计算机进行组装。
单核苷酸多态性(SNP),主要是指在基因组水平上由单个核苷酸的变异所引起的DNA序列多态性。
遗传图谱又称连锁图谱,它是以具有遗传多态性(在一个遗传位点上具有一个以上的等位基因,在群体中的出现频率皆高于1%)的遗传标记为“路标”,以遗传学距离(在减数分裂事件中两个位点之间进行交换、重组的百分率,1%的重组率称为1cM)为图距的基因组图。
遗传图谱的建立为基因识别和完成基因定位创造了条件。
物理图谱是指有关构成基因组的全部基因的排列和间距的信息,它是通过对构成基因组的DNA分子进行测定而绘制的。
绘制物理图谱的目的是把有关基因的遗传信息及其在每条染色体上的相对位置线性而系统地排列出来。
转录图谱是在识别基因组所包含的蛋白质编码序列的基础上绘制的结合有关基因序列、位置及表达模式等信息的图谱。
比较基因组学:全基因组核苷酸序列的整体比较的研究。
特点是在整个基因组的层次上比较基因组的大小及基因数目、位置、顺序、特定基因的缺失等。
环境基因组学:研究基因多态性与环境之间的关系,建立环境反应基因多态性的目录,确定引起人类疾病的环境因素的科学。
宏基因组是特定环境全部生物遗传物质总和,决定生物群体生命现象。
转录组即一个活细胞所能转录出来的所有mRNA。
研究转录组的一个重要方法就是利用DNA芯片技术检测有机体基因组中基因的表达。
而研究生物细胞中转录组的发生和变化规律的科学就称为转录组学。
蛋白质组学:研究不同时相细胞内蛋白质的变化,揭示正常和疾病状态下,蛋白质表达的规律,从而研究疾病发生机理并发现新药。
蛋白组:基因组表达的全部蛋白质,是一个动态的概念,指的是某种细胞或组织中,基因组表达的所有蛋白质。
代谢组是指是指某个时间点上一个细胞所有代谢物的集合,尤其指在不同代谢过程中充当底物和产物的小分子物质,如脂质,糖,氨基酸等,可以揭示取样时该细胞的生理状态。
名词解释:Consensus sequence:共有序列,指多种原核基因启动序列特定区域内,通常在转录起始点上游-10及-35区域存在一些相似序列。
1、FASTA序列格式:是将DNA或者蛋白质序列表示为一个带有一些标记的核苷酸或者氨基酸字符串,大于号(>)表示一个新文件的开始,其他无特殊要求。
2、Similarity相似性:是直接的连续的数量关系,是指序列比对过程中用来描述检测序列和目标序列之间相同DNA碱基或氨基酸残基顺序所占比列的高低。
3、genbank序列格式:是GenBank 数据库的基本信息单位,是最为广泛的生物信息学序列格式之一。
该文件格式按域划分为4个部分:第一部分包含整个记录的信息(描述符);第二部分包含注释;第三部分是引文区,提供了这个记录的科学依据;第四部分是核苷酸序列本身,以“//”结尾。
4、模体(motif):短的保守的多肽段,含有相同模体的蛋白质不一定是同源的,一般10-20个残基。
5、查询序列(query sequence):也称被检索序列,用来在数据库中检索并进行相似性比较的序列。
6、打分矩阵(scoring matrix):在相似性检索中对序列两两比对的质量评估方法。
包括基于理论(如考虑核酸和氨基酸之间的类似性)和实际进化距离(如PAM)两类方法。
7、空位(gap):在序列比对时,由于序列长度不同,需要插入一个或几个位点以取得最佳比对结果,这样在其中一序列上产生中断现象,这些中断的位点称为空位。
8、PDB:PDB中收录了大量通过实验(X射线晶体衍射,核磁共振NMR)测定的生物大分子的三维结构,记录有原子坐标、配基的化学结构和晶体结构的描述等。
PDB数据库的访问号由一个数字和三个字母组成(如,4HHB),同时支持关键词搜索,还可以FASTA程序进行搜索。
9、Prosite:是蛋白质家族和结构域数据库,包含具有生物学意义的位点、模式、可帮助识别蛋白质家族的统计特征。
生物学名词(附英文名称)解读一、细胞(Cell)细胞是生物体的基本单位,是生命活动的场所。
细胞具有细胞膜、细胞质和细胞核等结构,负责执行生命活动的各种功能。
二、DNA(DNA)DNA是脱氧核糖核酸的缩写,是生物体内存储遗传信息的分子。
DNA分子由四种碱基组成,通过碱基对排列形成双螺旋结构,负责传递遗传信息。
三、蛋白质(Protein)蛋白质是生物体内的重要分子,具有多种功能,如催化反应、运输物质、调节细胞活动等。
蛋白质由氨基酸组成,通过肽键连接形成长链。
四、酶(Enzyme)酶是一种特殊的蛋白质,具有催化化学反应的功能。
酶能够降低化学反应的活化能,提高反应速率,从而加速生命活动。
五、光合作用(Photosynthesis)光合作用是植物、藻类和某些细菌利用光能将二氧化碳和水转化为有机物和氧气的过程。
光合作用是地球上生物能量循环的基础。
六、呼吸作用(Respiration)呼吸作用是生物体内分解有机物,释放能量的过程。
呼吸作用包括有氧呼吸和无氧呼吸两种形式,有氧呼吸需要氧气,无氧呼吸不需要氧气。
七、有丝分裂(Mitosis)有丝分裂是生物体内细胞分裂的一种方式,通过有丝分裂,一个细胞分裂成两个细胞,保证了遗传信息的传递。
八、减数分裂(Meiosis)减数分裂是生物体内细胞分裂的一种方式,通过减数分裂,一个细胞分裂成四个细胞,每个细胞只含有一套染色体,保证了生殖细胞的形成。
九、基因(Gene)基因是生物体内控制遗传特征的基本单位,位于染色体上。
基因通过编码蛋白质,决定了生物体的遗传特征。
十、生态系统(Ecosystem)生态系统是由生物体和非生物环境组成的相互作用的整体。
生态系统中的生物体之间存在着食物链和食物网,通过能量流动和物质循环维持生态平衡。
十一、生物多样性(Biodiversity)生物多样性是指地球上生物种类、遗传差异和生态系统的多样性。
它包括物种多样性、遗传多样性和生态系统多样性。
十二、进化(Evolution)进化是指生物种群在长期演化过程中,通过自然选择、基因突变和基因重组等机制,逐渐适应环境变化的过程。
生物信息分析常用名词解释生物信息学(bioinformatics):综合计算机科学、信息技术和数学的理论和方法来研究生物信息的交叉学科。
包括生物学数据的研究、存档、显示、处理和模拟,基因遗传和物理图谱的处理,核苷酸和氨基酸序列分析,新基因的发现和蛋白质结构的预测等。
基因组(genome):是指一个物种的单倍体的染色体数目,又称染色体组。
它包含了该物种自身的所有基因。
基因(gene):是遗传信息的物理和功能单位,包含产生一条多肽链或功能RNA所必需的全部核苷酸序列。
基因组学:(genomics)是指对所有基因进行基因组作图(包括遗传图谱、物理图谱、转录图谱)、核酸序列测定、基因定位和基因功能分析的科学。
基因组学包括结构基因组学(structural genomics)、功能基因组学(functional genomics)、比较基因组学(Comparative genomics)宏基因组学:宏基因组是基因组学一个新兴的科学研究方向。
宏基因组学(又称元基因组学,环境基因组学,生态基因组学等),是研究直接从环境样本中提取的基因组遗传物质的学科。
传统的微生物研究依赖于实验室培养,元基因组的兴起填补了无法在传统实验室中培养的微生物研究的空白。
蛋白质组学(proteomics):阐明生物体各种生物基因组在细胞中表达的全部蛋白质的表达模式及功能模式的学科。
包括鉴定蛋白质的表达、存在方式(修饰形式)、结构、功能和相互作用等。
遗传图谱:指通过遗传重组所得到的基因线性排列图。
物理图谱:是利用限制性内切酶将染色体切成片段,再根据重叠序列把片段连接称染色体,确定遗传标记之间的物理距离的图谱。
转录图谱:是利用EST作为标记所构建的分子遗传图谱。
基因文库:用重组DNA技术将某种生物细胞的总DNA 或染色体DNA的所有片断随机地连接到基因载体上,然后转移到适当的宿主细胞中,通过细胞增殖而构成各个片段的无性繁殖系(克隆),在制备的克隆数目多到可以把某种生物的全部基因都包含在内的情况下,这一组克隆的总体就被称为某种生物的基因文库。
生物信息学常用基本词汇表A ( Adenine )腺嘌呤作为碱基的两种嘌呤中的一种。
active site活化位点蛋白质三维表面催化作用发生的区域。
alignment比对为了确定两个同源核酸或蛋白质序列的累计差异而进行的配对称为比对。
alignment of alignments比对的比对即比对的对象不是简单的序列,而是序列的比对。
alleles等位基因一个基因的不同版本。
alpha carbonα 碳在氨基酸中与侧链( R- 基团)相连的中心碳原子。
alternative splicing可变剪接从一个单独的 hnRNA 生成两个或多个 mRNA 分子的过程。
amino terminus (N-terminal)氨基端( N 端)在一个多肽中,具有自由氨基的分子端,对应于基因的 5'- 端。
anti-parallel反向平行表示相反的方向;在双链DNA中,这意味着如果一条链是5' 到3' 的,则其互补链方向是 3' 到 5' 的。
Bbase pair碱基对(1)在双链DNA中嘌呤和嘧啶之间的相互作用(特别指A和T 之间,G和C之间);(2)双链DNA序列长度的基本单位。
beta turnsβ转角在反向平行的β折叠片中,当β链反转方向的时候蛋白质内部形成的U型结构Bioinformatics生物信息学应用信息科学的理论、方法和技术,管理、分析和利用生物分子数据。
Biocomputing生物计算本书中特指用计算机技术分析和处理生物分子数据。
Basic Local Alignment Search Tool ( Blast)基本的局部比对搜索工具( Blast )一种常用的序列数据库搜索工具。
blotting and hybridization印迹和杂交将分子(通常是核酸分子)从凝胶转移到膜上,接着用绑定有特定感兴趣的分子的标记探针进行洗脱的过程。
bootstrap test自举检验对置信程度进行量化的检验。
生物信息学主要英文术语及释义Abstract Syntax Notation (ASN.l)(NCBI发展的许多程序,如显示蛋白质三维结构的Cn3D等所使用的内部格式)A language that is used to describe structured data types formally, Within bioinformatits,it has been used by the National Center for Biotechnology Information to encode sequences, maps, taxonomic information, molecular structures, and biographical information in such a way that it can be easily accessed and exchanged by computer software.Accession number(记录号)A unique identifier that is assigned to a single database entry for a DNA or protein sequence.Affine gap penalty(一种设置空位罚分策略)A gap penalty score that is a linear function of gap length, consisting of a gap opening penalty and a gap extension penalty multiplied by the length of the gap. Using this penalty scheme greatly enhances the performance of dynamic programming methods for sequence alignment. See also Gap penalty. Algorithm(算法)A systematic procedure for solving a problem in a finite number of steps, typically involving a repetition of operations. Once specified, an algorithm can be written in a computer language and run as a program.Alignment(联配/比对/联配)Refers to the procedure of comparing two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences. Of the two types of alignment, local and global, a local alignment is generally the most useful. See also Local and Global alignments. Alignment score(联配/比对/联配值)An algorithmically computed score based on the number of matches, substitutions, insertions, and deletions (gaps) within an alignment. Scores for matches and substitutions Are derived from a scoring matrix such as the BLOSUM and PAM matrices for proteins, and aftine gap penalties suitable for the matrix are chosen. Alignment scores are in log odds units, often bit units (log to the base 2). Higher scores denote better alignments. See also Similarity score, Distance in sequence analysis.Alphabet(字母表)The total number of symbols in a sequence-4 for DNA sequences and 20 for protein sequences.Annotation(注释)The prediction of genes in a genome, including the location of protein-encoding genes, the sequence of the encoded proteins, anysignificantmatches to other Proteins of known function, and the location of RNA-encoding genes. Predictions are based on gene models; e.g., hidden Markov models of introns and exons in proteins encoding genes, and models of secondary structure in RNA.Anonymous FTP(匿名FTP)When a FTP service allows anyone to log in, it is said to provide anonymous FTP ser-vice. A user can log in to an anonymous FTP server by typing anonymous as the user name and his E-mail address as a password. Most Web browsers now negotiate anonymous FTP logon without asking the user for a user name and password. See also FTP.ASCIIThe American Standard Code for Information Interchange (ASCII) encodes unaccented letters a-z, A-Z, the numbers O-9, most punctuation marks, space, and a set of control characters such as carriage return and tab. ASCII specifies 128 characters that are mapped to the values O-127. ASCII tiles are commonly called plain text, meaning that they only encode text without extra markup.BAC clone(细菌人工染色体克隆)Bacterial artificial chromosome vector carrying a genomic DNA insert, typically 100–200 kb. Most of the large-insert clones sequenced in the project were BAC clones.Back-propagation(反向传输)When training feed-forward neural networks, a back-propagation algorithm can be used to modify the network weights. After each training input pattern is fed through the network, the network’s output is compared with the desired output and the amount of error is calculated. This error is back-propagated through the network by using an error function to correct the network weights. See also Feed-forward neural network.Baum-Welch algorithm(Baum-Welch算法)An expectation maximization algorithm that is used to train hidden Markov models.Baye’s rule(贝叶斯法则)Forms the basis of conditional probability by calculating the likelihood of an event occurring based on the history of the event and relevant background information. In terms of two parameters A and B, the theorem is stated in anequation: The condition-al probability of A, given B, P(AIB), is equal to the probability of A, P(A), times the conditional probability of B, given A, P(BIA), divided by the probability of B, P(B). P(A) is the historical or prior distribution value of A, P(BIA) is a new prediction for B for a particular value of A, and P(B) is the sum of the newly predicted values for B. P(AIB) is a posterior probability, representing a new prediction for A given the prior knowledge of A and the newly discovered relationships between A and B. Bayesian analysis(贝叶斯分析)A statistical procedure used to estimate parameters of an underlyingdistribution based on an observed distribution. See also Baye’s rule.Biochips(生物芯片)Miniaturized arrays of large numbers of molecular substrates, often oligonucleotides, in a defined pattern. They are also called DNA microarrays and microchips.Bioinformatics (生物信息学)The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology. /The discipline of obtaining information about genomic or protein sequence data. This may involve similarity searches of databases, comparing your unidentified sequence to the sequences in a database, or making predictions about the sequence based on current knowledge of similar sequences. Databases are frequently made publically available through the Internet, or locally at your institution.Bit score (二进制值/ Bit值)The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.Bit unitsFrom information theory, a bit denotes the amount of information required to distinguish between two equally likely possibilities. The number of bits of information, AJ, required to convey a message that has A4 possibilities is log2 M = N bits.BLAST (基本局部联配搜索工具,一种主要数据库搜索程序)Basic Local Alignment Search Tool. A set of programs, used to perform fast similarity searches. Nucleotide sequences can be compared with nucleotide sequences in a database using BLASTN, for example. Complex statistics areapplied to judge the significance of each match. Reported sequences may be homologous to, or related to the query sequence. The BLASTP program is used to search a protein database for a match against a query protein sequence. There are several other flavours of BLAST. BLAST2 is a newer release of BLAST. Allows for insertions or deletions in the sequences being aligned. Gapped alignments may be more biologically significant.Block(蛋白质家族中保守区域的组块)Conserved ungapped patterns approximately 3-60 amino acids in length in a set of related proteins.BLOSUM matrices(模块替换矩阵,一种主要替换矩阵)An alternative to PAM tables, BLOSUM tables were derived using local multiple alignments of more distantly related sequences than were used for the PAM matrix. These are used to assess the similarity of sequences when performing alignments.Boltzmann distribution(Boltzmann 分布)Describes the number of molecules that have energies above a certain level, based on the Boltzmann gas constant and the absolute temperature.Boltzmann probability function(Boltzmann概率函数)See Boltzmann distribution.Bootstrap analysisA method for testing how well a particular data set fits a model. For example, the validity of the branch arrangement in a predicted phylogenetic tree can be tested by resampling columns in a multiple sequence alignment to create many new alignments. The appearance of a particular branch in trees generated from these resampled sequences can then be measured. Alternatively, a sequence may be left out of an analysis to deter-mine how much the sequence influences the results of an analysis.Branch length(分支长度)In sequence analysis, the number of sequence changes along a particular branch of a phylogenetic tree.CDS or cds (编码序列)Coding sequence.Chebyshe, d inequalityThe probability that a random variable exceeds its mean is less than or equal to the square of 1 over the number of standard deviations from the mean. Clone (克隆)Population of identical cells or molecules (e.g. DNA), derived from a singleancestor.Cloning Vector (克隆载体)A molecule that carries a foreign gene into a host, and allows/facilitates the multiplication of that gene in a host. When sequencing a gene that has been cloned using a cloning vector (rather than by PCR), care should be taken not to include the cloning vector sequence when performing similarity searches. Plasmids, cosmids, phagemids, YACs and PACs are example types of cloning vectors.Cluster analysis(聚类分析)A method for grouping together a set of objects that are most similar from a larger group of related objects. The relationships are based on some criterion of similarity or difference. For sequences, a similarity or distance score or a statistical evaluation of those scores is used.CobblerA single sequence that represents the most conserved regions in a multiple sequence alignment. The BLOCKS server uses the cobbler sequence to perform a database similarity search as a way to reach sequences that are more divergent than would be found using the single sequences in the alignment for searches.Coding system (neural networks)Regarding neural networks, a coding system needs to be designed for representing input and output. The level of success found when training the model will be partially dependent on the quality of the coding system chosen. Codon usageAnalysis of the codons used in a particular gene or organism. COG(直系同源簇)Clusters of orthologous groups in a set of groups of related sequences in microorganism and yeast (S. cerevisiae). These groups are found by whole proteome comparisons and include orthologs and paralogs. See also Orthologs and Paralogs.Comparative genomics(比较基因组学)A comparison of gene numbers, gene locations, and biological functions of genes in the genomes of diverse organisms, one objective being to identify groups of genes that play a unique biological role in a particular organism. Complexity (of an algorithm)(算法的复杂性)Describes the number of steps required by the algorithm to solve a problem as a function of the amount of data; for example, the length of sequences to be aligned.Conditional probability(条件概率)The probability of a particular result (or of a particular value of a variable) given one or more events or conditions (or values of other variables). Conservation (保守)Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.Consensus(一致序列)A single sequence that represents, at each subsequent position, the variation found within corresponding columns of a multiple sequence alignment. Context-free grammarsA recursive set of production rules for generating patterns of strings. These consist of a set of terminal characters that are used to create strings, a set of nonterminal symbols that correspond to rules and act as placeholders for patterns that can be generated using terminal characters, a set of rules for replacing nonterminal symbols with terminal characters, and a start symbol. Contig (序列重叠群/拼接序列)A set of clones that can be assembled into a linear order. A DNA sequence that overlaps with another contig. The full set of overlapping sequences (contigs) can be put together to obtain the sequence for a long region of DNA that cannot be sequenced in one run in a sequencing assay. Important in genetic mapping at the molecular level.CORBA(国际对象管理协作组制定的使OOP对象与网络接口统一起来的一套跨计算机、操作系统、程序语言和网络的共同标准)The Common Object Request Broker Architecture (CORBA) is an open industry standard for working with distributed objects, developed by the Object Management Group. CORBA allows the interconnection of objects and applications regardless of computer language, machine architecture, or geographic location of the computers.Correlation coefficient(相关系数)A numerical measure, falling between -1 and 1, of the degree of the linear relationship between two variables. A positive value indicates a direct relationship, a negative value indicates an inverse relationship, and the distance of the value away from zero indicates the strength of the relationship. A value near zero indicates no relationship between the variables.Covariation (in sequences)(共变)Coincident change at two or more sequence positions in related sequencesthat may influence the secondary structures of RNA or protein molecules. Coverage (or depth) (覆盖率/厚度)The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a 'high-quality base' is defined as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20).Database(数据库)A computerized storehouse of data that provides a standardized way for locating, adding, removing, and changing data. See also Object-oriented database, Relational database.DendogramA form of a tree that lists the compared objects (e.g., sequences or genes in a microarray analysis) in a vertical order and joins related ones by levels of branches extending to one side of the list.Depth (厚度)See coverageDirichlet mixturesDefined as the conjugational prior of a multinomial distribution. One use is for predicting the expected pattern of amino acid variation found in the match state of a hid-den Markov model (representing one column of a multiple sequence alignment of proteins), based on prior distributions found in conserved protein domains (blocks).Distance in sequence analysis(序列距离)The number of observed changes in an optimal alignment of two sequences, usually not counting gaps.DNA Sequencing (DNA测序)The experimental process of determining the nucleotide sequence of a region of DNA. This is done by labelling each nucleotide (A, C, G or T) with either a radioactive or fluorescent marker which identifies it. There are several methods of applying this technology, each with their advantages and disadvantages. For more information, refer to a current text book. High throughput laboratories frequently use automated sequencers, which are capable of rapidly reading large numbers of templates. Sometimes, the sequences may be generated more quickly than they can be characterised. Domain (功能域)A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function.Dot matrix(点标矩阵图)Dot matrix diagrams provide a graphical method for comparing two sequences. One sequence is written horizontally across the top of the graph and the other along the left-hand side. Dots are placed within the graph at the intersection of the same letter appearing in both sequences. A series of diagonal lines in the graph indicate regions of alignment. The matrix may be filtered to reveal the most-alike regions by scoring a minimal threshold number of matches within a sequence window.Draft genome sequence (基因组序列草图)The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes.DUST (一种低复杂性区段过滤程序)A program for filtering low complexity regions from nucleic acid sequences. Dynamic programming(动态规划法)A dynamic programming algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a table or matrix. Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found. This algorithm is used for producing sequence alignments, given a scoring system for sequence comparisons.EMBL (欧洲分子生物学实验室,EMBL数据库是主要公共核酸序列数据库之一)European Molecular Biology Laboratories. Maintain the EMBL database, one of the major public sequence databases.EMBnet (欧洲分子生物学网络)European Molecular Biology Network: / was established in 1988, and provides services including local molecular databases and software for molecular biologists in Europe. There are several large outposts of EMBnet, including EXPASY.Entropy(熵)From information theory, a measure of the unpredictable nature of a set of possible elements. The higher the level of variation within the set, the higher the entropy.Erdos and Renyi lawIn a toss of a “fair” coin, the number of heads in a row that can be expected is the logarithm of the number of tosses to the base 2. The law may begeneralized for more than two possible outcomes by changing the base of the logarithm to the number of out-comes. This law was used to analyze the number of matches and mismatches that can be expected between random sequences as a basis for scoring the statistical significance of a sequence alignment.EST (表达序列标签的缩写)See Expressed Sequence TagExpect value (E)(E值)E value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. In a database similarity search, the probability that an alignment score as good as the one found between a query sequence and a database sequence would be found in as many comparisons between random sequences as was done to find the matching sequence. In other types of sequence analysis, E has a similar meaning.Expectation maximization (sequence analysis)An algorithm for locating similar sequence patterns in a set of sequences. A guessed alignment of the sequences is first used to generate an expected scoring matrix representing the distribution of sequence characters in each column of the alignment, this pattern is matched to each sequence, and the scoring matrix values are then updated to maximize the alignment of the matrix to the sequences. The procedure is repeated until there is no further improvement.Exon (外显子)。
go bp生物信息学名词解释
生物信息学是一个跨学科领域,结合了生物学、计算机科学和
统计学等学科的知识,旨在利用计算机技术和数学方法来解决生物
学中的问题。
其中,"GO"代表Gene Ontology,是一种用于描述基
因和蛋白质功能的标准化系统。
"BP"代表Biological Process,是Gene Ontology中的一个主要分类,用于描述基因和蛋白质参与的
生物学过程和生物化学反应。
在生物信息学中,研究人员可以利用GO BP术语来对基因和蛋白质的生物学功能进行分类和注释,从而
更好地理解生物学系统中的基因调控和代谢过程。
通过对GO BP术
语的解释和应用,研究人员可以更好地理解基因和蛋白质在细胞和
生物体内的功能,为疾病研究、药物开发和基因工程等领域提供重
要的信息和指导。
生物信息学中的GO BP术语解释涉及到基因功能、生物过程、细胞信号传导等多个方面,需要结合具体的研究内容和
方法进行深入理解和解释。
生物信息学主要英文术语及释义附录:生物信息学主要英文术语及释义Abstract Syntax Notation (ASN.l)(NCBI发展的许多程序,如显示蛋白质三维结构的Cn3D等所使用的内部格式)A language that is used to describe structured data types formally, Within bioinformatits,it has been used by the National Center for Biotechnology Information to encode sequences, maps, taxonomic information, molecular structures, and biographical information in such a way that it can be easily accessed and exchanged by computer software.Accession number(记录号)A unique identifier that is assigned to a single database entry for a DNA or protein sequence.Affine gap penalty(一种设置空位罚分策略)A gap penalty score that is a linear function of gap length, consisting of a gap opening penalty and a gap extension penalty multiplied by the length of the gap. Using this penalty scheme greatly enhances the performance of dynamic programming methods for sequence alignment. See also Gap penalty. Algorithm(算法)A systematic procedure for solving a problem in a finite number of steps, typically involving a repetition of operations. Once specified, an algorithm can be written in a computer language and run as a program.Alignment(联配/比对/联配)Refers to the procedure of comparing two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences. Of the two types of alignment, local and global, a local alignmentis generally the most useful. See also Local and Global alignments. Alignment score(联配/比对/联配值)An algorithmically computed score based on the number of matches, substitutions, insertions, and deletions (gaps) within an alignment. Scores for matches and substitutions Are derived from a scoring matrix such as the BLOSUM and PAM matrices for proteins, and aftine gap penalties suitable for the matrix are chosen. Alignment scores are in log odds units, often bit units (log to the base 2). Higher scores denote better alignments. See also Similarity score, Distance in sequence analysis.Alphabet(字母表)The total number of symbols in a sequence-4 for DNA sequences and 20 for protein sequences.Annotation(注释)The prediction of genes in a genome, including the location of protein-encoding genes, the sequence of the encoded proteins, any significantmatches to other Proteins of known function, and the location of RNA-encoding genes. Predictions are based on gene models; e.g., hidden Markov models of introns and exons in proteins encoding genes, and models of secondary structure in RNA.Anonymous FTP(匿名FTP)When a FTP service allows anyone to log in, it is said to provide anonymous FTP ser-vice. A user can log in to an anonymous FTP server by typing anonymous as the user name and his E-mail address as a password. Most Web browsers now negotiate anonymous FTP logon without asking the user for a user name and password. See also FTP.ASCIIThe American Standard Code for Information Interchange (ASCII) encodes unaccented letters a-z, A-Z, the numbers O-9, most punctuation marks, space, and a set of control characters such as carriage return and tab. ASCII specifies 128 characters that are mapped to the values O-127. ASCII tiles are commonly called plain text, meaning that they only encode text without extra markup.BAC clone(细菌人工染色体克隆)Bacterial artificial chromosome vector carrying a genomic DNA insert, typically 100–200 kb. Most of the large-insert clones sequenced in the project were BAC clones.Back-propagation(反向传输)When training feed-forward neural networks, a back-propagation algorithm can be used to modify the network weights. After each training input pattern is fed through the network, the network’s output is compared with the desired output and the amount of error is calculated. This error is back-propagated through the network by using an error function to correct the network weights. See also Feed-forward neural network.Baum-Welch algorithm(Baum-Welch算法)An expectation maximization algorithm that is used to train hidden Markov models.Baye’s rule(贝叶斯法则)Forms the basis of conditional probability by calculating the likelihood of an event occurring based on the history of the event and relevant background information. In terms of two parameters A and B, the theorem is stated in an equation: The condition-al probability of A, given B, P(AIB), is equal to the probability of A, P(A), times the conditional probability of B, given A, P(BIA),divided by the probability of B, P(B). P(A) is the historical or prior distribution value of A, P(BIA) is a new prediction for B for a particular value of A, and P(B) is the sum of the newly predicted values for B. P(AIB) is a posterior probability, representing a new prediction for A given the prior knowledge of A and the newly discovered relationships between A and B.Bayesian analysis(贝叶斯分析)A statistical procedure used to estimate parameters of an underlyingdistribution based on an observed distribution. See also Baye’s rule.Biochips(生物芯片)Miniaturized arrays of large numbers of molecular substrates, often oligonucleotides, in a defined pattern. They are also called DNA microarrays and microchips.Bioinformatics (生物信息学)The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology. /The discipline of obtaining information about genomic or protein sequence data. This may involve similarity searches of databases, comparing your unidentified sequence to the sequences in a database, or making predictions about the sequence based on current knowledge of similar sequences. Databases are frequently made publically available through the Internet, or locally at your institution.Bit score (二进制值/ Bit值)The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be usedto compare alignment scores from different searches.Bit unitsFrom information theory, a bit denotes the amount of information required to distinguish between two equally likely possibilities. The number of bits of information, AJ, required to convey a message that has A4 possibilities is log2 M = N bits.BLAST (基本局部联配搜索工具,一种主要数据库搜索程序)Basic Local Alignment Search T ool. A set of programs, used to perform fast similarity searches. Nucleotide sequences can be compared with nucleotide sequences in a database using BLASTN, for example. Complex statistics are applied to judge the significance of each match. Reported sequences may be homologous to, or related to the query sequence. The BLASTP program is used to search a protein database for a match against a query protein sequence. There are several other flavours of BLAST. BLAST2 is a newer release of BLAST. Allows for insertions or deletions in the sequences being aligned. Gapped alignments may be more biologically significant.Block(蛋白质家族中保守区域的组块)Conserved ungapped patterns approximately 3-60 amino acids in length in a set of related proteins.BLOSUM matrices(模块替换矩阵,一种主要替换矩阵)An alternative to PAM tables, BLOSUM tables were derived using local multiple alignments of more distantly related sequences than were used for the PAM matrix. These are used to assess the similarity of sequences when performing alignments.Boltzmann distribution(Boltzmann 分布)Describes the number of molecules that have energies above a certain level, based on the Boltzmann gas constant and the absolute temperature.Boltzmann probability function(Boltzmann概率函数)See Boltzmann distribution.Bootstrap analysisA method for testing how well a particular data set fits a model. For example, the validity of the branch arrangement in a predicted phylogenetic tree can be tested by resampling columns in a multiple sequence alignment to create many new alignments. The appearance of a particular branch in trees generated from these resampled sequences can then be measured. Alternatively, a sequence may be left out of an analysis to deter-mine how much the sequence influences the results of an analysis.Branch length(分支长度)In sequence analysis, the number of sequence changes along a particular branch of a phylogenetic tree.CDS or cds (编码序列)Coding sequence.Chebyshe, d inequalityThe probability that a random variable exceeds its mean is less than or equal to the square of 1 over the number of standard deviations from the mean. Clone (克隆)Population of identical cells or molecules (e.g. DNA), derived from a single ancestor.Cloning Vector (克隆载体)A molecule that carries a foreign gene into a host, and allows/facilitates the multiplication of that gene in a host. When sequencing a gene that has been cloned using a cloning vector (rather than by PCR), care should be taken not to include the cloning vector sequence when performing similarity searches. Plasmids, cosmids, phagemids, YACs and PACs are example typesof cloning vectors.Cluster analysis(聚类分析)A method for grouping together a set of objects that are most similar from a larger group of related objects. The relationships are based on some criterion of similarity or difference. For sequences, a similarity or distance score or a statistical evaluation of those scores is used.CobblerA single sequence that represents the most conserved regions in a multiple sequence alignment. The BLOCKS server uses the cobbler sequence to perform a database similarity search as a way to reach sequences that are more divergent than would be found using the single sequences in the alignment for searches.Coding system (neural networks)Regarding neural networks, a coding system needs to be designed for representing input and output. The level of success found when training the model will be partially dependent on the quality of the coding system chosen. Codon usageAnalysis of the codons used in a particular gene or organism.COG(直系同源簇)Clusters of orthologous groups in a set of groups of related sequences in microorganism and yeast (S. cerevisiae). These groups are found by whole proteome comparisons and include orthologs and paralogs. See also Orthologs and Paralogs.Comparative genomics(比较基因组学)A comparison of gene numbers, gene locations, and biological functions of genes in the genomes of diverse organisms, one objective being to identify groups of genes that play a unique biological role in a particular organism. Complexity(of an algorithm)(算法的复杂性)Describes the number of steps required by the algorithm to solve a problem as a function of the amount of data; for example, the length of sequences to be aligned.Conditional probability(条件概率)The probability of a particular result (or of a particular value of a variable) given one or more events or conditions (or values of other variables). Conservation (保守)Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.Consensus(一致序列)A single sequence that represents, at each subsequent position, the variation found within corresponding columns of a multiple sequence alignment. Context-free grammarsA recursive set of production rules for generating patterns of strings. These consist of a set of terminal characters that are used to create strings, a set of nonterminal symbols that correspond to rules and act as placeholders for patterns that can be generated using terminal characters, a set of rules for replacing nonterminal symbols with terminal characters, and a start symbol. Contig (序列重叠群/拼接序列)A set of clones that can be assembled into a linear order. A DNA sequence that overlaps with another contig. The full set of overlapping sequences (contigs) can be put together to obtain the sequence for a long region of DNA that cannot be sequenced in one run in a sequencing assay. Important in genetic mapping at the molecular level.CORBA(国际对象管理协作组制定的使OOP对象与网络接口统一起来的一套跨计算机、操作系统、程序语言和网络的共同标准)The Common Object Request Broker Architecture (CORBA) is an open industry standard for working with distributed objects, developed by the Object Management Group. CORBA allows the interconnection of objects and applications regardless of computer language, machine architecture, or geographic location of the computers.Correlation coefficient(相关系数)A numerical measure, falling between - 1 and 1, of the degree of the linear relationship between two variables. A positive value indicates a direct relationship, a negative value indicates an inverse relationship, and the distance of the value away from zero indicates the strength of the relationship.A value near zero indicates no relationship between the variables. Covariation (in sequences)(共变)Coincident change at two or more sequence positions in related sequences that may influence the secondary structures of RNA or protein molecules. Coverage (or depth) (覆盖率/厚度)The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a 'high-quality base' is defined as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20).Database(数据库)A computerized storehouse of data that provides a standardized way for locating, adding, removing, and changing data. See also Object-oriented database, Relational database.DendogramA form of a tree that lists the compared objects (e.g., sequences or genes in a microarray analysis) in a vertical order and joins related ones by levels of branches extending to oneside of the list.Depth (厚度)See coverageDirichlet mixturesDefined as the conjugational prior of a multinomial distribution. One use is for predicting the expected pattern of amino acid variation found in the match state of a hid-den Markov model (representing one column of a multiple sequence alignment of proteins), based on prior distributions found in conserved protein domains (blocks).Distance in sequence analysis(序列距离)The number of observed changes in an optimal alignment of two sequences, usually not counting gaps.DNA Sequencing (DNA测序)The experimental process of determining the nucleotide sequence of a region of DNA. This is done by labelling each nucleotide (A, C, G or T) with either a radioactive or fluorescent marker which identifies it. There are several methods of applying this technology, each with their advantages and disadvantages. For more information, refer to a current text book. High throughput laboratories frequently use automated sequencers, which are capable of rapidly reading large numbers of templates. Sometimes, the sequences may be generated more quickly than they can be characterised. Domain (功能域)A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function.Dot matrix(点标矩阵图)Dot matrix diagrams provide a graphical method for comparing two sequences. One sequence is written horizontallyacross the top of the graph and the other along the left-hand side. Dots are placed within the graph at the intersection of the same letter appearing in both sequences. A series of diagonal lines in the graph indicate regions of alignment. The matrix may be filtered to reveal the most-alike regions by scoring a minimal threshold number of matches within a sequence window.Draft genome sequence (基因组序列草图)The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes.DUST (一种低复杂性区段过滤程序)A program for filtering low complexity regions from nucleic acid sequences. Dynamic programming(动态规划法)A dynamic programming algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a table or matrix. Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found. This algorithm is used for producing sequence alignments, given a scoring system for sequence comparisons.EMBL (欧洲分子生物学实验室,EMBL数据库是主要公共核酸序列数据库之一)European Molecular Biology Laboratories. Maintain the EMBL database, one of the major public sequence databases.EMBnet (欧洲分子生物学网络)European Molecular Biology Network: /doc/08cc4b81b9f3f90f77c61b24.html / was established in 1988, and provides services including localmolecular databases and software for molecular biologists in Europe. There are several large outposts of EMBnet, including EXPASY.Entropy(熵)From information theory, a measure of the unpredictable nature of a set of possible elements. The higher the level of variation within the set, the higher the entropy.Erdos and Renyi lawIn a toss of a “fair” coin, the number of heads in a row that can be expected is the logarithm of the number of tosses to the base 2. The law may be generalized for more than two possible outcomes by changing the base of the logarithm to the number of out-comes. This law was used to analyze the number of matches and mismatches that can be expected between random sequences as a basis for scoring the statistical significance of a sequence alignment.EST (表达序列标签的缩写)See Expressed Sequence TagExpect value (E)(E值)E value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. In a database similarity search, the probability that an alignment score as good as the one found between a query sequence and a database sequence would be found in as many comparisons between random sequences as was done to find the matching sequence. In other types of sequence analysis, E has a similar meaning.Expectation maximization (sequence analysis)An algorithm for locating similar sequence patterns in a setof sequences. A guessed alignment of the sequences is first used to generate an expected scoring matrix representing the distribution of sequence characters in each column of the alignment, this pattern is matched to each sequence, and the scoring matrix values are then updated to maximize the alignment of the matrix to the sequences. The procedure is repeated until there is no further improvement.Exon (外显子)Coding region of DNA. See CDS.Expressed Sequence Tag (EST) (表达序列标签)Randomly selected, partial cDNA sequence; represents it's corresponding mRNA. dbEST is a large database of ESTs at GenBank, NCBI.FASTA (一种主要数据库搜索程序)The first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable which specifies the size of a "word". (Pearson and Lipman)Extreme value distribution(极值分布)Some measurements are found to follow a distribution that has a long tail which decays at high values much more slowly than that found in a normal distribution. This slow-falling type is called the extreme value distribution. The alignment scores between unrelated or random sequences are an example. Thesescores can reach very high values, particularly when a large number of comparisons are made, as in a database similarity search. The probability of a particular score may be accurately predicted by the extreme value distribution, which follows a double negative exponential function after Gumbel.False negative(假阴性)A negative data point collected in a data set that was incorrectly reported due to a failure of the test in avoiding negative results.False positive (假阳性)A positive data point collected in a data set that was incorrectly reported due to a failure of the test. If the test had correctly measured the data point, the data would have been recorded as negative.Feed-forward neural network (反向传输神经网络)Organizes nodes into sequence layers in which the nodes in each layer are fully connected with the nodes in the next layer, except for the final output layer. Input is fed from the input layer through the layers in sequence in a “feed-forward” direction, resulting in output at the final layer. See also Neural network.Filtering (window size)During pair-wise sequence alignment using the dot matrix method, random matches can be filtered out by using a sliding window to compare the two sequences. Rather than comparing a single sequence position at a time, a window of adjacent positions in the two sequences is compared and a dot, indicating a match, is generated only if a certain minimal number of matches occur.Filtering (过滤)Also known as Masking. The process of hiding regions of(nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores. See SEG and DUST.Finished sequence(完成序列)Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps.Fourier analysisStudies the approximations and decomposition of functions using trigonometric polynomials.Format (file)(格式)Different programs require that information be specified to them in a formal manner, using particular keywords and ordering. This specification is a file format.Forward-backward algorithmUsed to train a hidden Markov model by aligning the model with training sequences. The algorithm then refines the model to reduce the error when fitted to the given data using a gradient descent approach.FTP (File Transfer Protocol)(文件传输协议)Allows a person to transfer files from one computer to another across a network using an FTP-capable client program. The FTP client program can only communicate with machines that run an FTP server. The server, in turn, will make a specific portion of its tile system available for FTP access, providing that the client is able to supply a recognized user name and password to the server.Full shotgun clone (鸟枪法克隆)A large-insert clone for which full shotgun sequence has been produced.Functional genomics(功能基因组学)Assessment of the function of genes identified by between-genome comparisons. The function of a newly identified gene is tested by introducing mutations into the gene and then examining the resultant mutant organism for an altered phenotype.gap (空位/间隙/缺口)A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment.Gap penalty(空位罚分)A numeric score used in sequence alignment programs to penalize the presence of gaps within an alignment. The value of a gap penalty affects how often gaps appear in alignments produced by the algorithm. Most alignment programs suggest gap penalties that are appropriate for particular scoring matrices.Genetic algorithm(遗传算法)A kind of search algorithm that was inspired by the principles of evolution. A population of initial solutions is encoded and the algorithm searches through these by applying a pre-defined fitness measurement to each solution, selecting those with the highest fitness for reproduction. New solutions can be generated during this phase by crossover and mutation operations, defined in the encoded solutions.Genetic map (遗传图谱)A genome map in which polymorphic loci are positioned relative to one another on the basis of the frequency with which they recombine during meiosis. The unit of distance iscentimorgans (cM), denoting a 1% chance of recombination.Genome(基因组)The genetic material of an organism, contained in one haploid set of chromosomes.Gibbs sampling methodAn algorithm for finding conserved patterns within a set of related sequences.A guessed alignment of all but one sequence is made and used to generate a scoring matrix that represents the alignment. The matrix is then matched to the left-out sequence, and a probable location of the corresponding pattern is found. This prediction is then input into a new alignment and another scoring matrix is produced and tested on a new left-out sequence. The process is repeated until there is no further improvement in the matrix.Global alignment(整体联配)Attempts to match as many characters as possible, from end to end, in a set of two ormore sequences.Gopher (一个文档发布系统,允许检索和显示文本文件)Graph theory(图论)A branch of mathematics which deals with problems that involve a graph or network structure. A graph is defined by a set of nodes (or points) and a set of arcs (lines or edges) joining the nodes. In sequence and genome analysis, graph theory is used for sequence alignments and clustering alike genes.GSS(基因综述序列)Genome survey sequence.GUI(图形用户界面)Graphical user interface.H (相对熵值)H is the relative entropy of the target and background residue frequencies. (Karlin and Altschul, 1990). H can be thought of as a measure of the average information (in bits) available per position that distinguishes an alignment from chance. At high values of H, short alignments can be distinguished by chance, whereas at lower H values, a longer alignment may be necessary. (Altschul, 1991)Half-bitsSome scoring matrices are in half-bit units. These units are logarithms to the base 2 of odds scores times 2.Heuristic(启发式方法)A procedure that progresses along empirical lines by using rules of thumb to reach a solution. The solution is not guaranteed to be optimal.Hexadecimal system(16制系统)The base 16 counting system that uses the digits O-9 followed by the letters A-F.HGMP (人类基因组图谱计划)Human Genome Mapping Project.Hidden Markov Model (HMM)(隐马尔可夫模型)In sequence analysis, a HMM is usually a probabilistic model of a multiple sequence alignment, but can also be a model of periodic patterns in a single sequence, representing, for example, patterns found in the exons of a gene. In a model of multiple sequence alignments, each column of symbols in the alignment is represented by a frequency distribution of the symbols called a state, and insertions and deletions by other states. One then moves through the model along a particular path from state to state trying to match a given sequence. The next matchingsymbol is chosen from each state, recording its probability (frequency) and also the probability of going to that particular state from a previous one (the transition probability). State and transition probabilities are then multiplied to obtain a probability of the given sequence. Generally speaking, a HMM is a statistical model for an ordered sequence of symbols, acting as a stochastic state machine that generates a symbol each time a transition is made from one state to the next. Transitions between states are specified by transition probabilities.Hidden layer(隐藏层)An inner layer within a neural network that receives its input and sends its output to other layers within the network. One function of the hidden layer is to detect covariation within the input data, such as patterns of amino acid covariation that are associated with a particular type of secondary structure in proteins.Hierarchical clustering(分级聚类)The clustering or grouping of objects based on some single criterion of similarity or difference.An example is the clustering of genes in a microarray experiment based on the correlation between their expression patterns. The distance method used in phylogenetic analysis is another example.Hill climbingA nonoptimal search algorithm that selects the singular best possible solution at a given state or step. The solution may result in a locally best solution that is not a globally best solution.Homology(同源性)A similar component in two organisms (e.g., genes with strongly similar sequences) that can be attributed to a common ancestor of the two organisms during evolution.。
⽣物信息学主要英⽂术语及释义(续完)These substitutions may be found in an amino acid substitution matrix such as the Dayhoff PAM and Henikoff BLOSUM matrices. Columns in the alignment that include gaps are not scored in the calculation. Perceptron(感知器,模拟⼈类视神经控制系统的图形识别机) A neural network in which input and output states are directly connected without intervening hidden layers. PHRED (⼀种⼴泛应⽤的原始序列分析程序,可以对序列的各个碱基进⾏识别和质量评价) A widely used computer program that analyses raw sequence to produce a 'base call' with an associated 'quality score' for each position in the sequence. A PHRED quality score of X corresponds to an error probability of approximately 10-X/10. Thus, a PHRED quality score of30 corresponds to 99.9% accuracy for the base call in the raw read. PHRAP (⼀种⼴泛应⽤的原始序列组装程序) A widely used computer program that assembles raw sequence into sequence contigs and assigns to each position in the sequence an associated 'quality score', on the basis of the PHRED scores of the raw sequence reads. A PHRAP quality score of X corresponds to an error probability of approximately 10-X/10. Thus, a PHRAP quality score of 30 corresponds to 99.9% accuracy for a base in the assembled sequence. Phylogenetic studies(系统发育研究) PIR (主要蛋⽩质序列数据库之⼀,翻译⾃GenBank) A database of translated GenBank nucleotide sequences. PIR is a redundant (see Redundancy) protein sequence database. The database is divided into four categories: PIR1 - Classified and annotated. PIR2 - Annotated. PIR3 -Unverified. PIR4 - Unencoded or untranslated. Poisson distribution(帕松分布) Used to predict the occurrence of infrequent events over a long period of time 143or when there are a large number of trials. In sequence analysis, it is used to calculate the chance that one pair of a large number of pairs of unrelated sequences may give a high local alignment score. Position-specific scoring matrix (PSSM)(特定位点记分矩阵,PSI-BLAST等搜索程序使⽤) The PSSM gives the log-odds score for finding a particular matching amino acid in a target sequence. Represents the variation found in the columns of an alignment of a set of related sequences. Each subsequent matrix column corresponds to the next column in the alignment and each row corresponds to a particular sequence character (one of four bases in DNA sequences or 20 amino acids in protein sequences). Matrix values are log odds scores obtained by dividing the counts of the residue in the alignment, dividing by the expected number of counts based on sequence composition, and converting the ratio to a log score. The matrix is moved along sequences to find similar regions by adding the matching log odds scores and looking for high values. There is no allowance for gaps. Also called a weight matrix or scoring matrix. Posterior (Bayesian analysis) A conditional probability based on prior knowledge and newly uated relationships among variables using Bayes rule. See also Bayes rule. Prior (Bayesian analysis) The expected distribution of a variable based on previous data. Profile(分布型) A matrix representation of a conserved region in a multiple sequence alignment that allows for gaps in the alignment. The rows include scores for matching sequential columns of the alignment to a test sequence. The columns include substitution scores for amino acids and gap penalties. See also PSSM. Profile hidden Markov model(分布型隐马尔可夫模型) A hidden Markov model of a conserved region in a multiple sequence alignment that includes gaps and may be used to search new sequences for similarity to the aligned sequences. Proteome(蛋⽩质组) The entire collection of proteins that are encoded by the genome of an organism. Initially the proteome is estimated by gene prediction and annotation methods but eventually will be revised as more information on the sequence of the expressed genes is obtained. Proteomics (蛋⽩质组学) Systematic analysis of protein expression_r of normal and diseased tissues that involves the separation, identification and characterization of all of the proteins in an organism. Pseudocounts Small number of counts that is added to the columns of a scoring matrix to increase the variability either to avoid zero counts or to add more variation than was found in the sequences used to produce the matrix. 144PSI-BLAST (BLAST系列程序之⼀) Position-Specific Iterative BLAST. An iterative search using the BLAST algorithm. A profile is built after the initial search, which is then used in subsequent searches. The process may be repeated, if desired with new sequences found in each cycle used to refine the profile. Details can be found in this discussion of PSI-BLAST. (Altschul et al.) PSSM (特定位点记分矩阵) See position-specific scoring matrix and profile. Public sequence databases (公共序列数据库,指GenBank、EMBL和DDBJ) The three coordinated international sequence databases: GenBank, the EMBL data library and DDBJ. Q20 (Quality score 20) A quality score of > or = 20 indicates that there is less than a 1 in 100 chance that the base call is incorrect. These are consequently high-quality bases. Specifically, the quality value "q" assigned to a basecall is defined as: q = -10 x log10(p) where p is the estimated error probability for that basecall. Note that high quality values correspond to low error probabilities, and conversely. Quality trimming This is an algorithm which uses a sliding window of 50 bases and trims from the 5' end of the read followed by the 3' end. With each window, the number of low quality (10 or less) bases is determined. If more than 5 bases are below the threshold quality, the window is incremented by one base and the process is repeated. When the low quality test fails, the position where it stopped is recorded. The parameters for window length low quality threshold and number of low quality bases tolerated are fixed. The positions of the 5' and 3' boundaries of the quality region are noted in the plot of quality values presented in the" Chromatogram Details" report. Query (待查序列/搜索序列) The input sequence (or other type of search term) with which all of the entries in a database are to be compared. Radiation hybrid (RH) map (辐射杂交图谱) A genome map in which STSs are positioned relative to one another on the basis of the frequency with which they are separated by radiation-induced breaks. The frequency is assayed by analysing a panel of human–hamster hybrid cell lines, each produced by lethally irradiating human cells and fusing them with recipient hamster cells such that each carries a collection of human chromosomal fragments. The unit of distance is centirays (cR), denoting a 1% chanceof a break occuring between two loci Raw Score (初值,指最初得到的联配值S) The score of an alignment, S, calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (see PAM, BLOSUM). Gap scores are typically calculated as the sum of G, the gap opening penalty 145and L, the gap extension penalty. For a gap of length n, the gap cost would be G+Ln. The choice of gap costs, G and L is empirical, but it is customary to choose a high value for G (10-15)and a low value for L (1-2). Raw sequence (原始序列/读胶序列) Individual unassembled sequence reads, produced by sequencing of clones containing DNA inserts. Receiver operator characteristic The receiver operator characteristic (ROC) curve describes the probability that a test will correctly declare the condition present against the probability that the test will declare the condition present when actually absent. This is shown through a graph of the tesls sensitivity against one minus the test specificity for different possible threshold values. Redundancy (冗余) The presence of more than one identical item represents redundancy. In bioinformatics, the term is used with reference to the sequences in a sequence database. If a database is described as being redundant, more than one identical (redundant) sequence may be found. If the database is said to be non-redundant (nr), the database managers have attempted to reduce the redundancy. The term is ambiguous with reference to genetics, and as such, the degree of non-redundancy varies according to the database manager's interpretation of the term. One can argue whether or not two alleles of a locus defines the limit of redundancy, or whether the same locus in different, closely related organisms constitutes redundency. Non-redundant databases are, in some ways, superior, but are less complete. These factors should be taken into consideration when selecting a database to search. Regular expression_rs This computational tool provides a method for expressing the variations found in a set of related sequences including a range of choices at one position, insertions, repeats, and so on. For example, these expression_rs are used to characterize variations found in protein domains in the PROSITE catalog. Regularization A set of techniques for reducing data overfitting when training a model. See also Overfitting. Relational database(关系数据库)Organizes information into tables where each column represents the fields of informa-tion that can be stored in a single record. Each row in the table corresponds to a single record. A single database can have many tables and a query language is used to access the data. See also Object-oriented database. Scaffold (⽀架,由序列重叠群拼接⽽成) The result of connecting contigs by linking information from paired-end reads from plasmids, paired-end reads from BACs, known messenger RNAs or other sources. The contigs in a scaffold are ordered and oriented with respect to one another. 146 Scoring matrix(记分矩阵) See Position-specific scoring matrix. SEG (⼀种蛋⽩质程序低复杂性区段过滤程序) A program for filtering low complexity regions in amino acid sequences. Residues that have been masked are represented as "X" in an alignment. SEG filtering is performed by default in the blastp subroutine of BLAST 2.0. (Wootton and Federhen) Selectivity (in database similarity searches)(数据库相似性搜索的选择准确性) The ability of a search method to locate members of a protein family without making a false-positive classification of members of other families. Sensitivity (in database similarity searches)(数据库相似性搜索的灵敏性) The ability of a search method to locate as many members of a protein family as possi-ble, including distant members of limited sequence similarity. Sequence Tagged Site (序列标签位点) Short cDNA sequences of regions that have been physically mapped. STSs provide unique landmarks, or identifiers, throughout the genome. Useful as a framework for further sequencing. Significance(显著⽔平) A significant result is one that has not simply occurred by chance, and therefore is prob-ably true. Significance levels show how likely a result is due to chance, expressed as a probability. In sequence analysis, the significance of an alignment score may be calcu-lated as the chance that such a score would be found between random or unrelated sequences. See Expect value. Similarity score (sequence alignment) (相似性值) Similarity means the extent to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity and/or conservation. In BLAST similarity refers to a positive matrix score. The sum of the number of identical matches and conservative (high scoring) substitu-tions in a sequence alignment divided by the total number of aligned sequence charac-ters. Gaps are usually ignored. Simulated annealing A search algorithm that attempts to solve the problem of finding global extrema. The algorithm was inspired by the physical cooling process of metals and the freezing process in liquids where atoms slow down in movement and line up to form a crystal. The algorithm traverses the energy levels of a function, always accepting energy levels that are smaller than previous ones, but sometimes accepting energy levels that are greater, according to the Boltzmann probability distribution. Single-linkage cluster analysis An analysis of a group of related objects, e.g., similar proteins in different genomes to identify both close and more distant relationships, represented on a tree or dendogram. The method joins the most closely related pairs by the neighbor-joining algorithm by representing these pairs as outer branches on 147the tree. More distant objects are then pro-gressively added to lower tree branches. The method is also used to predict phylogenet-ic relationships by distance methods. See also Hierarchical clustering, Neighbor-joining method. Smith-Waterman algorithm(Smith-Waterman算法) Uses dynamic programming to find local alignments between sequences. The key fea-ture is that all negative scores calculated in the dynamic programming matrix are changed to zero in order to avoid extending poorly scoring alignments and to assist in identifying local alignments starting and stopping anywhere with the matrix. SNP (单核苷酸多态性) Single nucleotide polymorphism, or a single nucleotide position in the genome sequence for which two or more alternative alleles are present at appreciable frequency (traditionally, at least 1%) in the human population. Space or time complexity(时间或空间复杂性) An algorithms complexity is the maximum amount of computer memory or time required for the number of algorithmic steps to solve a problem. Specificity (in database similarity searches)(数据库相似性搜索的特异性) The ability of a search method to locate members of one protein family, including dis-tantly related members. SSR (简单序列重复) Simple sequence repeat, a sequence consisting largely of a tandem repeat of a specific k-mer (such as (CA)15). Many SSRs are polymorphic and have been widely used in genetic mapping. Stochastic context-free grammar A formal representation of groups of symbols in different parts of a sequence; i.e., not in the same context. An example is complementary regions in RNA that will form sec-ondary structures. The stochastic feature introduces variability into such regions. Stringency Refers to the minimum number of matches required within a window. See also Filtering. STS (序列标签位点的缩写) See Sequence Tagged Site Substitution (替换) The presence of a non-identical amino acid at a given position in an alignment. If the aligned residues have similar physico-chemical properties the substitution is said to be "conservative". Substitution Matrix (替换矩阵) A substitution matrix containing values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. such matrices are constructed by assembling a large and diverse sample of verified pairwise alignments of amino acids. If the sample is large enough to be statistically significant, the resulting matrices should reflect the true probabilities of mutations occuring through a period of evolution. 148Sum of pairs method Sums the substitution scores of all possible pair-wise combinations of sequence charac-ters in one column of a multiple sequence alignment. SWISS-PROT (主要蛋⽩质序列数据库之⼀) A non-redundant (See Redundancy) protein sequence database. Thoroughly annotated and cross referenced. A subdivision is TrEMBL. Synteny The presence of a set of homologous genes in the same order on two genomes. Threading In protein structure prediction, the aligning of the sequence of a protein of unknown structure with a known three-dimensional structure to determine whether the amino acid sequence is spatially and chemically compatible with that structure. TrEMBL (蛋⽩质数据库之⼀,翻译⾃EMBL) A protein sequence database of Translated EMBL nucleotide sequences. Uncertainty(不确定性) From information theory, a logarithmic measure of the average number of choices that must be made for identification purposes. See also Information content. Unified Modeling Language (UML) A standard sanctioned by the Object Management Group that provides a formal nota-tion for describing object-oriented design. UniGene (⼈类基因数据库之⼀) Database of unique human genes, at NCBI. Entries are selected by near identical presence in GenBank and dbEST databases. The clusters of sequences produced are considered to represent a single gene. Unitary Matrix (⼀元矩阵) Also known as Identity Matrix.A scoring system in which only identical characters receive a positive score. URL(统⼀资源定位符) Uniform resource locator. Viterbi algorithm Calculates the optimal path of a sequence through a hidden Markov model of sequences using a dynamic programming algorithm. Weight matrix See Position-specifc scoring matrix.。
附录:生物信息学主要英文术语及释义Abstract Syntax Notation (ASN.l)(NCBI发展的许多程序,如显示蛋白质三维结构的Cn3D等所使用的内部格式)A language that is used to describe structured data types formally, Within bioinformatits,it has been used by the National Center for Biotechnology Information to encode sequences, maps, taxonomic information, molecular structures, and biographical information in such a way that it can be easily accessed and exchanged by computer software.Accession number(记录号)A unique identifier that is assigned to a single database entry for a DNA or protein sequence.Affine gap penalty(一种设置空位罚分策略)A gap penalty score that is a linear function of gap length, consisting of a gap opening penalty and a gap extension penalty multiplied by the length of the gap. Using this penalty scheme greatly enhances the performance of dynamic programming methods for sequence alignment. See also Gap penalty. Algorithm(算法)A systematic procedure for solving a problem in a finite number of steps, typically involving a repetition of operations. Once specified, an algorithm can be written in a computer language and run as a program.Alignment(联配/比对/联配)Refers to the procedure of comparing two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences. Of the two types of alignment, local and global, a local alignment is generally the most useful. See also Local and Global alignments. Alignment score(联配/比对/联配值)An algorithmically computed score based on the number of matches, substitutions, insertions, and deletions (gaps) within an alignment. Scores for matches and substitutions Are derived from a scoring matrix such as the BLOSUM and PAM matrices for proteins, and aftine gap penalties suitable for the matrix are chosen. Alignment scores are in log odds units, often bit units (log to the base 2). Higher scores denote better alignments. See also Similarity score, Distance in sequence analysis.Alphabet(字母表)The total number of symbols in a sequence-4 for DNA sequences and 20 for protein sequences.Annotation(注释)The prediction of genes in a genome, including the location of protein-encoding genes, the sequence of the encoded proteins, any significantmatches to other Proteins of known function, and the location of RNA-encoding genes. Predictions are based on gene models; e.g., hidden Markov models of introns and exons in proteins encoding genes, and models of secondary structure in RNA.Anonymous FTP(匿名FTP)When a FTP service allows anyone to log in, it is said to provide anonymous FTP ser-vice. A user can log in to an anonymous FTP server by typing anonymous as the user name and his E-mail address as a password. Most Web browsers now negotiate anonymous FTP logon without asking the user for a user name and password. See also FTP.ASCIIThe American Standard Code for Information Interchange (ASCII) encodes unaccented letters a-z, A-Z, the numbers O-9, most punctuation marks, space, and a set of control characters such as carriage return and tab. ASCII specifies 128 characters that are mapped to the values O-127. ASCII tiles are commonly called plain text, meaning that they only encode text without extra markup.BAC clone(细菌人工染色体克隆)Bacterial artificial chromosome vector carrying a genomic DNA insert, typically 100–200 kb. Most of the large-insert clones sequenced in the project were BAC clones.Back-propagation(反向传输)When training feed-forward neural networks, a back-propagation algorithm can be used to modify the network weights. After each training input pattern is fed through the network, the network’s output is compared with the desired output and the amount of error is calculated. This error is back-propagated through the network by using an error function to correct the network weights. See also Feed-forward neural network.Baum-Welch algorithm(Baum-Welch算法)An expectation maximization algorithm that is used to train hidden Markov models.Baye’s rule(贝叶斯法则)Forms the basis of conditional probability by calculating the likelihood of an event occurring based on the history of the event and relevant background information. In terms of two parameters A and B, the theorem is stated in an equation: The condition-al probability of A, given B, P(AIB), is equal to the probability of A, P(A), times the conditional probability of B, given A, P(BIA), divided by the probability of B, P(B). P(A) is the historical or prior distribution value of A, P(BIA) is a new prediction for B for a particular value of A, and P(B) is the sum of the newly predicted values for B. P(AIB) is a posterior probability, representing a new prediction for A given the prior knowledge of A and the newly discovered relationships between A and B.Bayesian analysis(贝叶斯分析)A statistical procedure used to estimate parameters of an underlyingdistribution based on an observed distribution. See also Baye’s rule.Biochips(生物芯片)Miniaturized arrays of large numbers of molecular substrates, often oligonucleotides, in a defined pattern. They are also called DNA microarrays and microchips.Bioinformatics (生物信息学)The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology. /The discipline of obtaining information about genomic or protein sequence data. This may involve similarity searches of databases, comparing your unidentified sequence to the sequences in a database, or making predictions about the sequence based on current knowledge of similar sequences. Databases are frequently made publically available through the Internet, or locally at your institution.Bit score (二进制值/ Bit值)The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.Bit unitsFrom information theory, a bit denotes the amount of information required to distinguish between two equally likely possibilities. The number of bits of information, AJ, required to convey a message that has A4 possibilities is log2 M = N bits.BLAST (基本局部联配搜索工具,一种主要数据库搜索程序)Basic Local Alignment Search Tool. A set of programs, used to perform fast similarity searches. Nucleotide sequences can be compared with nucleotide sequences in a database using BLASTN, for example. Complex statistics are applied to judge the significance of each match. Reported sequences may be homologous to, or related to the query sequence. The BLASTP program is used to search a protein database for a match against a query protein sequence. There are several other flavours of BLAST. BLAST2 is a newer release of BLAST. Allows for insertions or deletions in the sequences being aligned. Gapped alignments may be more biologically significant.Block(蛋白质家族中保守区域的组块)Conserved ungapped patterns approximately 3-60 amino acids in length in a set of related proteins.BLOSUM matrices(模块替换矩阵,一种主要替换矩阵)An alternative to PAM tables, BLOSUM tables were derived using local multiple alignments of more distantly related sequences than were used for the PAM matrix. These are used to assess the similarity of sequences when performing alignments.Boltzmann distribution(Boltzmann 分布)Describes the number of molecules that have energies above a certain level, based on the Boltzmann gas constant and the absolute temperature.Boltzmann probability function(Boltzmann概率函数)See Boltzmann distribution.Bootstrap analysisA method for testing how well a particular data set fits a model. For example, the validity of the branch arrangement in a predicted phylogenetic tree can be tested by resampling columns in a multiple sequence alignment to create many new alignments. The appearance of a particular branch in trees generated from these resampled sequences can then be measured. Alternatively, a sequence may be left out of an analysis to deter-mine how much the sequence influences the results of an analysis.Branch length(分支长度)In sequence analysis, the number of sequence changes along a particular branch of a phylogenetic tree.CDS or cds (编码序列)Coding sequence.Chebyshe, d inequalityThe probability that a random variable exceeds its mean is less than or equal to the square of 1 over the number of standard deviations from the mean. Clone (克隆)Population of identical cells or molecules (e.g. DNA), derived from a single ancestor.Cloning Vector (克隆载体)A molecule that carries a foreign gene into a host, and allows/facilitates the multiplication of that gene in a host. When sequencing a gene that has been cloned using a cloning vector (rather than by PCR), care should be taken not to include the cloning vector sequence when performing similarity searches. Plasmids, cosmids, phagemids, YACs and PACs are example types of cloning vectors.Cluster analysis(聚类分析)A method for grouping together a set of objects that are most similar from a larger group of related objects. The relationships are based on some criterion of similarity or difference. For sequences, a similarity or distance score or a statistical evaluation of those scores is used.CobblerA single sequence that represents the most conserved regions in a multiple sequence alignment. The BLOCKS server uses the cobbler sequence to perform a database similarity search as a way to reach sequences that are more divergent than would be found using the single sequences in the alignment for searches.Coding system (neural networks)Regarding neural networks, a coding system needs to be designed for representing input and output. The level of success found when training the model will be partially dependent on the quality of the coding system chosen. Codon usageAnalysis of the codons used in a particular gene or organism.COG(直系同源簇)Clusters of orthologous groups in a set of groups of related sequences in microorganism and yeast (S. cerevisiae). These groups are found by whole proteome comparisons and include orthologs and paralogs. See also Orthologs and Paralogs.Comparative genomics(比较基因组学)A comparison of gene numbers, gene locations, and biological functions of genes in the genomes of diverse organisms, one objective being to identify groups of genes that play a unique biological role in a particular organism. Complexity (of an algorithm)(算法的复杂性)Describes the number of steps required by the algorithm to solve a problem as a function of the amount of data; for example, the length of sequences to be aligned.Conditional probability(条件概率)The probability of a particular result (or of a particular value of a variable) given one or more events or conditions (or values of other variables). Conservation (保守)Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue.Consensus(一致序列)A single sequence that represents, at each subsequent position, the variation found within corresponding columns of a multiple sequence alignment. Context-free grammarsA recursive set of production rules for generating patterns of strings. These consist of a set of terminal characters that are used to create strings, a set of nonterminal symbols that correspond to rules and act as placeholders for patterns that can be generated using terminal characters, a set of rules for replacing nonterminal symbols with terminal characters, and a start symbol. Contig (序列重叠群/拼接序列)A set of clones that can be assembled into a linear order. A DNA sequence that overlaps with another contig. The full set of overlapping sequences (contigs) can be put together to obtain the sequence for a long region of DNA that cannot be sequenced in one run in a sequencing assay. Important in genetic mapping at the molecular level.CORBA(国际对象管理协作组制定的使OOP对象与网络接口统一起来的一套跨计算机、操作系统、程序语言和网络的共同标准)The Common Object Request Broker Architecture (CORBA) is an open industry standard for working with distributed objects, developed by the Object Management Group. CORBA allows the interconnection of objects and applications regardless of computer language, machine architecture, or geographic location of the computers.Correlation coefficient(相关系数)A numerical measure, falling between - 1 and 1, of the degree of the linear relationship between two variables. A positive value indicates a direct relationship, a negative value indicates an inverse relationship, and the distance of the value away from zero indicates the strength of the relationship.A value near zero indicates no relationship between the variables. Covariation (in sequences)(共变)Coincident change at two or more sequence positions in related sequences that may influence the secondary structures of RNA or protein molecules. Coverage (or depth) (覆盖率/厚度)The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a 'high-quality base' is defined as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20).Database(数据库)A computerized storehouse of data that provides a standardized way for locating, adding, removing, and changing data. See also Object-oriented database, Relational database.DendogramA form of a tree that lists the compared objects (e.g., sequences or genes in a microarray analysis) in a vertical order and joins related ones by levels of branches extending to one side of the list.Depth (厚度)See coverageDirichlet mixturesDefined as the conjugational prior of a multinomial distribution. One use is for predicting the expected pattern of amino acid variation found in the match state of a hid-den Markov model (representing one column of a multiple sequence alignment of proteins), based on prior distributions found in conserved protein domains (blocks).Distance in sequence analysis(序列距离)The number of observed changes in an optimal alignment of two sequences, usually not counting gaps.DNA Sequencing (DNA测序)The experimental process of determining the nucleotide sequence of a region of DNA. This is done by labelling each nucleotide (A, C, G or T) with either a radioactive or fluorescent marker which identifies it. There are several methods of applying this technology, each with their advantages and disadvantages. For more information, refer to a current text book. High throughput laboratories frequently use automated sequencers, which are capable of rapidly reading large numbers of templates. Sometimes, the sequences may be generated more quickly than they can be characterised. Domain (功能域)A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function.Dot matrix(点标矩阵图)Dot matrix diagrams provide a graphical method for comparing two sequences. One sequence is written horizontally across the top of the graph and the other along the left-hand side. Dots are placed within the graph at the intersection of the same letter appearing in both sequences. A series of diagonal lines in the graph indicate regions of alignment. The matrix may be filtered to reveal the most-alike regions by scoring a minimal threshold number of matches within a sequence window.Draft genome sequence (基因组序列草图)The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes.DUST (一种低复杂性区段过滤程序)A program for filtering low complexity regions from nucleic acid sequences. Dynamic programming(动态规划法)A dynamic programming algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a table or matrix. Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found. This algorithm is used for producing sequence alignments, given a scoring system for sequence comparisons.EMBL (欧洲分子生物学实验室,EMBL数据库是主要公共核酸序列数据库之一)European Molecular Biology Laboratories. Maintain the EMBL database, one of the major public sequence databases.EMBnet (欧洲分子生物学网络)European Molecular Biology Network: / was established in 1988, and provides services including local molecular databases and software for molecular biologists in Europe. There are several large outposts of EMBnet, including EXPASY.Entropy(熵)From information theory, a measure of the unpredictable nature of a set of possible elements. The higher the level of variation within the set, the higher the entropy.Erdos and Renyi lawIn a toss of a “fair” coin, the number of heads in a row that can be expected is the logarithm of the number of tosses to the base 2. The law may be generalized for more than two possible outcomes by changing the base of the logarithm to the number of out-comes. This law was used to analyze the number of matches and mismatches that can be expected between random sequences as a basis for scoring the statistical significance of a sequence alignment.EST (表达序列标签的缩写)See Expressed Sequence TagExpect value (E)(E值)E value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. In a database similarity search, the probability that an alignment score as good as the one found between a query sequence and a database sequence would be found in as many comparisons between random sequences as was done to find the matching sequence. In other types of sequence analysis, E has a similar meaning.Expectation maximization (sequence analysis)An algorithm for locating similar sequence patterns in a set of sequences. A guessed alignment of the sequences is first used to generate an expected scoring matrix representing the distribution of sequence characters in each column of the alignment, this pattern is matched to each sequence, and the scoring matrix values are then updated to maximize the alignment of the matrix to the sequences. The procedure is repeated until there is no further improvement.Exon (外显子)Coding region of DNA. See CDS.Expressed Sequence Tag (EST) (表达序列标签)Randomly selected, partial cDNA sequence; represents it's corresponding mRNA. dbEST is a large database of ESTs at GenBank, NCBI.FASTA (一种主要数据库搜索程序)The first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "words". Initially, the scores of segments in which there are multiple word hits are calculated ("init1"). Later the scores of several segments may be summed to generate an "initn" score. An optimized alignment that includes gaps is shown in the output as "opt". The sensitivity and speed of the search are inversely related and controlled by the "k-tup" variable which specifies the size of a "word". (Pearson and Lipman)Extreme value distribution(极值分布)Some measurements are found to follow a distribution that has a long tail which decays at high values much more slowly than that found in a normal distribution. This slow-falling type is called the extreme value distribution. The alignment scores between unrelated or random sequences are an example. These scores can reach very high values, particularly when a large number of comparisons are made, as in a database similarity search. The probability of a particular score may be accurately predicted by the extreme value distribution, which follows a double negative exponential function after Gumbel.False negative(假阴性)A negative data point collected in a data set that was incorrectly reported due to a failure of the test in avoiding negative results.False positive (假阳性)A positive data point collected in a data set that was incorrectly reported due to a failure of the test. If the test had correctly measured the data point, the data would have been recorded as negative.Feed-forward neural network (反向传输神经网络)Organizes nodes into sequence layers in which the nodes in each layer are fully connected with the nodes in the next layer, except for the final output layer. Input is fed from the input layer through the layers in sequence in a “feed-forward” direction, resulting in output at the final layer. See also Neural network.Filtering (window size)During pair-wise sequence alignment using the dot matrix method, random matches can be filtered out by using a sliding window to compare the two sequences. Rather than comparing a single sequence position at a time, a window of adjacent positions in the two sequences is compared and a dot, indicating a match, is generated only if a certain minimal number of matches occur.Filtering (过滤)Also known as Masking. The process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores. See SEG and DUST.Finished sequence(完成序列)Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps.Fourier analysisStudies the approximations and decomposition of functions using trigonometric polynomials.Format (file)(格式)Different programs require that information be specified to them in a formal manner, using particular keywords and ordering. This specification is a file format.Forward-backward algorithmUsed to train a hidden Markov model by aligning the model with training sequences. The algorithm then refines the model to reduce the error when fitted to the given data using a gradient descent approach.FTP (File Transfer Protocol)(文件传输协议)Allows a person to transfer files from one computer to another across a network using an FTP-capable client program. The FTP client program can only communicate with machines that run an FTP server. The server, in turn, will make a specific portion of its tile system available for FTP access, providing that the client is able to supply a recognized user name and password to the server.Full shotgun clone (鸟枪法克隆)A large-insert clone for which full shotgun sequence has been produced.Functional genomics(功能基因组学)Assessment of the function of genes identified by between-genome comparisons. The function of a newly identified gene is tested by introducing mutations into the gene and then examining the resultant mutant organism for an altered phenotype.gap (空位/间隙/缺口)A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment.Gap penalty(空位罚分)A numeric score used in sequence alignment programs to penalize the presence of gaps within an alignment. The value of a gap penalty affects how often gaps appear in alignments produced by the algorithm. Most alignment programs suggest gap penalties that are appropriate for particular scoring matrices.Genetic algorithm(遗传算法)A kind of search algorithm that was inspired by the principles of evolution. A population of initial solutions is encoded and the algorithm searches through these by applying a pre-defined fitness measurement to each solution, selecting those with the highest fitness for reproduction. New solutions can be generated during this phase by crossover and mutation operations, defined in the encoded solutions.Genetic map (遗传图谱)A genome map in which polymorphic loci are positioned relative to one another on the basis of the frequency with which they recombine during meiosis. The unit of distance is centimorgans (cM), denoting a 1% chance of recombination.Genome(基因组)The genetic material of an organism, contained in one haploid set of chromosomes.Gibbs sampling methodAn algorithm for finding conserved patterns within a set of related sequences.A guessed alignment of all but one sequence is made and used to generate a scoring matrix that represents the alignment. The matrix is then matched to the left-out sequence, and a probable location of the corresponding pattern is found. This prediction is then input into a new alignment and another scoring matrix is produced and tested on a new left-out sequence. The process is repeated until there is no further improvement in the matrix.Global alignment(整体联配)Attempts to match as many characters as possible, from end to end, in a set of two ormore sequences.Gopher (一个文档发布系统,允许检索和显示文本文件)Graph theory(图论)A branch of mathematics which deals with problems that involve a graph or network structure. A graph is defined by a set of nodes (or points) and a set of arcs (lines or edges) joining the nodes. In sequence and genome analysis, graph theory is used for sequence alignments and clustering alike genes.GSS(基因综述序列)Genome survey sequence.GUI(图形用户界面)Graphical user interface.H (相对熵值)H is the relative entropy of the target and background residue frequencies. (Karlin and Altschul, 1990). H can be thought of as a measure of the average information (in bits) available per position that distinguishes an alignment from chance. At high values of H, short alignments can be distinguished by chance, whereas at lower H values, a longer alignment may be necessary. (Altschul, 1991)Half-bitsSome scoring matrices are in half-bit units. These units are logarithms to the base 2 of odds scores times 2.Heuristic(启发式方法)A procedure that progresses along empirical lines by using rules of thumb to reach a solution. The solution is not guaranteed to be optimal.Hexadecimal system(16制系统)The base 16 counting system that uses the digits O-9 followed by the letters A-F.HGMP (人类基因组图谱计划)Human Genome Mapping Project.Hidden Markov Model (HMM)(隐马尔可夫模型)In sequence analysis, a HMM is usually a probabilistic model of a multiple sequence alignment, but can also be a model of periodic patterns in a single sequence, representing, for example, patterns found in the exons of a gene. In a model of multiple sequence alignments, each column of symbols in the alignment is represented by a frequency distribution of the symbols called a state, and insertions and deletions by other states. One then moves through the model along a particular path from state to state trying to match a given sequence. The next matching symbol is chosen from each state, recording its probability (frequency) and also the probability of going to that particular state from a previous one (the transition probability). State and transition probabilities are then multiplied to obtain a probability of the given sequence. Generally speaking, a HMM is a statistical model for an ordered sequence of symbols, acting as a stochastic state machine that generates a symbol each time a transition is made from one state to the next. Transitions betweenstates are specified by transition probabilities.Hidden layer(隐藏层)An inner layer within a neural network that receives its input and sends its output to other layers within the network. One function of the hidden layer is to detect covariation within the input data, such as patterns of amino acid covariation that are associated with a particular type of secondary structure in proteins.Hierarchical clustering(分级聚类)The clustering or grouping of objects based on some single criterion of similarity or difference.An example is the clustering of genes in a microarray experiment based on the correlation between their expression patterns. The distance method used in phylogenetic analysis is another example.Hill climbingA nonoptimal search algorithm that selects the singular best possible solution at a given state or step. The solution may result in a locally best solution that is not a globally best solution.Homology(同源性)A similar component in two organisms (e.g., genes with strongly similar sequences) that can be attributed to a common ancestor of the two organisms during evolution.Horizontal transfer(水平转移)The transfer of genetic material between two distinct species that do not ordinarily exchange genetic material. The transferred DNA becomes established in the recipient genome and can be detected by a novel phylogenetic history and codon content com-pared to the rest of the genome. HSP (高比值片段对)High-scoring segment pair. Local alignments with no gaps that achieve one of the top alignment scores in a given search.HTGS/HGT(高通量基因组序列)High-throughout genome sequencesHTML(超文本标识语言)The Hyper-Text Markup Language (HTML) provides a structural description of a document using a specified tag set. HTML currently serves as the Internet lingua franca for describing hypertext Web page documents.HyperplaneA generalization of the two-dimensional plane to N dimensions.HypercubeA generalization of the three-dimensional cube to N dimensions.Identity (相同性/相同率)The extent to which two (nucleotide or amino acid) sequences are invariant. Indel(插入或删除的缩略语)An insertion or deletion in a sequence alignment.Information content (of a scoring matrix)A representation of the degree of sequence conservation in a column of a。