Rapid ab initio Prediction of RNA Pseudoknots via Graph Tree Decomposition
- 格式:pdf
- 大小:213.09 KB
- 文档页数:11
生物信息学中RNA序列分析与可变剪接预测方法研究RNA序列是生物信息学研究中的重要内容之一。
它是由DNA转录而来的一种核酸分子,在细胞中具有多种功能,如编码蛋白质、调控基因表达等。
随着高通量测序技术的发展,越来越多的RNA序列数据被产生出来,这使得RNA序列分析与可变剪接预测方法的研究变得尤为重要。
RNA序列分析是对RNA序列进行全面的研究和挖掘,以了解其功能和作用机制。
在RNA序列分析中,首先需要对RNA序列进行质量控制和清洗,以确保后续分析的准确性和可靠性。
接下来,常见的RNA序列分析方法包括差异表达分析、富集分析、转录因子结合位点预测等。
其中,差异表达分析可以帮助我们发现在不同条件下表达水平发生显著变化的基因,从而了解这些基因在生物过程中的功能和调控机制。
富集分析则可以帮助我们发现在特定的功能模块或通路中富集的基因,从而揭示这些功能模块或通路在生物过程中的作用。
转录因子结合位点预测可以帮助我们发现哪些转录因子与RNA序列相互作用,从而推断其可能的调控关系。
RNA的可变剪接是由于同一基因在转录过程中剪接方式的不同而产生的多种剪接异构体。
可变剪接对于生物体的功能多样性和复杂性起到重要作用。
因此,预测可变剪接的方式也成为RNA序列分析的一个重要研究方向。
常见的可变剪接预测方法包括基于比对的方法和基于组装的方法。
基于比对的方法首先对RNA序列进行比对,然后通过分析比对结果来预测可变剪接位点。
基于组装的方法则将RNA序列比对到基因组参考序列上,然后通过组装比对结果来预测可变剪接位点。
这些预测方法的准确性和可靠性对于研究人员能否准确地预测和理解可变剪接的功能至关重要。
除了RNA序列分析和可变剪接预测方法外,还有一些与RNA序列相关的研究方法和工具。
例如,RNA二级结构预测可以帮助我们了解RNA序列中的空间结构和相互作用方式,有助于理解其功能和调控机制。
RNA序列比对和聚类分析可以帮助我们了解不同物种或不同基因之间的相似性和差异性。
受损细胞释放的RNA可促进凝血
佚名
【期刊名称】《基础医学与临床》
【年(卷),期】2007(27)5
【摘要】德国Justus-Liebig大学的KlausT·Preissner博士及其同事于4月10日的《美国科学院院刊》(ProcNatlAcadsciusA,2007;104:6388-6393.)上报告一项新研究的结果显示,从受损细胞释放的RNA是血液凝固的一种关键性辅助因子。
【总页数】1页(P543-543)
【关键词】细胞释放;RNA;凝血;辅助因子;血液凝固;科学院;美国;博士
【正文语种】中文
【中图分类】R962
【相关文献】
1.水蛭提取液对凝血酶诱导血管内皮细胞释放凝血因子的影响 [J], 黄莺;张晓青;李龙;易小民;田雪飞
2.凝血酶受体激活肽和凝血酶对人成纤维细胞释放TGF-β和VEGF的影响 [J], 黄跃生;杨宗城
3.凝血酶受体激活肽和凝血酶对人成纤维细胞释放TGF-β和VEGF的影响 [J], 黄跃生;杨宗城
4.环状RNA circHIPK3调控微RNA124表达促进口腔鳞状细胞癌细胞增殖的研究[J], 王军;赵思语;欧阳少波;黄自坤;罗清;廖岚
5.促进细胞代谢的ATP能促进病毒粒子释放RNA [J], 周昀陵
因版权原因,仅展示原文概要,查看原文内容请购买。
垂体特异性转录因子研究进展
高岩
【期刊名称】《国外医学:内分泌学分册》
【年(卷),期】1995(015)003
【摘要】垂体特异性转录因子于垂体前叶表达,是垂体生长激素细胞,催乳素细胞以及促甲状腺素细胞基因转录的调节者,能特异地识别基因上的DNA序列,并与之结合,进而影响该基因转录。
本文阐述已发现的两种垂体特异性转录因子(Pit-1和Pit-2)对垂体激素分泌功能及垂体疾病发病机理的重要意义。
【总页数】5页(P116-120)
【作者】高岩
【作者单位】无
【正文语种】中文
【中图分类】R584.02
【相关文献】
1.小干扰垂体特异性转录因子-1对大鼠垂体生长激素肿瘤细胞侵袭行为的影响 [J], 范月超;张慧;李锦晓;李中林;纵振坤;陈晨;冯力;苗发安;谢满意
2.垂体特异性转录因子与家畜经济性状关系的研究进展 [J], 李佳蓉;贾超;姜怀志
3.牛垂体特异性转录因子POU1F1研究进展 [J], 赵静雯;孙少华;吴慧光
4.猪垂体特异性转录因子的研究进展 [J], 王政坤;王钦德
5.垂体特异性转录因子1基因遗传多态性与中国西门塔尔牛生长及育肥性状的遗传效应分析 [J], 陈翠;许尚忠;张立敏;陈晓杰;刘喜冬;李姣;张路培;高雪;高会江;李俊雅
因版权原因,仅展示原文概要,查看原文内容请购买。
环状RNA研究现状环状RNA(circRNA)是一类特殊的非编码RNA分子,也是RNA领域最新的研究热点。
与传统的线性RNA(linear RNA,含5’和3’末端)不同,circRNA分子呈封闭环状结构,不受RNA外切酶影响,表达更稳定,不易降解。
在功能上,近年的研究表明,circRNA分子富含microRNA(miRNA)结合位点,在细胞中起到miRNA海绵(miRNA sponge)的作用,进而解除miRNA对其靶基因的抑制作用,升高靶基因的表达水平,进而在转录水平上对靶基因进行调控;这一作用机制被称为竞争性内源RNA(ceRNA)机制。
通过与疾病关联的miRNA相互作用,circRNA在疾病中发挥着重要的调控作用。
一、环状RNA相关研究环状RNA(circular RNA, circRNA)是一类新近发现的内源性非编码RNA(non-coding RNA, ncRNA),是最近RNA领域的最新的研究热点。
Sanger等在20世纪70年代发现某些高等植物中存在可致病的单链环状类病毒,这是人类首次发现cirRNA。
随后人们陆续在酵母线粒体、丁型肝炎病毒中发现cirRNA,也发现在小鼠的睾丸内含有性别决定基因(sex-determining region Y, Sry)转录而来的cirRNA,还证实了cirRNA也存在于人体细胞。
随着高通量测序及生物信息学的快速发展,越来越多的cirRNA在生物体中被发现,它们广泛的,多样化的,保守的,稳定的存在于真核生物中,丰富了RNA家族的内容。
不同物种中鉴定的cirRNA数目是不一样的,总结目前已经发表的研究中cirRNA的鉴定情况如下:刘骏武研究植物水稻的cirRNA,利用总RNA-seq数据,分别利用find_circ、STAR、CIRI三种软件同时进行预测,只选取具有canonical位点的环状RNA,分别找出20、155、69 个,取三个软件结果的并集,总共有213 个环状RNA。
果蝇也能产生一种系统性RNAi反应
佚名
【期刊名称】《昆虫知识》
【年(卷),期】2009()3
【摘要】果蝇和其他昆虫已知能够发起涉及RNA干涉(RNAi)的局部抗病毒防卫。
以前人们认为,果蝇不能系统性地展开RNAi反应,依据是所表达的内生RNA发夹不会从一个细胞向另一个细胞展开。
【总页数】1页(P331-331)
【关键词】RNAi;果蝇;反应;系统;RNA干涉;抗病毒;细胞;昆虫
【正文语种】中文
【中图分类】Q522;Q969.462.2
【相关文献】
1.一种不产生核废物的加速器反应堆 [J], 张果
2.系统性红斑狼疮患者外周血淋巴细胞增殖反应与其产生甲硫氨酸脑啡肽的关系[J], 徐星铭;戴宏;魏伟
3.一种连续产生次氯酸钠的管状电化学反应杀菌器 [J], 曾新安;秦贯丰;李忠彦
4.学懂有关药物基本知识,引导安全用药(三)成的一种适应状态,用药者渴求定期使用某药物,以得到欣快感。
中断用药后会产生严重的戒断反应, [J], 陈孝治;
5.一种黄明龙还原反应副产物产生的机理分析 [J], 刘海涛;蔡亮;郭强;尹登山;王小伟
因版权原因,仅展示原文概要,查看原文内容请购买。
除酒精、病毒、药物等脂肪肝致病因素外,大泡性肝细胞脂肪变性超过5%诊断为非酒精性脂肪性肝病(NAFLD ),其疾病谱包括单纯性脂肪肝、非酒精性脂肪性肝炎(NASH )、肝纤维化、肝硬化和肝癌[1,2]。
2018年NAFLD 全球患病率高达25%[3]。
我国NAFLD 患病率已赶超欧美发达国家,严重危害国民健康和社会发展[1]。
脂肪肝是NAFLD 早期阶段,肝脏脂肪分解利用与新生脂肪生成的失调是重要的致病风险因素,因此,促进肝内脂肪酸β-氧化分解,降低肝脏脂肪堆积可有效延缓或阻止NAFLD 发展。
Lithocholic acid decreases mRNA stability of nuclear receptor PPAR αby upregulating miR-21expression in hepatoma HepG2cellsP ANG Biying,HUANG Nana,HUANG Xiaoxia,LI Xin,XIONG Wenting,KONG Bo,YAO Yan School of Life Science,Guangzhou University,Guangzhou 510000,China摘要:目的探讨石胆酸在转录后水平对PPAR αmRNA 稳定性的调控。
方法构建PPAR α3'UTR 荧光素酶报告基因载体并转染HepG2细胞,比较两种浓度石胆酸处理后荧光素酶活性的变化。
结合生物信息学预测,确定石胆酸诱导下差异表达的miRNAs 及其在3'UTR 上的潜在结合位点,对结合位点进行突变(Mut1、Mut2和Mut1+Mut2),比较石胆酸处理后Mut1、Mut2和Mut1+Mut2荧光素酶活性的变化。
利用Western blot 和RT-qPCR 检测两种浓度石胆酸处理下信号通路的激活及其下游转录因子基因表达水平,比较不转染对照组和瞬时转染过表达转录因子的PPAR α蛋白表达,确定参与石胆酸调控PPAR αmRNA 稳定性的信号通路和转录因子。
Rapid ab initio Prediction of RNA Pseudoknots via GraphTree Decomposition∗Jizhen Zhao†,Russell L.Malmberg‡and Liming Cai†AbstractThe prediction of RNA secondary structure including pseudoknots remains a challenge due to the in-tractable computation of the sequence conformation from nucleotide interactions under free energy models.Optimal algorithms often assume a restricted class for the predicted RNA structures and yet still require ahigh-degree polynomial time complexity,which is too expensive to use.Heuristic methods may yield time-efficient algorithms but they do not guarantee optimality of the predicted structure.This paper introducesa new and efficient algorithm for the prediction of RNA structure with pseudoknots for which the structureis not restricted.Novel prediction techniques are developed based on graph tree decomposition.In partic-ular,based on a simplified energy model,stem overlapping relationships are defined with a graph,in whicha specialized maximum independent set corresponds to the desired optimal structure.Such a graph is treedecomposable;dynamic programming over a tree decomposition of the graph leads to an efficient optimalalgorithm.Thefinal structure predictions are then based on re-ranking a list of suboptimal structures under amore comprehensive free energy model.The new algorithm is evaluated on a large number of RNA sequencesets taken from diverse resources.It demonstrates overall sensitivity and specificity that outperforms or iscomparable with those of previous optimal and heuristic algorithms yet it requires significantly less time thanthe compared optimal algorithms.Keywords:RNA secondary structure prediction,pseudoknot,thermodynamic energy,tree decomposition,tree width,maximum independent set1IntroductionThe secondary structure of an RNA molecule is formed due to short or long distance pairings between nucleotides in the sequence.Base pair regions either single,nested or parallel are called stem-loops;base pair regions crossing each other are called pseudoknots[30].Pseudoknots are important structures in RNA molecules and often play important functional roles such as catalysis,RNA splicing,transcription regulation[15,2,25].Knowing the secondary structures of RNA molecules is critical for determining their three dimensional structures and understanding their functions.Automated prediction of RNA secondary structure is thus in demand since it is expensive and time consuming to experimentally determine the structure.It is computationally challenging to predict RNA secondary structure including pseudoknots.In particular, the problem of predicting RNA pseudoknots with the minimum free energy is provably NP-hard for the nearest neighbor model[17].Practical approaches to cope with this computational challenge are either to restrict the class of pseudoknots under consideration or to employ heuristics in the algorithms.Optimal algorithms for restricted pseudoknot classes are usually thermodynamics-based extended from Zuker’s algorithm for the prediction of pseudoknot-free structures[33].In such algorithms,the predicted optimal structure of a single RNA sequence is the one with the global minimum free energy based on a set of thermodynamical parameters.For example,the recently developed PKNOTS[21]can handle the widest classes of pseudoknots.However,its time complexity O(n6)makes it infeasible to fold RNA sequences of a moderate length.The computational efficiency may be improved at the cost of further restricting the structure of pseudoknots[3,17,31],but still with the time ∗The preliminary version of this paper appeared in the proceedings of the6th Workshop on Algorithms for Bioinformatics(WABI 2006).†Department of Computer Science,University of Georgia,Athens,GA30602,USA.Email:{jizhen,cai}@,Fax: (706)542-2966.To whom correspondence should be addressed.‡Department of Plant Biology,University of Georgia,Athens,GA30602,USA.Email:russell@complexity O(n5)or O(n4).Most such algorithms produce only the optimal solution,while suboptimal ones that may reveal the true structure are often ignored.However,the partition function approach,based on the calculation of equilibrium base-pairing probabilities,captures the contributions of all suboptimal structures[9].On the other hand,computationally efficient heuristic methods have also been explored to allow unrestricted pseudoknot structures.Iterated loop matching(ILM)[23]is one such method.Itfinds the most stable stem, adds it to the candidate secondary structure and then masks offthe bases forming the stem and iterates on the left sequence segments until no other stable stem can be found.One structure is reported at the end.Another algorithm,HotKnots[20],does the prediction in a slightly different way.It keeps multiple candidate structures rather than only one and builds each of them in a similar but more elaborate way.These methods can usually be fast,yet they often do not provide an optimality guarantee for the predicted structure or a quality measure on the predicted structure with respect to the optimal structure.Other heuristic methods based on genetic algorithms usually do not address the optimality issue either[8,1].In this paper,we introduce a novel approach for the optimal prediction of RNA pseudoknots for which the structure is not restricted.Our method is based on a simplified free energy model without accounting for loop energies[19,23].In this method,stable stems are selected from an RNA sequence as vertices of a graph;vertices are connected with edges if corresponding stems conflict(i.e.,overlap)in their positions in the sequence.The optimal structure of an RNA sequence corresponds to a collection of non-conflicting stable stems,which can be found by seeking the maximum weighted independent set(WIS)from the graph.We observe that stable stems can be so selected that the resulting graph is of a moderately small tree width t.Based on a tree decomposition of the graph,a dynamic programming algorithm for WIS of the worst-case time complexity O(2t n)is obtained, where n is the number of vertices in the graph,at most quadratic in the length of the RNA sequence.This is an efficient prediction algorithm parameterized on the tree width t,which is usually small.In addition,the algorithm re-ranks a list of suboptimal structures based on the more comprehensive energy functions used in PKNOTS[21].We implemented our algorithm TdFOLD and evaluated its performance on various RNA sequence sets from different sources.The test results showed high efficiency and high accuracy for our algorithm.TdFOLD was tested against PKNOTS,ILM and HotKnots on a set of50tRNA’s,a set of50small RNA sequences containing pseudoknots with length ranging from23to113,and a set of11large RNA’s with length range from210to 412.The results showed that overall,in terms of the sensitivity and specificity of the prediction,TdFOLD outperforms the optimal algorithm PKNOTS and the heuristic algorithms ILM and HotKnots.In time efficiency, it outperforms PKNOTS and HotKnots,and is comparable with ILM.Graph theoretic methods have previously been explored for RNA structure prediction[30].Our method is different from the previous ones in two respects.Our graphs constructed from the RNA sequence contain vertices describing stems instead of nucleotides;making stem to be the smallest structural unit can greatly simplify the complexity of the problem.More importantly,our graph algorithm takes advantage of the tree decomposition technique on the formulated graphs.In fact,it has been demonstrated that the RNA secondary structure can be profiled with a conformational graph of small tree width[27].The underlying graph constructed for the ab initio structure prediction is essentially an augmentation of the conformational graph in which additional vertices and edges are added only for the overlapping stems,thus inheriting the tree decomposability which makes the algorithm efficient.We note that our algorithm,like other ab initio ones,is suitable for predicting the structure of single RNA sequence.When related structurally homologous sequences are available,the accuracy of RNA structure predic-tion can usually be improved through the use of comparative analysis.This uses the information of the covarying residues in a set of multiple sequences or additional phylogenetic relationship of these sequences and may produce the most reliable prediction for the consensus structure[11,24,16,14,29].Because such methods inevitably involve multiple sources of data or computational tools,they usually rely on human intervention.Nevertheless,a fully automated comparative analysis process was proposed[11,10]for RNA consensus structure prediction.Due to its computational complexity,the process has only been implemented for pseudoknot-free RNAs[11].With algorithm TdFOLD,and a tree decomposition based structure-sequence alignment algorithm we developed earlier [27],the efficient implementation of this fully automated process for pseudoknots becomes possible.The paper concludes with a presentation of how TdFOLD could be used in automatid comparative RNA analysis.Figure1:(a)Ten stable stems in Ec P k4,the fourth pseudoknot in E.coli tmRNA molecule,including their left and right regions,and thermodynamic energies;(b)stem graph for Ec P k4;and(c)a tree decomposition of the stem graph with tree width4.2Methods and AlgorithmGiven an RNA sequence,our algorithmfirst builds a pool of stable stems.A secondary structure is indeed a set of compatible stems(i.e.the stems do not share one or more bases between each other).Based on the observation that a set of compatible stems with minimum or near minimum total stem energies tend to be in the true structure,our algorithmfirstfinds a number of secondary structures with minimum or near minimum total stem energies by a tree decomposition based procedure for a graph formed by the stable stems.These secondary structures are then reordered by counting the stem stabilizing and loop destabilizing energies together, and reported as the predicted structures.2.1Problem formulationA(canonical)base pair is either a Watson-Crick pair(A-U or C-G)or wobble pair G-U.A stem is a set of stacked nucleotide base pairs on an RNA sequence s.In general a stem S can be associated with four positions (i l,j l,i r,j r),where i l<j l<i r<j r,on the sequence s such that(a)(s[i l],s[j r])and(s[j l],s[i r])are two canonical base pairs;and(b)for any two base pairs(s[x],s[y]),(s[z],s[w])in the stem S,either i l≤x<z≤j l and i r≤w<y≤j r,or i l≤z<x≤j l and i r≤y<w≤j r.Region s[i l..j l]is the left region of the stem and s[i r..j r]is the right region of the stem.Stem S is stable if the formation of its base pairs allows the thermodynamic energy∆(S)of the stem to be below a predefined threshold parameter E<0.Figure1(a)shows all the stable stems in Ec P k4with E=−5kcal/mol,the fourth pseudoknot in E.coli tmRNA[32],and their corresponding free energy values.A stem graph G s=(V,E)can be defined for the RNA sequence s,where each vertex in V uniquely represents a stable stem on s,and E contains an edge between two vertices if and only if the corresponding two stems (a,b,c,d)and(x,y,z,w)conflict in their positions,i.e.,one or both of the regions s[a..b]and s[c..d]overlap with at least one of the regions s[x..y]and s[z..w].Figure1(b)shows the stem graph for Ec P k4constructed according to the stable stems given in Figure1(a).The stem graph is a weighted graph,with a weight on every vertex. Usually,the weight of a vertex can simply be absolute value of the thermodynamic energy∆(S)of the stem S corresponding to the vertex.The weight may also be adjusted by scaling it(non-)linearly according to the length of the corresponding stem or the distance between the left and right regions of the stem.The problem of predicting the optimal structure of the RNA then corresponds tofinding a collection of non-conflicting stems from its stem graph which achieves the maximum total weight.This is exactly the same as the graph theoretic problem:finding the maximum weighted independent set(WIS)in the stem graph.Note the optimality of the secondary structures depends on the total stem energies only(this was previously adopted by both the primitive method[19]and the more elaborate one[23]too).In order to rectify the possible bias caused by not counting the loop energies,we output optimal as well as a number of sub-optimal struc-tures.These structures are then re-ordered according to the whole energies including stem stabilizing and loop destabilizing energies,and reported as the predicted structures.2.2Identifying stable stemsFor our purpose,stable stems are defined according to a set of parameters (in addition to the energy threshold parameter E ).In particular,a stem contains at least P base pairs;the loop length in between the left and right region of the stem is at least L ;free energy using only Turner’s base stacking energy parameters is at most E .Bulges within a stem are allowed,for which the stem essentially becomes a set of substems separated by the bulges.In addition,parameter T limits the minimum substem length,and parameter B limits the maximum bulge length.The thermodynamic energy ∆(S )of stem S is calculated by taking into account both the stacking energies and the destabilizing energies caused by bulges.The default values for parameters P,L,E,T,and B are set to 3,3,−5,3,0according to our previous experiments,but they may be adjusted according to different class of RNA’s.We set B to 0since it works well for a lot of RNAs.Besides 0,1and 2are good choices for B for some other RNAs.We leave the choice to the users.A procedure similar to the one used in [14]is employed to identify all the stable stems.These stable stems are called the initial stable stem pool.If two stems only share a few bases,we may resolve the conflict by considering some maximal substems of these two stems.For example,one stem A formed by s [a..a +9]and s [b −9..b ]conflicts with another stem B formed by s [b −1..b +8]and s [c..c +9],we could add two more stems formed by s [a +2..a +9]and s [b −9..b −2],s [b +11..b +8]and s [c..c +7],which are shortened from A and B respectively.This is controlled by a switch parameter S .If it is on,an extended stable stem pool will be built:all the stems in the initial stem pool will be imported into it;for each pair of conflict stems,get the maximal sub-stems that can resolve the conflict and meet the requirements defined by the parameters P,L,E,T,B and add them to the extended pool.If it is off,we just use the initial pool as all of the stable stems.According to our experiments,the use of the extended pool can improve the accuracy for some but not all of the RNAs.We also noticed that the longer a sequence is,the more the spurious stems will be produced.We incorporated some biological knowledge to reduce the spurious stems for long sequences,e.g.parameter adjusting according to the properties of some RNAs,not allowing G-U pairs at the ending of a stem.2.3Tree decomposition based algorithmDefinition [22]A tree decomposition of graph G =(V,E )is a pair (T,X )if it satisfies:1.T =(I,F )is a tree with node set I and edge set F ,2.X ={X i :i ∈I,X i ⊆V },i X i =V and ∀u ∈V ,∃i ∈I such that u ∈X i ,3.∀(u,v )∈E ,∃i ∈I such that u,v ∈X i ,4.∀i,j,k ∈I ,if k is on the path that connects i and j in tree T ,then X i ∩X j ⊆X kThe width of a tree decomposition (T,X )is defined as max i ∈I |X i |−1.The tree width of the graph G is the minimum tree width over all possible tree decomposition of G .If T is restricted to be a path,we refer to (T,X )as a path decomposition and the best width over all of the path decompositions as the path width of G .The tree decomposition is rooted in the deep graph minor theorems by Robertson and Seymour [22].It provides a topological view on a graph and the tree width measures how much the graph is “tree-like”.Figure 1(c)shows a tree decomposition for the stem graph given in Figure 1(b).Many computationally intractable graph problems can be easily solved on graphs of small tree width.In particular,a large number of such graph problems,while intractable on general graphs,can be solved in linear time,given a tree decomposition of tree width ≤t ,for a fixed t .Maximum weighted independent set (WIS)is one such problem [5];it has the time complexity O (2t n ).The factor 2t is due to the dynamic programming enumeration,in each node of the tree decomposition,of all partial independent sets formed by the t vertices.2.3.1Algorithm detailsNow we describe the tree decomposition based dynamic programming algorithm that finds the maximum weighted independent set from the stem graph G =(V,E ).It assumes a binary tree decomposition (T,X ),where X =∪X m i =1,for the stem graph,where m =O (|V |),|X i |=t ,for i =1,...,m .We only discuss the process for achieving the optimal solution.The technical details for getting suboptimal solutions are similar.The algorithm constructs one dynamic programming table m i for every tree node X i ={v 1,...,v t }.Table m i records all possible partial independent sets in the subgraph induced by the set of all the vertices in theFigure2:Dynamic programming table construction over tree decomposition.Table m i is computed also based on the computed tables m k and m j.Row f=(x0,...,x h,...,x m,...,x r,...,x t)in table m i is computed from row g in table m k and row h of table m j,Row g is the optimal for columns X k−X i given the value(x0,...,x m) for columns X k∩X i.Similarly,row h is the optimal for columns X j−X i given the value(x h,...,x r)for columns X j∩X i.subtree rooted at i of the tree decomposition.There are t columns in the table m i,one for each vertex in the corresponding tree node X i.Rows are the combinations of these vertices;a vertex is selected if and only if the corresponding column takes value1.There are additionally three columns V,S,Opt in the table.Column V records whether each row represents a valid independent set,column S is the weight of the valid independent set represented by each row.Column Opt,more sophisticated,is explained in the following.These tables are constructed in a bottom-up fashion,from leaves to the apex of the tree decomposition(see Figure2).Every table contains rows,with each being some combination of the vertices in the corresponding node.Column V is set to be1if the row represents a valid independent set.Column Opt is set1if and only if:•The row represents a valid independent set.•S in this row is optimal among all the rows with different choices in the columns corresponding to the vertices in X i−X p,given the chosen values same as this row in the columns corresponding to the vertices in X i∩X p,where node p is the parent of node i.Column S is set differently based on whether the current node is a leaf or an internal node in the tree decomposition.For a leaf node,S is0if the row is not a valid independent set;otherwise S is the corresponding weight of the set.For an internal node i that has two children j and k whose tables m j,m k have been computed, for each row in table m i,column S is computed as S=w1+w2+w3−w4,where•w1is the weight of the row in table m j with the same combination in the columns corresponding to the vertices in X j∩X i that has column Opt=1;•w2is the weight of the row in table m k with the same combination in the columns corresponding to the vertices in X k∩X i that has column Opt=1;•w3is the weight of the independent set formed by the choices in columns corresponding to the vertices in X i−X j−X k;and•w4is the weight of the independent set formed by the same combination in the columns corresponding to the vertices in X i∩X j∩X k.In implementation,the computation of table T i of node i does not enumerate all combinations of vertices in X i.Instead,in general,a greedy algorithm is used to partition set X i into a collection of cliques.Consider the sequence as a straight line and the left(right)region of a stem as an interval.Let all the left regions of the stable stems included in the tree node form an interval graph.Choose an interval(left region)with the right end at the leftmost position among all of the intervals,record all the intervals that overlap with this interval as a clique and remove them,recursively call on the interval graph left until it is empty.A linear time in t is enough forthis procedure.Once the collection of cliques is obtained for X i,combinations of vertices are only considered by taking at most one vertex from every clique.A similar technique can also be used to further improve the efficiency of the algorithm for some long sequences: we build the stem graph based on the left regions of the stems,then we use a similar procedure to the above one to build a path decomposition of the graph.Each node of the path decomposition is indeed a clique consisting of overlapping half stems.We only need to choose zero or one half stem from each node.During the dynamic programming,each combination is associated with C(a user defined number,default4000)suboptimal partial solutions rather than the optimal one only,and the conflicts caused by the right regions are resolved during the dynamic programming.2.3.2Tree decomposition of stem graphFinding the optimal tree decomposition is NP-hard[4],we use a simple,fast heuristic algorithm to produce a tree decomposition for the given stem graph.This algorithm is based on a heuristic method for greedyfill-in[13].This method will produce a tree decomposition with small tree width but not necessary the optimal one,i.e.the tree width of the tree decomposition might be greater than the tree width of the stem graph.Note this will not affect the optimal property of the tree decomposition based method described above since along any tree decomposition the optimal solution could be found.2.3.3Reordering suboptimal structuresThe list of candidate structures,including the optimal and the suboptimal ones,are reordered based on a more sophisticated energy model.In particular,we recalculate the free energy for each of the candidate structures obtained from the previous step using a procedure implemented in[20]according to the energy model in[26,18] together with the one in[9],which take the stem stabilizing energies,loop destabilizing,and pseudoknot energies into account.3Evaluation Results3.1Data setsIn order to test the effectiveness and the efficiency of our algorithm,we evaluated it on small and large,pseudo-knotted and pseudoknot-free RNAs.We used three sets of RNA sequences.Thefirst set consists of50tRNAs, shown in Table1,with lengths ranging from71to79(with the average75).The second set consists of50small RNA sequences or sequence segments with pseudoknot structures(see Table2)of lengths ranging from23to113 (with the average53).The third set consists of11large pseudoknot or pseudoknot free RNA sequences(see Table 3)of lengths ranging from210to412(with the average344).GA0001GA1262GA2492GA3755GA4966GC2866GD1723GD5199GE2095GE4739GF1407GF4687GG0841GG2136GG3917GH0128GH4536GI1748GI4502GK1078GK4537GM0313GM2284GM4471GM5945GN2837GP1341GP3879GP5312GQ2684GR0044GR0793GR1516GR2309GR3541GR4508GR4705GR4740GR5278GT0109GT1418GT4178GT5273GV0579GV1734GV4391GV5554GW1796GW5332GY4135Table1:Test set one:sequence IDs of50tRNAs,taken from[28].3.2Experiment detailsWe compared the performance of our algorithm TdFOLD and that of algorithms PKNOTS[21],ILM[23],and HotKnots[20].We chose the optimal algorithm PKNOTS since it can deal with the most comprehensive type of pseudoknot,while some newer algorithms can only deal with very restricted kinds of pseudoknots.We ran all these algorithms on the tRNAs and the set of small pseudoknot RNAs,and ran all but PKNOTS on the set of large RNAs.We evaluated both accuracy and efficiency of these algorithms.The accuracy is measured in both sensitivity and specificity.Let RP be the number of base pairs in the real structure,T P(true positive)be theSequence type Sequence IDsmRNA Bt-PrP,Ec alpha,Ec S15,Hs-PrP,T4gene32[32]tmRNA Lp PK1,Ec PK1,Ec PK4[32]ribozymes satRPV,Tt-LSU-P3P7,Bp PK2[32]viral tRNA like OYMV,APLV,CGMMV,SBWMV1,BSMVbeta,CGMMV PKbulge,ORSV-S1,AMV3[32]viral3’UTR BSBV3,TMV-L UPD-PK3,STMV UPD1-PK3,BVQ3UPD-PKb,BSBV1,UPD-PKc,PSLVbeta UPD-PK1,PSLVbeta UPD-PK3,SBWMV1UPD-PKb[32] viral ribosomal minimal IBV,MMTV,MMTV-vpk,pKA-A,BWYV,SRV-1,T2gene32[12];RNA shifting signals EIAV,PLRV-S[32]ribozymes HDV-It ag[32]telomerase RNA T.the telo[32]aptamers NGF-L6[32]rRNA Sc18S-PKE21-7[32]antizyme ribosomalframe shifting site Rr ODCanti[32]viral RNA PSIV IRES[32];TYMV,TMV.L,TMV.R[20]HIV-1-RT ligand RNA HIVRT32,HIVRT322,HIVRT33[20]hepatitis virus ribozyme HDV,HDV anti[20]Table2:Test set two:sequence IDs of50small pseudoknotted RNAs(with their reference citations).Sequence type Sequence IDsRNaseP RNA A.ferrooxidans,idlawii(pseudoknot free),A.tum,B.anthracis,B.halodurans,CPB147,D.desulfuricans,EM14b-9,E.thermomarinus,T.roseum[6]telomerase RNA telo.human[8]Table3:Test set three:large RNAs containing pseudoknots(with their reference citations).number of correctly predicted base pairs and F P(false positive)be the number of predicted base pairs that do not exist as real structures.We define SE(sensitivity)as T P/RP,and SP(specificity,also called as positive predictive value)as T P/(T P+F P).The perfect prediction should yield1for both sensitivity and specificity values.For tRNA,the pseudoknot option for PKNOTS was turned offsince it affects the predictions little for tRNA based on some previous tests.For TdFOLD,parameters were set to default values and the number of output solutions was set to40for tRNAs and small pseudoknotted RNAs.For HotKnots and TdFOLD,we only collect the top prediction result from the program output.We also determine the number of best predictions for each program on all data sets.Here,we say that a program has the best prediction for the secondary structure of an RNA if the sensitivity or specificity of the prediction is not worse than any of the predictions from other programs.The experiments were run on a PC with2.8GHz Intel(R)Pentium4processor and1-GB RAM,running RedHat Enterprise Linux version4AS.Running times were measured using the“time”function.The testing results are summarized in Tables4,5,and6.Due to space limitations,we omit the data for each tRNA and each small RNA with pseudoknots.3.3Prediction accuracyTables4to6summarize the testing results for different programs on the three RNA data sets.In Table4and5, we also record the minimum(maximum)value among the predictions for all of the sequences as min(max)since we did not show the data for each sequence.For example,among the50predictions for tRNAs from TdFOLD, in term of sensitivity,the worst one(min)is0.33and the best(max)one is1.Table4shows that TdFOLD has sensitivity0.81and specificity0.75on average for the tRNA prediction,which are slightly better than PKNOTS and significantly better than ILM and HotKnots.Table5shows that,for the small pseudoknotted RNAs,TdFOLD has average sensitivity0.76,which is less than PKNOTS but greater than ILM and HotKnots.On the other hand,TdFOLD has average specificity0.79,which outperforms all the others. TdFOLD is slightly better in overall accuracy than PKNOTS,which reports the optimal structure according its sophisticated energy model.Table6shows the prediction accuracy on the large RNA’s.TdFOLD maintains thesame sensitivity(0.54)as HotKnots,which is slightly better than ILM.TdFOLD has the highest specificity.As we mentioned,a program has the best prediction for an RNA sequence if the sensitivity or specificity of the prediction is not worse than any of the predictions by other programs.For example,for tRNA,in sensitivity, TdFOLD,PKNOTS,HotKnots,ILM have30,29,15,20best predictions respectively.In specificity,they have 23,27,19,4respectively.For small pseudoknotted RNAs,they have29,31,27,24for sensitivity and29,26,22, 20for specificity respectively.For large RNAs,TdFOLD,HotKnots and ILM have6,3,4for sensitivity and7, 1,4for specificity.Thus,TdFOLD performs better than,or as well as,the other programs tested.We also noticed that different initial stem pools could affect the prediction results.The predictions based on the initial pools according to the parameter values mentioned in this paper may not necessarily be the best ones. It could be straightforward to improve the accuracy of our algorithm:generate multiple initial stem pools for an RNA sequence according to different parameters values;run our current version of the algorithm to produce multiple sets of predictions;pick up a number of the best ones from the multiple sets of the predictions according to the full energy model.Given the efficiency of our algorithm,such an extension is reasonable.Since it could be prejudicial to choose any one particular set of values of the parameters,this extension could also rectify the bias caused by choosing only one parameter value set at some degree.TdFOLD HotKnots ILM PKNOTSSE SP T SE SP T SE SP T SE SP Tmin0.330.290.260.330.250.570.330.250.01000.11max 1.00 1.00 1.37 1.00 1.008.32 1.00 1.000.15 1.00 1.000.24average0.810.750.540.720.66 3.330.750.610.030.780.730.41Table4:Summary of testing results on50tRNAs,where SE:sensitivity,SP:specificity,T:time(in seconds).TdFOLD HotKnots ILM PKNOTSSE SP T SE SP T SE SP T SE SP Tmin000.04000.0500.250.001000.27max 1.00 1.000.57 1.00 1.0057.0 1.00 1.000.05 1.00 1.00>1hraverage0.760.790.360.690.72 5.840.730.690.030.780.731066Table5:Summary of testing results on small pseudoknotted RNAs,where SE:sensitivity,SP:specificity,T:time in seconds(if not otherwise noted).TdFOLD HotKnots ILML SE SP T SE SP T SE SP TA.ferrooxidans3440.420.3814.50.430.381690.380.360.71idlawii3160.450.35 3.130.680.526550.570.45 1.15A.tum.RNaseP4000.580.70.460.610.6314320.770.82 1.24B.anthracis4080.430.570.470.350.322220.430.42 1.49B.halodurans4120.560.60.480.530.44134830.560.530.99CPB1472980.180.177.620.590.521700.300.250.85D.desulfuricans3600.620.61 5.590.580.521590.480.450.63EM14b-93550.720.73 4.580.660.51297100.670.610.86E.thermomarinus3310.460.41 1.860.650.561480.540.470.83telo.human2100.860.69 3.170.280.181570.700.550.81T.roseum3500.620.63 1.830.240.1927140.510.45 1.10average3440.540.53 3.970.540.4944560.510.440.97Table6:Summary of testing results on large RNAs(pseudoknots),where L:length,SE:sensitivity,SP:specificity, and T:time in seconds.3.4Time EfficiencyEfficiency comparisons are also given in Tables4,5and6on each data set,respectively.For tRNA’s,the average running time of0.54seconds for TdFOLD is slower than the average0.03of ILM and the average0.41of PKNOTS。