Sequencing by Hybridization An Enhanced Crossover Operator for a Hybrid Genetic Algorithm
- 格式:pdf
- 大小:153.32 KB
- 文档页数:28
生物芯片技术——生物化学分析论文08应化2江小乔温雪燕袁伟豪张若琦2011-5-3一、摘要:生物芯片技术,被喻为21世纪生命科学的支撑技术,是便携式生化分析仪器的技术核心,是90年代中期以来影响最深远的重大科技进展之一,是融微电子学、生物学、物理学、化学、计算机科学为一体的高度交叉的新技术,具有重大的基础研究价值,又具有明显的产业化前景。
由于用该技术可以将极其大量的探针同时固定于支持物上,所以一次可以对大量的生物分子进行检测分析,从而解决了传统核酸印迹杂交(Southern Blotting 和Northern Blotting 等)技术复杂、自动化程度低、检测目的分子数量少、低通量(low through-put)等不足。
二、关键词生物芯片;检测;基因三、正文(一)、生物芯片的简介生物芯片技术是一种高通量检测技术,通过设计不同的探针阵列、使用特定的分析方法可使该技术具有多种不同的应用价值,如基因表达谱测定、突变检测、多态性分析、基因组文库作图及杂交测序(Sequencing by hybridization, SBH)等,为"后基因组计划"时期基因功能的研究及现代医学科学及医学诊断学的发展提供了强有力的工具,将会使新基因的发现、基因诊断、药物筛选、给药个性化等方面取得重大突破,为整个人类社会带来深刻广泛的变革。
该技术被评为1998年度世界十大科技进展之一。
(1)它包括基因芯片、蛋白芯片及芯片实验室三大领域。
基因芯片(Genechip)又称DNA芯片(DNAChip)。
它是在基因探针的基础上研制出的,所谓基因探针只是一段人工合成的碱基序列,在探针上连接一些可检测的物质,根据碱基互补的原理,利用基因探针到基因混合物中识别特定基因。
它将大量探针分子固定于支持物上,然后与标记的样品进行杂交,通过检测杂交信号的强度及分布来进行分析。
蛋白质芯片与基因芯片的基本原理相同,但它利用的不是碱基配对而是抗体与抗原结合的特异性即免疫反应来检测。
DNA测序技术发展摘要自从1977年双脱氧链终止法和化学裂解法问世以来,经过三十几年的努力,DNA测序技术已经取得了很大的发展,测序技术已经成为现代生物学和生物信息学研究中的重要手段之一。
1990年提出的人类基因组测序计划历时14年,耗费30亿美元,利用双脱氧终止法完成了对第一个人的基因组测序工作,而利用现在的第二代测序技术中的SOLiD的测序平台,则只需要一周左右的时间就能完成对人基因组的测序工作。
在第一代双脱氧终止法、化学裂解法和第二代测序技术的基础上,以单分子测序为特点的第三代测序技术已经诞生,其终极目标为实现1000美元的人类基因组测序计划。
相信在不久的将来,充分利用测序技术必将完成构建生命进化树等一系列研究任务,并最终搭建针对个体的个性化医学体系。
关键字双脱氧链终止法(Sanger法)化学裂解法(Maxam Gilber法)第二代测序技术(Next Generation Sequencing Technology)第三代测序技术(Third Generation Sequencing Technology)单分子实时测序(Single Molecule Real Time DNA Sequencing: SMRT)纳米孔测序简介快速和准确的获取生物体的遗传信息于生命科学研究一直具有十分重要的意义。
对于每个生物体来说,基因组包含了整个生物体的遗传信息。
测序技术能够真实地反映基因组DNA 上的真实序列,通过软件的分析得到序列所代表的遗传信息,进而通过基因组的比较来全面地揭示基因组的复杂性和多样性,因而在生命科学研究中扮演了十分重要的角色。
测序技术最早可以追溯到20世纪50年代,早在1954年就已经出现了关于早期测序技术的报导,即Whitfeld等用化学降解的方法测定多聚核糖核苷酸序列。
1977年Sanger等发明的双脱氧核苷酸末端终止法和Maxam、Gilbert等发明的化学降解法,标志着第一代测序技术的诞生。
生物芯片及应用简介一、简介生物芯片(biochip)是指采用逛到原位合成或微量点样等方法,将大量生物大分子比如核酸片段、多肽分子甚至组织切片、细胞等等生物样品有序地固化于支持物(比如玻璃、硅片、聚丙烯酰胺凝胶、尼龙膜等载体)的表面,组成密集二维分子排列,然后与标记的待检测生物样品中靶分子杂交,通过特定的仪器比如激光共聚焦扫描或电荷偶联摄像机(CCD)对杂交信号的强度进行快速、并行、高效地检测分心,从而判断样品中靶分子的数量。
由于常用玻片/硅片作为固相支持物,且在制备过程模拟计算机芯片的制备技术,所以称之为生物芯片技术。
根据芯片上的固定的探针不同,生物芯片包括基因芯片、蛋白质芯片、细胞芯片、组织芯片,另外根据原理还有原件型微阵列芯片、通道型微阵列芯片、生物传感芯片等新型生物芯片、如果芯片上固定的是肽或蛋白,则称为肽芯片或蛋白芯片;如果芯片上固定的分子是寡核苷酸探针或DNA,就是DNA芯片。
由于基因芯片(Genechip)这一专有名词已被业界的领头羊Affymetrix公司注册专利,因而其他厂家的同类产品通常称为DNA微阵列(DNA Microarray)。
这类产品是目前最重要的一种,有寡核苷酸芯片、cDNA芯片和Genomic芯片之分,包括二种模式:一是将靶DNA固定于支持物上,适合于大量不同靶DNA的分析,二是将大量的探针分子固定于支持物上,适合于对同一靶DNA进行不同探针序列的分析。
生物芯片技术是90年代中期以来影响最深远的重大科技进展之一,是融微电子学、生物学、物理学、化学、计算机科学为一体的高度交叉的新技术,具有重大的基础研究价值,又具有明显的产业化前景。
由于用该技术可以将及其大量的探针同时固定于支持物上,所以一次可以对大量的生物分子进行检测分析,从而解决了传统核酸印迹杂交(Southern Blotting 和Northern Blotting等)技术复杂、自动化程度低、检测目的分子数量少、低通量(low through-put)等不足。
生物样品分析英语作文标题,Analysis of Biological Samples。
In modern scientific research and medical diagnostics, the analysis of biological samples plays a crucial role in understanding diseases, developing new treatments, and monitoring health conditions. Through various analytical techniques, scientists can extract valuable informationfrom biological samples such as blood, urine, saliva, and tissue, contributing to advancements in biotechnology and healthcare.One of the most common biological samples used for analysis is blood. Blood contains a wealth of information about an individual's health, including levels of various biomarkers, such as glucose, cholesterol, and hormones. Analyzing blood samples can help diagnose diseases such as diabetes, cardiovascular disorders, and hormonal imbalances. Techniques such as enzyme-linked immunosorbent assay (ELISA), polymerase chain reaction (PCR), and massspectrometry are often employed to detect and quantify specific molecules in blood samples with high sensitivity and accuracy.Urine is another valuable biological sample for analysis. It provides insights into kidney function, hydration levels, and metabolic processes in the body. Urinalysis, which involves examining the physical, chemical, and microscopic properties of urine, can help diagnose urinary tract infections, kidney diseases, and metabolic disorders. Additionally, drug testing often relies on urine samples to detect the presence of illicit substances or medications.Saliva has emerged as a non-invasive alternative to blood and urine for certain types of analysis. Salivary biomarkers can reflect physiological changes associatedwith stress, hormonal fluctuations, and oral health. Researchers are exploring the use of saliva testing for diagnosing conditions such as periodontal disease, diabetes, and even certain types of cancer. Advances in saliva collection devices and analytical techniques have madesaliva analysis more accessible and reliable.Tissue samples, obtained through procedures such as biopsies, are indispensable for understanding the molecular mechanisms underlying diseases such as cancer. Histological analysis of tissue samples allows pathologists to examine cellular structures and identify abnormalities indicative of disease. In addition to traditional histopathology, molecular techniques such as fluorescence in situ hybridization (FISH), immunohistochemistry (IHC), and next-generation sequencing (NGS) enable deeper molecularprofiling of tissues, guiding personalized treatment strategies for patients.The analysis of biological samples is not without its challenges. Sample collection, storage, and processing must be carefully standardized to ensure reproducibility and reliability of results. Contamination, degradation, and variability inherent to biological samples can introduce errors and confound interpretations. Moreover, ethical considerations surrounding the use of human samples, including informed consent and patient privacy, must berigorously addressed in research and clinical settings.Despite these challenges, the analysis of biological samples continues to drive innovation in medicine and biotechnology. Emerging technologies such as microfluidics, biosensors, and artificial intelligence promise to revolutionize how biological samples are analyzed, making diagnostics faster, more accurate, and more accessible. Collaborative efforts among scientists, clinicians, and industry partners are essential to harnessing the full potential of biological sample analysis for improving human health.In conclusion, the analysis of biological samples is a cornerstone of modern biomedical research and clinical practice. By harnessing the wealth of information contained within blood, urine, saliva, and tissue samples, scientists and healthcare professionals can gain insights into disease mechanisms, develop targeted therapies, and monitor treatment responses. Continued advancements in analytical techniques and technology will further enhance our abilityto leverage biological sample analysis for the benefit of individuals and society as a whole.。
2012年第4期2012№4辽宁林业科技Journal of Liaoning Forestry Science &Technology收稿日期:2012-02-22基金项目:引进国际先进林业科学技术项目(2012-4-39);辽宁省科学技术计划重大项目(2011207002);林业转基因生物安全性监测项目(JJ-11-03)。
高通量测序技术及其在植物研究中的应用冯健1,赵雪崴2(1.辽宁省林业科学研究院,辽宁沈阳110032;2.辽宁省林业调查规划院,辽宁沈阳110122)摘要:随着“后基因组时代”的到来,国际学术界已经认识到发展快速低成本测序技术的重要性。
在这样的背景下,高通量测序技术应运而生,其代表性技术平台有Roche 的454测序技术、Illumina 的Solexa 测序技术和ABI 的Solid 测序技术。
笔者通过查阅相关文献,系统回顾了第1代测序技术发展过程,详细叙述了高通量测序技术原理和方法,探讨了高通量测序技术在植物分子生物学研究中的应用。
关键词:高通量测序技术;测序技术;植物中图分类号:Q503文献标识码:A文章编号:1001-1714(2012)04-0029-072003年4月14日,美国人类基因组研究项目首席科学家Collins F 博士在华盛顿隆重宣布:人类基因组序列图绘制成功,人类基因组计划(Human Genome Project ,HGP )的所有目标全部实现。
这标志“人类基因组计划”胜利完成和“后基因组时代”(Post Genome Era ,PGE )正式来临[1]。
进入后基因组时代,功能基因组学、系统基因组学等研究越来越受到重视,与之相应的是对测序技术的需求有增无减,传统的测序方法已经不能满足深度测序和重复测序等大规模基因组测序的需求。
国际学术界已经清醒地认识到发展快速低成本测序技术的重要性。
2006年10月美国加州Santa Monica 的X Prize 基金会承诺给第1个实现1000美元人类全基因组测序的个人或单位颁发1000万美元的奖金,成为有史以来生物医学领域奖金额度最高的科技奖,足见人类科学对测序技术的需求。
2. 基因(gene)是一段携带功能产物(多肽,蛋白质,tRNA和rRNA和某些小分子RNA信息的DNA片段,是控制某种性状的的遗传单位。
3. 密码子偏爱 ( codon bias ):指在不同物种的基因中经常为某种氨基酸编码的只是其中的一个密码子。
4. 基因的剪接位点 ( splice sites ): 一般有特定的序列特征,计算机程序利用这种序列特征可预测将近50%的外显子及20%的完整基因。
值佯谬( C value paradox ):生物体的进化程度与基因组大小之间不完全成比例的现象。
N 值佯谬( N value paradox ):基因组中基因数目与生物进化程度或复杂程度的不对称性6. 基因组(genome :是指一个细胞或生物体的一套完整的单倍体遗传物质.()7. 基因家族(genefamily): 指核苷酸序列或编码产物的结构具有一定同源性的一些基因。
(04)8. 基因超家族( gene superfamily ):结构上具有一定的相似性,但功能不一定相似,且进化上的亲缘关系较远。
如免疫球蛋白基因超家族、丝氨酸蛋白酶基因超家族等( 05)9. 基因组学(genomics):发展和应用基因作图、DNA测序、基因定位等新技术以及计算机程序,分析生命体(包括人类)全部基因组结构及功能10. 微卫星DNA(microsatellite DNA:或称简短串连重复,由2~6个核苷酸的重复顺序组成,如(CA)n、(GA)n、(TA)n,n 为15~30具有多态性,卫星长度常小于1OObp,大量分布每条染色体11. 小卫星DNA(minisatellite DNA): 由6~12个核酸的重复顺序组成,位于染色体端粒及其附近,长度数十~数千bp12. 大卫星DNA(macrosatellite DNA ):即经典的卫星DNA由数十个核苷酸的重复单位构成,主要存在于异染色区和着丝粒。
13. 蛋白质组(proteome) 是指细胞或组织表达的全部蛋白质14. 蛋白质组学(proteomics): 是从整体上采取高通量/ 大规模手段研究所有蛋白组成及其活动规律.15. 单核苷酸多态性( single nucleotide polymorphism, SNP ) : 指发生在基因组序列中单个碱基的改变引起的DNA序列的变化16. 限制性片段长度多态性( restriction fragment length polymorphism ,RFLP:DNA位点的多态性导致限制性内切酶切割位点的差异,即RFLP是第一代DNA遗传学标记()17. 顺式作用元件(cis-acting element) :真核生物中能够被基因调控蛋白特异性识别和结合,并对自身基因转录起始有调节作用的DNA序列18. 核心启动子(Core promoter):常由TATA盒、位于TATA盒上游的的上游启动子元件、以转录点为中心的起始子和下游启动子元件, 4 个元件组合而成。
基因芯片技术在生命科学研究中的应用进展及前景分析熊伟【摘要】目的:探讨生命科学研究领域基因芯片技术研究现状及未来的应用前景.方法:收集有关基因芯片技术在生命科学研究中的国内外研究资料并加以综合归纳.结果:基因芯片技术是一种高新生物技术,因具有高通量、并行性、微型化与自动化等特点,在生命科学中日益显示出其重要的理论与实际应用价值.结论:基因芯片技术在生命科学领域的深入研究具有重要的理论意义和应用价值,前景广阔.【期刊名称】《生命科学仪器》【年(卷),期】2010(008)002【总页数】5页(P32-36)【关键词】基因芯片技术;DNA阵列;应用;前景【作者】熊伟【作者单位】大理学院基础医学院生物化学与分子生物学教研室,大理,云南,671000;云南大学生命科学院生物化学与分子生物学实验室,昆明,云南,651000【正文语种】中文基因芯片(Gene chip)也被称之为DNA阵列(DNA array), DNA集微芯片(DNA microchip)或寡核苷酸阵列(Aligonucleotide array)。
1991年Stephen Fodor 博士[1]首次提出基因芯片的概念,决定将硅技术与生物学技术融合在一起, 借助半导体技术进行芯片研制, 解读生命有机体在长期进化中累计下来的浩瀚基因信息。
美国Affymetrix 生物公司于1996年制造出世界上第一块商业化的基因芯片。
由此掀起了基因芯片研究热潮,出现了多种类型的基因芯片制作技术。
如电压打印法[2],机械打点法[3];电定位技术[4]等。
近年, 随着各种相关技术的进步, 基因芯片技术的应用范围不断扩大, 尤其在基因表达分析(gene expression)及基因诊断(gene diagnoses)方面有非常显著的成果。
尤其是2003年人类基因组计划(human genome project, HGP)测序工作的完成,基因芯片技术已成为“后基因组时代”基因功能分析研究的最重要技术之一。
第七章 DNA序列分析DNA的一级结构决定了基因的功能,欲想解释基因的生物学含义,首先必须知道其DNA 顺序。
因此DNA序列分析(DNA sequencing)是分子遗传学中一项既重要又基本的课题。
1986年由美国学者提出的,目前正在实施的人类基因组计划(human genome project),则是要通过对人类基因组3×109bp全序列的序列分析和人类基因的染色体图谱制定达到了解其结构,认识其功能,即从分子遗传学水平来认识人类自身的结构和功能特征的目的。
核酸的核苷酸序列测定方法已经过近20年的发展,因而测序的具体方法五花八门、种类繁多。
但是究其所依据的基本原理,不外乎Sanger的核酸链合成终止法及Maxam和Gilbert的化学降解法两大类。
虽然原理不同,但这两种方法都同样生成互相独立的若干组带放射性标记的寡核苷酸,每组寡核苷酸都有固定的起点,但却随机终止于特定的一种或多种残基上。
由于DNA链上每一个碱基出现在可变终止端的机会均等,因而上述每一组产物都是一些寡核苷酸的混合物,这些寡核苷酸的长度由某一种特定碱基在原DNA片段上的位置所决定。
然后在可以区分长度仅相差一个核苷酸的不同DNA分子的条件下,对各组寡核苷酸进行电泳分析,只要把几组寡核苷酸加样于测序凝胶中若干个相邻的泳道之上,即可从凝胶的放射自显影片上直接读出DNA上的核苷酸顺序。
以下分别介绍。
1、Sanger的双脱氧链终止法这是1977年由英国剑桥大学分子生物学实验室的生物化学家Sanger(桑格)等人发明的,是一种简单快速的DNA序列分析法,利用DNA聚合酶和双脱氧链终止物测定DNA核苷酸序列。
它的基本原理是:利用DNA聚合酶的两种酶促反应的能力。
第一是,DNA聚合酶能够利用单链的DNA作模板,准确地催化合成出DNA互补链。
实际上这是DNA在体外进行的复制过程。
第二是,DNA聚合酶能够利用2′,3′-双脱氧核苷三磷酸作底物,使之掺入到寡核苷酸链(由几个核苷酸组成的核苷酸链叫做寡核苷酸链)的3′末端,从而终止DNA链的生长。
基因芯片技术及其在植物基因功能研究中的应用摘要:基因芯片技术即dna微列阵技术,作为一种高通量快速分析技术,已广泛地应用于植物基因组研究。
本文简要综述了基因芯片的制备及分类、实验设计和数据分析,以及基因芯片在植物胁迫应答基因功能研究中的应用。
关键词:基因工程;基因芯片;植物胁迫应答中图分类号:q789文献标识码:a基因芯片是伴随人类基因组计划而发展起来的一种高新生物技术,具有快速、高效、大规模、高容量、高度并行性等特点,已成为目前国际上生命科学研究的热点之一。
随着植物基因序列数据库迅速增长,基因芯片已成为植物基因组学的主要手段之一。
近几年,采用基因芯片技术进行转基因植物表达谱分析的研究越来越广泛,通过对差异基因生物信息学分析,筛选与植物胁迫应答相关基因,从而深入研究其在植物胁迫应答过程中调控机理。
1基因芯片的概念及分类基因芯片是利用核酸杂交测序(sequencing by hybridization,sbh)原理,在载体表面建立可寻址的高密度dna分子微阵列,通过与标记过的样品核酸序列互补匹配,进行测序与大规模平行检测生物未知基因分子的有关信息。
通过基因芯片技术可大规模、高通量地对成千上万条基因同时进行研究,从而大大加快了基因研究的效率。
基因芯片的种类较多,根据dna微阵列上的核酸序列长度,基因芯片可分为两类:一类是cdna 微阵列;另一类是寡聚核苷酸微阵列。
根据基因芯片所用的载体材料不同,可分为玻璃芯片、硅芯片、膜芯片、陶瓷芯片等;根据基因芯片制备方式不同,可分为原位合成芯片、直接点样芯片、电定位芯片和三维芯片等。
2基因芯片实验设计实验设计是基因芯片实验研究中重要的部分,是芯片数据可靠的前提。
由于基因芯片实验成本昂贵,在进行实验时需严格设计和认真操作。
实验设计中探针筛选、芯片选择、生物学重复次数对试验数据质量都有影响。
基因芯片中荧光实验是利用标记了红色荧光cy5和绿色荧光cy3的两个样品同时与基因芯片进行杂交,基因芯片上每一个点包括了这两种样品中相应mrna的荧光信息,通过比较两者的荧光信号强度计算相对表达量。
Sequencing by Hybridization:An Enhanced Crossover Operator fora Hybrid Genetic AlgorithmCarlos A.Brizuela1,Luis C.Gonz´a lez-Gurrola2,Andrei Tchernykh1,and Denis Trystram3(1)Computer Sciences Department,CICESE Research CenterKm107Carr.Tijuana-Ensenada,Ensenada,B.C.,M´e xicocbrizuel@cicese.mx,+52–646–1750500(2)Instituto Tecnol´o gico Superior de Santiago PapasquiaroKm.114Carretera J.Guadalupe Aguilera-Guanacev´ıSantiago Papasquiaro,Dgo.34600M´e xico.(3)ID-Institut IMAGAvenue Jean Kuntzmann38330Montbonnot Saint Martin38041Grenoble Cedex9FranceAbstractThis paper presents a genetic algorithm for an important computational biology problem.The prob-lem appears in the computational part of a new proposal for DNA sequencing denominated sequencing by hybridization.The general usage of this method for real sequencing purposes depends mainly on the development of good algorithmic procedures solving its computational phase.The proposed genetic al-gorithm is a modified version of a previously proposed hybrid genetic algorithm for the same problem.It is compared with two well suited meta-heuristic approaches reported in the literature:the hybrid genetic algorithm,which is the origin of our proposed variant,and a tabu-scatter search algorithm.Experimental results carried out on real DNA data show the advantages of using the proposed algorithm.Furthermore, statistical tests confirm the superiority of the proposed variant over the state-of-the-art heuristics.Keywords.Sequencing by Hybridization,Hybrid Genetic Algorithm,Greedy Crossover.IntroductionThe DNA sequencing is one of the most important problems in molecular biology.It refers to the iden-tification of an unknown short DNA sequence of100≤n≤1000base pairs.There are two classical approaches to sequencing:the chemical one proposed by Maxam and Gilbert(1977),and the one involv-ing gel electrophoresis by Sanger and Coulson(1978),widely used in sequencing laboratories.A relatively new approach denominated sequencing by hybridization(SBH)(Bains and Smith1988),which consists of biochemical and computational parts,offers an interesting alternative.The biochemical part of this method has been already widely used for Single Nucleotide Polymorphism(SNP)analysis(Hirschhom et al.2000) and its applicability to real sequencing problems depends mainly on the development of good algorithmic techniques solving the computational part(Blazewicz et al.2002a).For the biochemical part we are given an unknown DNA fragment,composed by a sequence of nucleotides from a set of four:A(adenine),C (cytosine),G(guanine),and T(thymine).In this set A and T are complements as well as C and G.From this sequence all oligonucleotides(short sequence of nucleotides)of length l,usually8to10nucleotides (Iduri and Waterman1995)are detected,and compose the unknown sequence,when properly ordered and overlapped.The detection is performed using a full DNA chip(Southern1988),or full DNA array,which contains all fragments(oligonucleotides)of length l,i.e.4l fragments.Copies of the unknown DNA sequence treated with afluorescent substance are deposited on the DNA chip.During the biochemical phase,com-plementary fragments,of length l come together,i.e.hybridize.Two DNA fragments are complementary if at each position the nucleotide of one fragment is a complement of the corresponding nucleotide in the other fragment.After the hybridization and by reading afluorescent image of the chip,we can obtain the spectrum,which is the set of oligonucleotides composing the unknown DNA sequence.Here,the second stage of SBH starts,i.e.given the spectrum and the length of the unknown DNA sequence,which can be estimated by using gel electrophoresis(Krebs and Dunaway1996),find the order of oligonucleotides such that consecutive elements always overlap in l−1nucleotides.When the hybridization is performed without errors the spectrum includes all length l oligonucleotides of the unknown DNA sequence.However,many factors do not allow the experiment to be run error-free.There are two types of errors in the spectrum:if the spectrum does not include one or more oligonucleotides of the original DNA sequence,then we have an experiment with negative errors.On the other hand,if the spectrum includes oligonucleotides that are not present in the original sequence then we have an experiment with positive errors.Every repetition of an oligonucleotide within the original sequence becomes a negative error,since the hybridization cannot detect the number of occurrences of oligonucleotides in the sequence.The presence of negative errors forces the overlap between some neighboring oligonucleotides in the sequence to be of length less than l−1.The pres-ence of positive errors in the spectrum forces some oligonucleotides to be rejected during the reconstruction process.When the spectrum contains only negative errors,the problem can be approximately modeled as the shortest common superstring problem(SCS)(Blazewicz and Kasprzak2003),which is NP-hard(Garey and Johnson1978).On the other hand,when the spectrum contains both types of errors,the problem can be represented as the selective traveling salesman problem(STSP)(Blazewicz et al.1999)which is also known to be NP-hard(Garey and Johnson1978).There have been many attempts in trying tofind an efficient method to solve the computational part of the SBH.The problem can be solved in polynomial time(Pevzner1989)when the biochemical phase is run error-free.The optimal solution can be obtained by reducing the problem tofinding an Eulerian path in a directed graph.However,when errors are present,the problem becomes strongly NP-hard(Blazewicz and Kasprzak2003),i.e.,there is no known polynomial time algorithm for the SBH problem with errors. Exact and heuristic methods have been proposed assuming errors in the spectrum.A Tabu Search based algorithm for the latter case is proposed by Blazewicz et al.(2000).Later on,the same authors proposed a sophisticated heuristic(Blazewicz et al.2002a)that improves their previous results.At the same time, a hybrid genetic algorithm(Blazewicz et al.2002b)with a greedy crossover is proposed to effectively and efficiently deal with positive and negative errors.This genetic algorithm gets better results than the Tabu Search algorithm(Blazewicz et al.2000).Blazewicz et al.(1999)present a branch and bound method to deal with positive and negative errors.This algorithm obtains its best performance on instances with only positive errors,however,it has problems handling negative errors.In a recent work by Zhang et al.(2003),a new exact algorithm,based on the information of connected oligonucleotides,predecessor-successor relation,is employed to effectively handle both types of errors.Indeed,even in the presence of repetitions,this algorithm has exceptionally good performance.In(Bui and Youseef2004),a genetic algorithm is proposed to deal with both type of errors,and compared with the one in(Blazewicz et al.2002b).In a more recent work (Blazewicz et al.2004),the Tabu search algorithm(Blazewicz et al.2000)is enhanced by scatter search.The Tabu-Scatter-Search algorithm is compared with the results presented in(Blazewicz et al.1999),(Blazewicz et al.2000)and(Blazewicz et al.2002b),and in all cases it obtains better results.In this paper,a genetic algorithm1to deal with the computational part of the SBH problem is introduced. This algorithm is a variant of the hybrid GA(HGA)proposed by Blazewicz et al.(2002b).The variant achieves better similarity results,with respect to the original sequences for computationally hard instances, and shorter computation time.The remainder of the paper is organized as follows.Section1describes the SBH problem.Section2 introduces the proposed algorithm variant.Section3presents the experimental setup and results.Finally,Section4summarizes the paper and points out ideas for future research.1Problem StatementThe input for the computational part of the SBH problem consists of a set S={s1,s2,···,s k}of equal length(l)strings s i over the alphabetΣ={A,C,G,T},and a number n representing the length of the unknown sequence.Each s i is always a fragment of the original sequence N,whenever the experiment is error-free(|S|=n−l+1).However,in general,s i may represent a fragment that is not in the original sequence(positive errors).Furthermore,there may be fragments in the original sequence N that do not appear as a string s i in S(negative errors).The problem is tofind a sequence L of length no greater than n such that the number of used strings s i is maximized,and therefore the differences between N and L is minimized.The justification for maximizing the number of used strings s i’s,is based on the assumption that most of the information from the hybridization experiment is correct.The SBH problem with positive and negative errors was proven to be strongly NP-hard(Blazewicz and Kasprzak2003).2The Proposed AlgorithmOur Sequencing Genetic Algorithm(SGA)is based on the idea of genetic algorithms(Holland1975;Goldberg 1989).The SGA is similar to the one proposed by Blazewicz et al.(2002b)in all genetic operators except for the crossover.The encoding method and the objective function are defined next.2.1EncodingEach individual i is represented by a permutation of indices of oligonucleotides in the spectrum.Specifically, the adjacency-based coding(Grefenstette et al.1985)is used:value i at locus j in the chromosome means that oligonucleotide i follows oligonucleotide j in the solution.Therefore,feasible solutions are represented by subcycle free permutations,except for a single cycle of length|S|.Figure1shows a feasible individual i=[42301]for a given spectrum S={CT G,ACT,GGA, GAC,T GA}.In this chromosome,the number4at locus0indicates that the oligonucleotide at position 0(CTG)in the spectrum,is followed by the oligonucleotide at position4(TGA)in the spectrum.In this way,following the indices in the spectrum,the sequence of oligonucleotides that individual i represents is: CTG,TGA,ACT,GGA,GAC(0,4,1,2,3,0).Notice that this solution defines a set of candidate sequences,not a single sequence,as it will become clear with the explanation of thefitness function computation.(Introduce here Figure1)Algorithm1.SGAInput:S and nOutput:A candidate sequence L of length|L|≤n1for i=1to P opIndividual(i)3Evaluate Objectiveindividual found6While Num Iter without change in the Objective individual)7do Crossover with probability P c8Apply Generational Replacement9for j=1to P opFunction(j)11do Select Individuals to update the population12Keep the bestsize is the SGA population size.The function GenerateFunction(i)computes thefitness for each individual i.Thefitness of each individual is the maximum number of oligonucleotides that,overlapped in the order given by the chromosome,generates a sequence of nucleotides not longer than n,and with the maximum total overlap. P c is the crossover probability.Figure2shows the way the objective function is computed.First,we start at position zero of the resulting cycle(0,4,1,2,3,0).Then we join oligonucleotides,once the sequence has3oligonucleotides(CTG, TGA and ACT)and an n ≤6≤n,the next oligonucleotide(GGA)can be added.In this case,the resulting sequence with4oligonucleotides is CTGACTGGA(0,4,1,2).However,the last oligonucleotide(GGA)has to be eliminated since it makes the sequence to be of length(n )greater than n.Hence,the number of oligonucleotides used when starting at position0is3.The process is repeated starting at each locus in the chromosome.This can be done given that the permutation of indices in the individual generates a cycle of length|S|,and the last oligonucleotide can be considered to be adjacent to thefirst one.After the selection of all positions,the sequence employing the maximum number of oligonucleotides without violating the lengthrestriction is chosen.For example,the sequence generated starting at position2(2,3,0,4)is GGACTGA. This sequence uses four oligonucleotides,being the maximum number this individual represents,and its length is exactly seven,that is n.Notice that for this particular case,the resulting sequence is identical to the original one,but this is not always the case.For each individual itsfitness value is normalized,based on the maximum number of oligonucleotides in any valid sequence,and then linearly scaled,f new=(fOfSize−23if it does not produce subcycle4do pick up the best overlapping oligonucleotide from the parents5else6do pick up the best overlapping oligonucleotide from the spec-trum such that a subcycle is not introduced7set the last oligonucleotideSpectrumthe offspring are set randomly.For this randomly selected oligo there are two possible successors(oligos): one in parent1and the other in parent2.First we calculate the overlap generated by each of this successors and we take the one with longest overlap,as long as it does not produce a subcycle.Otherwise,the best oligonucleotide from the spectrum is chosen.The best oligonucleotide is the one which produces the longest overlap with the previously selected one and that does not produce a subcycle.In all cases,if there is more than one best choice,thefirst found is chosen.Finally,the last oligonucleotide for completing a cycle of length|S|is set.Notice that the implementation of Step3implies that we search for the best overlapping oligo in the parents,and if it produces a subcycle then we go to the spectrum without verifying is the other parent produces a subcycle.Figure3shows how this operation is performed.Suppose again that the spectrum is given by S= {CT G,ACT,GGA,GAC,T GA}.Let us assume that thefirst oligonucleotide randomly selected is GAC,i.e.,oligonucleotide number3,and its randomly selected locus is2(Fig.3A).Notice that for completing the cycle,oligonucleotide2will be the last one to be selected in the chromosome.We search in Parent1and in Parent2for the successors of oligonucleotide3,which are4and0,respectively(Fig.3B).The best successor, that is the best overlapping oligonucleotide,is oligonucleotide0(Fig.3C).Next,we look at the parents for the best successor of oligonucleotide0,which is oligonucleotide2,but this oligonucleotide generates a subcycle (Fig.3D).In this case,we search in the spectrum for the best successor which is oligonucleotide4(Fig.3E). For oligonucleotide4,both successors in the parents generate a subcycle,so we search in the spectrum,given that oligonucleotide2will be the last one,we have only one option,oligonucleotide1(Fig.3F).Finally,for completing the cycle,we select oligonucleotide2(Fig.3G).(Introduce here Figure3)The previously proposed crossover(Blazewicz et al.2002b)used a80-20rule.This means that from the first to the last generations80%of the oligos come from the parents and20%from the spectrum.The main idea of what we propose here is to set this oligo’s selection rule according to the quality of parents.At the beginning the quality of parents is low and the oligos are searched mainly in the spectrum.When the quality of parents increases the number of oligos coming from them also increases.In this way the algorithm can tune itself to a broader set of instances than it can do with the80-20rule.Another very important aspect of this idea is that the worst case computing time for a successful search in the parents takes O(1)while in the case of searching in the spectrum it takes O(|S2|),where S is the spectrum cardinality.3Computational Experiments:Setup and ResultsIn the computational experiments,the proposed algorithm SGA has been compared with two other meta-heuristic approaches for this problem:the hybrid genetic algorithm(HGA)(Blazewicz et al.2002b),and the Tabu-Scatter search algorithm(TSS)(Blazewicz et al.2004).These algorithms produce the best up to date results for SBH.They have been applied to real DNA sequences.The data used in these experiments are exactly the same as those used by Blazewicz et al.(2002b,2004).These spectra have been derived from the DNA sequences coding human proteins(taken from GenBank,National Institute of Health,USA). Their accession numbers are given by D00726,D11428,D13510,X00351,X02160,X02874,X02994,X04772, X05299,X05451,X05908,X06537,X06985,X07820,X07982,X07994,X12654,X13440,X13452,X13561, X14322,X14618,X14758,X14894,X15005,X15610,X51408,X51535,X51841,X52104,X53799,X54867, X55762,X57548,X58794,Y00093,Y00264,Y00649,Y00651and Y00711,respectively.Each spectrum was modified with the introduction of40%of errors(20%negative and20%positive).Given that100≤|S|≤500, the spectrum instances contain from40to200errors(e.g.for|S|=500,100randomly chosen oligonucleotides are erased,and100randomly generated oligonucleotides are introduced in the spectrum).The spectra have been sorted alphabetically,thus no information about the original order of oligonucleotides from their original sequences is known(Blazewicz et al.2002b).The length of oligonucleotides(l=10)and the length of the sequences(109≤n≤509)have been chosen on the basis of real hybridization experiments(Caviani et al. 1994),where l≤n.For each spectrum size,40different instances are generated.The parameters for the instances are shown in Table1.(Introduce here Table I)The input instances are obtained in the following way:1.Obtain a DNA sequence N of length n from GenBank(use the accession codes previously given).2.Derive a set P of strings of equal length l from the sequence N shifting the characters one by one.3.Modify P randomly eliminating a given number of strings so that some of the resulting strings do notnecessarily overlaps in exactly l−1characters.4.Modify P randomly adding a given number of strings of length l.5.Ensure that P is a free subset(i.i.,there are not repeated oligos)which becomes the input of thealgorithm S.We show an example for this method:N=CACCGCATCGA,n=7,l=4,P={CACC,ACCG,CCGC,CGCA,GCAT,ATCG,TCGA}Error=20%,Number of strings to erase=0.20*7= 1.4 =1Index=rnd[1...|P|]=4P =CACC,ACCG,CCGC,*,GCAT,ATCG,TCGAS=ACCG,ATCG,CACC,CCGC,GCAT,TCGATo add20%of positive errors we just need to add one oligo of length four randomly generated.The sequences produced by the algorithms have been compared with the original sequences using a pairwise alignment algorithm(Stoye1998).The costs of alignment for two sequences are as follows:a mismatch(different nucleotides)brings a penalty of one point,a gap(an insertion,a nucleotide against a space)also brings a penalty of one point,and a match(the same nucleotides at a given position in both strings)brings no penalty(i.e.,0).Therefore,when two sequences have an alignment cost of zero they are identical(100%similar).The parameters of the HGA have been set as follows.The population size is set to50individuals,and the condition for termination is20iterations without improvement in the objective function.These values led to good quality solutions and short computation times(Blazewicz et al.2002b).The parameters of the TSS have been set to values resulting in similar computation times as used by Blazewicz et al.(2004)when both algorithms are compared.The parameters of the SGA are identical to those used by the HGA,P op Of(Introduce here Table2)We can see that two methods(HGA and TSS)produce solutions of very high quality.However,the TSS algorithm is shown to be better than the HGA.The solutions obtained by the HGA have average qualities that range from98.25%(393.0)to99.87%(79.9)of the optimum,and by the TSS in the range from99.15% (396.6)to99.87%(79.9)of the optimum.The optimum number returned by the TSS algorithm is greater than the one returned by the HGA in all cases.For thefirst three spectrum sizes,the TSS algorithm returns the optimal sequence more than75%of the time.However,the optimum returned by one algorithm does not necessarily correspond to the original sequence,as it can be seen in the row of original sequences found.This fact makes us assume that some instances have more than one optimal solution,in terms of the objective function.For spectra of cardinality500there is a notable difference in the similarity score between these algorithms.Results produced by the SGA are also shown in this table.Average qualities range from99.1% (396.7)to99.75%(79.8)of the optimum values.These outcomes are generally better than those obtained by the HGA,and the difference becomes notable as the spectrum size increases(note that both GA’s were executed with the same parameters).This can be seen in the optimum number and in the number of original sequences found.In contrast,the TSS algorithm obtains better results when compared to those generated by the SGA,but the difference decreases as the cardinality of the spectrum increases.In a very important criterion,the similarity score,the SGA obtains better results than the other two algorithms.Although the number of original sequences found by the TSS is greater than the ones obtained by the SGA,the latter provides better results in the similarity score.The difference in computation time between the SGA and the other algorithms,SGA is the fastest,allows us to improve the results obtained by the TSS,regarding the optimum number and the original sequences found.This outperformance is achieved by increasing,in the SGA,the population size and the number of iterations.Even with this increase in population size and number of iterations the SGA remains the fastest algorithm.Table3presents the results obtained by the SGA with the new set of parameters,P op ofthe other algorithms in much less computation time.All algorithms were ran on the same machine.The differences in computation time,between SGA and HGA,are due to the number of times the algorithms go directly to the spectrum to search for the best successor.The SGA goes to the spectrum only10%of the time.(Introduce here Figure4)To assess the statistical difference among the results of the HGA,TSS and the SGA,the non-parametric Kruskal-Wallis(Zar1999)and Tukey multiple comparison(Zar1999)tests were used.Given the importance of the application of these algorithms in real sequencing experiments,we considered to apply these tests on the similarity score criterion.For a confidence level ofα=0.01the tests proved statistical difference among the SGA and the other algorithms for spectra cardinality greater than200oligonucleotides.It means that with a high level of confidence(99.99%),the differences in similarity scores of the SGA and those of the HGA and TSS algorithms are due to real differences in their performances,and not to random events.For spectra cardinality of100oligonucleotides,the similarity score of the algorithms are alike,i.e.the differences are not statistically significant.Although for real hybridization experiments the original sequences have109≤n≤509(Caviani et al. 1994),another series of experiments were performed in order to observe the behavior of the algorithms for larger sequences(709and1009nucleotides).A total of20instances were derived for each sequence length(using l=10).Their accession numbers(GenBank)are:X05908,X02994,X51841,X06537,X07820, D11428,Y00093,X02160,X14894,X06985,Y00264,X04772,X57548,X15610,X07994,X51408,X52104, Y00649,X14758and X00351,respectively.Table4presents results produced by the algorithms over the mentioned data.All entries are average values over30runs,and at each run the algorithms were applied to20instances.The SGA was executed using the parameters P op ofunder the same conditions as those of Table2.Table5shows the results,and as expected the solutions are of very high quality.The clear winner is the TSS in terms of solution quality.On the other hand,SGA is the fastest algorithm.Another set of exploratory experiments were performed for n=709and1009.The results are shown in Table6.It is clear that for l=20,a length of1000is not yet a threshold for the algorithms’performance degradation.An interesting open question is to know the threslhold,in terms of n,for each algorithm when l=20.As we also expect,the relative performance,considering quality of solution,of SGA is improving as|S|increases.Again,when the issue is computation time then SGA is the clear winner.We can predict with much confidence that our SGA will outperform the other two,in terms of solution quality for bigger and more realistic|S|.3.1DiscussionThe reason for success of the proposed algorithm is conjectured to be the quasi-deterministic choices it makes when selecting successors in the crossover operator.The only random part in the whole process is in the selection of the initial locus and allele(Step1of Algorithm2).This operator exploits the problem structure by using a deterministic greedy procedure which allows the SGA to select the best successor for a given oligonucleotide.The use of this operator makes the selection of the successors faster,and selecting the best successor helps to improve the solution quality.In order to have a clearer idea of the performance of the greedy crossover and its influence on the construction of good solutions,the source from where the oligos come is analyzed.The criteria are measured considering a single run of a randomly selected instance.The larger the number of times the oligos are selected from the parents the faster the algorithm becomes.This is because searching the best oligo from the spectrum is computationally more expensive than searching from the parents.Figure5A shows the source (parents or spectrum)for selecting the next oligonucleotide in the crossover operator for constructing new individuals.It shows that the main source are the parents and that as the GA’s population evolves(through generations)this source is selected approximately90%of the time.We can see in thisfigure that the SGA can tune itself dynamically,i.e.the source for selecting oligos is notfixed as it was proposed by Blazewicz et al.(2002b).Given one oligonucleotide,the application of a greedy strategy(Step6of Algorithm2)guarantees the selection of the best successor in the spectrum.This does not necessarily happen if we select the next oligonucleotide from the parents.Therefore,it will be interesting to see what the differences are between oligonucleotides selected from the parents and oligonucleotides selected in a greedy strategy,considering thesame previous oligonucleotide.Figure5B shows the quality of oligonucleotides selected from the parents. After thefirst10generations,the oligonucleotides selected from the parents are almost(90%)the same as the ones selected if we would search in the spectrum for the best successors.This shows how high quality genetic information is passed from parents to children and this is a key point for determining the operator’s performance given the short computation time needed to transfer this information.(Introduce here Figure5)4ConclusionsWe have presented an efficient and effective genetic algorithm for the computational part of the sequencing by hybridization problem.An almost deterministic greedy crossover operator is introduced to improve the solution quality and to reduce the computational cost of a recently proposed hybrid genetic algorithm.Exper-imental results obtained on real DNA data show that the proposed algorithm can handle a large percentage of both positive and negative errors,yielding very high quality putational experiments were performed to compare the algorithm with two other meta-heuristic approaches:a previous hybrid genetic algorithm and a tabu-scatter search algorithm.The new results are shown to be better than the previous ones.The algorithm achieved high similarity score,in average,in all instances of more than90%.The main criterion for comparing these algorithms was the similarity score between the generated se-quences and the original ones.Statistical tests applied to the results of these algorithms proved the superiority of the new variant of the hybrid genetic algorithm.The proposed algorithm does not use any additional information about spectra or original sequences,which could be derived from biochemical experiments,and may help to obtain better similarity scores.Furthermore,some ideas explaining the behavior of the crossover operator are given in order to get a better understanding of the rationale for success of the algorithm.As a future research we will further explore the rationale for success of the algorithm as well as its ro-bustness to larger values of positive and specially of negative errors.AcknowledgmentsThis research was partially supported by the Laboratorio Franco-Mexicano de Inform´a tica.The authors would like to thank Jacek Blazewicz and Marta Kasprzak for their valuable help in providing the test data and the executables for the HGA and TSS algorithms.The authors would also like to thank the anonymous referees for their comments that have been very helpful in writing thisfinal version.。