Bioperl经典教程Bioperl_II.talk
- 格式:pdf
- 大小:347.90 KB
- 文档页数:45
Perl 入门介绍(1)perl是一门免费及开源的脚本语言,由Larry Wall所创造,这门语言以实用,快速开发为主要目标,与当前流行的面向对象,结构化编程有些格格不入。
但是这并不妨碍perl被广泛流传和使用,世界范围内围绕Perl建立起了非常活跃的社区,很多人在其中不断帮助完善文档,创建示例代码,提供一些第三库等等。
具体可以浏览以下两个网页: ,.perl最出名也最擅长的要数文本处理了,很多其它语言要花几十上百行代码才能完成的任务,perl可能只需要几行代码就能完成。
但这些优势是有代价的,Perl书写代码的风格有时常常被诟病,典型的面向过程式语言。
还有就是,Perl一直以来所坚持的哲学:只要不引起冲突误解,代码可以写成多种形式--There's More Than One Way To Do It。
这导致Perl在语法上具有非常松散随意的特点,同一件事情常常有多种不同的写法,有些看起来还很奇特,体现在语法上,Perl的符号特别多,让Perl在一定程度比较难学易忘。
本文主要简单介绍一下Perl的基本语法,目的是通过本文,你能基本了解Perl的写法,能够顺利的去读别人的代码。
天下的语言在一定程度上是相通的,如果有一门语言的基础,去学另一本语言,入门基本不会太难,但要用到熟,得心应手,第三库的使用等,这些就需要耐心,假以时日了。
本节主要介绍一下各种数据类型.(一)语句及注释:Perl语句以分号(;)结尾,用#作为一行的注释,没有其它语言中那种跨行的注释。
代码块用大括号围起来,这个和c类似。
但这个大括号在有些地方是强制要求,如在if ,for,do, while等语句中。
它不像其它语言一样会用缩进来判断块。
(二) 变量: 标量(scalar) & 列表(list)Perl把简单的数据类型,如字符串,数字等“单数”的东西统称为标量,与之相对的,就是“复数”的东西,如数组。
标量的声明都是$开始,如: $str = "abc"多个标量也可以放一处一起声明:($x, $y, $z) = (11, 22, "no", 4);而数组的声明则是以@开头,如:@arr = ("abc","edf")Perl的变量声明和很多其它脚本语言一样,不需要指明类型,直接声明赋值就可以使用。
Bioperl 操作指南camelbbs@Bioperl 为许多经典的生物信息学程序提供了软件模块,这些包括: 从本地或远程数据库获取数据; 转换数据库或文件记录的格式; 操作单个序列; 搜索相似序列;创建和进行序列比对;搜索基因组上的基因及其它结构; 发展机器可读的序列注释;下面的章节将描述bioperl 怎样执行这些任务;III.1从本地和远程数据库中获取数据bioperl 主要集中于序列操作,但是在用bioperl 操作序列之前,需要获取序列数据。
现在你可以直接将序列数据输入到bioperl 的Seq 对象,例如:$seq = Bio::Seq->new(-seq => 'actgtggcgtcaact',-desc => 'Sample Bio::Seq object', -display_id => 'something', -accession_number => 'accnum', -alphabet => 'dna' );然而,在大多数时候,从在线文档及数据库中获取序列更优越。
注意在生物信息学的传统叫法中有时候被称作“数据库”的很可能是一个“索引平台文件”。
Bioperl 支持远程数据获取,也可为访问本地数据库创建索引。
有两个普通的方法完成这个。
如果你知道序列储存在什么样的数据库中(例如文本文件、本地关系型数据库或一个internet 上可访问的远程数据库),你可以写一个脚本特定地从这些数据库中获得数据。
这种方法将在III.1.1 节和III.1.2节中描述,这两节分别讲如何从远程数据库和本地的索引平台文件中获取数据。
明确地从本地关系型数据库中获取序列数据需要安装和设置bioperl-db 库和BioSQL 计划中的模块,更多介绍可见IV .3节。
另一个方法是使用最近发展起来的OBDA(Open Bioinformatics Data Access)注册系统。
第七章Bioperl模块唐玉荣tangyurong@主要内容7.1 Bioperl模块简介7.2 Bioperl功能与用法1.Sequences类2.Alignments类3.Analysis类4.Databases类7.3 实例7.1 Bioperl模块简介1.什么是Bioperl–Bioperl是一组Perl模块,它的主要目的在于利用Perl解决生物学研究中的一些问题,主要是生物信息学中的各种实际问题,如获取分子数据,分析序列文件,序列比对,大批量BLAST等。
–Bioperl开发项目正式开始于1995年。
它是一个开放源代码项目,使用Perl语言开发,由OBF(Open BioinformaticsFoundation) 所支持。
Bioperl的官方主页是:2.Bioperl相关文档–Bioperl tutorial/Core/Latest/bptutorial.html –Mastering Perl for Bioinformatics: Introduction to bioperl /catalog/mperlbio/chapter/ch09.pdf –Bioperl documentation/Core/Latest/modules.html –Modules listing/bioperl-live/–Jason Stajich的Bioperl教程/Bioperl_Tutorials/–Beginner’s HOWTO by Brian Osborne & James Thompson /HOWTOs/Beginners/index.html–法国Pasteur Institute的Bioperl课程http://www.pasteur.fr/recherche/unites/sis/formation/bioperl/–WIBR Bioinformatics and Research Computing /bio/education/bioinfo-mini/unix-perl/ /~lehvasla/bioperl/7.2 Bioperl功能与用法1.Sequences类2.Alignments类3.Analysis类4.Databases类1.Sequences类–The Bio::SeqIO class–Sequence classes–Features and Location classes –Sequence analysis tools•The Bio::SeqIO class文档:/releases/bioperl-1.0.1/Bio/SeqIO.html 功能:格式转换例如:SwissProt格式转化为Fasta格式#! /usr/bin/perl–wuse strict;use Bio::SeqIO;my $in = Bio::SeqIO->newFh( -file => ’<seqs.sp’, -format => ’swiss’);my $out = Bio::SeqIO->newFh( -file => ’>seqs.fasta’, -format => ’fasta’);print $out $_ while <$in>;phred output phdQuality values (get a sequence of quality scores)qual GAME XML format gameACeDB sequence format aceRaw format (one sequence per line, no ID) rawGCG format GCGProtein Information Resource format PIRSCF tracefile format SCFSwissprot format swissGenBank format GenBankEMBL format EMBLFASTA format Fastadescription formatBio::SeqIO->newFh 中的format 说明:Swissprot格式ID 7LES_DROME STANDARD; PRT; 2554 AA.AC P13368; Q9TYI0; Q9U5V7; Q9VZ36;DT 01-JAN-1990 (Rel. 13, Created)DT 16-OCT-2001 (Rel. 40, Last sequence update)DT 15-JUN-2004 (Rel. 44, Last annotation update)DE Sevenless protein (EC 2.7.1.112).GN SEV OR HD-265 OR CG18085.OS Drosophila melanogaster(Fruit fly).OC Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota;OC Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;OC Ephydroidea; Drosophilidae; Drosophila.OX NCBI_TaxID=7227;RN [1]RP SEQUENCE FROM N.A.RC STRAIN=Canton-S;RX MEDLINE=88282538; PubMed=2840202;RA Basler K., Hafen E.;RT "Control of photoreceptor cell fate by the sevenless protein requiresRT a functional tyrosine kinase domain.";RL Cell 54:299-311(1988).RN [2]RP SEQUENCE FROM N.A.RC STRAIN=Oregon-R;RX MEDLINE=88329706; PubMed=3138161;RA Bowtell D.L.L., Simon M.A., Rubin G.M.;RT "Nucleotide sequence and structure of the sevenless gene ofRT Drosophila melanogaster.";RL Genes Dev. 2:620-634(1988).CC -!-FUNCTION: Receptor for an extracellular signal required toCC instruct a cell to differentiate into an R7 photoreceptor. TheCC ligand for sev is the boss (bride of sevenless) protein on theCC surface of the neighboring R8 cell.CC -!-CATALYTIC ACTIVITY: ATP + a protein tyrosine = ADP + protein CC tyrosine phosphate.CC -!-SUBUNIT: May form a complex with drk and Sos.CC -!-DOMAIN: It is unclear whether the potential membrane spanning CC region near the N-terminus is present as a transmembrane domain inCC the native protein or serves as a cleaved signal sequence.CC -!-SIMILARITY: Belongs to the Tyr family of protein kinases. Insulin CC receptor subfamily.CC -!-SIMILARITY: Contains 7 fibronectin type III domains.CCDR EMBL; J03158; AAA28882.1; -.DR EMBL; X13666; CAA31960.1; ALT_INIT.DR EMBL; X13666; CAB55310.1; -.DR EMBL; AE003484; AAF47992.2; -.DR EMBL; AJ002917; CAA05752.1; -.DR PIR; A28912; TVFF7L.DR HSSP; P08069; 1JQH.DR FlyBase; FBgn0003366; sev.DR GO; GO:0005886; C:plasma membrane; IDA.DR GO; GO:0004713; F:protein-tyrosine kinase activity; IDA.DR GO; GO:0045467; P:R7 development; NAS.DR GO; GO:0008293; P:torso signaling pathway; NAS.DR InterPro; IPR003961; FN_III.DR InterPro; IPR008957; FN_III-like.DR InterPro; IPR000033; Ldl_receptor_rep.DR InterPro; IPR000719; Prot_kinase.DR InterPro; IPR002011; RecepttyrkinsII.DR InterPro; IPR001245; Tyr_pkinase.DR InterPro; IPR008266; Tyr_pkinase_AS.DR Pfam; PF00041; fn3; 6.DR Pfam; PF00069; Pkinase; 1.DR PRINTS; PR00109; TYRKINASE.DR ProDom; PD000001; Prot_kinase; 1.DR SMART; SM00060; FN3; 6.DR SMART; SM00135; LY; 2.DR SMART; SM00219; TyrKc; 1.DR PROSITE; PS50853; FN3; 7.DR PROSITE; PS00107; PROTEIN_KINASE_ATP; 1.DR PROSITE; PS50011; PROTEIN_KINASE_DOM; 1.DR PROSITE; PS00109; PROTEIN_KINASE_TYR; 1.DR PROSITE; PS00239; RECEPTOR_TYR_KIN_II; 1.KW Transferase; Tyrosine-protein kinase; Receptor; Vision; Transmembrane; KW Glycoprotein; ATP-binding; Phosphorylation; Repeat.FT DOMAIN 1 2123 Extracellular(Potential).FT TRANSMEM 2124 2147 Potential.FT DOMAIN 2148 2554 Cytoplasmic(Potential).FT DOMAIN 311 431 Fibronectin type-III 1.FT DOMAIN 436 528 Fibronectin type-III 2.FT DOMAIN 822 921 Fibronectin type-III 3.FT DOMAIN 1298 1392 Fibronectin type-III 4.FT DOMAIN 1680 1794 Fibronectin type-III 5.FT DOMAIN 1797 1897 Fibronectin type-III 6.FT DOMAIN 1898 1988 Fibronectin type-III 7.FT DOMAIN 2038 2046 Poly-Arg.FT DOMAIN 2209 2485 Protein kinase.FT CONFLICT 2271 2271C -> R (in Ref. 1).SQ SEQUENCE 2554 AA; 287022 MW; 09E238A0F27684F8 CRC64;MTMFWQQNVD HQSDEQDKQA KGAAPTKRLN ISFNVKIAVN VNTKMTTTHI NQQAPGTSSS SSNSQNASPS KIVVRQQSSS FDLRQQLARL GRQLASGQDG HGGISTILII NLLLLILLSI CCDVCRSHNY TVHQSPEPVS KDQMRLLRPK LDSDVVEKVA IWHKHAAAAP PSIVEGIAIS SRPQSTMAHH PDDRDRDRDP SEEQHGVDER MVLERVTRDC VQRCIVEEDL FLDEFGIQCE KADNGEKCYK TRCTKGCAQW YRALKELESC QEACLSLQFY PYDMPCIGAC EMAQRDYWHL QRLAISHLVE RTQPQLERAP RADGQSTPLT IRWAMHFPEH YLASRPFNIQ YQFVDHHGEE LDLEQEDQDA SGETGSSAWF NLADYDCDEY YVCEILEALI PYTQYRFRFE LPFGENRDEV LYSPATPAYQ TPPEGAPISA PVIEHLMGLD DSHLAVHWHP GRFTNGPIEG YRLRLSSSEG•Sequence类–创建一条序列use Bio::Seq;my $seq1 = Bio::Seq->new (-seq=> ’ATGAGTAGTAGTAAAGGTA’,-id => ’my seq’,-desc=> ’this is a new Seq’);–读取序列文件use Bio::SeqIO;my $seqin= Bio::SeqIO->new( -file => ’seq.fasta’,-format => ’fasta’);my $seq3 = $seqin->next_seq();use Bio::SeqIO;my $seqin2 = Bio::SeqIO->newFh( -file => ’seq.fasta’,-format => ’fasta’); my $seq4 = <$seqin2>;–从远程数据库中获取序列举例:•Create a sequence object for the MALK_ECOLI SwissProtentry•Print its Accession number and descriptionuse Bio::DB::SwissProt;my $database = new Bio::DB::SwissProt;my $seq= $database->get_Seq_by_id(’MALK_ECOLI’);print "Seq: ", $seq->accession_number(), " --", $seq->desc(), "\n\n"; Seq: P68187--Maltose/maltodextrin import ATP-binding protein malK(EC3.6.3.19).–根据SwissProt entry获取PDB结构entry$annotation = $seq->annotation();my @structures = ();foreach my $link ( $annotation->get_Annotations(’dblink’) ) { if ($link->database() eq’PDB’) {push (@structures, $link->primary_id());}}print "\nPDB Structures: ", join (" ", @structures), "\n";•Features and Location classes –Features class–Location class•Features Classes–Bio::SeqFeatureI–Bio::SeqFeature::Generic–Bio::SeqFeature::FeaturePair–Bio::SeqFeature::SimilarityPair–Bio::SeqFeature::Similarity–Bio::SeqFeature::Gene::–Bio::SeqFeature::Gene::ExonI–Bio::SeqFeature::Gene::Exon–Bio::SeqFeature::Gene::Transcript–Bio::SeqFeature::Gene::TranscriptI–Bio::SeqFeature::Gene::GeneStructureI –Bio::SeqFeature::Gene::GeneStructure–Adding a feature to a Genbank entryuse Bio::SeqFeature::Generic;use Bio::SeqIO;$in = Bio::SeqIO->newFh(-file => $ARGV[0]);$out = Bio::SeqIO->newFh();$seq= <$in>;$feat = new Bio::SeqFeature::Generic( -start => 10, -end => 100,-strand => -1,-primary => 'repeat',-source => 'repeatmasker',-score => 1000,-tag => {new => 1,author => 'someone',sillytag=> 'this is silly!' } ); $seq->add_SeqFeature($feat);print $out $seq;增加前的特征项:FEATURES Location/Qualifiers source 1..16705/organism="Pteropus dasymallus"/organelle="mitochondrion"/mol_type="genomic DNA"/db_xref="taxon:126282"gene 1..69/gene="tRNA-Phe“……………gene 15321..15390/gene="tRNA-Thr"tRNA15321..15390/gene="tRNA-Thr"gene complement(15390..15455)/gene="tRNA-Pro"tRNA complement(15390..15455)/gene="tRNA-Pro"D-loop 15456..16705/note="D-loop region"增加后的特征项:FEATURES Location/Qualifiers……….D-loop 15456..16705/note="D-loop region"repeat complement(10..100)/new="1"/author="someone"/sillytag="this is silly!"/note="score=1000"•Location Classes–Bio::LocationI–Bio::Location::Simple–Bio::Location::SplitLocationIet Bio::Location::Split–Bio::Location::FuzzyLocationIet Bio::Location::Fuzzy –Coordinate policies, for fuzzy location start and enddetermination:•interface: Bio::Location::CoordinatePolicyI,•default: Bio::Location::WidestCoordPolicy•Bio::Location::NarrowestCoordPolicy,Bio::Location::AvWithinCoordPolicy–specifies a location whose begin is before 30 and whose end is at 90my $fuzzylocation= new Bio::Location::Fuzzy(-start => ’<30’,-end => 90,-loc_type=> ’.’);符号说明:–’..’: EXACT–’^’: BETWEEN (a site between 2 bases), exemple: 123^124a site between bases 123 and 124. 145^177 a site between 2adjacent bases, somewhere between bases 145 and 177.–’.’: WITHIN (a base within a range), example : (102.110) is a site wihtin the 102 .. 110 range, inclusive.–’<’: BEFORE–’>’: AFTER•Sequence analysis tools–Bio::Tools::SeqStats/releases/bioperl-1.0.1/Bio/Tools/SeqStats.html –Bio::Tools::SeqWordshttp:/releases/bioperl-1.0.1/Bio/Tools/SeqWords.html/–Bio::Tools::SeqPattern/releases/bioperl-1.0.1/Bio/Tools/SeqPattern.html –Bio::SeqUtils/releases/bioperl-1.0.1/Bio/Tools/SeqUtils.html –Bio::Tools::RestrictionEnzyme/releases/bioperl-1.0.1/Bio/Tools/RestrictionEnzyme.html–Bio::Tools::Sigcleave/releases/bioperl-1.0.1/Bio/Tools/Sigcleave.html–Bio::Tools::SeqStats$seq_stats= Bio::Tools::SeqStats->new(-seq=>$seqobj); $seq_stats->count_monomers();$seq_stats->count_codons();$weight = $seq_stats->get_mol_wt($seqobj);–Bio::Tools::SeqWords$seq_word= Bio::Tools::SeqWords->new(-seq=> $seqobj); $seq_stats->count_words($word_length);–Bio::Tools::SeqPattern$pat = ’T[GA]AA...TAAT’;$pattern = new Bio::Tools::SeqPattern(-SEQ =>$pat, -TYPE=>’Dna’); $pattern->expand;$pattern->revcom;$pattern->alphabet_ok;–Bio::Tools::RestrictionEnzyme$re1 = new Bio::Tools::RestrictionEnzyme(-NAME =>’EcoRI’);@fragments = $re1->cut_seq($seqobj);$re2 = new Bio::Tools::RestrictionEnzyme( -NAME =>’EcoRV--GAT^ATC’, -MAKE =>’custom’);–Bio::Tools::Sigcleave$sigcleave_object= new Bio::Tools::Sigcleave(’-file’=>’sigtest.aa’,’-threshold’=>’3.5’,’-desc’=>’test sigcleave protein seq’,’-type’=>’AMINO ’); %raw_results= $sigcleave_object->signals;$formatted_output= $sigcleave_object->pretty_print;2.Alignments类–Bio::AlignIO/releases/bioperl-1.0/Bio/AlignIO.html –Bio::SimpleAlign/releases/bioperl-1.0/Bio/SimpleAlign.html –Bio::Tools::CodonTable/releases/bioperl-1.0/Bio/Tools/CodonTable.html–Bio::AlignIOFormat conversions with AlignIO:use Bio::AlignIO;$inputfilename= "testaln.fasta";$in = Bio::AlignIO->new(-file => $inputfilename, '-format' => 'fasta'); $out = Bio::AlignIO->new(-file => ">out.aln.pfam" , '-format' => 'pfam'); while ( my $aln= $in->next_aln() ) {$out->write_aln($aln);}Alignment result:use Bio::AlignIO;my $in = new Bio::AlignIO( -file =>, $ARGV[0], -format => 'clustalw' );my $aln= $in->next_aln();print "same length of all sequences: ", ($aln->is_flush()) ? "yes" : "no", "\n"; print "alignment length: ", $aln->length(), "\n";printf"identity: %.2f %%\n", $aln->percentage_identity();printf"identity of conserved columns: %.2f %%\n",$aln>overall_percentage_identity();infile.ALN文件内容:CLUSTAL W (1.82) multiple sequence alignmentBACR_HALAR ---------------MPEPGSEAIWLWLGTAGMFLGMLYFIARGWGETDSRRQKFYIATI BACR_HALHP -----------------------IWLWLGTAGMFLGMLYFIARGWGETDSRRQKFYIATI BACR_HALVA ---------------MPAPEGEAIWLWLGTAGMFLGMLYFIARGWGETDSRRQKFYIATI BACR_HALHS -------------------------LWLGTAGMFLGMLYFIARGWGETDGRRQKFYIATI BACR_HALSR --------MLQSGMSTYVPGGESIFLWVGTAGMFLGMLYFIARGWSVSDQRRQKFYIATI BACR_HALHA MLELLPTAVEGVSQAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITT BACR_HALHM --------------------------GIGTLLMLIGTFYFIARGWGVTDKKAREYYAITI. :** * :* :**:.:* . :* :::* *BACR_HALAR LITAIAFVNYLAMALGFGLTIVEFAGEEHP--IYWARYSDWLFTTPLLLYDLGLLAGADR BACR_HALHP LITAIAFVNYLAMALGFGLTIVEFAGEEHP--IYWARYSDWLFTTPLLLYDLGLLAGADR BACR_HALVA LITAIAFVNYLAMALGFGLTIVEIAGEQRP--IYWARYSDWLFTTPLLLYDLGLLAGADR BACR_HALHS LITAIAFVNYLAMALGFGLTFIEFGGEQHP--IYWARYTDWLFTTPLLLYDLGLLAGADR BACR_HALSR MIAAIAFVNYLSMALGFGVTTIELGGEERA--IYWARYTDWLFTTPLLLYDLALLAGADR BACR_HALHA LVPAIAFTMYLSMLLGYGLTMVPFGGEQNP--IYWARYADWLFTTPLLLLDLALLVDADQ BACR_HALHM LVPGIASAAYLSMFFGIGLTTVEVAGMAEPLEIYYARYADWLFTTPLLLLDLALLANADR ::..** . **:* :* *:* : ..* .. **:***:********** **.**..**:BACR_HALAR NTITSLVSLDVLMIGTGLVATLSPGSGVLSAGAERLVWWGISTAFLLVLLYFLFSSLSGR BACR_HALHP NTITSLVSLDVLMIGTGLVATLSAGSGVLSAGAERLVWWGISTAFLLVLLYFLFSSLSGR BACR_HALVA NTISSLVSLDVLMIGTGLVATLSAGSGVLSAGAERLVWWGISTAFLLVLLYFLFSSLSGR BACR_HALHS NTIYSLVSLDVLMIGTGVVATLSAGSGVLSAGAERLVWWGISTAFLLVLLYFLFSSLSGR BACR_HALSR NTIYSLVGLDVLMIGTGALATLSAGSGVLPAGAERLVWWGISTGFLLVLLYFLFSNLTDR BACR_HALHA GTILALVGADGIMIGTGLVGALT------KVYSYRFVWWAISTAAMLYILYVLFFGFTSK BACR_HALHM TTIGTLIGVDALMIVTGLIGALS------HTPLARYTWWLFSTIAFLFVLYYLLTVLRSA ** :*:. * :** ** :.:*:..:. . * .** :** :* :** *: : .BACR_HALAR VADLPSDTRSTFKTLRNLVTVVWLVYPVWWLIGTEGIGLVGIGIETAGFMVIDLTAKVGF BACR_HALHP VADLPSDTRSTFKTLRNLVTVVWLVYPVWWLIGTEGIGLVGIGIETAGFMVIDLTA----BACR_HALVA VADLPSDTRSTFKTLRNLVTVVWLVYPVWWLVGTEGIGLVGIGIETAGFMVIDLVAKVGF BACR_HALHS VANLPSDTRSTFKTLRNLVTVVWLVYPVWWLVGSEGLGLVGIGIETAGFMVIDLVA----BACR_HALSR ASELSGDLQSKFSTLRNLVLVLWLVYPVLWLVGTEGLGLVGLPIETAAFMVLDLTAKIGF BACR_HALHA AESMRPEVASTFKVLRNVTVVLWSAYPVVWLIGSEGAGIVPLNIETLLFMVLDVSAKVGF BACR_HALHM AAELSEDVQTTFNTLTALVAVLWTAYPILWIIGTEGAGVVGLGVETLAFMVLDVTA----. .: : :.*..* :. *:* .**: *::*:** *:* : :** ***:*: *. . BACR_HALAR GIILLRS---HGVLDGAAETTGTGATPADDBACR_HALHP ------------------------------BACR_HALVA GIILLRS---HGVLDGAAETTGAGATATADBACR_HALHS ------------------------------BACR_HALSR GIILLQS---HAVLD-EGQTASEGAAVAD-BACR_HALHA GLILLRSRAIFGEAEAPEPSAGDGAAATSDBACR_HALHM ------------------------------. : . .. ::. .:: : .屏幕输出:same length of all sequences: yesalignment length: 270identity: 69.99 %identity of conserved columns: 28.52 %3.Analysis类–Blast/bioperl-live/–Genscan/releases/bioperl-1.0/Bio/Tools/Genscan.html•Blast–Bio::Tools::Run::StandAloneBlastRunning local Blastuse Bio::SeqIO;use Bio::Tools::Run::StandAloneBlast;my $Seq_in= Bio::SeqIO->new (-file => $ARGV[0], -format => 'fasta');my $query = $Seq_in->next_seq();my $factory = Bio::Tools::Run::StandAloneBlast->new('program' => 'blastp','database' => 'swissprot',);my $blast_report= $factory->blastall($query);my $result = $blast_report->next_result;while( my $hit = $result->next_hit()) {print "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n";}Running Blast on a Swissprot entryuse Bio::SeqIO;use Bio::Tools::Run::StandAloneBlast;use Bio::DB::SwissProt;my $database = new Bio::DB::SwissProt;my $query = $database->get_Seq_by_id(’TAUD_ECOLI’);my $factory = Bio::Tools::Run::StandAloneBlast->new(’program’=> ’blastp’,’database’=> ’swissprot’,); my $blast_report= $factory->blastall($query);my $result = $blast_report->next_result;while( my $hit = $result->next_hit()) {print "\thit name: ", $hit->name()," significance: ", $hit->significance(), "\n";}Parsing from a Blast run resultuse Bio::SeqIO;use Bio::Tools::Run::StandAloneBlast;my $Seq_in= Bio::SeqIO->new (-file => $ARGV[0], -format => 'fasta');my $query = $Seq_in->next_seq();my $factory = Bio::Tools::Run::StandAloneBlast->new('program' => 'blastp','database' => 'swissprot',); my $blast_report= $factory->blastall($query);my $result = $blast_report->next_result;while( my $hit = $result->next_hit()) {print "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n";while( my $hsp= $hit->next_hsp()) {print "E: ", $hsp->evalue(), "frac_identical: ", $hsp->frac_identical(), "\n";}}Parsing from a Blast fileuse Bio::SearchIO;my $blast_report= new Bio::SearchIO('-format' => 'blast', '-file' => $ARGV[0]); my $result = $blast_report->next_result;while( my $hit = $result->next_hit()) {print "\thit name: ", $hit->name(), "\n";while( my $hsp= $hit->next_hsp()) {print "E: ", $hsp->evalue(), "frac_identical: ", $hsp->frac_identical(), "\n";}}Running bl2sequse Bio::SeqIO;use Bio::Tools::Run::StandAloneBlast;my $query1_in = Bio::SeqIO->newFh( -file => $ARGV[0], -format => 'fasta' ); my $query1 = <$query1_in>;my $query2_in = Bio::SeqIO->newFh( -file => $ARGV[1], -format => 'fasta' ); my $query2 = <$query2_in>;$factory = Bio::Tools::Run::StandAloneBlast->new('program' => 'blastp'); $report = $factory->bl2seq($query1, $query2);while(my$hsp= $report->next_feature) {print "homology seq:\n", $hsp->homologySeq, "\n";print "sbjctSeq:\n", $hsp->sbjctSeq, "\n";}•Genscan–Bio::Tools::GenscanGenscan parsinguse Bio::Tools::Genscan;my $genscan_file= $ARGV[0];$genscan= Bio::Tools::Genscan->new(-file => $genscan_file);while(my$gene = $genscan->next_prediction()) {my $prot= $gene->predicted_protein;print "protein :\n", $prot->seq, "\n\n";}Genscan.out文件内容:GENSCAN 1.0Date run: 20-Jan-98Time: 11:02:22Sequence HUMRASH : 6453 bp: 68.19% C+G : Isochore 4 (57.00 -100.00 C+G%) Parameter matrix: HumanIso.smatPredicted genes/exons:Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..------------------------------------------------------1.01 Init + 1664 1774 111 1 0 94 83 212 0.997 21.331.02 Intr+ 2042 2220 179 1 2 104 66 408 0.997 40.121.03 Intr+ 2374 2533 160 1 1 89 94 302 0.999 32.081.04 Term + 3231 3350 120 2 0 115 48 202 0.961 18.31Predicted peptide sequence(s):>HUMRASH|GENSCAN_predicted_peptide_1|189_aa MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG QEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDL AARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMSCKCVLS屏幕显示内容:protein: MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLC VFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLV REIRQHKLRKLNPPDESGPGCMSCKCVLSGenscan parsing, with sub-sequencesuse Bio::Tools::Genscan;my $genscan_file= $ARGV[0];$genscan= Bio::Tools::Genscan->new(-file => $genscan_file); while(my$gene = $genscan->next_prediction()) {my $prot= $gene->predicted_protein;print "protein :\n", $prot->seq, "\n\n";# display genscan predicted cds(if -cds genscan option) my $predicted_cds= $gene->predicted_cds;print "predicted CDS: \n", $predicted_cds->seq, "\n\n";foreach my $exon($gene->exons()) {my $loc = $exon->location;print "exon-primary_tag: ",$exon->primary_tag, " coding? ",$exon->is_coding(), " start-end: ", $loc->start, "-", $loc->end, "\n";}foreach my $intron($gene->introns()) {my $loc = $intron->location;print "intron primary_tag: ",$intron->primary_tag, " start-end: ",$loc->start, "-", $loc->end, "\n";}print "---------------------------------\n";}输入文件:GENSCAN 1.0Date run: 30-Oct-101Time: 17:32:04Sequence HUMRASH : 6453 bp: 68.19% C+G : Isochore 4 (57 -100 C+G%)Parameter matrix: HumanIso.smatPredicted genes/exons:Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..------------------------------------------------------1.01 Init + 1664 1774 111 1 0 94 83 212 0.997 21.33 1.02 Intr+ 2042 2220 179 1 2 104 66 408 0.997 40.12 1.03 Intr+ 2374 2533 160 1 1 89 94 302 0.999 32.08 1.04 Term + 3231 3350 120 2 0 115 48 202 0.961 18.31Predicted peptide sequence(s):Predicted coding sequence(s):>HUMRASH|GENSCAN_predicted_peptide_1|189_aa MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG QEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDL AARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPG CMSCKCVLS>HUMRASH|GENSCAN_predicted_CDS_1|570_bp atgacggaatataagctggtggtggtgggcgccggcggtgtgggcaagagtgcgctgacc atccagctgatccagaaccattttgtggacgaatacgaccccactatagaggattcctac cggaagcaggtggtcattgatggggagacgtgcctgttggacatcctggataccgccggc caggaggagtacagcgccatgcgggaccagtacatgcgcaccggggagggcttcctgtgt gtgtttgccatcaacaacaccaagtcttttgaggacatccaccagtacagggagcagatc aaacgggtgaaggactcggatgacgtgcccatggtgctggtggggaacaagtgtgacctg gctgcacgcactgtggaatctcggcaggctcaggacctcgcccgaagctacggcatcccc tacatcgagacctcggccaagacccggcagggagtggaggatgccttctacacgttggtg cgtgagatccggcagcacaagctgcggaagctgaaccctcctgatgagagtggccccggc tgcatgagctgcaagtgtgtgctctcctga输出文件:protein: MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLC VFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLV REIRQHKLRKLNPPDESGPGCMSCKCVLSpredicted CDS: atgacggaatataagctggtggtggtgggcgccggcggtgtgggcaagagtgcgctgaccatccagctgatccagaacca ttttgtggacgaatacgaccccactatagaggattcctaccggaagcaggtggtcattgatggggagacgtgcctgttgg acatcctggataccgccggccaggaggagtacagcgccatgcgggaccagtacatgcgcaccggggagggcttcctgtgt gtgtttgccatcaacaacaccaagtcttttgaggacatccaccagtacagggagcagatcaaacgggtgaaggactcgga tgacgtgcccatggtgctggtggggaacaagtgtgacctggctgcacgcactgtggaatctcggcaggctcaggacctcg cccgaagctacggcatcccctacatcgagacctcggccaagacccggcagggagtggaggatgccttctacacgttggtg cgtgagatccggcagcacaagctgcggaagctgaaccctcctgatgagagtggccccggctgcatgagctgcaagtgtgt gctctcctgaexon-primary_tag: Initial coding? 1 start-end: 1664-1774exon-primary_tag: Internal coding? 1 start-end: 2042-2220exon-primary_tag: Internal coding? 1 start-end: 2374-2533exon-primary_tag: Terminal coding? 1 start-end: 3231-3350intron primary_tag: intron start-end: 1775-2041intron primary_tag: intron start-end: 2221-2373intron primary_tag: intron start-end: 2534-3230---------------------------------4.Databases类–Bio::Index::Swissprotuse Bio::Index::Swissprot;use Bio::SeqIO;my $out = Bio::SeqIO->newFh(-fh=>\*STDOUT, -format=>'fasta');my $Index_File_Name= shift;my $inx= Bio::Index::Swissprot->new(-filename=>$Index_File_Name); foreach my $id (@ARGV) {my $seq= $inx->fetch($id);print $out $seq;}7.3 实例1.获取序列统计信息–分子量、残基、密码子使用频度#!/usr/bin/perluse warnings;use Bio::Seq;use Bio::Tools::SeqStats ;$seqobj = Bio::Seq->new(-seq =>"aaaatgggggggggggccccgtt",-display_id => "#12345",-desc => "example 1" );$seq_stats = Bio::Tools::SeqStats->new($seqobj);$weight = $seq_stats->get_mol_wt();$monomer_ref = $seq_stats->count_monomers();$codon_ref = $seq_stats->count_codons();print "$weight\n$monomer_ref\n$codon_ref\n";ARRAY(0x1bf9024)HASH(0x1be0e98)HASH(0x1bf94d4)2.识别限制性酶切位点#!/usr/bin/perluse warnings;use Bio::Seq;use Bio::Tools::RestrictionEnzyme;$seqobj= Bio::Seq->new(-seq=>"aaaatgggggggggggccccgtt",-display_id=> "#12345",-desc=> "example 1" );$re = new Bio::Tools::RestrictionEnzyme(-NAME =>'EcoRI');@sixcutters= $re->available_list();$re1 = new Bio::Tools::RestrictionEnzyme(-NAME =>'EcoRI');@fragments = $re1->cut_seq($seqobj);print "@sixcutters\n@fragments";3.识别氨基酸剪切位点#!/usr/bin/perl use warnings;use Bio::Seq;use Bio::Tools::Sigcleave;$seqobj = Bio::Seq->new(-seq=>"AALLHHHHHHGGGGPPRTTTTTVVVVVVVVVVVVVVV");$sigcleave_object = new Bio::Tools::Sigcleave(-seq=>$seqobj, -threshod=>3.5,-desc=>'test sigcleave protein seq',-type=>'AMINO');$formatted_output = $sigcleave_object->pretty_print;print "$formatted_output\n";SIGCLEAVE of NONAME from: 1 to 37Report scores over 3.5Maximum score 3.6 at residue 3Sequence: AA--------------LLHHHHHHGGGGPPRTTTTTVVVVVVVVVVVVVVV | (signal) | (mature peptide)1 3。
BioPerl使用手册第一弹—Bio::SeqIO篇注:本手册假设你已经拥有了一定的Perl编程经验,对个中术语(terms)不再进行赘述。
1.让我们开始吧为了让第一次使用本手册的同志们在刚开始就有成功的喜悦,这里给出一个例子,大家准备好自己手中的fasta 文件吧!请在文本编辑器中写入如下程序,并在终端运行:#! /usr/bin/perl –wuse strict;use Bio::SeqIO;my $file = “*********”; # Please use your path to replace the startsmy $seqio_object = Bio::SeqIO -> new(-file = $file);my $seq_obj = $seqio_object -> next_seq;printf “$seq_obj\n”;如果你成功键入了以上程序并且没有报错发生,那么屏幕上面就会正常显示出你的fasta序列。
那么恭喜你,你已经成功调用了BioPerl的模块,并且完成了一个面对对象的程序。
下面我们就来看一下我们第一个认识的BioPrel 的模块Bio::SeqIO。
2.关于Bio::SeqIO的那些事儿在介绍Bio::SeqIO之前,先来说一下为什么会产生BioPerl这个东西。
在生物信息学起步之初Prel语言强悍的字符串处理能力以及执行效率,毫无疑问的被各位从计算机和数学行业转行过来的“生物学家”选为工具语言(在生物信息数据处理方面放眼望去毫无疑问是Perl语言的天下,近来对大规模数据的处理方面R语言亦有崛起之势)。
但是,对于这海量的数据,同样丰富多彩的数据格式以及花样繁多的数据分析;每次处理数据都要重新自己编写正则表达式未免效率过于低下。
于是,在Perl一次重大的更新之后(引入面对对象编程,后面都将使用OOP代替面对对象编程),几个不太勤快的学生物的程序员看到了通用编程的可能,于是就有了我们现在广泛应用的BioPerl。
BioPerlT utorial - a tutorial for bioperlAUTHORWritten by Peter Schattner <schattner@>DESCRIPTIONThis tutorial includes "snippets" of code and text from variousBioperl documents including module documentation, example scriptsand "t" test scripts. You may distribute this tutorial under thesame terms as perl itself.This document is written in Perl POD (plain old documentation)format. You can run this file through your favorite pod translator(pod2html, pod2man, pod2text, etc.) if you would like a moreconvenient formatting.zhihuazhang@ May 17, 2003I. Introduction (4)I. Introduction (4)I.4 Installation (4)I.5 Additional comments for non-unix users (4)I.6 Places to look for additional documentation (5)II. Brief introduction to bioperl's objects (5)II.1 Sequence objects (5)II.2 Location objects (6)II.4 Interface objects and implementation objects (7)III. Using bioperl (7)III.1 Accessing sequence data from local and remote databases (7)III.1.1 Accessing remote databases (Bio::DB::GenBank, etc) (8)III.1.2 Indexing and accessing local databases (Bio::Index::*, bpindex.pl, (8)III.2 Transforming formats of database/ file records (9)III.2.1 Transforming sequence files (SeqIO) (9)III.2.2 Transforming alignment files (AlignIO) (10)III.3 Manipulating sequences (11)III.3.1 Manipulating sequence data with Seq methods (11)III.3.2 Obtaining basic sequence statistics- molecular weights, residue & (13)III.3.3 Identifying restriction enzyme sites (RestrictionEnzyme) (13)III.3.4 Identifying amino acid cleavage sites (Sigcleave) (13)III.3.5 Miscellaneous sequence utilities: OddCodes, SeqPattern (14)III.3.6 Sequence manipulation without creating Bioperl ``objects'' (15)III.4 Searching for ``similar'' sequences (15)III.4.1 Running BLAST (using RemoteBlast.pm) (15)III.4.2 Parsing BLAST and FASTA reports with Search and SearchIO (16)III.4.3 Parsing BLAST reports with BPlite, BPpsilite, and BPbl2seq (17)III.4.4 Parsing HMM reports (HMMER::Results, SearchIO) (18)III.4.5 Running BLAST locally (StandAloneBlast) (18)III.6 Searching for genes and other structures on genomic DNA (Genscan, (20)III.7 Developing machine readable sequence annotations (21)III.7.1 Representing sequence annotations (21)III.7.2 Representing and large and/or changing sequences (22)III.7.3 Representing related sequences - mutations, polymorphisms etc (23)III.7.4 Incorporating quality data in sequence annotation (SeqWithQuality) (24)III.7.5 Sequence XML representations - generation and parsing (25)III.7.6 Representing Sequences using GFF (Bio:DB:GFF ) (25)III.8 Manipulating clusters of sequences (Cluster, ClusterIO) (26)III.9 Representing non-sequence data in Bioperl: structures, trees and maps (26)III.9.3 Map objects for manipulating genetic maps (Map::MapI, MapIO) (27)III.9.5 Graphics objects for representing sequence objects as images (27)III.10 Bioperl alphabets (28)III.10.1 Extended DNA / RNA alphabet (28)III.10.2 Amino Acid alphabet (28)IV. Auxilliary Bioperl Libraries (Bioperl-run, Bioperl-db, etc.) (29)IV.1 Using the Bioperl Auxilliary Libraries (29)IV.2 Running programs (Bioperl-run, Bioperl-ext) (30)IV.2.2 Aligning 2 sequences with Blast using bl2seq and AlignIO (31)IV.2.3 Aligning multiple sequences (Clustalw.pm, TCoffee.pm) (31)IV.3 bioperl-db (32)IV.4 Other Bioperl auxilliary libraries (32)V Appendix (33)V.1 Appendi x: Finding out which methods are used by which Bioperl Objects (33)V.2 Appendix: Tutorial demo scripts (33)I. IntroductionI.4 InstallationThe actual installation of the various system components is accomplished in the standard manner:Locate the package on the networkDownloadDecompress (with gunzip or a similiar utility)Remove the file archive (eg with tar -xvf)Create a ``makefile'' (with ``perl Makefile.PL'' for perl modules or a supplied ``install'' or ``configure'' program for non-perl program Run ``make'', ``make test'' and ``make install'' This procedure must be repeated for every CPAN module, bioperl-extension and external module to be installed. A helper module CPAN.pm is available from CPAN which automates the process for installing the perl modules. The CPAN module can also be used to install all of the modules listed above in a single step as a ``bundle'' of modules, Bundle::BioPerl, eg$>perl -MCPAN -e shellcpan>install Bundle::BioPerl<installation details....>cpan>install B/BI/BIRNEY/bioperl-1.0.1.tar.gz<installation details....>cpan>quitBe advised that version numbers change regularly, so the number used above may not apply. A disadvantage of the ``bundle'' approach is that if there's a problem installing any individual module it may be a bit more difficult to isolate. See the package's INSTALL file for more details. For the external programs (clustal, Tcoffee, ncbi-blast), there is an extra step:Set the relevant environmental variable (CLUSTALDIR, TCOFFEEDIR or BLASTDIR) to the directory holding the executable in your startup file - eg in .bashrc. (For running local blasts, it is also necessary that the name of local-blast database directory is known to bioperl. This will typically happen automatically, but in case of difficulty, refer to the documentation in the Bio::Tools::Run::StandAloneBlast manpage) The only likely complication (at least on unix systems) that may occur is if you are unable to obtain system level writing privileges. For instructions on modifying the installation in this case and for more details on the overall installation procedure, see the INSTALL file in the bioperl distribution as well as the README files in the external programs you want to use (eg bioperl-ext, clustalw, TCoffee, NCBI-blast).I.5 Additional comments for non-unix usersBioperl has mainly been developed and tested under various unix environments (including Linux and MacOSX). In addition, this tutorial has been written largely from a Unix perspective. Mac users may find Steve Cannon's installation notes and suggestions for Bioperl on OS X at/~cann0010/Bioperl_OSX_install.html helpful. Also Todd Richmond has written of his experiences with BioPerl on MacOS 9 at /Core/mac-bioperl.html The bioperl core has also been tested and should work under most versions of Microsoft Windows. For many windows users the perl and bioperl distributions from Active State, at / has been quite helpful. Other windows users have had success running bioperl under Cygwin (). See the package'sINSTALL.WIN file for more details.Many bioperl features require the use of CPAN modules, compiled extensions or external programs. These features probably will not work under some or all of these other operating systems. If a script attempts to access these features from a non-unix OS, bioperl is designed to simply report that the desired capability is not available. However, since the testing of bioperl in these environments has been limited, the script may well crash in a less ``graceful'' manner.I.6 Places to look for additional documentationThis tutorial does not intend to be a comprehensive description of all the objects and methods available in bioperl. For that the reader is directed to the documentation included with each of the modules. A very usefulinterface for finding one's way within all the module documentation can be found at /bioperl-live/ This interface lists all bioperl modules and descriptions of all of their methods. In addition, beginner questions can often be answered by looking at the FAQ, INSTALL and README files in the top-level directory of the bioperl distribution. One potential problem in locating the correct documentation is that multiple methods in different modules may all share the same name. Moreover, because of perl's complex method of ``inheritance'', it is not often clear which of the identically named methods is being called by a given object. One way to resolve this question is by using the software described in Appendix V.1.For those who prefer more visual descriptions, /Core/Latest/modules.html also offers links to PDF files which contain schematics that describe how many of the bioperl objects related to one another. In addition, a bioperl online course is available on the web at http://www.pasteur.fr/recherche/unites/sis/formation/bioperl/ The user is also referred to numerous bioperl scripts in the scripts/ and examples/ directories (see bioscripts.pod for a description of these scripts).II. Brief introduction to bioperl's objectsThe purpose of this tutorial is to get you using bioperl to solve real-life bioinformatics problems as quickly as possible. The aim is not to explain the structure of bioperl objects or perl object-oriented programming in general. Indeed, the relationships among the bioperl objects is not simple; however, understanding them in detail is fortunately not necessary for successfully using the package. Nevertheless, a little familiarity with the bioperl object ``bestiary'' can be very helpful even to the casual user of bioperl. For example there are (at least) eight different ``sequence objects'' - Seq, PrimarySeq, LocatableSeq, RelSegment, LiveSeq, LargeSeq, SeqI, and SeqWithQuality. Understanding the relationships among these objects - and why there are so many of them - will help you select the appropriate one to use in your script.II.1 Sequence objectsSeq is the central sequence object in bioperl. When in doubt this is probably the object that you want to use to describe a DNA, RNA or protein sequence in bioperl. Most common sequence manipulations can be performed with Seq. These capabilities are described in sections III.3.1 and III.7.1, or in the Bio::Seq manpage. Seq objects can be created explicitly (see section III.2.1 for an example). However usually Seq objects will be created for you automatically when you read i n a file containing sequence data using the SeqIO object. This procedure is described in section III.2.1. In addition to storing its identification labels and the sequence itself, a Seq object canstore multiple annotations and associated ``sequence features''. This capability can be very useful - especially in development of automated genome annotation systems, see section III.7.1. RichSeq objects store additional annotations beyond those used by standard Seq objects. If you are using sources with very ric h sequence annotation, you may want to consider using these objects which are described in section III.7.1. SeqWithQuality objects are used to manipulate sequences with quality data, like those produced by phred. These objects are described in section III.7.4, the Bio::Seq::RichSeqI manpage, and in the Bio::Seq::SeqWithQuality manpage. On the other hand, if you need a script capable of simultaneously handling many (hundreds or thousands) sequences at a time, then the overhead of adding annotations to each sequence can be significant. For such applications, you will want to use the PrimarySeq object. PrimarySeq is basically a ``stripped down'' version of Seq. It contains just the sequence data itself and a few identifying labels (id, accession number, molecule type = dna, rna, or protein). For applications with hundreds or thousands or sequences, using PrimarySeq objects can significantly speed up program execution and decrease the amount of RAM the program requires. See the Bio::PrimarySeq manpage for more details. What is (for historical reasons) called a LocatableSeq object might be more appropriately called an ``AlignedSeq'' object. It is a Seq object which is part of a multiple sequence alignment. It has ``start'' and ``end'' positions indicating from where in a larger sequence it may have been extracted. It also may have ``gap'' symbols corresponding to the alignment to which it belongs. It is used by the alignment object SimpleAlign and other modules that use SimpleAlign objects (eg AlignIO.pm, pSW.pm). I n general you don't have to worry about creating LocatableSeq objects because they will be made for you automatically when you create an alignment (using pSW, Clustalw, Tcoffee or bl2seq) or when you input an alignment data file using AlignIO. However if you need to input a sequence alignment by hand (eg to build a SimpleAlign object), you will need to input the sequences as LocatableSeqs. Other sources of information include the Bio::LocatableSeq manpage, the Bio::SimpleAlign manpage, the Bio::AlignIO manpage, and the Bio::Tools::pSW manpage. The RelSegment object is also a type of bioperl Seq object. RelSegment objects are useful when you want to be able to manipulate the origin of the genomic coordinate system. This situation may occur when looking at a sub-sequence (e.g. an exon) which is located on a longer underlying underlying sequence such as a chromosome or a contig. Such manipulations may be important, for example when designing a graphical genome ``browaer''. If your code may need such a capability, look at the documentation the Bio::DB::GFF::RelSegment manpage which describes this feature in detail.A LargeSeq object is a special type of Seq object used for handling very long ( eg > 100 MB) sequences. If you need to manipulate such long sequences see section III.7.2 which describes LargeSeq objects, or the Bio::Seq::LargeSeq manpage. A LiveSeq object is another specialized object for storing sequence data. LiveSeq addresses the problem of features whose location on a sequence changes over time. This can happen, for example, when sequence feature objects are used to store gene locations on newly sequenced genomes -locations which can change as higher quality sequencing data becomes available. Although a LiveSeq object is not implemented in the same way as a Seq object, LiveSeq does implement the SeqI interface (see below). Consequently, most methods available for Seq objects will work fine with LiveSeq objects. Section III.7.2 and the Bio::LiveSeq manpage contain further discussion of LiveSeq objects. SeqI objects are Seq ``interface objects'' (see section II.4 and the Bio::SeqI manpage). They are used to ensure bioperl's compatibility with other software packages. SeqI and other interface objects are not likely to be relevant to the casual bioperl user. *** Having described these other types of sequence objects, the ``bottom line'' still is that if you store your sequence data in Seq objects (which is where they'll be if you read them in with SeqIO), you will usually do just fine. ***II.2 Location objectsA Location object is designed to be associated with a Sequence Feature object to indicate where on a largerstructure (eg a chromosome or contig) the feature can be found. The reason why this simple concept has evolved into a collection of rather complicated objects is that: 1) Some objects have multiple locations or sub-locations (e.g. a gene's exons may have multiple start and stop locations) 2) In unfinished genomes, the precise locations of features is not known with certainty. Bioperl's various Location objects address these complications. In addition there are ``CoordinatePolicy'' objects that allow the user to specify how to measure the ``length'' of a feature if its precise start and end coordinates are not know. In most cases, you will not need to worry about these complications if you are using bioperl to handle simple features with well-defined start and stop locations. However, if you are using bioperl to annotate partially or unfinished genomes or to read annotations of such genomes with bioperl, understanding the various Location objects will be important. See the documentation of the various modules in the Bio::Locations directory or the Bio::Location::CoordinatePolicyI manpage for more information.II.4 Interface objects and implementation objectsOne goal of the design of Bioperl is to separate interface and implementation objects. An interface is solely the definition of what methods one can call on an object, without any knowledge of how it is implemented. An implementation is an actual, working implementation of an object. In languages like Java, interface definition is part of the language. In Perl, you have to roll your own. In bioperl, the interface objects usually have names like Bio::MyObjectI, with the trailing I indicating it is an interface object. The interface objects mainly provide documentation on what the interface is, and how to use it, without any implementations (though there are some exceptions). Although interface objects are not of much direct utility to the casual bioperl user, being aware of their existence is useful since they are the basis to understanding how bioperl programs can communicate with other bioinformatics projects and computer languages such as Ensembl and biopython and biojava. For more discussion of design and development issues please see the biodesign.pod file.III. Using bioperlBioperl provides software modules for many of the typical tasks of bioinformatics programming. These include: Accessing sequence data from local and remote databasesTransforming formats of database/ file recordsManipulating individual sequencesSearching for ``similar'' sequencesCreating and manipulating sequence alignmentsSearching for genes and other structures on genomic DNADeveloping machine readable sequence annotationsThe following sections describe how bioperl can help perform all of thesetasks.III.1 Accessing sequence data from local and remote databasesMuch of bioperl is focused on sequence manipulation. However, before bioperl can manipulate sequences, it needs to have access to sequence data. Now one can directly enter data sequence data into a bioperl Seq object, eg:$seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact','-desc'=>'Sample Bio::Seq object','-display_id' => 'something', '-accession_number' => 'accnum','-alphabet' => 'dna' );However, in most cases, it is preferable to access sequence data from some online data file or database (Note that in common with conventional bioinformatics usage we will sometimes call a ``database'' what might be more appropriately referred to as an ``indexed flat file''.) Bioperl supports accessing remote databases as well as developing indices for setting up local databases.III.1.1 Accessing remote databases (Bio::DB::GenBank, etc)Accessing sequence data from the principal molecular biology databases is straightforward in bioperl. Data can be accessed by means of the sequence's accession number or id. Batch mode access is also supported to facilitate the efficient retrieval of multiple sequences. For retrieving data from genbank, for example, the code could be as follows:$gb = new Bio::DB::GenBank();# this returns a Seq object :$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');# this returns a Seq object :$seq2 = $gb->get_Seq_by_acc('AF303112'))# this returns a SeqIO object :$seqio = $gb->get_Stream_by_id("J00522","AF303112","2981014");Bioperl currently supports sequence data retrieval from the genbank, genpept, RefSeq, swissprot, and EMBL databases. See the Bio::DB::GenBank manpage, the Bio::DB::GenPept manpage, the Bio::DB::SwissProt manpage, the Bio::DB::RefSeq manpage and the Bio::DB::EMBL manpage for more information. A user can also specify a different database mirror for a database - this is especially relevent for the SwissProt resource where there a re many ExPaSy mirrors. There are also configuration options for specifying local proxy servers for those behind firewalls. The retrieval of NCBI RefSeqs sequences is supported through a special module called Bio::DB::RefSeq which actually queries an EBI server. Please see the Bio::DB::RefSeq manpage before using it as there are some caveats with RefSeq retrieval. RefSeq ids in Genbank begin with ``NT_'', ``NC_'',``NG_'', ``NM_'', ``NP_'', ``XM_'' ``XR_'', or ``XP_'' (for more information see /LocusLink/refseq.html). Bio::DB::GenBank can be used to retrieve entries corresponding to these ids but bear in mind that these are not Genbank entries, strictly speaking. See the Bio::DB::GenBank manpage for special details on retrieving e ntries beginning with ``NT_'', these are specially formatted ``CONTIG'' entries. Bioperl also supports retrieval from a remote Ace database. This capability requires the presence of the external AcePerl module. You need to download and install the aceperl module from /AcePerl/ An additional module is available for accessing remote databases, BioFetch, which queries the dbfetch script at EBI. The available databases are EMBL, GenBank, or SWALL, and the entries can be retrieved in different formats as objects or streams (SeqIO objects), or as ``tempfiles''. See the Bio::DB::BioFetch manpage for the details.III.1.2 Indexing and accessing local databases (Bio::Index::*, bpindex.pl,bpfetch.pl, Bio::DB::*)Alternately, bioperl permits indexing local sequence data files by means of the Bio::Index or Bio::DB::Fasta objects. The following sequence data formats are supported by Bio::Index: genbank, swissprot, pfam, embl and fasta. Once the set o f sequences have been indexed using Bio::Index, individual sequences can be accessed using syntax very similar to that described above for accessing remote databases. For example, if one wants to set up an indexed flat-file database of fasta files, and later wants then to retrieve one file, one could write scripts like:# script 1: create the indexuse Bio::Index::Fasta; # using fasta file formatuse strict; # some users have reported that this is necessarymy $Index_File_Name = shift;my $inx = Bio::Index::Fasta->new(-filename => $Index_File_Name,-write_flag => 1);$inx->make_index(@ARGV);# script 2: retrieve some filesuse Bio::Index::Fasta;use strict; # some users have reported that this is necessarymy $Index_File_Name = shift;my $inx = Bio::Index::Fasta->new($Index_File_Name);foreach my $id (@ARGV) {my $seq = $inx->fetch($id); # Returns Bio::Seq object# do something with the sequence}To facilitate the creation and use of more complex or flexible indexing systems, the bioperl distribution includes two sample scripts in the scripts/ directory, bpindex.pl and bpfetch.pl. These scripts can be used as templates to develop customized local data-file indexing systems. Bioperl also supplies Bio::DB::Fasta as a means to index and query Fasta format files. It's similar in spirit to Bio::Index::Fasta but offers more methods, eguse Bio::DB::Fasta;use strict;my $db = Bio::DB::Fasta->new($file); # one file or many filesmy $seqstring = $db->seq($id); # get a sequence as stringmy $seqobj = $db->get_Seq_by_id($id); # get a PrimarySeq objmy $desc = $db->header($id); # get the header, or description, lineThis module also offers the user the ability to designate a specific string within the fasta header as the desired id, such as the gi number within the string ``gi|4556644|gb|X45555'' (use the -makeid option for this capability). See the Bio::DB::Fasta manpage for more information on this fully-featured module. The core bioperl installation does not support accessing sequences stored in relational databases. However, such capabilility is available with the auxilliary bioperl-db library. See section IV.3 for more information.III.2 Transforming formats of database/ file recordsIII.2.1 Transforming sequence files (SeqIO)A common - and tedious - bioinformatics task is that of converting sequence data among the many widely used data formats. Bioperl's SeqIO object, however, makes this chore a breeze. SeqIO can read a stream of sequences - located in a single or in multiple files - in a number of formats: Fasta, EMBL, GenBank, Swissprot, PIR, GCG, SCF, phd/phred, Ace, fastq, or raw (plain sequence). Once the sequence data has been read in with SeqIO, it is available to bioperl in the form of Seq objects. Moreover, the Seq objects can then be written to another file (againusing SeqIO) in any of the supported data formats making data converters simple to implement, for example: use Bio::SeqIO;$in = Bio::SeqIO->new('-file' => "inputfilename",'-format' => 'Fasta');$out = Bio::SeqIO->new('-file' => ">outputfilename",'-format' => 'EMBL');while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }In addition, perl ``tied filehandle'' syntax is available to SeqIO, allowing you to use the standard <> and print operations to read and write sequence objects, eg:$in = Bio::SeqIO->newFh('-file' => "inputfilename" ,'-format' => 'Fasta');$out = Bio::SeqIO->newFh('-format' => 'EMBL');print $out $_ while <$in>;If the ``-format'' argument isn't used then Bioperl will guess the format based on the file's suffix in a case-insensitive manner. Here are the current interpretations:Format Suffixesfasta fasta|fast|seq|fa|fsa|nt|aagenbank gb|gbank|genbankscf scfpir pirembl embl|ebl|emb|datraw txtgcg gcgace acebsml bsm|bsmlswiss swiss|spphd phd|phredfastq fastqFor more information see the Bio::SeqIO manpage.III.2.2 Transforming alignment files (AlignIO)Data files storing multiple sequence alignments also appear in varied formats. AlignIO is the bioperl object for data conversion of alignment files. AlignIO is patterned on the SeqIO object and shares most of SeqIO's features. AlignIO currently supports input in the following formats:fastamase (Seaview)stockholmprodomselex (HMMER))bl2seqclustalw (.aln)msf (GCG)water*phylip (interleaved)stockholmnexusmegamemepsi (PSI-BLAST)*used by EMBOSS, see IV.2.1AlignIO supports output in these formats: fasta, mase, selex, clustalw, msf/gcg, and phylip (interleaved). One significant difference between AlignIO and SeqIO is that AlignIO handles IO for only a single alignment at a time but SeqIO.pm handles IO for multiple sequences in a single stream. Syntax for AlignIO is almost identical to that of SeqIO:use Bio::AlignIO;$in = Bio::AlignIO->new('-file' => "inputfilename" ,'-format' => 'fasta');$out = Bio::AlignIO->new('-file' => ">outputfilename",'-format' => 'pfam');while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }The only difference is that here, the returned object reference, $aln, is to a SimpleAlign object rather than a Seq object. AlignIO also supports the tied filehandle syntax described above for SeqIO. See the Bio::AlignIO manpage and section III.5 for more information.III.3 Manipulating sequencesBioperl contains many modules with functions for sequence analysis. And if you cannot find the function you want in bioperl you may be able to find it in EMBOSS or PISE , which are accessible through the bioperl-run auxilliary library (see IV.2.1).III.3.1 Manipulating sequence data with Seq methodsOK, so we know how to retrieve sequences and access them as Seq objects. Let's see how we can use the Seq objects to manipulate our sequence data and retrieve information. Seq provides multiple methods for performing many common (and some not-so-common) tasks of sequence manipulation and data retrieval. Here are some of the most useful: The following methods return strings$seqobj->display_id(); # the human read-able id of the sequence$seqobj->seq(); # string of sequence$seqobj->subseq(5,10); # part of the sequence as a string$seqobj->accession_number(); # when there, the accession number$seqobj->alphabet(); # one of 'dna','rna','protein'$seqobj->primary_id(); # a unique id for this sequence irregardless# of its display_id or accession number$seqobj->desc(); # a description of the sequenceIt is worth mentioning that some of these values correspond to specific fields of given formats. For example, the display_id method returns the LOCUS name of a Genbank entry, the (\S+) following the > character in a Fasta file, the ID from a SwissProt file, and so on. The desc() method will return the DEFINITION line of a Genbank file, the line following the display_id in a Fasta file, and the DE field in a SwissProt file. The following methods return an array of Bio::SeqFeature objects$seqobj->get_SeqFeatures; # The 'top level' sequence features$seqobj->get_all_SeqFeatures; # All sequence features, including sub# seq featuresFor a comment annotation, you can use:use Bio::Annotation::Comment;。
Hans Journal of Computational Biology 计算生物学, 2014, 4, 13-19Published Online June 2014 in Hans. /journal/hjcb/10.12677/hjcb.2014.42002Based on BioPerl Realize AccuratelyDownload LEA Gene Sequences fromthe NCBIXiaojing Zhang*, Xingqin Cao, Weimin Pan#School of Life Sciences, Xinjiang Normal University, UrumchiEmail: 313741033@, #379483304@Received: Apr. 11th, 2014; revised: Apr. 18th, 2014; accepted: Apr. 22nd, 2014Copyright © 2014 by authors and Hans Publishers Inc.This work is licensed under the Creative Commons Attribution International License (CC BY)./licenses/by/4.0/AbstractRecently, researchers have paid more and more attention to the resistance gene research; in Xin-jiang, especially the research of drought resistance gene has attached great importance. Based on these factors, according to the conservative domain structure of LEA gene (late embriogenesis ab-undant gene, LEA) and the corresponding keywords, this paper designed a program that LEA gene sequences was downloaded accurately from NCBI based on the BioPerl. This procedure not only solves the precise acquisition of LEA gene, but also provides a better solution to download differ-ent types of sequence exactly.KeywordsBioPerl, Conservative Structure Domain, Feature List, Key Words, LEA Gene基于BioPerl实现从NCBI中精确下载LEA基因序列张晓婧*,曹兴芹,潘伟民#新疆师范大学生命科学学院,乌鲁木齐Email: 313741033@, #379483304@#通讯作者。