High-throughput sequencing and Platforms--高通量测序以及平台介绍
- 格式:pdf
- 大小:272.16 KB
- 文档页数:3
A Bibliography of Publications in The InternationalJournal of Supercomputer Applications,The International Journal of Supercomputer Applications and High-Performance Computing,and TheInternational Journal of High PerformanceComputing ApplicationsNelson H.F.BeebeUniversity of UtahDepartment of Mathematics,110LCB155S1400E RM233Salt Lake City,UT84112-0090USATel:+18015815254FAX:+18015814148E-mail:beebe@,beebe@,beebe@(Internet)WWW URL:/~beebe/12April2006Version1.32Title word cross-reference 3[GGS01].d=2[BRT+92].CH+H2 CH∗3 CH2+H[ASW91]. CuO2[SSSW91].K2[CBW95].N[SWW94]. -Body[SWW94].-D[GGS01]./I[CHZ02].0th[RAGW93].100[IHM87].10P[DD89].1917-1991[Mar91].2[DD89].200/VF[DD89].3[THL88].3-D[THL88].3.0[BRM03]. 3090[DD89].3090-200[DD89].3090-200/ VF[DD89].31G*[PUR94].3800[WOG95].125[HRM89].5/SE[KJH96].6[PUR94].6-31G*[PUR94].90[DL97].A&M[Nas92].Access[WHL03]. Accessing[HLP+03].Accurate[TMWS91].Acoustic[GKN+96]. Active[Her91].Ad[BG02].Ada[Kok88]. Adapting[DE03].Adaptive[AH93]. Additive[PR95].Administration[SDA+01].Adsorption[CH94].Advances[KKDV03]. Aerodynamics[YM91].Agents[QWIC02]. agricultural[SH93].Aided[MM90].AIX[Ano01a].Alamos[BBB+91b]. Algebra[GJM88].Algorithm[GJM88]. Algorithms[KL87].All-to-All[BJ92]. Alliant[DD91].Allocation[WPBB01]. alpha[TKSK88].Amdahl[HE01].amines[PUR94].ammonium[PUR94]. Amplitude[BGK+90].analogs[PUR94]. Analysis[MB87,LS90].Analytic[MA89]. Analyze[KKCB98].Analyzers[Ano01a]. Analyzing[WPBB01].Anatomy[YFH+96].Animal[UB95]. animated[LSS93].Animation[SS89]. Aperture[MPG93].API[BH00]. Appendix[Ano01a].Appendixes[Ano01a].AppLeS[SBWS99]. Application[NKR90].Applications[Ano98a].Applied[vL+03]. Applying[Dem90].Approach[FBW87]. Approximate[Cho01].Aqueous[PRT90]. Architectural[Gro03].Architecture[Ish91].architectures[JO92].Area[DFP+96]. ARION[HLP+03].Arising[Ma00]. Arithmetic[BSB89].Army[Aus92].array[JO92].Arrival[Wit92].Aspects[ZOF90].Assessing[ACM88]. Assessment[ZOF90].Assist[BB02]. asynchronous[PH91].Atmosphere[HAF+96].Atmosphere-Ocean[HAF+96]. Atmospheric[ARR99].Atomic[IHM87]. Attributes[Del93].Automata[RE87]. Automatic[Cza03].Automobile[HTSK90].Autonomous[SKB01].Availability[Pra01].Aware[YBA+03]. Axisymmetric[SG91].B[Ano01a].Band[Tho90].Based[Nak99]. Basic[JO92,Don02a].Bay[WLVL+96]. Beamforming[CYT+02].Bearing[FFNP97].Behavior[AK93]. Benchmarking[BRT+92].Benchmarks[BCK89,BBB+91a].Benefits[ACM88].Beowulf[SS99].Best[Lee03].Beyond[SBF90].Binary[DIB00].Biofluid[RKKC90]. Biological[WW92].Biology[SSNM92]. Biomembranes[SABK94].BLAS[DD89]. Blast[Don02a].Block[BS88].Body[TMWS91].Bone[HOPB92]. Boundary[SG91].Bridging[SS99]. Broadcast[BJ92].Builder[DL97]. Building[Wit92].Bulk[DGP+97].Butterfly[Kum89].C[Poz97].C90[ABF+99].Cache[MBW87].Cache-Coherent[Wad99].Cactus[AAF+01].Calculation[ACG+90]. Calculational[ZOF90].calculations[TKSK88].Caltech[Din91]. Caltech/JPL[Din91].Campus[GNTLH97].Campus-Wide[GNTLH97].Can[Pan97]. Cancers[GKB93].Capacity[BL99]. Carcinogens[HB90].Cards[Gro03]. Carlo[MB87].Carolina[LC90].Case[WGI90].CBVE[WLVL+96]. CEBAF[DZDR95].Center[All88,BBW90].Centers[All88]. CFD[GKMT00].CGNR[Man97].3Challenge[Kit90].Changing[MMS88]. Characterization[LPJ98].Chemical[TW87].Chemically[MYC92]. ChemIO[NFK98].Chemistry[EDS95]. Chesapeake[WL VL+96]. Chromodynamics[Liu90].Circular[AEPR92].Circulation[KM95]. CLAS[DZDR95].Classification[Tho90]. Climate[WHL03].Climatic[WBMY90]. Clouds[Tho90].Club[BCK89].Cluster[KT99].Clustering[NRR97]. Clusters[DT99].CM[CC95].CM-2[CC95].CM-5[KJH96].CM-5/SE[KJH96].CM2[CH94].Co[Mat03].Co-reservation[Mat03].Co-scheduling[Mat03].Coarse[BGB+96]. Coarse-Grained[BGB+96].Code[MSK92].Codes[IHM87].Coherent[Wad99].Collaborative[NBB+96].Collaboratory[YFH+96].Collapse[HTSK90].Collections[HLP+03]. Collide[NBB+96].Color[Tho90].Color/ Albedo[Tho90].Combining[Gir02]. Communication[BBDR95]. Communication/Computation[BBDR95].Community[HBSM03].Comparative[MOK00].Comparing[BF01].Comparison[Gen88]. Comparisons[Ma00].Compilers[Ano01a]. Complete[LK01].Complex[GKB93]. Complexity[BGB+96].Component[KBA00].Compositional[KR94,KR95]. Compounds[FWZ91].Compression[DLY+98].Computation[Her88,TR92]. Computational[FBW87].Computations[Duk91].Computer[TW87].Computer-Aided[MM90].Computers[Meu88].Computing[Ewi88,Lee03].Concurrent[MBW87].Conference[KKDV03].Configuration[AEPR92].Confined[ACG+90].Conjugate[Mel87]. Connection[HZ91].Conquer[Cza03]. Constant[MP94].Constrained[NKR90]. Constraints[GSHL03].Contaminant[ABF+99].Context[YBA+03].Context-Aware[YBA+03].Contributors[Ano96b].Control[AK91]. Controlled[DSD+91].convex[SH93]. Coordinate[YRA+02].Coordinated[FP02].CORBA[P´e r03]. Correspondence[IS96].Cortical[WW92]. Coscheduling[BL99].Coupled[HAF+96]. Coupling[P´e r03].CPU[BL99].Crash[HTSK90,CEL+97].CRAY[THL88].CRAY-2[DD89].CRAY-T3E[Ma00].Creutz[BRT+92]. CRPC[CDP+94].Crystal[Cla91]. Crystallography[CTH+93].CUMUL VS[GKP97].CYBER[ABA87]. CYDRA[HRM89].CYDRA-5[HRM89]. D[THL88].DAMPVM[Cza03]. DAMPVM/DAC[Cza03].Data[KBH88]. Data-Intensive[KUE+00].Data-Parallel[HJ96].Dataflow[ACM88]. Datasets[SE92].Davidson[UF89]. Dealing[GSHL03].Debuggers[Ano01a]. Decomposition[Meu88].Decoupled[PH91].Dedicated[GSHL03]. Delay[Rao02].demand[dPIdA03].Dense[Ede93].Department[Kit90]. Deployable[GCL93].Deployment[GCL93].Deposition[MD99]. derivatives[Haj93].Design[GJM88]. Detailed[EDS95].Detector[DZDR95]. Determination[KBH88].Determined[CGB+94].Development[HRM89].device[Lai93]. Devices[RKKC90].Diagrams[FWZ91]. Dielectric[ZOF90].Difference[THL88].4Differential[Meu88].Diffusion[TW87]. Digital[MPG93].dimensional[KS89]. Dimensionality[BFLL99].Dimensions[TW87].Dip[LT90].Direct[CM97].Direction[Mah90]. Directions[Fol90a].Discharge[YW93]. Discovery[AAF+01,AEG+03].Discrete[Ham91].Disk[KNP87]. Disordered[KVY+90].Dissemination[GL97].Dissolution[Cla91].Distance[HME90]. Distributed[MW AR87].Distributed-Memory[MCW+00]. Distributing[CBSB01].Divide[Cza03]. Divide-and-Conquer[Cza03].DNA[HB90].DOE[HBSM03].Domain[Meu88].Domain-Specific[CDH+97b].Double[PRT90].Drive[HE01].Driven[CHZ02].Dual[Ish91].Dual-Level[BBC+00].DV[TKSK88].DV-X[TKSK88].Dynamic[ABA87]. Dynamical[FBW87].Dynamics[Gen88]. e-Science[HWP03].Early[HGD91]. Econometric[Pet87].Economic[NKR90]. Economics[AK91].Eddy[CK01].Editor[dA03].Editorial[Don92]. Education[Mah90].Effective[BCK89].Effects[WBMY90,Haj93].Efficiency[ABA87].Efficient[Mel87]. Eigenvalue[UF89].Eigenvalues[KC92]. Electromagnetic[DGP+97].Electron[KVY+90].Electronic[FWZ91]. Electroweak[BGK+90].Element[KM95]. Eliminating[HME90].Embedded[KK01]. Embedded/Real[KK01].Embedded/ Real-Time[KK01].Enabled[CD97]. Enabling[FKT01].Encoding[DLY+98]. End[Rao02].End-To-End[Rao02]. Endangered[BB02].energies[PUR94]. Energy[IHM87,Kit90].Engineering[MMS88].Enhancement[AAC+97].Enhancements[BDG+95].Entity[BGF02].Entropy[CBW95]. Environment[CCH+88,WL VL+96]. Environmental[DLY+98].Environments[MA89].Equation[Fro91]. Equations[Syz87].Equilibration[NKR90].Equilibrium[NK89].Erratum[KR95]. estimation[SH93].ETA[DD89].ETA-10P[DD89].EuroPVMMPI[KKDV03].Evaluate[WGI90].Evaluating[BBDR95]. Evaluation[BCK89].Event[NRR97]. Events[BG00].Evolution[WJS+90]. Exact[ZK93].Example[NBB+96]. Excited[WLC91].Excited-State[WLC91].Excitement[RAGW93].Expand[GCCC+03].Expect[Pan92]. Experience[HGD91].Experiences[Reu92].Experiment[HME90].Experimental[KL87].Experiments[AK91].Exploration[KPM+96].Exploring[HAF+96].Expression[RS03]. Expressions[BBDR95].Extreme[KC92].F ACOM[IHM87].Factor[DH96]. Factorization[DD89].factorizations[DEKL92].Farming[CKPD99].fast[TKSK88,KNP87].Fault[GKP97]. Faulty[LK01].Feasibility[KR94]. Feature[PTGB02].features[PUR94]. February[Sci92].Feedback[CGB+94]. Feedback-Scaling[CGB+94].Fermions[ZK93].Fernbach[Mar91]. FETI[GCD97].FFT[Wad99].FFT-Based[GGS01].field[PUR94].File[GCCC+03].Film[MD99].Financial[HZ91].Fine[ACM88].Fine-Grain[ACM88].Finite[THL88]. Finite-Element[MS02].First[DQFW90].5Flames[SG91].FLO67[WLB92].Floating[BSB89].Flow[HKK88].Flowfield[MKG90].Flows[MYC92].Fluid[Gen88].Fluid-Structure[KT99]. Fluorinated[DFC90].Fock[KKCB98]. force[PUR94].Forming[CM97].Fortran[KR94].Forum[Don02a]. Forward[THL88].Foundation[Web91]. Four[Tho90].Four-Band[Tho90].Fourier[KNP87].FPS[LT88].Fracture[BG00].Framework[vL+03]. Frankenstein[Wit92].Frontwidth[MBW87].Fueling[Her91]. Fujitsu[Ish91].Full[AEPR92].Fully[YW93].Fun[RAGW93].Function[ZOF90].Fundamental[MR90]. Fusion[ACG+90].Future[BSB89].FX[DD91].FX/80[DD91].Galaxies[Her91].Games[EGMP93].Gap[SS99].Gas[MKG90].Gases[WBMY90].Gauge[Mor89a].Gene[RS03].Generation[DE03].Genetic[RS03].GFLOP[SBF90].Glass[YSN90].Global[WBMY90]. Globalized[GKMT00].Globally[SH93]. Globus[FK97].GloVE[dPIdA03].Glow[YW93].Gluons[BRE+90]. Goodput[BL99].Gradient[Mel87]. Gradient-like[CSV91].GrADS[BCC+01]. Grain[ACM88].Grained[BGB+96]. Grand[Kit90].Graphs[LK01]. Gravitational[SWW94].Gravity[Ham91]. Greenbook’[HBSM03].Greenhouse[WBMY90].Grid[CKPD99,FKT01].Grid-based[LM03].GridLab[A+03]. Grids[DT99,vL+03].Groundwater[MMD98].Growth[Cla91]. Guest[dA03].Guided[F+03].Gyrofluid[KPM+96].Hadron[Liu90].Harbor[BBC+00]. Hartree[KKCB98].Hartree-Fock[KKCB98].Head[GKB93]. Heavy[Reu92].Heavy-Ion[Reu92]. Helium[Fro91].Helix[PRT90]. Helmholtz[BEF+95].Heterogeneous[RAGW93].Hierarchical[GJM88].High[THL88]. High-Level[BCC+01].High-Order[CC95].high-performance[Fer90].High-Pressure[WLC91].High-Wave[BEF+95].Higher[Mah90]. Highly[Sim90].history[Bra91].Hitachi[WOG95].Hoc[CHZ02,BG02]. Homotopy[DZRS99].Hoshen[CBZ97]. Hoshen-Kopelman[CBZ97].Hosted[HBSM03].HPCC[CBB+96].HPF[DL97].HPF-Builder[DL97]. HPVM[CLP+99].HPVM-Based[CLP+99].Hybrid[MS02]. Hyperbolic[FG97].Hypercube[Din91,KL87].Hypercubes[LK01].I-W AY[DFP+96].I/O[PH91].IBM[DD89].Ice[ZOF90].IceT[GS99]. IEH[LK01].II[JP93].IJSGA[Hua03]. ILU[Ma00].Image[AAC+97].Imaging[CBB+96].Immersive[THC+96]. Impact[GJM88,KBH88].Implementation[Mel87]. Implementations[RR96].Implementing[YFH+96].Implications[RE87].Implicit[GKMT00]. Improving[BL99].Incomplete[IIJ93]. Increased[WBMY90].Increasing[WW92].Index[Ano96a]. Industrial[DGP+97].Inequality[NK89]. Infer[RS03].Influence[Ede93]. Information[Ano96b].Information-Driven[CHZ02]. Information-Theoretic[FWSW02]. Infrastructure[FK97].Initial[WLVL+96]. Initio[ASW91].Institute[IHM87]. Instruction[HRM89].6Instrumentation[TM99].Integer[Gro03]. Integrate[BFLL99].Integrated[CFK+94]. Integration[QWIC02].Intel[KL87]. Intensive[Mah90].Inter[FWZ91].Inter-Semiconductor[FWZ91]. Interaction[Liu90].Interactions[TMWS91].Interactive[SS89].Interface[Ano94,SLG95].Interleaving[KNP87].International[Ano98a].Internet[Rao02]. Interpretation[Fei99].Introduction[Nag93].Inverse[Cho01]. Investigation[CK01].Investigations[Mav02].Ion[Reu92].iPSC[HGD91,KR95].iPSC/860[HGD91,KR95].Irregular[Man97]. Ising[BRT+92].Issue[Fol90b].Issues[MBW87].Iterated[RR96]. Iterative[MC90].Japan[IHM87].Jini[Hua03].Jini-based[Hua03].Jumpshot[ZLGS99]. Jupiter[Tho90].Kernel[TM99].Kinetics[ARR99]. knowledge[KT94].Kopelman[CBZ97]. Krylov[GKMT00].Kutta[RR96]. Laboratory[BBB+91b].Laminar[SG91]. Language[Sha88].languages[JO92]. Large[FBW87].Large-Scale[Ewi88]. Lattice[Mor89a].Law[HE01].LBLAS[KJH96,JO92].Learning[AH93]. Legion[GNTLH97].Length[DLY+98]. Level[DD89].Libraries[DMT01].Library[CE00,Poz97].Ligature[KBA00]. like[CSV91].Limited[TW87].Linda[SSNM92].Line[LWOB97].Linear[AGL87].Link[TLG98,Pet87]. Linux[Ano01a].Liquid[DQFW90]. Livermore[WGI90].Local[BRT+92,JO92].Local-Creutz[BRT+92].Localization[CYT+02].Localized[WCE95].Logical[SR98].Long[HRM89].Looking[AK93].Loop[IS96].Loops[WGI90].Loss[ZOF90].LU[DD89].Machine[SS89,LPG88].machines[KS89]. Magnetically[ACG+90]. Magnetohydrodynamic[ACG+90]. making[KT94].Man[Wit92]. Management[HTSK90].Many[TMWS91].Many-Body[TMWS91]. Mappings[PTGB02].Market[NK89]. Market-Based[WPBB01].Markets[IIJ93].Massively[Mon89]. Matching[ZC92].Materials[KVY+90]. Mathematical[Mon89].Matrices[KC92]. Matrix[AGL87].MCell[CBSB01].MCHF[SYF96].Means[BRT+92]. Mechanics[Her88].Mechanism[DZRS99]. Medicine[SSNM92].Meetings[Ano98c]. Member[HTSK90].Memoriam[Mar91]. memories[TKSK88].Memory[MBW87]. Merging[YBA+03].Mesh[WCE95].Mesh-Iterative[MCW+00].Meshes[Ytt97].Meso[GGS01].Meso-Scale[GGS01].Message[Ano94,SLG95].Message-Passing[Ano94,SLG95]. Metacomputing[FK97].Metals[Cla91]. Metascheduling[Mat03].method[TKSK88].Methods[Mel87]. Metric[HE01].MHD[ACG+90]. Microprocessors[WT99].Microscopic[YFH+96].Microtasked[MSK92].Microtasking[HA91].Middleware[CKPD99].Migration[KL87]. MIMD[BOD+91].Mini[Gen88].Mini-Supercomputers[Gen88]. Minimization[Rao02].Minnesota[Aus92].MiPAX[HKK88]. Missions[SKB01].MM2[PUR94].Mobile[FP02].Model[ABA87].7Modeled[WJS+90].Modeling[DD87]. Models[Pet87].Modern[BDG+00].Modified[HB90].Modulo[Gro03]. Molecular[DFC90].Monitoring[L WOB97].Monte[MB87]. Motions[DFC90].Moveout[LT90].MP[LT88].MP/416[THL88].MPI[Ano94].MPI-OpenMP[MS02]. MPI2[MPI98].MPICH[GL97].Much[RAGW93].Multiblock[Ytt97]. Multibody[BGI+99].Multicommodity[NK89].Multicomputer[Man97].Multicomputers[MOK00]. Multidimensional[HL W00]. Multidisciplinary[BGB+96].Multifrontal[MBW87].Multigrid[DMT97].Multilevel[DW97]. Multimodal[FWSW02].Multiparadigm[AS00].Multiphase[ZC92].Multiphysics[MCW+00].Multiple[Mor89b].Multiprocessing[YM91].Multiprocessor[BS88].Multiprocessors[DD91]. Multiprogramming[MA89].Multitasking[THL88].Multiunit[GCL93].NAMD[NHG+96].Nanophase[Nak99]. NAS[BBB+91a].National[BBB+91b,All88].Navier[SBF90].Navier-Stokes[SBF90]. nCube[CL95].NEC[Mor89a].Neck[GKB93].Needs[HBSM03].NERSC[HBSM03].Net[AEG+03]. Netlets[Rao02].NetSolve[CD97]. Network[NZ93].Network-Based[AM00]. Network-Enabled[CD97].Networked[FWSW02].Networks[RE87]. Neural[RE87].Newton[GKMT00]. Newton-Krylov-Schwarz[GKMT00]. Next[DE03].NMR[KBH88].NOE[CGB+94].NOE-Restrained[CGB+94].Non[GSHL03].Non-Dedicated[GSHL03]. Nonequilibrium[YW93].Nonlinear[ABA87].Nonsymmetric[MC90].normal[Haj93]. North[LC90].Northern[UB95].Novel[FWZ91].NSF[Bra91].NT[CLP+99].Nuclear[IHM87].Number[FG97].Numbers[BEF+95]. Numerical[RKKC90].Numerically[Mah90].O[PH91].Oak[HGD91].Object[NHG+96].Object-Oriented[NHG+96].Ocean[KM95,JO90].ODE[BH99].Ohio[BBW90].Oil[KR94].On-Line[L WOB97].Open[LWOB97,AEG+03].Opening[PRT90].OpenMP[BBC+00]. Operating[CW01].Optimal[FG97]. Optimization[LT88].optimizations[PUR94].Optimize[KKCB98].Optimized[MSK92]. Optimizing[Mor89a].Optorsim[B+03]. Order[THL88].Organization[FKT01]. Organized[BGF02].Oriented[NHG+96]. Our[WW92].Overlap[BBDR95]. Overlapping[PR95].Overview[DFP+96]. P4[Mat95].PACE[NKP+00].Pacific[JO90].Package[SYF96].Pair[Fro91].PAM[CEL+97].PAM-CRASH[CEL+97].papers[KKDV03].Paradigm[BGB+96]. Parallel[Syz87].Parallelism[ACM88]. Parallelization[Reu92].parameter[SH93].ParaScope[CCH+88]. Park[UB95].Parkbench[HL00]. Parmetis[LDGR03].Partial[Meu88]. Particle[DD87].Partitioning[Ytt97]. Partitions[WCE95].Passing[Ano94,SLG95].8PASSION[KKCB98].Patching[BH00]. Paths[Rao02].Patterns[GKB93].PC[Ste01].PCs[AWS01].PDEs[Ma00]. 416[THL88].600J[DEKL92].80[DD91]. 860[HGD91,KR95].Albedo[Tho90]. Computation[BBDR95].DAC[Cza03]. JPL[Din91].Logical[Chu99].Real-Time[KK01].VF[DD89]. PERFECT[BCK89].Performance[IHM87].PERMAS[AJL+97].pH[MP94].Phase[YCHH90].Photon[MWAR87]. Physical[SR98].Physical/Logical[Chu99].Physician[Wit92]. Physics[MR90].Pipeline[BFLL99]. Plasma[CDD+90].Plasmas[DD87]. Platforms[BLRR01].Play[Pan97]. POEMS[BBD00].Point[BSB89].Point-SSOR[Ma00].Poisson[GGS01]. Pollution[DFH+96].Polyacetylene[ZOF90].Polyenes[AEPR92].Polymers[DFC90]. Portable[GL97].Portals[BRM03].Power[Dem90].Powerful[Mor89b]. Practical[Cho01].Practice[BR03]. Preconditioned[Mel87].Preconditioner[BBS99].Preconditioners[Ma00].Predict[VS03]. Predicting[WLC91].Prediction[SCB+95].Predictions[RIF01].Preface[DT97]. Preprocessing[DMT97].Preprocessors[Ano01a].Pressure[WLC91].Prime[Sim90]. Principles[DQFW90].Priori[Cho01]. probabilities[Haj93].Problem[UF89]. Problems[FBW87].Procedure[CGB+94]. Process[GCL93].Processing[Mor89b]. Processor[SBF90].Processors[LT88]. Production[MSK92].program[Fer90,Web91].Programming[Syz87].Programs[ACM88].Progress[AGL87]. Project[Wit92,Pet87].Promising[Gir02].Propagation[GKN+96].Properties[ACG+90].prospectus[Bra91]. Protein[KBH88].Prototypical[WLVL+96].Providing[GKP97].Proximal[NZ93]. PVM[BDG+95].PVMGeant[DZDR95]. PVODE[BH99].Qaulity[Mat03].QCD[Din91].QoS[BSCC03].Quadtree[CL95]. Quantized[Ham91].Quantum[FBW87]. Quarks[BRE+90].Quasigeostrophic[KM95].Query[S+03]. Querying[CHZ02].Queuing[Ish91]. Radar[MPG93].Radio[CBB+96]. Randomly[CYT+02].Rare[BB02].Ray[CTH+93].Reacting[MYC92]. Reaction[Koi90].Reactions[TW87]. Ready[Sim90].Real[WLC91].Real-Time[NRR97].Realistic[BR03]. Recognition[RE87].Reconfiguration[LK01].Recovery[BB02].rectangle[Haj93]. Recurrent[Syz87].Reduced[BFLL99]. Reduced-Dimensionality[BFLL99]. Reducing[DLY+98].Reduction[NRR97]. Regional[KM95].Regression[VS03]. Relative[PUR94].Relativity[RIF01]. Remeshing[LDGR03].Remote[BB02]. Replication[B+03].Representations[WW92].Requirements[LPJ98].Research[IHM87,EM89].reservation[Mat03].Reservoir[Ewi88]. Resolution[HB90].Resource[WPBB01]. Response[ZOF90].Restrained[CGB+94]. Results[PUR94].Retrospective[Mar88]. RF[YW93].Ridge[HGD91].Rigid[Nak99].Rigid-Body-Based[Nak99].RISC[Gro03].RISC-Based[Gro03].RNA[SCB+95].Role[Sab91].Roles[MMS88].Rolling[FFNP97].9Routing[MOK00].Run[DLY+98].Runge[RR96].Runge-Kutta[RR96]. Runtime[AJL+97].S[Lai93].S-3800[WOG95].S-MP[Lai93]. SAMCEF[GCD97].SAR[AAC+97]. SARA[SBWS99].SCALA[SFP02]. Scalability[HLW00].Scalable[WLB92]. scalar[KS89].Scale[Pet87].Scaling[CGB+94].Schedule[SBWS99]. Scheduling[CKPD99].Scheme[BG00]. Schemes[BS88].Schr¨o dinger[BFLL99]. Schwarz[PR95].Science[All88]. Sciences[NKR90,DGH+93].Scientific[LS90].SE[KJH96].Sea[LPJ98]. Searches[F+03].Secondary[SCB+95]. Seeing[LPG88].Seismic[CDH+97b]. Select[KKDV03].Selective[RE87].Self[BGF02].Self-Adapting[DE03].Self-Organization[FWSW02].Self-Organized[BGF02].Semantic[FP02].semiconductor[TKSK88]. Semiconductors[Cla91].Sensor[BGF02]. Sensors[FWSW02].Sequence[Jon92]. Serial[NK89].Server[CD97].Service[Mat03].Service-based[HLP+03]. Service-oriented[Hua03].Services[AEG+03].Set[PTGB02]. Severe[WJS+90].Shape[WCDS99]. Shared[MBW87].shared-memory[DEKL92].Shelf[LPJ98]. Should[Pan92].Side[HTSK90].Sidney[Mar91].Sieves[Mon89].Signal[FP02].Simple[SBWS99]. Simulating[BRE+90].Simulation[TW87].Simulations[ABA87]. Simulator[B+03].Simultaneous[ABA87]. Single[BCJ01].Singular[Ber92].Six[WOG95].Skeletonization[DIB00]. small[PUR94].Smart[Gro03].Social[NKR90].Sodium[DQFW90]. Software[Fol90a].Soil[CWHP99].Solaris[Ano01a].Solid[DQFW90].Solution[KBH88].Solutions[Fro91]. Solve[CTH+93].Solved[CSV91].Solver[PR95].Solvers[GGS01].Solving[BS88].Some[Gir02]. Sometimes[RAGW93].Sonic[WW92]. Source[CYT+02].Sowing[GL97].Space[F+03].Spaceborne[SKB01].SPAI[BBS99].Sparce[WT99].Sparse[AGL87].Sparsity[Cho01]. Special[Nag93].Species[BB02].Specific[CDH+97b].Spectral[Tho90]. Spline[Fro91].Splitting[IS96].Spotlight[MPG93].Spread[GKB93]. SSOR[Ma00].Stability[ACG+90]. Standard[Ano94,Poz97].Standards[Pan92].State[WLC91].Static[BLRR01].Statistical[Her88]. Status[MB87].Steering[GKP97].Stefan[CSV91].Stochastic[ABA87]. Stokes[SBF90].Storm[WJS+90]. Strategies[MOK00].Strategy[MCW+00]. Structural[YCHH90].Structure[Liu90]. Structured[Ytt97].Structures[KBH88]. Studies[DQFW90].Study[WJS+90]. Studying[BOD+91].Subdomains[FG97]. Subprograms[Don02a].Subroutines[KJH96,JO92]. Supercomputer[Duk91]. Supercomputers[DD87,Gen88]. Supercomputing[MMS88,All88]. Superconductors[JP93].Supersonic[MYC92].Support[CFK+94]. SUPRENUM[MST88].Sustained[MSK92].SX[Mor89a].SX-2[Mor89a].Symmetric[Gir02]. Synchronous[DGP+97].syntax[JO92]. Synthesis[CBB+96].Synthetic[MPG93]. System[MST88,GCCC+03].Systems[AGL87].T3D[ABF+99].T3E[BBS99].Tables[vL+03].Target[BG02].Task[CFK+94].Tasking[JMP02].Taxol[CGB+94].TCGMSG[Mat95].REFERENCES10Technique[WGI90].Techniques[KM95]. Technologies[Dar99].Technology[Dar00]. Televisualization[HME90].Template[Poz97].Teraflop[HLW00].Teraflop-Scale[HL W00].Teraflops[SS99]. Testing[KDL01].Texas[Nas92].Their[RE87].Theme[Hau93].Theoretic[FWSW02].Theoretical[ASW91].Theory[Mor89a]. Thermochemical[vL+03]. Thermodynamics[GKH+91].Thin[MD99].Thin-Film[MD99]. Thinning[DIB00].Third[Lee03].Three[TW87].Three-Dimensional[LT90].Time[Sim90]. times[MP95].Tokamak[DSD+91]. Tolerance[GKP97].Tomography[CDH+97b].Too[RAGW93]. Tool[Ytt97].Toolkit[FK97].Tools[SS89]. Toolset[NKP+00].Top500[Fei99]. Topologies[MOK00].Topology[Chu99]. Total[YCHH90].Toys[SS99].Trace[NRR97].Tracking[BGF02].Traffic[BG02].Training[AM00].transfer[KT94].Transfers[VS03]. Transform[DL97].Transformations[YCHH90].Transforms[KNP87].Transition[YSN90]. Transport[MB87].Tree[SWW94].Trees[LK01].Trends[Tho90].Tridiagonal[BS88].truly[KT94].Tuning[TM99].Turbine[MKG90]. Turbulence[CDD+90].Turbulent[CB95]. Turnaround[MP95].two[KS89].two-dimensional[KS89].Two-Paths[Rao02].Type[CK01,JP93]. Type-II[JP93].U.S.[Fer90].Unconstrained[LT88]. Understanding[WW92].Units[Tho90]. University[Nas92,SSNM92]. Unstructured[WCE95].usable[KT94]. use[TKSK88].Used[DFH+96].Users[Pan97].Using[THL88].Value[SG91].Variable[BGB+96]. Variable-Complexity[BGB+96]. Variational[NK89].Vector[Mel87]. Vectorization[Reu92].Vectorized[MB87].Very[KNP87].VF[DD91].VF/600J[DEKL92].via[CSV91].Vibrational[DFC90].Video[dPIdA03].Video-on-demand[dPIdA03].Virginia[GNTLH97].Virtual[BEF+95]. Vis5D[HAF+96].Vision[Sha88,LPG88]. Visual[Koi90].Visualization[SS89,HBSM03]. Visualizing[GKB93].Vivo[CBW95]. Volume[Ano96a].Vortex[JP93].VP[IHM87].VP-100[IHM87].VP2000[Ish91].Wave[BEF+95].Wavefront[HL W00].W AY[DFP+96].Weakest[TLG98].Web[Men00].Wide[DFP+96].Wide-Area[DFP+96].Wideband[CYT+02].Windows[CLP+99]. Word[HRM89].Workload[Del93]. Workshop[Lee03,LS90].Worm[AAF+01]. X[THL88].X-MP[THL88].X-MP/416[THL88].X-Ray[CTH+93].XMU[LT90].Y-MP[AEPR92].Yale[SSNM92].Yau[Tis97].Yellowstone[UB95].Zeolite[CH94].ReferencesAllen:2003:EAG [A+03]Gabrielle Allen et al.Enablingapplications on the Grid:a Grid-Lab overview.The Interna-tional Journal of High Perfor-mance Computing Applications,REFERENCES1117(4):449–??,Winter2003.CO-DEN IHPCFL.ISSN1094-3420.Addison:1997:PSI [AAC+97] C.Addison,E.Appiani,R.Cook,M.Corvi,P.G.N.Howard,andB.Stephens.Parallel SAR imageenhancement.The InternationalJournal of Supercomputer Appli-cations and High PerformanceComputing,11(4):314–327,Win-ter1997.CODEN IJSCFG.ISSN1078-3482.Allen:2001:CWE [AAF+01]Gabrielle Allen,David Angulo,Ian Foster,Gerd Lanfermann,Chuang Liu,Thomas Radke,Ed Seidel,and John Shalf.TheCactus Worm:Experiments withdynamic resource discovery andallocation in a Grid environ-ment.The International Jour-nal of High Performance Com-puting Applications,15(4):345–358,November2001.CODENIHPCFL.ISSN1094-3420.Apon:2001:NT [AB01]Amy Apon and Mark -work technologies.The Interna-tional Journal of High Perfor-mance Computing Applications,15(2):102–114,Summer2001.CODEN IHPCFL.ISSN1094-3420.Ando:1987:ECS [ABA87] A.Ando,P.Beaumont,andM.Ando.Efficiency of the CY-BER205for stochastic simula-tions of a simultaneous,nonlin-ear,dynamic econometric model.The International Journal of Su-percomputer Applications,1(4):54–81,Winter1987.CODENIJSAE9.ISSN0890-2720.Averick:1994:NOA [ABB+94] B.Averick,C.Bischof,B.Bixby,A.Carle,J.Dennis,M.El-Alem,A.El-Bakry,A.Griewank,G.Johnson,R.Lewis,J.Mor´e,R.Tapia,V.Torczon,andK.Williamson.Numerical opti-mization at the Center for Re-search on Parallel Computation.The International Journal ofSupercomputer Applications andHigh Performance Computing,8(2):143–153,Summer1994.CO-DEN IJSAE9.ISSN0890-2720.Ashby:1999:NSG [ABF+99]S.F.Ashby,W.J.Bosl,R.D.Falgout,S.G.Smith, A.F.B.Tompson,and T.J.Williams.Anumerical simulation of ground-waterflow and contaminanttransport on the Cray T3D andC90supercomputers.The Inter-national Journal of High Perfor-mance Computing Applications,13(2):80–93,Spring1999.CO-DEN IHPCFL.ISSN1094-3420.Anderson:1990:MEC [ACG+90] D.V.Anderson,W.A.Cooper,R.Gruber,S.Merazzi,andU.Schwenn.Methods for theefficient calculation of the mag-netohydrodynamic(MHD)sta-bility properties of magneticallyconfined fusion plasmas.TheInternational Journal of Super-computer Applications,4(3):34–REFERENCES1247,Fall1990.CODEN IJSAE9.ISSN0890-2720.Arvind:1988:ABF [ACM88]Arvind,D.E.Culler,and G.K.Maa.Assessing the bene-fits of Fine-Grain parallelism indataflow programs.The Inter-national Journal of Supercom-puter Applications,2(3):10–36,Fall1988.CODEN IJSAE9.ISSN0890-2720.Amestoy:1993:MMI [AD93]Patrick R.Amestoy and Iain S.Duff.Memory management is-sues in sparse multifrontal meth-ods on multiprocessors.The In-ternational Journal of Supercom-puter Applications,7(1):64–82,Spring1993.CODEN IJSAE9.ISSN0890-2720.AlSairafi:2003:DDN [AEG+03]Salman AlSairafi,Filippia-SofiaEmmanouil,Moustafa Ghanem,Nikolaos Giannadakis,YikeGuo,Dimitrios Kalaitzopou-los,Michelle Osmond,AnthonyRowe,Jameel Syed,and PatrickWendel.The design of Discov-ery Net:Towards Open Gridservices for knowledge discov-ery.The International Journalof High Performance ComputingApplications,17(3):297–315,Fall2003.CODEN IHPCFL.ISSN1094-3420.Ansaloni:1992:EPI [AEPR92]R.Ansaloni,S.Evangelisti,G.Paruolo,and E.Rossi.Ef-ficient parallel implementation ofa full configuration interaction al-gorithm for circular polyenes ona CRAY Y-MP.The Interna-tional Journal of SupercomputerApplications,6(4):351–360,Win-ter1992.CODEN IJSAE9.ISSN0890-2720.Ashcraft:1987:PSM [AGL87] C. C.Ashcraft,R.G.Grimes,and J.G.Lewis.Progress insparse matrix methods for largelinear systems on vector super-computers.The InternationalJournal of Supercomputer Appli-cations,1(4):10–30,Winter1987.CODEN IJSAE9.ISSN0890-2720.Adeli:1993:CAC [AH93]H.Adeli and S.L.Hung.A con-current adaptive conjugate gradi-ent learning algorithm on MIMDshared-memory machines.TheInternational Journal of Super-computer Applications,7(2):155–166,Summer1993.CODENIJSAE9.ISSN0890-2720.Ast:1997:RPF [AJL+97]M.Ast,T.Jerez,barta,H.Manz, A.P´e rez,U.Schulz,and J.Sol´e.Runtime paralleliza-tion of thefinite element codePERMAS.The InternationalJournal of Supercomputer Appli-cations and High PerformanceComputing,11(4):328–335,Win-ter1997.CODEN IJSCFG.ISSN1078-3482.Amman:1991:PPL [AK91]H.M.Amman and D. A.Kendrick.Parallel processing for。
Conditional Random Fields:Probabilistic Modelsfor Segmenting and Labeling Sequence DataJohn Lafferty LAFFERTY@ Andrew McCallum MCCALLUM@ Fernando Pereira FPEREIRA@ WhizBang!Labs–Research,4616Henry Street,Pittsburgh,PA15213USASchool of Computer Science,Carnegie Mellon University,Pittsburgh,PA15213USADepartment of Computer and Information Science,University of Pennsylvania,Philadelphia,PA19104USAAbstractWe present,a frame-work for building probabilistic models to seg-ment and label sequence data.Conditional ran-domfields offer several advantages over hid-den Markov models and stochastic grammarsfor such tasks,including the ability to relaxstrong independence assumptions made in thosemodels.Conditional randomfields also avoida fundamental limitation of maximum entropyMarkov models(MEMMs)and other discrimi-native Markov models based on directed graph-ical models,which can be biased towards stateswith few successor states.We present iterativeparameter estimation algorithms for conditionalrandomfields and compare the performance ofthe resulting models to HMMs and MEMMs onsynthetic and natural-language data.1.IntroductionThe need to segment and label sequences arises in many different problems in several scientificfields.Hidden Markov models(HMMs)and stochastic grammars are well understood and widely used probabilistic models for such problems.In computational biology,HMMs and stochas-tic grammars have been successfully used to align bio-logical sequences,find sequences homologous to a known evolutionary family,and analyze RNA secondary structure (Durbin et al.,1998).In computational linguistics and computer science,HMMs and stochastic grammars have been applied to a wide variety of problems in text and speech processing,including topic segmentation,part-of-speech(POS)tagging,information extraction,and syntac-tic disambiguation(Manning&Sch¨u tze,1999).HMMs and stochastic grammars are generative models,as-signing a joint probability to paired observation and label sequences;the parameters are typically trained to maxi-mize the joint likelihood of training examples.To define a joint probability over observation and label sequences, a generative model needs to enumerate all possible ob-servation sequences,typically requiring a representation in which observations are task-appropriate atomic entities, such as words or nucleotides.In particular,it is not practi-cal to represent multiple interacting features or long-range dependencies of the observations,since the inference prob-lem for such models is intractable.This difficulty is one of the main motivations for looking at conditional models as an alternative.A conditional model specifies the probabilities of possible label sequences given an observation sequence.Therefore,it does not expend modeling effort on the observations,which at test time arefixed anyway.Furthermore,the conditional probabil-ity of the label sequence can depend on arbitrary,non-independent features of the observation sequence without forcing the model to account for the distribution of those dependencies.The chosen features may represent attributes at different levels of granularity of the same observations (for example,words and characters in English text),or aggregate properties of the observation sequence(for in-stance,text layout).The probability of a transition between labels may depend not only on the current observation, but also on past and future observations,if available.In contrast,generative models must make very strict indepen-dence assumptions on the observations,for instance condi-tional independence given the labels,to achieve tractability. Maximum entropy Markov models(MEMMs)are condi-tional probabilistic sequence models that attain all of the above advantages(McCallum et al.,2000).In MEMMs, each source state1has a exponential model that takes the observation features as input,and outputs a distribution over possible next states.These exponential models are trained by an appropriate iterative scaling method in the 1Output labels are associated with states;it is possible for sev-eral states to have the same label,but for simplicity in the rest of this paper we assume a one-to-one correspondence.maximum entropy framework.Previously published exper-imental results show MEMMs increasing recall and dou-bling precision relative to HMMs in a FAQ segmentation task.MEMMs and other non-generativefinite-state models based on next-state classifiers,such as discriminative Markov models(Bottou,1991),share a weakness we callhere the:the transitions leaving a given state compete only against each other,rather than againstall other transitions in the model.In probabilistic terms, transition scores are the conditional probabilities of pos-sible next states given the current state and the observa-tion sequence.This per-state normalization of transition scores implies a“conservation of score mass”(Bottou, 1991)whereby all the mass that arrives at a state must be distributed among the possible successor states.An obser-vation can affect which destination states get the mass,but not how much total mass to pass on.This causes a bias to-ward states with fewer outgoing transitions.In the extreme case,a state with a single outgoing transition effectively ignores the observation.In those cases,unlike in HMMs, Viterbi decoding cannot downgrade a branch based on ob-servations after the branch point,and models with state-transition structures that have sparsely connected chains of states are not properly handled.The Markovian assump-tions in MEMMs and similar state-conditional models in-sulate decisions at one state from future decisions in a way that does not match the actual dependencies between con-secutive states.This paper introduces(CRFs),a sequence modeling framework that has all the advantages of MEMMs but also solves the label bias problem in a principled way.The critical difference between CRFs and MEMMs is that a MEMM uses per-state exponential mod-els for the conditional probabilities of next states given the current state,while a CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence.Therefore,the weights of different features at different states can be traded off against each other.We can also think of a CRF as afinite state model with un-normalized transition probabilities.However,unlike some other weightedfinite-state approaches(LeCun et al.,1998), CRFs assign a well-defined probability distribution over possible labelings,trained by maximum likelihood or MAP estimation.Furthermore,the loss function is convex,2guar-anteeing convergence to the global optimum.CRFs also generalize easily to analogues of stochastic context-free grammars that would be useful in such problems as RNA secondary structure prediction and natural language pro-cessing.2In the case of fully observable states,as we are discussing here;if several states have the same label,the usual local maxima of Baum-Welch arise.bel bias example,after(Bottou,1991).For concise-ness,we place observation-label pairs on transitions rather than states;the symbol‘’represents the null output label.We present the model,describe two training procedures and sketch a proof of convergence.We also give experimental results on synthetic data showing that CRFs solve the clas-sical version of the label bias problem,and,more signifi-cantly,that CRFs perform better than HMMs and MEMMs when the true data distribution has higher-order dependen-cies than the model,as is often the case in practice.Finally, we confirm these results as well as the claimed advantages of conditional models by evaluating HMMs,MEMMs and CRFs with identical state structure on a part-of-speech tag-ging task.2.The Label Bias ProblemClassical probabilistic automata(Paz,1971),discrimina-tive Markov models(Bottou,1991),maximum entropy taggers(Ratnaparkhi,1996),and MEMMs,as well as non-probabilistic sequence tagging and segmentation mod-els with independently trained next-state classifiers(Pun-yakanok&Roth,2001)are all potential victims of the label bias problem.For example,Figure1represents a simplefinite-state model designed to distinguish between the two words and.Suppose that the observation sequence is. In thefirst time step,matches both transitions from the start state,so the probability mass gets distributed roughly equally among those two transitions.Next we observe. Both states1and4have only one outgoing transition.State 1has seen this observation often in training,state4has al-most never seen this observation;but like state1,state4 has no choice but to pass all its mass to its single outgoing transition,since it is not generating the observation,only conditioning on it.Thus,states with a single outgoing tran-sition effectively ignore their observations.More generally, states with low-entropy next state distributions will take lit-tle notice of observations.Returning to the example,the top path and the bottom path will be about equally likely, independently of the observation sequence.If one of the two words is slightly more common in the training set,the transitions out of the start state will slightly prefer its cor-responding transition,and that word’s state sequence will always win.This behavior is demonstrated experimentally in Section5.L´e on Bottou(1991)discussed two solutions for the label bias problem.One is to change the state-transition struc-ture of the model.In the above example we could collapse states1and4,and delay the branching until we get a dis-criminating observation.This operation is a special case of determinization(Mohri,1997),but determinization of weightedfinite-state machines is not always possible,and even when possible,it may lead to combinatorial explo-sion.The other solution mentioned is to start with a fully-connected model and let the training procedurefigure out a good structure.But that would preclude the use of prior structural knowledge that has proven so valuable in infor-mation extraction tasks(Freitag&McCallum,2000). Proper solutions require models that account for whole state sequences at once by letting some transitions“vote”more strongly than others depending on the corresponding observations.This implies that score mass will not be con-served,but instead individual transitions can“amplify”or “dampen”the mass they receive.In the above example,the transitions from the start state would have a very weak ef-fect on path score,while the transitions from states1and4 would have much stronger effects,amplifying or damping depending on the actual observation,and a proportionally higher contribution to the selection of the Viterbi path.3In the related work section we discuss other heuristic model classes that account for state sequences globally rather than locally.To the best of our knowledge,CRFs are the only model class that does this in a purely probabilistic setting, with guaranteed global maximum likelihood convergence.3.Conditional Random FieldsIn what follows,is a random variable over data se-quences to be labeled,and is a random variable over corresponding label sequences.All components of are assumed to range over afinite label alphabet.For ex-ample,might range over natural language sentences and range over part-of-speech taggings of those sentences, with the set of possible part-of-speech tags.The ran-dom variables and are jointly distributed,but in a dis-criminative framework we construct a conditional model from paired observation and label sequences,and do not explicitly model the marginal..Thus,a CRF is a randomfield globally conditioned on the observation.Throughout the paper we tacitly assume that the graph isfixed.In the simplest and most impor-3Weighted determinization and minimization techniques shift transition weights while preserving overall path weight(Mohri, 2000);their connection to this discussion deserves further study.tant example for modeling sequences,is a simple chain or line:.may also have a natural graph structure;yet in gen-eral it is not necessary to assume that and have the same graphical structure,or even that has any graph-ical structure at all.However,in this paper we will be most concerned with sequencesand.If the graph of is a tree(of which a chain is the simplest example),its cliques are the edges and ver-tices.Therefore,by the fundamental theorem of random fields(Hammersley&Clifford,1971),the joint distribu-tion over the label sequence given has the form(1),where is a data sequence,a label sequence,and is the set of components of associated with the vertices in subgraph.We assume that the and are given andfixed. For example,a Boolean vertex feature might be true if the word is upper case and the tag is“proper noun.”The parameter estimation problem is to determine the pa-rameters from training datawith empirical distribution. In Section4we describe an iterative scaling algorithm that maximizes the log-likelihood objective function:.As a particular case,we can construct an HMM-like CRF by defining one feature for each state pair,and one feature for each state-observation pair:. The corresponding parameters and play a simi-lar role to the(logarithms of the)usual HMM parameters and.Boltzmann chain models(Saul&Jor-dan,1996;MacKay,1996)have a similar form but use a single normalization constant to yield a joint distribution, whereas CRFs use the observation-dependent normaliza-tion for conditional distributions.Although it encompasses HMM-like models,the class of conditional randomfields is much more expressive,be-cause it allows arbitrary dependencies on the observationFigure2.Graphical structures of simple HMMs(left),MEMMs(center),and the chain-structured case of CRFs(right)for sequences. An open circle indicates that the variable is not generated by the model.sequence.In addition,the features do not need to specify completely a state or observation,so one might expect that the model can be estimated from less training data.Another attractive property is the convexity of the loss function;in-deed,CRFs share all of the convexity properties of general maximum entropy models.For the remainder of the paper we assume that the depen-dencies of,conditioned on,form a chain.To sim-plify some expressions,we add special start and stop states and.Thus,we will be using the graphical structure shown in Figure2.For a chain struc-ture,the conditional probability of a label sequence can be expressed concisely in matrix form,which will be useful in describing the parameter estimation and inference al-gorithms in Section4.Suppose that is a CRF given by(1).For each position in the observation se-quence,we define the matrix random variableby,where is the edge with labels and is the vertex with label.In contrast to generative models,con-ditional models like CRFs do not need to enumerate over all possible observation sequences,and therefore these matrices can be computed directly as needed from a given training or test observation sequence and the parameter vector.Then the normalization(partition function)is the entry of the product of these matrices:. Using this notation,the conditional probability of a label sequence is written as, where and.4.Parameter Estimation for CRFsWe now describe two iterative scaling algorithms tofind the parameter vector that maximizes the log-likelihood of the training data.Both algorithms are based on the im-proved iterative scaling(IIS)algorithm of Della Pietra et al. (1997);the proof technique based on auxiliary functions can be extended to show convergence of the algorithms for CRFs.Iterative scaling algorithms update the weights asand for appropriately chosen and.In particular,the IIS update for an edge feature is the solution ofdef.where is thedef. The equations for vertex feature updates have similar form.However,efficiently computing the exponential sums on the right-hand sides of these equations is problematic,be-cause is a global property of,and dynamic programming will sum over sequences with potentially varying.To deal with this,thefirst algorithm,Algorithm S,uses a“slack feature.”The second,Algorithm T,keepstrack of partial totals.For Algorithm S,we define the bydef,where is a constant chosen so that for all and all observation vectors in the training set,thus making.Feature is“global,”that is,it does not correspond to any particular edge or vertex.For each index we now define thewith base caseifotherwiseand recurrence.Similarly,the are defined byifotherwiseand.With these definitions,the update equations are,where.The factors involving the forward and backward vectors in the above equations have the same meaning as for standard hidden Markov models.For example,is the marginal probability of label given that the observation sequence is.This algorithm is closely related to the algorithm of Darroch and Ratcliff(1972),and MART algorithms used in image reconstruction.The constant in Algorithm S can be quite large,since in practice it is proportional to the length of the longest train-ing observation sequence.As a result,the algorithm may converge slowly,taking very small steps toward the maxi-mum in each iteration.If the length of the observations and the number of active features varies greatly,a faster-converging algorithm can be obtained by keeping track of feature totals for each observation sequence separately. Let def.Algorithm T accumulates feature expectations into counters indexed by.More specifically,we use the forward-backward recurrences just introduced to compute the expectations of feature and of feature given that.Then our param-eter updates are and,whereand are the unique positive roots to the following polynomial equationsmax max,(2)which can be easily computed by Newton’s method.A single iteration of Algorithm S and Algorithm T has roughly the same time and space complexity as the well known Baum-Welch algorithm for HMMs.To prove con-vergence of our algorithms,we can derive an auxiliary function to bound the change in likelihood from below;this method is developed in detail by Della Pietra et al.(1997). The full proof is somewhat detailed;however,here we give an idea of how to derive the auxiliary function.To simplify notation,we assume only edge features with parameters .Given two parameter settings and,we bound from below the change in the objective function with anas followsdefwhere the inequalities follow from the convexity of and.Differentiating with respect to and setting the result to zero yields equation(2).5.ExperimentsWefirst discuss two sets of experiments with synthetic data that highlight the differences between CRFs and MEMMs. Thefirst experiments are a direct verification of the label bias problem discussed in Section2.In the second set of experiments,we generate synthetic data using randomly chosen hidden Markov models,each of which is a mix-ture of afirst-order and second-order peting models are then trained and compared on test data.As the data becomes more second-order,the test er-ror rates of the trained models increase.This experiment corresponds to the common modeling practice of approxi-mating complex local and long-range dependencies,as oc-cur in natural data,by small-order Markov models.OurFigure3.Plots of error rates for HMMs,CRFs,and MEMMs on randomly generated synthetic data sets,as described in Section5.2. As the data becomes“more second order,”the error rates of the test models increase.As shown in the left plot,the CRF typically significantly outperforms the MEMM.The center plot shows that the HMM outperforms the MEMM.In the right plot,each open square represents a data set with,and a solid circle indicates a data set with.The plot shows that when the data is mostly second order(),the discriminatively trained CRF typically outperforms the HMM.These experiments are not designed to demonstrate the advantages of the additional representational power of CRFs and MEMMs relative to HMMs.results clearly indicate that even when the models are pa-rameterized in exactly the same way,CRFs are more ro-bust to inaccurate modeling assumptions than MEMMs or HMMs,and resolve the label bias problem,which affects the performance of MEMMs.To avoid confusion of dif-ferent effects,the MEMMs and CRFs in these experiments use overlapping features of the observations.Fi-nally,in a set of POS tagging experiments,we confirm the advantage of CRFs over MEMMs.We also show that the addition of overlapping features to CRFs and MEMMs al-lows them to perform much better than HMMs,as already shown for MEMMs by McCallum et al.(2000).5.1Modeling label biasWe generate data from a simple HMM which encodes a noisy version of thefinite-state network in Figure1.Each state emits its designated symbol with probabilityand any of the other symbols with probability.We train both an MEMM and a CRF with the same topologies on the data generated by the HMM.The observation fea-tures are simply the identity of the observation symbols. In a typical run using training and test samples, trained to convergence of the iterative scaling algorithm, the CRF error is while the MEMM error is, showing that the MEMM fails to discriminate between the two branches.5.2Modeling mixed-order sourcesFor these results,we usefive labels,(),and26 observation values,();however,the results were qualitatively the same over a range of sizes for and .We generate data from a mixed-order HMM with state transition probabilities given byand,simi-larly,emission probabilities given by.Thus,for we have a standardfirst-order HMM.In order to limit the size of the Bayes error rate for the resulting models,the con-ditional probability tables are constrained to be sparse. In particular,can have at most two nonzero en-tries,for each,and can have at most three nonzero entries for each.For each randomly gener-ated model,a sample of1,000sequences of length25is generated for training and testing.On each randomly generated training set,a CRF is trained using Algorithm S.(Note that since the length of the se-quences and number of active features is constant,Algo-rithms S and T are identical.)The algorithm is fairly slow to converge,typically taking approximately500iterations for the model to stabilize.On the500MHz Pentium PC used in our experiments,each iteration takes approximately 0.2seconds.On the same data an MEMM is trained using iterative scaling,which does not require forward-backward calculations,and is thus more efficient.The MEMM train-ing converges more quickly,stabilizing after approximately 100iterations.For each model,the Viterbi algorithm is used to label a test set;the experimental results do not sig-nificantly change when using forward-backward decoding to minimize the per-symbol error rate.The results of several runs are presented in Figure3.Each plot compares two classes of models,with each point indi-cating the error rate for a single test set.As increases,the error rates generally increase,as thefirst-order models fail tofit the second-order data.Thefigure compares models parameterized as,,and;results for models parameterized as,,and are qualitatively the same.As shown in thefirst graph,the CRF generally out-performs the MEMM,often by a wide margin of10%–20% relative error.(The points for very small error rate,with ,where the MEMM does better than the CRF, are suspected to be the result of an insufficient number of training iterations for the CRF.)HMM 5.69%45.99%MEMM 6.37%54.61%CRF 5.55%48.05%MEMM 4.81%26.99%CRF 4.27%23.76%Using spelling featuresFigure4.Per-word error rates for POS tagging on the Penn tree-bank,usingfirst-order models trained on50%of the1.1million word corpus.The oov rate is5.45%.5.3POS tagging experimentsTo confirm our synthetic data results,we also compared HMMs,MEMMs and CRFs on Penn treebank POS tag-ging,where each word in a given input sentence must be labeled with one of45syntactic tags.We carried out two sets of experiments with this natural language data.First,we trainedfirst-order HMM,MEMM, and CRF models as in the synthetic data experiments,in-troducing parameters for each tag-word pair andfor each tag-tag pair in the training set.The results are con-sistent with what is observed on synthetic data:the HMM outperforms the MEMM,as a consequence of the label bias problem,while the CRF outperforms the HMM.The er-ror rates for training runs using a50%-50%train-test split are shown in Figure5.3;the results are qualitatively sim-ilar for other splits of the data.The error rates on out-of-vocabulary(oov)words,which are not observed in the training set,are reported separately.In the second set of experiments,we take advantage of the power of conditional models by adding a small set of or-thographic features:whether a spelling begins with a num-ber or upper case letter,whether it contains a hyphen,and whether it ends in one of the following suffixes:.Here wefind,as expected,that both the MEMM and the CRF benefit signif-icantly from the use of these features,with the overall error rate reduced by around25%,and the out-of-vocabulary er-ror rate reduced by around50%.One usually starts training from the all zero parameter vec-tor,corresponding to the uniform distribution.However, for these datasets,CRF training with that initialization is much slower than MEMM training.Fortunately,we can use the optimal MEMM parameter vector as a starting point for training the corresponding CRF.In Figure5.3, MEMM was trained to convergence in around100iter-ations.Its parameters were then used to initialize the train-ing of CRF,which converged in1,000iterations.In con-trast,training of the same CRF from the uniform distribu-tion had not converged even after2,000iterations.6.Further Aspects of CRFsMany further aspects of CRFs are attractive for applica-tions and deserve further study.In this section we briefly mention just two.Conditional randomfields can be trained using the expo-nential loss objective function used by the AdaBoost algo-rithm(Freund&Schapire,1997).Typically,boosting is applied to classification problems with a small,fixed num-ber of classes;applications of boosting to sequence labeling have treated each label as a separate classification problem (Abney et al.,1999).However,it is possible to apply the parallel update algorithm of Collins et al.(2000)to op-timize the per-sequence exponential loss.This requires a forward-backward algorithm to compute efficiently certain feature expectations,along the lines of Algorithm T,ex-cept that each feature requires a separate set of forward and backward accumulators.Another attractive aspect of CRFs is that one can imple-ment efficient feature selection and feature induction al-gorithms for them.That is,rather than specifying in ad-vance which features of to use,we could start from feature-generating rules and evaluate the benefit of gener-ated features automatically on data.In particular,the fea-ture induction algorithms presented in Della Pietra et al. (1997)can be adapted tofit the dynamic programming techniques of conditional randomfields.7.Related Work and ConclusionsAs far as we know,the present work is thefirst to combine the benefits of conditional models with the global normal-ization of randomfield models.Other applications of expo-nential models in sequence modeling have either attempted to build generative models(Rosenfeld,1997),which in-volve a hard normalization problem,or adopted local con-ditional models(Berger et al.,1996;Ratnaparkhi,1996; McCallum et al.,2000)that may suffer from label bias. Non-probabilistic local decision models have also been widely used in segmentation and tagging(Brill,1995; Roth,1998;Abney et al.,1999).Because of the computa-tional complexity of global training,these models are only trained to minimize the error of individual label decisions assuming that neighboring labels are correctly -bel bias would be expected to be a problem here too.An alternative approach to discriminative modeling of se-quence labeling is to use a permissive generative model, which can only model local dependencies,to produce a list of candidates,and then use a more global discrimina-tive model to rerank those candidates.This approach is standard in large-vocabulary speech recognition(Schwartz &Austin,1993),and has also been proposed for parsing (Collins,2000).However,these methods fail when the cor-rect output is pruned away in thefirst pass.。
One of the many doors that open when a species’ genome is first sequenced is to the world of popula-tion genomics and to the unparalleled study of the evolutionary divergence of closely related populations and species. Ever since Darwin developed his model of how one species splits into two (his principle of diver-gence)1, the field of evolutionary biology has been divided on the question of whether Darwin’s model is correct. With growing access to very large amounts of genome data from multiple individuals of a species, it becomes increasingly likely that we will resolve this long-standing question.Recent next-generation sequencing (NGS) technolo-gies and assembly tools, such as restriction-site-associated DNA (RAD-tag)2 sequencing and genotyping by sequencing (GBS)3, now make it possible to obtain genome-scale data affordably from multiple individuals (reviewed in REFS 4,5). When individuals are sampled from multiple populations of a species, as has been done for humans 6–8, stickleback fish 2, fruitflies (Drosophila Population Genomics Project ), Arabidopsis thaliana (1001 Genome Project )9, dogs 10 and different species of great apes 11–13, among others 14,15, we gain an excep-tional view not only on the variation within populations and species but also on the variation that lies between them. Data sets such as these can include millions of variable single-nucleotide polymorphisms (SNPs) and other kinds of polymorphisms and hold the promise of finally revealing the genetic side of how species and populations diverge.However, a flood of new data may not lead directly to a commensurate gain in knowledge, and today, as new population genomic data sets are emerging, our skills of analysis and interpretation are partly overwhelmed. With the rise of large NGS data sets that reveal com-plex patterns of variation across species’ genomes, we find that our best models and tools for explaining pat-terns of variation were designed for a simpler time and smaller data sets. In the first place, NGS data sets present unique challenges, apart from their size, that result from the way in which they are generated. For example, it is common to use a reference genome to aid the assembly of additional NGS data, and yet this introduces a form of ascertainment bias , or reference bias, that can affect one’s findings 16,17 (BOX 1). Second, most models and methods available to analyse NGS data have limitations that pre-vent using all of the information in the data that bears on the processes of interest.In this Review, we survey the state of the art of population divergence models and inference methods, with regard to population genomics data sets. We do not examine in detail the technical challenges that are related to NGS for correcting sequencing, assembly, SNP and genotype-calling errors, as these have recently been reviewed elsewhere 4,17,18. Rather, we focus on models of population divergence and on methods to detect and to quantify gene flow, as well as methods to distinguish alternative modes of speciation. We discuss the limita-tions of these methods and provide examples of their application to recently available genome-wide data sets.Department of Genetics, Rutgers, the State University of New Jersey, Piscataway, New Jersey 08854, USA.Correspondence to J.H.e-mail: hey@doi:10.1038/nrg3446Published online 9 May 2013Single-nucleotide polymorphisms(SNPs). Sites in the DNA in which there is variation across the genomes in a population, usually comprising two alleles that correspond to two different nucleotides.Ascertainment biasSystematic bias introduced by the sampling design (for example, criteria used to select individuals and/or genetic markers) that induces a nonrandom sample of observations.Understanding the origin of species with genome-scale data: modelling gene flowVitor Sousa and Jody HeyAbstract | As it becomes easier to sequence multiple genomes from closely related species, evolutionary biologists working on speciation are struggling to get the most out of very large population genomic data sets. Such data hold the potential to resolve long-standing questions in evolutionary biology about the role of gene exchange in species formation. In principle, the new population genomic data can be used to disentangle the conflicting roles of natural selection and gene flow during the divergence process. However, there are great challenges in taking full advantage of such data, especially with regard to including recombination in genetic models of the divergence process. Current data, models, methods and the potential pitfalls in using them will be considered here.REVIEWSModels of species formationSpeciation in the absence of gene flow. When two popula-tions become allopatric (that is, completely geographically separated), they can diverge without mixing genes and eventually become reproductively isolated 19. Compared with divergence in the presence of gene exchange, thisis a comparatively simple process, and this simplicity, together with clear biogeographic evidence (such as the finding that in island archipelagos it is common to find different species on different islands), convinced many that this was how nearly all new species formed 19–24.Speciation in the presence of gene flow. Darwin, how-ever, envisioned that natural selection can act in dis-parate ways over a species range to pull a species in different directions and eventually to split it, first into different varieties and finally into separate species. This model, which has come to be called ‘sympatric speciation ’, was considered by many to be unlikely as gene exchange across a species range was considered to be a strong homogenizing force that counteracts divergence by natu-ral selection. However, more recently, genetic data (for example, mitochondrial DNA sequences and microsat-ellites) together with biogeographic circumstances have provided compelling evidence that sympatric speciation has occurred in numerous contexts 25–27.At the genetic level, Darwin’s model raises complexities, as it predicts that divergence in the presence of gene flow can cause different genes to experience very dif-ferent histories. Diversifying selection favours different alleles in different parts of a species range (and at one or more loci). However, the movement of all genes across the range of the species as the normal result of organ-isms reproducing and dispersing will regularly move alleles that are affected by the diversifying selection into the ‘wrong’ part of the species range. It will also cause the species to appear to be homogeneous when examined for patterns of variation at neutral genes .An additional major player in the divergence process, particularly when gene flow is occurring, is recombina-tion, which is the breaking and rejoining of chromo-somes that happens during meiosis every generation and that allows different parts of the genome to have different histories. Because of recombination, an allele that is favoured by selection and that is increasing in frequency will carry with it in its trajectory towards fixation only those flanking regions to which it is most tightly linked 28,29. Recombination also makes it possible for alleles at neutral loci to move by gene flow across the species range and to co-occur in the same population of genomes in which there are loci diverging by the action of diversifying selection 30,31. Recombination thus allows a species to have a population of genomes with a split personality: to resemble two diverging gene pools at loci affected by diversifying selection and to resemble a sin-gle gene pool at loci that are not under selection in this way. Evidence of this kind of genomic schism has come from a diversity of systems in recent years, on the basis of DNA Sanger sequence and microsatellite data 32 and more recently from NGS data in stickleback fish 2, Heliconius butterflies 14 and flycatchers 15.Modelling population divergence. A widely used theo-retical framework for studying speciation using genetic data is the ‘isolation with migration’ model, so named because it includes both the separation of two popula-tions (a process called isolation) following a splitting| Geneticsdifferent from referenceREVIEWSSympatric speciationThe process of divergence between populations or species occupying the same geographical area and in presence of gene flow.Diversifying selectionNatural selection acting towards different alleles (or phenotypes) being favoured in different regions within a single population or among multiple connected populations.Neutral genesGenes for which geneticpatterns are mostly affected by mutation and demographic factors, such as genetic drift and migration.Allopatric divergenceThe process of divergence between populations orspecies that are geographically separated, in the absence of gene flow.Linkage disequilibrium(LD). The nonrandom association of alleles at different sites or loci.Islands of differentiationGenomic regions of elevated differentiation owing to the action of natural selection.F STThe proportion of the total genetic variability occurring among populations, typically used as a measure of the level of population genetic differentiation.Island modelA model introduced by Sewall Wright to study population structure comprising multiple populations connected to each other through migration.Metapopulation modelIn the context of F ST -based statistics, this is an idealized model in which severalpopulations diverge without migration from a common ancestral gene pool (or metapopulation).event from their common ancestral population as well as migration between populations 33–35. At one extreme, we can consider a simple isolation model in which the migration rate is zero in both directions; this corresponds to an allopatric divergence scenario (FIG. 1a). Other models include isolation with migration (FIG. 1b), isolation after migration (FIG. 1c) and secondary contact (FIG. 1d). It has been shown that patterns of genetic variation in samples from two closely related populations or species can be used to distinguish a pure isolation model (FIG. 1a) from a model with migration 33,34 (FIG. 1b–d). Furthermore, the growing evidence of persistent gene exchange between closely related species means that divergence often arises in the midst of conflicting evolutionary processes 32.Inferring the history of divergenceNGS data from multiple individuals offer the promise of disentangling the complex interplay between selec-tion, gene flow and recombination that occurs during speciation with gene flow. First, by having information for essentially all parts of the genome, we can gain a more detailed and accurate picture of the demography of popu-lations 36,37. Second, it becomes possible to ask whether some parts of the genome have been exchanging genes more than others. Substantial variation in gene flow lev-els across the genome constitutes clear, albeit indirect, evidence that selection is acting against gene flow to a greater degree in some genome regions than in oth-ers 38,39. Third, NGS data allow us to get better estimates of recombination rates and linkage disequilibrium (LD) patterns along the genome 40,41, and this can in princi-ple be used to infer the timing and magnitude of gene flow. Finally, polymorphism and LD along the genome also bear information about selective sweeps and genes that are the targets of diversifying selection (reviewed in REFS 42,43). However, all of these inferences depend on having a theoretical framework that connects patterns of variation to an explicit model.Genome scans using indicators of divergence. Depending on an investigator’s question, it can sometimes be useful to take a fairly simple approach that does not use models with lots of parameters to study the levels of divergence between populations. This can be achieved by tailoring analyses to a specific component of the divergence pro-cess and scanning across the genome while calculating statistics that are expected to be sensitive to that feature. For example, there has been lots of interest in detecting ‘islands of differentiation ’ by looking at the distribution of summary statistics that measure genetic differen-tiation, such as F ST 44. In the first study using RAD-tag sequencing, the differentiation of 45,789 SNPs along the genome between oceanic and freshwater populations of threespine sticklebacks (Gasterosteus aculeatus ) showed, overall, reduced levels of differentiation (F ST values close to zero)2. However, when a sliding window was run along the aligned genomes of freshwater and oceanic populations, the authors found evidence for genomic regions characterized by very high F ST values (>0.35), potentially harbouring genes under divergent selection. Interestingly, the same genomic regions were high-lighted in contrast to different freshwater populations, suggesting parallel adaptation to the freshwater environ-ment. These results are in agreement with a larger study comprising seven pairs of closely related marine andfreshwater populations comprising 5,897,368 SNPs 45.Divergence summaries can also be used within demographic models of divergence, such as an island model or a metapopulation model (reviewed in REFS 46,47). One approach is to scan the genome using a hierarchical F ST model that assumes a nested island model underlying the divergence process 48. A related approach explicitly accounts for variation in read depth in NGS data within a Bayesian framework 49 and hence should be preferred for analysing such data.Another type of genome scan that is targeted to identify recent admixture relies on comparing the| Genetics Figure 1 | Alternative modes of divergence. All models assume that an ancestral population of size N A splits into twopopulations at time of split (t s ). The two present-day populations have effective sizes N 1 and N 2, respectively. Panel ashows the model in which migration rate is zero in both directions, which corresponds to an allopatric divergence scenario. Panels b –d represent alternative models in which populations have been exchanging migrants. Gene flow occurs at constant rates since the split from the ancestral population (b ). Migration rates are assumed to be constant through time, but gene flow can be asymmetric: that is, one migration rate for each direction. Panel c shows a scenario in which populations begin diverging in the presence of gene flow but experience a cessation of gene flow after time since isolation (t i ). If the lack of current gene flow in this model is due to reproductive isolation then this represents a history in which divergence occurred to the point of speciation in the presence of gene flow. In panel d , we consider the alternative migration history in which populations were isolated and diverged for a period of time in the absence of gene flow, followed by secondary contact at time of secondary contact (t sc ) and the introgression of alleles from the other population by gene flow.Nested island modelA hierarchical island model with groups of populations in which migration among populations within the same group is higher than among populations in different groups. Gene treesBifurcating trees that represent the ancestral relationships of homologous haplotypes sampled from a single or multiple populations. A gene tree includes coalescent events and, in models with gene flow, migration events. A gene tree is characterized by a topology, branch lengths, coalescence times and migration times.population tree (assumed to be known) with the genetrees inferred at a specific site. Incongruences betweenthe population tree and the gene tree can be due toincomplete lineage sorting (shared ancestral poly-morphism) or to gene flow. One statistic, called ‘D’,was specifically designed to detect introgression fromone population to another50(FIG. 2). Computing Drequires a genome from each of two sister populations,a genome from a third population (a potential sourceof introgressed genes) and a fourth outgroup genometo identify the ancestral state (identified as the A allele).Focusing on SNPs in which the candidate source popu-lation has the derived allele (B) and in which the twosister genomes have different alleles, there are two pos-sible configurations: either ABBA or BABA. Under thehypothesis of shared ancestral polymorphism, the num-ber of tree topologies of ABBA and BABA are expectedto be equal, and the expected D will be zero. Deviationsfrom that expectation are interpreted as evidence ofintrogression. As with FSTgenome scans, investigatorscan look at the distribution of D along the genome,but when using D, the aim is to find genomic regionsthat specifically experienced introgression, whereas inthe case of FST, the goal is to identify regions of highdifferentiation, regardless of the cause.Genome scans using D were used, for instance,to detect admixture between archaic and modernhumans51,52 and to study the patterns of introgressionin Heliconius butterflies14. In the case of modern andarchaic humans, unidirectional introgression fromNeanderthals to non-African humans was estimatedto have occurred for 1–4% of the genome51. Similarly,data from 642,690 SNPs point to 4–6% of the presentday Melanesian genomes being derived from admix-ture with Denisovans52. For Heliconius butterflies,RAD-tag sequencing of 4% of the genome (~12 Mb)indicated introgression from Heliconius timareta toHeliconius melpomene amaryllis (2–5% admixture),which are sympatric species that exhibit the samewing colour patterns. Interestingly, only a few regionsexhibited significant D values, including genes knownto contain the mimicry loci B/D and N/Yb. Despite thelack of an explicit test of positive selection, the fact thatthese regions harbour genes involved in mimicry is inagreement with an active role of selection promotingintrogression at these regions. In these species, the pat-terns of differentiation along the genome suggest a casein which most of the genome is differentiated — con-sistent with a model of allopatric divergence (FIG. 1a) ordivergence with limited gene flow (FIG. 1b) — whereasa few regions show evidence of secondary contact anduni- or bidirectional introgression of genes from onepopulation (species) to the other (FIG. 1c). In both cases ofhumans and Heliconius spp., there was evidenceof regions exchanged between populations that werealready differentiated, pointing to the importance ofsecondary contact.Although genome scan approaches are flexible andapplicable to large genomic data sets, the focus onamenable summary statistics typically entails settingaside much of the information in a data set. A relatedlimitation is that the same numerical value of a par-ticular statistic can result from very distinct scenarios.For instance, a low FSTcan be due to shared ancestralpolymorphism or due to gene flow44. Similarly, the Dstatistic can be significantly different from zero owingto other events rather than admixture. The evidenceof admixture between modern human non-Africanpopulations and Neanderthals has been questionedby a simulation study showing that spatial expansionsand population substructure without admixture couldlead to D values that are similar to the observed ones53.|Genetics→BAncestral polymorphismIntrogression (gene flow)Figure 2 | Disentangling ancestral polymorphism from gene flow (ABBA and BABA test). The diagram shows thedivergence of two sister populations (1 and 2), a third population (potential source of introgressed genes; 3) and anoutgroup population (4) over time. The black line represents the gene tree of a given site, and the star represents amutation from the ancestral state (allele A) to the derived state (allele B). The pattern ABBA can occur owing toan ancestral polymorphism (a): that is, coalescent of lineage from population 2 with lineage from population 3 in theancestral population (population ancestral to populations 1, 2 and 3), or gene flow from population 3 to population 2(b). Under a model with no gene flow, we expect that the pattern ABBA is as frequent as BABA owing to the fact thatthere is 50% chance that either the lineage from population 1 or from population 2 coalesces with lineage frompopulation 3 in the population ancestral to populations 1, 2 and 3.REVIEWSBayesian statistics Statistical framework in which the parameters of the models are treated as random variables, allowing expression of the probability of parameters, given the data; this is called the posterior. The posterior probability is obtained by Bayes’ rule, and it is proportional to the likelihood times the prior.Allele frequency spectrum (AFS). A distribution of the counts of single-nucleotide polymorphisms with a given observed frequency in a single or multiple populations. Genetic driftStochastic changes in gene frequency owing to finite size of populations, resulting from the random sampling of gametes from the parents at each generation.Likelihood and model-based methods. As useful asgenome scans with indicator variables can be to identifycomponents of the divergence process, they fall short ofproviding a full portrait of divergence unless they arecombined with other analyses. In this light, the goal formany investigators is to be able to calculate the likeli-hood under a rich divergence model. For some model ofdivergence M, with a parameter set Θ, the likelihood isthe probability (P) of the data given the parameters: thatis, PM(Data | Θ). Having a likelihood function at handallows estimating the most likely parameters of a givenmodel either with frequentist or Bayesian statistics54. Also,comparing the likelihood of alternative models opensthe door to model choice approaches to infer the mostprobable divergence model. Currently, there are twomain families of likelihood-based approaches for study-ing divergence: one based on the allele frequency spectrum(AFS) and a second based on sampling genealogies forshort portions of the genome (BOX 2).Likelihoods using the allele frequency spectrum. For asingle SNP sampled in each of two populations, consid-ered together with the base that is present in an outgroupgenome, the data can be summarized as the number ofcopies of the derived allele in each of the two popula-tions. For a large number of SNPs, these counts fill adiscrete distribution — the allele frequency spectrum(AFS) — in two dimensions (one for each sampled pop-ulation), which can be represented in graphical form(FIG. 3). This approach has seen renewed interest as largeSNP data sets have become more common55–57. FIGURE 3shows how the AFS can vary considerably for the dif-ferent isolation with migration models shown in FIG. 1,particularly how simple isolation differs from modelswith gene flow. In the absence of gene flow (FIG. 3a), thefrequencies of SNPs found in only one population aredifferent from the SNPs in the other populations becausegenetic drift drives different alleles to fixation in eachpopulation. By contrast, in models with gene flow, thecells along the diagonal exhibit a higher density (FIG. 3b)because there are many SNPs with similar frequenciesin the two populations. However, as exemplified in theseAFSs, it can be difficult to separate alternative scenarioswith gene flow, as these tend to be similar (FIG. 3b–d).Although the expected AFS can be generated bysimulations55,58,59, it is also the focus of a populationgenetic theory in which differential equations describethe diffusion of allele frequencies in populations60,61. Inrecent years, the diffusion equation approach has beenreawakened for the study of the AFS under isolationwith migration models, such as the ones shown in FIG. 1(REFS 57,62,63). If it is assumed that the SNPs segregateCoalescent theoryA theory that describes thedistribution of gene trees (and ancestral recombination graphs) under a givendemographic model that canbe used to compute theprobability of a given gene tree.independently, then given both an observed and an expected AFS for a model of interest, the likelihood canbe directly calculated using a multinomial distribution. One difficulty is that in reality most data sets include many SNPs that are sufficiently close to one another that the assumption of independence does not apply. Still, the same likelihood calculation can be applied (now identi-fied as a ‘composite likelihood’57) without introducing bias to the parameter estimates, albeit with limited access to confidence intervals and other analyses for which a likelihood is often used 64. By reducing the data to counts of SNP frequencies, AFS methods are also guilty of discarding all linkage information in the data. This means that these methods are not expected to be verysensitive to processes that can affect local LD patterns,such as gene flow or admixture. AFS-based analyses onpopulation genomic data sets have so far mostly beenconducted on human data 57,63,65, but the same approachcan be used to study the divergence of closely related species. A nice example is the study of the divergence ofSumatran orangutans (Pongo abelii ) and Bornean oran-gutans (Pongo pygmaeus )13. Low-coverage (8×) Illumina sequencing of 5 individuals from each species yielded a total of 12.74 million SNPs, and an AFS analysis led to an estimated speciation time of 400,000 years with a low level of gene exchange between the species 13.The AFS approach has also been applied to morecomplex models with more than two populations or spe-cies. One example comes from the analysis of humandata from the 1000 Genomes Project under a three-population isolation with migration model with gene flow and population expansions 65. By considering onlySNPs at synonymous sites and by explicitly model-ling genotype calling errors, these authors estimated a time for expansion out of Africa around 51,000 years ago, a split between Europeans and East Asians around 23,000 years ago, recent population expansion in both Europeans and East Asians and statistically significant but reduced gene flow among all populations.However, AFS-based methods become computa-tionally challenging and expensive for models with more than three populations. There is thus considerable interest in finding suitable approximations to the diffu-sion process that do not rely on a full multidimensionalAFS 66–68. Recently, some new methods have appeared that implement simplified diffusion processes that do not include mutation models but that do account for divergence from common ancestry by genetic drift 66–68.The lack of a mutational component means that these methods are intended for cases of recent divergence among populations. By modelling the branch lengths ofthe population and species tree as proportional to drift and by treating drift in different branches as independ-ent (with no gene flow), it is possible to write down a likelihood function, opening the door to infer the pop-ulation and species trees. It is also possible to include admixture within this framework by allowing for one population to have ancestry in multiple populations 66.For instance, 60,000 SNPs from 82 dog breeds and wild canids (obtained with SNP arrays) supported a popula-tion tree with admixture events (as in FIG. 1c ) rather thana pure isolation model 66 (as in FIG. 1a ).Likelihoods by sampling genealogies. If the recombina-tion rate is low, such that it is unlikely to have occurred in the time since the common ancestor of a sample of sequences from one or various populations, as can be the case over a short region of the genome, the history of a sample of sequences can be described by a gene tree or genealogy (BOX 3). The depth and structure of such genealogies have been described by coalescent theory for a diversity of models, including the models shown in FIG. 1 (REFS 69–71), and this coalescent modelling has made it possible to calculate the likelihood for data| Geneticsa clog(prob)Population 1Population 1Population 1Population 1P o p u l a t i o n 2P o p u l a t i o n 2P o p u l a t i o n 2bd P o p u l a t i o n2Figure 3 | Allele frequency spectrum under alternative divergence models. Each entry in the matrix (x ,y ) corresponds to the probability of observing a single-nucleotide polymorphism (SNP) with frequency of derived allele x in population 1 and y in population 2. The colours represent the log of the expected probability for each cell ofthe allele frequency spectrum (AFS). The white colour corresponds to –Inf: that is, tocells with an expected probability of zero. These AFSs are conditional on polymorphicSNPs, hence the cells (0,0) and (10,10) have zero probability. The likelihood for anobserved AFS can be computed by comparing it with these expected AFSs. a | Isolation model. b | Isolation with migration. c | Isolation after migration. d | Secondary contact.The joint allele frequency spectrums for the different scenarios were obtained withcoalescent simulations carried out with ms 125. All scenarios were simulated, assuming all populations share the same effective sizes (N = 10,000), a time of split t s = 20,000 generations ago (t / 4N = 0.5), symmetrical migration rate (2N 1m 12 = 5, 2N 2m 21 = 5, for scenarios b , c and d , for scenario c , a time of isolation of t i = 2,000 generations ago (t i / 4N = 0.05) and, for scenario d , a time of secondary contact of t sc = 6,000 generationsago (t sc / 4N = 0.15).REVIEWS。
保育教育质量评估指南英文The Comprehensive Guide to Evaluating the Quality of Conservation EducationEffective conservation education is essential for promoting environmental stewardship and inspiring individuals to take action to protect the natural world. As the world faces increasingly complex environmental challenges, the need for high-quality conservation education programs has never been more pressing. However, evaluating the quality and impact of these programs can be a complex and multifaceted task. This comprehensive guide aims to provide a framework for evaluating the quality of conservation education programs to ensure they are achieving their intended goals and making a meaningful difference.Defining the Scope and Objectives of the Conservation Education ProgramThe first step in evaluating the quality of a conservation education program is to clearly define its scope and objectives. What are the program's primary goals? Is it focused on raising awareness about a specific environmental issue, fostering behavior change, or building a sense of connection to nature? Understanding the program'sintended outcomes is crucial for determining the appropriate evaluation metrics and methods.It is also important to consider the target audience for the program. Is it designed for students, community members, or a specific stakeholder group? The evaluation approach may need to be tailored to the unique needs and characteristics of the target audience.Assessing Program Design and DeliveryOnce the program's scope and objectives have been established, the next step is to evaluate the design and delivery of the educational content and activities. This includes examining the following elements:1. Curriculum and Content: Is the educational content accurate, up-to-date, and aligned with the program's goals? Does it effectively convey complex environmental concepts in a clear and engaging manner?2. Instructional Strategies: Are the teaching methods and learning activities appropriate for the target audience? Do they foster active engagement and critical thinking?3. Facilitator Expertise: Are the program facilitators knowledgeable, passionate, and skilled in delivering effective conservation education?4. Accessibility and Inclusivity: Is the program accessible to diverse audiences, including those with disabilities or from underserved communities? Does it address potential barriers to participation?5. Partnerships and Collaborations: Does the program leverage partnerships with local organizations, environmental experts, or community leaders to enhance its reach and impact?Measuring Program Outcomes and ImpactEvaluating the outcomes and impact of a conservation education program is crucial for understanding its effectiveness and identifying areas for improvement. This can involve a combination of quantitative and qualitative measures, including:1. Participant Engagement and Satisfaction: How engaged and satisfied are the participants with the program? This can be assessed through surveys, focus groups, or observation of participant behavior and interactions.2. Knowledge and Attitude Changes: Does the program result in increased knowledge about environmental issues and more positive attitudes towards conservation? Pre- and post-program assessments can help measure these changes.3. Behavior Changes: Does the program inspire participants to adopt more environmentally-friendly behaviors, such as reducing waste, conserving water, or engaging in community conservation efforts? Tracking behavioral changes over time can provide valuable insights.4. Long-term Impact: What is the program's long-term impact on the community or the environment? This may involve measuring changes in environmental indicators, such as biodiversity, habitat restoration, or reduced resource consumption.5. Stakeholder Feedback: Gathering feedback from key stakeholders, such as community members, environmental organizations, or program funders, can provide valuable insights into the program's strengths, weaknesses, and overall impact.Continuous Improvement and AdaptationEvaluating the quality of a conservation education program is an ongoing process that requires regular review and adaptation. By continuously collecting and analyzing data, program organizers can identify areas for improvement, adjust their strategies, and ensure that the program remains relevant and effective in addressing the evolving environmental challenges faced by the community.Effective conservation education has the power to inspire individuals, communities, and societies to take action to protect the naturalworld. By implementing a comprehensive evaluation framework, program organizers can ensure that their efforts are having a meaningful and lasting impact, and continue to refine and improve their approach over time.。
《文献检索与利用》总复习题库(Literature search and utilization review the question bank)General review questions bank of "document retrieval and utilization"First, individual choice questions1. or less is not Boolean logicA.NOTB.ORC.ANDD.NEARThe usual order of operations of 2. Boolean logic operators is ():When A. has parentheses, the parentheses are first executed; NOT, > AND > OR without parenthesesWhen B. is bracketed, the parentheses are executed first; when parentheses are not there, NOT > OR >ANDWhen C. has parentheses, the parentheses are first executed; AND >NOT > OR without parenthesesWhen D. has parentheses, the parentheses are first executed;AND, > OR > NOT without parentheses3. words and phrases? "Can be used instead of 0 or more characters?"A. more than oneB.1C.2D.34. which of the following is the abbreviation of the library public directory retrieval system?A. CalisB. NSTLC. OCLCD. OPACWhat is the unique identifier of the 5.ISSN?A. conference documentsB. standard documentsC. thesisD. JournalWhat is the unique identifier of the 6.ISBN?A. booksB. JournalC. Technology ReportD. patent documents7. which of the following databases is a full-text database?A.CPCIB.Elsevier Science DirectC.EID. SCI8. use Adobe Reader to read the files in which formatA.PDFB. VIPC. HTML9.cajviewer is the following database full text reading software:A. Superstar Digital LibraryB. VIP Chinese science and Technology Periodicals DatabaseKI CNKI full text libraryD. Wanfang Data Resources10. browsing superstar digital library, should be installed first:A. Apabi ReaderB. Adobe ReaderC. CAJ ViewerD. SSReader11. the following databases belong to the bibliographic databaseA. SCIB. ISTPD. library OPAC12.PQDT isA. conference document databaseB. dissertation databaseC. standard document databaseD. science and technology reporting database13.AD, PB, NASA, and DOE are the four largest U.S. government reports, in which NASA refers toA. administrative reportB. Energy ReportC. military reportD. Aerospace Report14. () is a large, informative and comprehensive instrument that reflects all the knowledge, categories, or basic knowledge and basic circumstances of a human being. It is called the king of the book of tools".A. dictionaryB. EncyclopediaC. YearbookD. manual15. which of the following databases do not belong to the numeric and factual databases?A. China Information BankB. search network, statistical yearbook, databaseC. country research dataD. NPC press copy information16. in which of the following search tools can you get statistics over the years?A. dictionaryB. EncyclopediaC. YearbookD. manual17. which of the following retrieval systems provides citations for journals and references?A. WEB OF SCIEB. OCLCC. OPACD. EI18. the following Boolean logic operators are.A. andB. orC.D.near19. commonly used retrieval systems areA. directory retrieval systemB. digest retrieval systemC. full text retrieval systemD. or more20. what kind of retrieval system does the online public directory retrieval system (OPAC) belong to?A. directory retrieval systemB. digest retrieval systemC. full text retrieval system21., we can sum up the general steps of information retrieval.A.1. analysis of search topics, a clear demand for information;2. sources of information, knowledge retrieval system;3. access, selected4. retrieval methods; implementation of retrieval strategy, evaluation of retrieval results;5. adjust the search strategy to obtain the required information;6. analysis of management information, rational use of informationB.1. selects information sources, 2. develops strategies and implements retrieval, 3. evaluation information, 4. adjusts retrieval strategies, and 5. analyzes and utilizes informationC. 1. develops strategies, 2. defining questions, 3. selecting information sources, 4. implementing retrieval, 5. evaluating information, and 6. analyzing and utilizing informationD.1. define problems, 2. select information sources, 3.Develop strategies, 4. implement retrieval, and 5. evaluate information22., you need to write a report on the current status of businessintelligence systems, which should focus on the following information sources.A. web pagesB. newspaperC. magazineD. all kinds of literature database23. the standard format of reference is ().A. author Title SourceB. source of famous authorsPlease mark the 24. documents: Ma Pinzhong. Study on large astronomical telescope. China space science and technology, 1993, 13 (5) P6 - 14, ISSN1000 - 758X which belongs to the type of literature ().A. booksB. Technology ReportC. JournalD. newspaper25. in retrieval language, it is natural language.A. Heading WordsB. keywordC. unit wordsD. keywords26., what kind of retrieval methods should be taken after using the references found in the literature to expand the search scope?A. tool methodB. intersection methodC. retrospective method (snowball method)D. method27. which title of the following retrieval expressions is correct in retrieving the title of the poem entitled "poetry of the Tang and Song Dynasties"?.A. (title = Tang or title = song) and title = PoetryB. title = Tang or title = song and title = PoetryC. title = Tang and title = song or title = PoetryD. title = Tang or title = song or title = Poetry28. which of the following documents belong to the special literature?A. booksB. conference documentsC. JournalD. newspaperAre the fields represented by the 29. field codes AU, AB and PY represented?A. titles, annotations, abstract typesB. authors, abstracts, and publication yearsC. descriptors, classifications, languages30. what is the largest digital library in China?A. superstarB. scholar's homeC. founderD. Wanfang31., Xiao Li students need to find some data about the national economic life, which database can be retrieved in the following?A. Chinese VIP journalsB.EIC. National Research NetworkD.EBSCO32. foreign language full-text database, search results are sorted in several ways, if you want to sort in accordance with the relevant, you should choose: ()A.dateB.sourceC.authorD.relevance33.. This is a literature retrieved in web of science. What is the type of literature in this article?:Title:, Using, desktop, computers, to, solve, large-scale, dense,, linear, algebra, problemsAuthor (s): Marques M.; Quintana-Orti G.; Quintana-Orti E. S.; et al.Conference: Symposium on High Performance Computing (HPC) Applied to Computational Problems in Science and Engineering/9th International Conference on Computational and Mathematical Methods in Science and Engineering Location: Gijon, SPAIN Date: JUN, 2009Source:, JOURNAL, OF, SUPERCOMPUTING, Volume:, 58, Issue:, 2, Special, Issue:, SI, Pages:, 145-150, DOI:,10.1007/s11227-010-0394-2,, Published:, NOV,...A. Journal PapersB. booksC. Conference PapersD. Technical Report34.. The classification numbers of commonly used foreign language and economic books in this museum are respectively:A.H, FB.H, CC.I, HE.I, F35. using the library bibliographic retrieval system, Xiao Li's classmate retrieved a book with the classification number "H319CF". What's the significance of the classification number?A. CET Band FourB. CET Band SixC. postgraduate English testD.Tofel test36. Wang students want to find the "The Washington Post" on November 7th, an article entitled "No favorites emerge in race for horse of the year" text and listen to what can be downloaded, the museum has purchased what search resources of foreign language in the database?A.Elsevier SDB. EBSCO之报纸资源c.wsnD.翡翠二、多选题1。
·综述·Chinese Journal of Animal Infectious Diseases中国动物传染病学报摘 要:干扰素刺激基因15(ISG15)是由病原微生物或干扰素诱导产生的一种大小为15 kDa 的泛素样蛋白。
在干扰素诱导的数百个干扰素刺激基因中,ISG15是诱导最强烈、最快的ISG 蛋白之一。
研究表明,ISG15对多种病毒具有抗病毒作用。
此外,ISG15在调节宿主损伤、DNA 修复,调节信号通路及抗原递呈中也发挥着重要的作用。
文章介绍了ISG15的概况,并阐述了近年来ISG15在抗病毒、免疫调节和调节宿主信号通路过程中的作用。
关键词:干扰素刺激基因15;抗病毒作用;免疫调节中图分类号:S852.4 文献标志码:A 文章编号:1674-6422(2023)06-0170-07Molecular Mechanism of Interferon-Stimulated Gene 15 Antiviral InfectionTANG Jingyu 1, DU Hanyu 1,2, JIA Nannan 1, TANG Aoxing 1, LIU Chuncao 1, ZHU Jie 1, MENGChunchun 1, LI Chuanfeng 1, LIU Guangqing 1(1. Shanghai V eterinary Research Institute, CAAS, Shanghai 200241, China; 2. Xinjiang Agricultural University, Xinjiang 830052, China)收稿日期:2021-11-02作者简介:国家重点研发计划项目(2016YFD0500108);中国农业科学院创新工程项目作者简介:唐井玉,女,博士研究生,预防兽医学专业通信作者:刘光清,E-mail:**************.cn干扰素刺激基因15抗病毒感染的分子机制唐井玉1,杜汉宇1,2,贾楠楠1,汤傲星1,刘春草1,朱 杰1,孟春春1,李传峰1,刘光清1(1.中国农业科学院上海兽医研究所 小动物传染病预防与控制创新团队,上海200241;2.新疆农业大学,乌鲁木齐830052)2023,31(6):170-176Abstract: Interferon-stimulated gene 15 (ISG15) is a ubiquitin-like protein of approximately 15 kDa induced by pathogenic microorganisms or interferons. Among the hundreds of interferon-stimulated genes induced by interferons, ISG15 is one of the most strongly and fastest induced ISG proteins. Studies have shown that ISG15 has antiviral effects against a variety of viruses. In addition, ISG15 plays an important role in regulating host damage, DNA repair, and regulating signaling pathways and antigen delivery. The article presented an overview of ISG15 and described the role of ISG15 in the process of antiviral, immunomodulation and regulation of host signaling pathways in recent years.Key words: Interferon-stimulated gene 15; antiviral infection; immunomodulation先天性免疫应答是抵抗入侵病原体的第一道防线,病原体可以通过宿主模式识别受体来感知。
594E-mail:***********.cnWebsite: Tel: ************中国图象图形学报JOURNAL OF IMAGE A N D GRAPHICS ©中国图象图形学报版权所有中图法分类号:TP391 文献标识码:A 文章编号:1006-8961(2021)03-0594-11论文引用格式:Shi C J,Tu D J and Liu J Y. 2021. Re-GAN: residual generative adversarial network algorithm. Journal of Image and Graphics,26(03): 0594-0604(史彩娟,涂冬景,刘靖祎.2021. Re-GAN:残差生成式对抗网络算法.中国图象图形学报,26(03) :0594-0604) [DOI: 10. 11834/ jig. 200069]Re-GAN:残差生成式对抗网络算法史彩娟,涂冬景,刘靖祎华北理工大学人工智能学院,唐山063210摘要:目的生成式对抗网络(generative adversarial network, GAN)是一种无监督生成模型,通过生成模型和判别 模型的博弈学习生成图像。
GAN的生成模型是逐级直接生成图像,下级网络无法得知上级网络学习的特征,以至 于生成的图像多样性不够丰富。
另外,随着网络层数的增加,参数变多,反向传播变得困难,出现训练不稳定和梯度消失等问题.3针对上述问题,基于残差网络(residual network,ResNet)和组标准化(group normalization.GN),提出 了一■种残差生成式对抗网络(residual generative adversarial networks, Re-GAN) t方法Re-GAN在生成模型中构建深度残差网络模块,通过跳连接的方式融合上级网络学习的特征.增强生成图像的多样性和质量,改善反向传播过程,增强生成式对抗网络的训练稳定性,缓解梯度消失。
DOI : 10.13430/ki.jpgr.20230919001植物遗传资源学报 2024, 25 ( 4 ): 562-575Journal of Plant Genetic Resources高粱地方种质资源表型多样性分析及综合评价牛雪婧1,王新栋1,王金萍2,孙娟1,郄彦敏1,王丽娜1,耿立格1(1河北省农林科学院粮油作物研究所/河北省作物遗传育种重点实验室,石家庄 050035; 2河北省农林科学院谷子研究所,石家庄 050035)摘要: 以“第三次全国农作物种质资源普查与收集行动”为依托,采用遗传多样性分析、相关性分析、聚类分析、主成分分析等方法,对2020-2021年在河北省内普查收集的136份高粱种质资源进行研究及综合评价。
结果表明:河北省高粱地方种质资源具有丰富的表型变异,15个表型性状的多样性指数在0.0844~1.9926之间,变异系数在4.69%~68.00%之间,千粒重的遗传多样性最高,穗形的变异系数最大;株高与穗部性状之间存在极显著正相关关系;聚类分析将136份高粱种质资源分为3个类群,3个类群无明显地域聚类特点。
第I 类群在穗部性状方面表现最好,可进行工艺用高粱资源选育;第II 类群株高较低,可筛选矮秆高粱资源进行种质创新;第III 类群在产量性状方面表现最好,可作为粒用高粱育种材料加以利用。
主成分分析将表型性状简化为5个主成分,累计贡献率为60.182%,株高、穗部与籽粒性状是高粱表型变异的主要因素;136份高粱的综合得分范围在0.107~1.147之间,以综合得分排序,筛选出肃宁高粱、长穗高粱、笤帚高粱、落黍等排名前10的综合性状表现较好的种质资源。
本研究从多方面、多角度对新征集的河北省高粱种质资源进行了分析及评价,以期为高粱优异种质资源的挖掘及种质创新利用提供参考依据。
关键词: 高粱;种质资源;遗传多样性分析;综合评价;普查收集Genetic Diversity and Comprehensive Evaluation of SorghumGermplasm Based on Phenotypic TraitsNIU Xuejing 1,WANG Xindong 1,WANG Jinping 2,SUN Juan 1,QIE Yanmin 1,WANG Lina 1,GENG Lige 1(1Institute of Cereal and Oil Crops , Hebei Academy of Agricultural and Forestry Sciences/Hebei Key Laboratory of Crop Geneticsand Breeding , Shijiazhuang 050035;2Institute of Millet Crops ,Hebei Academy ofAgricultural and Forestry Sciences , Shijiazhuang 050035)Abstract :Based on "The Third National Campaign of Crop Germplasm Census and Collection", genetic diversity analysis , correlation analysis , cluster analysis and principal component analysis were used to study and evaluate 136 sorghum germplasm resources collected from 2020 to 2021 in Hebei province. The results showed that there were abundant phenotypic variations in the collection. The diversity index of 15 phenotypic traits ranged from 0.0844 to 1.9926, and the coefficient of variation ranged from 4.69% to 68.00%. The genetic diversity of 1000-grain weight and the coefficient of variation of panicle shape were the highest. A significant positive correlation between plant height and panicle traits was detected. These germplasms were divided into three groups by cluster analysis , while three clusters didn ’t correlate with the geographic collection sites. Cluster I showed the best performance in panicle traits and could be used in breeding for technical purposes. The plant height of cluster II was low , which can be used as dwarf resources in germplasm innovation. Cluster III showed the best yield traits which can be used in breeding for higher grain production. Principal component analysis收稿日期: 2023-09-19 修回日期: 2023-10-11 网络出版日期: 2023-11-28 URL : https :///10.13430/ki.jpgr.20230919001第一作者研究方向为农作物种质资源收集整理和评价利用,E-mail :133****************通信作者: 耿立格,研究方向为农作物种质资源收集保存、鉴定评价和共享利用,E-mail :****************基金项目: 第三次全国农作物种质资源普查与收集行动; 河北省农林科学院科技创新专项(2022KJCXZX-LYS-15);国家科技资源共享服务平台项目(NCGRC-2023-023)Foundation projects : The Third National Campaign of Crop Germplasm Census and Collection ;Science and Technology Innovation Project ofHebei Academy of Agricultural and Forestry Sciences (2022KJCXZX-LYS-15);National Science and Technology Resource4 期牛雪婧等:高粱地方种质资源表型多样性分析及综合评价simplified phenotypic trait factors into 5 principal components,with the cumulative contribution rate of 60.182%. The plant height,the panicle and grain traits were the main factors contributing to the phenotypic differences. Based on the comprehensive scores of 136 sorghum germplasms that ranged from 0.107 to 1.147,the top 10 elite germplasm resources such as Suning sorghum,Changsui sorghum,Tiaozhou sorghum and Luoshu were selected. Collectively,based on the evaluation of these newly collected sorghum germplasm resources in Hebei province,this study provided insights for the mining of elite sorghum germplasm and the innovative utilization of germplasm.Key words:sorghum;germplasm resources;genetic diversity;comprehensive evaluation;survey and collection种质资源也称遗传资源,是进行作物育种、品种改良最原始、最有价值的材料[1],而表型性状鉴定因具有直观性且易于获取,在遗传多样性分析中具有独特优势,是认识作物种质资源和培育新品种的基石[2]。
姓名:刘传宇High-throughput sequencing and PlatformsAbstractHigh-throughput sequencing or called "Next-generation" sequencing technology can sequence hundreds of thousands to millions of DNA molecules in one time and its read is shorter than previous sequencing technique,yet these advantages are symbol of sequencing ,which is the same case of former techniques.However,high-throughput sequencing canenable the transcriptome and genome of a species the panorama of a detailed analysis,so it was also called deepsequencing. Keywordshigh-throughput sequencing deepsequencing next-generation sequencing application platforms third generation sequencing technologyIntroductionsDevelopment Situation of sequencing technique:According to the history of development andinfluence of sequencing principle and technique,it has the following kinds: Massively Parallel Signature Sequencing (MPSS), Polony Sequencing1, 454 pyrosequencing2, illumina (Solexa) sequencing3, ABI SOLiD sequencing, Ion semiconductor sequencing, DNA nanoball sequencing and so on,scientific community has been widely used in second generation sequencing technology to solve biological problems. Such as sequencing species which have no referential sequences at the genome level4(de novo sequencing), obtaine the referential sequences of that species.This work lays a foundation for subsequent research and molecular breeding. However,some species which have referential sequence, process whole genome resequencing, scanning and detecting mutations at the whole genome level to find that the molecular basis of individual differences,and do the whole transcriptome sequencing at the transcriptome level , thus to research alternative splicing and coding sequence of single nucleotide polymorphisms (cSNP) or small molecular RNA sequencing5, sequencing by separateing specific size RNA to discover new microRNA.At the transcriptome level, high-throughput sequencing works with the help of Chromatin Immunoprecipitation (ChIP) and Methylated DNA immunoprecipitation technology (MeDIP) to detect DNA areas which binding with specific transcription factors and methylated sites on the genome .Particularly, targeted Resequencing is a great application in high-throughput sequencing and microarray technology. Firstly,this technology uses microarray to synthesise oligonucleotide probeswhich could combined with certain parts of the genome, so that it assemble es specific segment and then use the second generation sequencing to sequencing these sections.Currently, manufacturers which provide sequence capture are Agilent andNimblegen ,but exome sequencing is used much more frequently.Scientists now think that exome sequencing is superiorer to whole genome sequencing not only because of cost lower , but also because that exome sequencing has a smaller amount of data analysis , and it also combines with biological phenotypes more straightforwardly. At present, high throughput sequencing are used widely in searching for candidate genes of disease.Summary of the PlatformsThe first generation of sequencing technology can sequence long fragments and has high accuracy, but it cost too much and it is not able to practise the sequencing of microscale DNA samples.454 was the 1st commercial NGS platform, it was acquired by Roche, but is still known as by the name 454. This sequencer uses pyrosequencing technology. Instead of using dideoxynucleotides to terminate the chain amplification, pyrosequencing technology relies on the detection of pyrophosphate released during nucleotide incorporation. It 454 uses beads that start with a single template molecule which is amplified via emPCR. Millions of beads are loaded onto a picotitre plate designed so that each well can hold only a single bead. All beads are then sequenced in parallel by flowing pyrosequencing reagents across the plate.SOLiD was purchased by Applied Biosystems in 2006. SOLiD was the 3rd commercial NGS platform. The sequencer adopts the technology of two-base sequencing based on ligation sequencing. On a SOLiD flowcell, the libraries can be sequenced by 8 base-probe ligation. The fluorescent signal will be recorded during the probes complementary to the templa te strand and vanished by the cleavage of probes’ last 3 bases. And the sequence of the fragment can be deduced after 5 round of sequencing using ladder primer sets.Solexa was subsequently acquired by Illumina and is now known by the name Illumina. Solexa developed the 2nd commercial NGS platform. Illumina uses a solid glass surface 6(similar to a microscope slide) to capture individual molecules and bridge PCR to amplify DNA into small clusters of identical molecules. These clusters are then sequenced with a strategy that is similar to Sanger sequencing, except only dye-labelled terminators are added, the sequence at that position is determined for all clusters, then the dye is cleaved and another round of dye-labelled terminators are added.Helicos7developed the HeliScope, which was the first commercial single-molecule sequencer. Unfortunately, the high cost of the instruments and short read lengths limited adoption of this platform. Helicos no longer sells instruments, but conducts sequencing via a service centre model.ConclusionThe second generation sequencing technology makes an indelible contribution to biology, although there are still some technical defects. Now, third generation Sequencing technology has emerged, such as Single Molecule Real Time (SMRT ™) DNA Sequencing8, but the second generation Sequencing is still used for large-scalesequencing. Therefore, before the fully maturity of the third generation sequencing technology, the second generation sequencing technology will not exit from sequencing projects.Recently, the third generation sequencing technology, which is able to determine the base composition of single DNA molecules, has joined the race.It is based on single molecular nanopore sequencing9, and a lot of companies, such as Pacific Biosciences10, Oxford Nanpore, Genia, NABsys, NobloGen, and so on. Among these gene chip technology are being used. Its advantages include read longer, faster, more accueate and it does not need to prepare samples.References1 Porreca, G. J., Shendure, J. & Church, G. M. Polony DNA sequencing. Current protocols inmolecular biology / edited by Frederick M. Ausubel ... [et al.]Chapter 7, Unit 7 8, doi:10.1002/0471142727.mb0708s76 (2006).2 Serkebaeva, Y. M., Kim, Y., Liesack, W. & Dedysh, S. N. Pyrosequencing-based assessmentof the bacteria diversity in surface and subsurface peat layers of a northern wetland, with focus on poorly studied phyla and candidate divisions. PloS one8, e63994, doi:10.1371/journal.pone.0063994 (2013).3 Castoe, T. A.et al.Rapid microsatellite identification from Illumina paired-end genomicsequencing in two birds and a snake. PloS one7, e30953, doi:10.1371/journal.pone.0030953 (2012).4 Sanders, S. J.et al.De novo mutations revealed by whole-exome sequencing are stronglyassociated with autism. Nature485, 237-241, doi:10.1038/nature10945 (2012).5 Gausson, V. & Saleh, M. C. Viral small RNA cloning and sequencing. Methods Mol Biol721,107-122, doi:10.1007/978-1-61779-037-9_6 (2011).6 Hofs, B., Brzozowska, A., de Keizer, A., Norde, W. & Cohen Stuart, M. A. Reduction ofprotein adsorption to a solid surface by a coating composed of polymeric micelles with a glass-like core. Journal of colloid and interface science325, 309-315, doi:10.1016/j.jcis.2008.06.006 (2008).7 Kapranov, P., Ozsolak, F. & Milos, P. M. Profiling of short RNAs using Helicossingle-molecule sequencing. Methods Mol Biol822, 219-232, doi:10.1007/978-1-61779-427-8_15 (2012).8 Liu, Y. & Wu, B. Q. [Third-generation DNA sequencing: single molecule real-time DNAsequencing]. Zhonghua bing li xue za zhi Chinese journal of pathology40, 718-720 (2011).9 Astier, Y., Braha, O. & Bayley, H. Toward single molecule DNA sequencing: directidentification of ribonucleoside and deoxyribonucleoside 5'-monophosphates by using an engineered protein nanopore equipped with a molecular adapter. Journal of the American Chemical Society128, 1705-1710, doi:10.1021/ja057123+ (2006).10 Quail, M. A. et al. A tale of three next generation sequencing platforms: comparison of IonTorrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC genomics13, 341, doi:10.1186/1471-2164-13-341 (2012).。