当前位置:文档之家› 基于基因组测序数据的拷贝数变异检测方法研究

基于基因组测序数据的拷贝数变异检测方法研究

目录

Abstract

With the rapid development of genome sequencing technology, personal genome sequencing has gradually become one of the main approaches to diagnose diseases, develop treatments, build health management and explore the mysteries of life. It has greatly promoted the development of genetics, genomics, medical science and other related areas. Meanwhile, more and more scientific researches have shown that copy number variation (CNV), as an important structural variation, is closely related to the evolution, biodiversity, a variety of complex and rare diseases. Therefore, it is very important to explore the natural laws of organisms, reveal the mysteries of life, understand the mechanism of disease, and find out the pathogenic targets of diseases. However, due to the high complexity of the human genome itself, the large amount of data in the sequencing data and the technical limitations of the current sequencing technology, how to quickly and effectively identify and analyse the copy number variants (CNVs) is facing great challenges.

This thesis mainly focuses on the detection of CNVs from genome sequencing data, and carries out relevant researches. The goal of this thesis is to evaluate the whole-exome sequencing (WES) CNV calling methods, and to develop new methods which can achieve higher sensitivities and specificities compared with current algorithms. At the same time, this thesis also provides a new entropy based method to detect and analyse human genome duplicated sequences. The main research contents of this thesis are as follows:

First, the current WES CNV calling algorithms are not clear in the real sequencing data. Specifically, there is no systematic evaluation criterion at present. In this thesis, a series of WES CNV evaluation methods were proposed and four current WES CNV calling methods were evaluated by these measures standards. This evaluation study can provide theoretical basis for differential scientific experiments of scientists in different area. Meanwhile, it can also lay the foundation of developing new WES CNV calling methods in the future.

Second, the identification results of existing WES CNV calling methods are not ideal, a pooled-sample based WES CNV method is proposed. This method firstly uses principal component analysis (PCA) to denoise WES data. Then, this method integrates both read depth (RD) and SNV information together as the paired input singles of hidden Markov model (HMM).

Third, in order to further enhance the efficiency of identifing CNVs from WES data, a hybrid approach of CNV detection from WES data is proposed. Firstly, A single-sample based WES CNV calling method is proposed, which aims to avoid the

Contents

problem of excessive noise reduction from pooled-sample based model. The single-sample based model employs a median method to normalize those known source biases and uses negative binomial distribution to fit the normalized RD signal. Then, a paired HMM is proposed to identify CNVs by using RD and SNV information. Finally, a merging algorithm is proposed to integrate the results of both pooled-sample based method and single-sample based method into the final CNVs result.

Fourth, a generalized topological entropy is proposed to analyse duplicated genome sequence. The relationship between generalized topological entropy and topological entropy is proved mathematically. The generalized topological entropy is applied to analyse genomic elements, segmental duplication in human reference genome and short tandem repeats in personal genome. This is a new dimension to view and understand duplicated genome sequence. Meanwhile, it also supplies a new idea and method to precisely identify copy number duplications in the future.

In conclusion, this thesis provides a series of comprehensive and objective criterions to evaluate CNVs results identified from WES data. A new pooled-sample based and a new hybrid approach of WES CNV calling methods are proposed by integrating both RD and SNV information into a paired HMM in this thesis. These two methods achieve better sensitivities and specificities with high practical significance and application value. A generalized topological entropy based duplicated genome sequence detection method is proposed and applied to genomic elements, segmental duplication as well as short tandem repeats, which has certain theoretical and practical significance.

Keywords:genome sequencing, whole-exome sequencing, copy number variation, hidden Markov model, generalized topological entropy

目录

目录

摘要 .......................................................................................................................... I ABSTRACT ............................................................................................................... I II 第1章绪论 .. (1)

1.1课题背景及研究的目的和意义 (1)

1.1.1研究背景 (1)

1.1.2 研究的目的和意义 (2)

1.2相关背景知识 (3)

1.2.1 基因组变异 (3)

1.2.2 基因组测序 (5)

1.2.3 基因组数据的基本格式 (10)

1.2.4 基因组序列的组成 (14)

1.3研究现状 (15)

1.3.1 测序数据的降噪方法 (15)

1.3.2 拷贝数变异检测方法 (19)

1.3.3 拷贝数变异检测效果测评指标 (23)

1.3.4 基于熵的基因组序列研究 (24)

1.4本文的主要研究内容 (24)

第2章基于外显子组测序数据的拷贝数变异检测效果测评方法 (27)

2.1引言 (27)

2.2外显子组拷贝数变异检测效果的测评方法 (28)

2.2.1与全基因组测序数据的拷贝数变异检测结果一致性测评方法 (28)

2.2.2tagSNP与常见拷贝数变异协同性测评方法 (30)

2.2.3 拷贝数变异孟德尔遗传错误率测评方法 (31)

2.2.4 拷贝数缺失杂合性测评方法 (32)

2.2.5 其他常规测评方法 (32)

2.3实验数据及拷贝数变异检测方法 (34)

2.4各外显子组拷贝数变异检测方法的测评结果与分析 (35)

2.4.1 WES与WGS数据检测拷贝数变异一致性测评结果与分析 (35)

2.4.2 tagSNP与常见拷贝数变异的协同性测评结果与分析 (37)

2.4.3 拷贝数变异孟德尔遗传错误率测评结果与分析 (38)

2.4.4 拷贝数缺失杂合性测评结果与分析 (39)

2.4.5 拷贝数变异大小及分布测评结果与分析 (40)

Contents

2.4.6 四种拷贝数变异检测方法检测结果一致性测评结果与分析 (42)

2.4.7 低敏感性及特异性原因分析 (43)

2.4.8 四种外显子组拷贝数变异检测方法的优点与缺点分析 (44)

2.5本章小结 (46)

第3章基于群体样本模式的外显子组测序数据拷贝数变异检测方法 (48)

3.1引言 (48)

3.2群体外显子组测序数据的标准化及降噪方法 (49)

3.2.1 基因组测序数据的采集及标准化 (49)

3.2.2 群体外显子组测序数据降噪 (50)

3.3群体外显子组测序数据的拷贝数变异区域分割 (52)

3.3.1 隐藏状态集合及转移概率 (52)

3.3.2reads深度信号的发射概率 (53)

3.3.3 SNV信号的发射概率 (54)

3.3.4 双链隐马尔科夫模型的总体发射概率 (55)

3.4群体外显子组测序数据拷贝数变异检测实验结果与分析 (55)

3.4.1 群体外显子组测序数据拷贝数变异检测结果的敏感性 (56)

3.4.2 群体外显子组测序数据拷贝数变异检测结果的特异性 (58)

3.5本章小结 (59)

第4章基于融合样本模式的外显子组测序数据拷贝数变异检测方法 (61)

4.1引言 (61)

4.2外显子组测序数据reads深度的影响因素分析 (62)

4.2.1 GC含量对reads深度的影响分析 (62)

4.2.2 Mappability对reads深度的影响分析 (63)

4.2.3 外显子长度对reads深度的影响分析 (64)

4.3单样本外显子组测序数据降噪方法 (65)

4.4单样本外显子组测序数据的概率模型 (67)

4.4.1单样本外显子组测序数据的reads深度概率模型 (67)

4.4.2 负二项分布的参数估计 (69)

4.5融合样本模式的拷贝数变异检测结果合并方法 (70)

4.6融合样本模式的拷贝数变异检测实验结果与分析 (72)

4.6.1 融合样本模式拷贝数变异检测结果的敏感性 (73)

4.6.2 融合样本模式拷贝数变异检测结果的特异性 (75)

4.7本章小结 (77)

第5章基于广义拓扑熵的拷贝数片段复制序列检测方法 (78)

目录

5.1引言 (78)

5.2基于广义拓扑熵的基因组序列分析方法 (78)

5.2.1 拓扑熵的概念、计算方法及局限 (78)

5.2.2 广义拓扑熵的概念及计算方法 (80)

5.2.3 广义拓扑熵在有限序列上的近似算法 (81)

5.3基因组元件及拷贝数片段复制序列检测实验结果与分析 (83)

5.3.1 基因组元件的实验结果与分析 (83)

5.3.2 拷贝数片段复制的实验结果与分析 (85)

5.3.3 短串联重复序列的实验结果与分析 (90)

5.4本章小结 (91)

结论 (92)

参考文献 (94)

攻读博士学位期间发表的论文及其它成果 (107)

哈尔滨工业大学学位论文原创性声明和使用权限 (109)

致谢 (110)

个人简历 (111)

Contents

Contents

Abstract (In Chinese)........................................................................................?Abstract (In English).....................................................................................III Chapter 1 Introduction (1)

1.1 Background, objective and significance of the subject (1)

1.1.1 Background of the subject (1)

1.1.2 Object and significance of the subject (2)

1.2 Related background (3)

1.2.1 Genome Variation (3)

1.2.2 Genome sequencing (5)

1.2.3 The basic format of genomic data (10)

1.2.4 The composition of genome sequence (14)

1.3 Review of related research (15)

1.3.1 The denoising method of sequencing data (15)

1.3.2 Copy number variation detection methods (19)

1.3.3 The evaluation metrics of copy number variation calling results (23)

1.3.4 DNA sequence research based on topology theory (24)

1.4 Main research contents of this subject (24)

Chapter 2 Evaluation methods of copy number variation detection from whole-exome sequencing data (27)

2.1 Introduction (27)

2.2 Evaluation methods of WES CNV calling results (28)

2.2.1 Evaluation method of CNVs results concordance with WGS data (28)

2.2.2 Evaluation method of the concordance between tagSNPs and common

CNVs (30)

2.2.3 Evaluation method of CNV Mendelian error rates (31)

2.2.4 Evaluation method of heterozygosity check for deletions (32)

2.2.5 Other routine evaluation methods (32)

2.3 Experimental data and CNV detection methods (34)

2.4 Experimental results and analyses of WES CNV calling methods (35)

2.4.1 Evaluation results and analysis of CNVs concordance identified by

WES and WGS (35)

2.4.2 Evaluation results and analysis of the concordance between tagSNPs and

common CNVs (37)

2.4.3 Evaluation results and analysis of CNV Mendelian error rates (38)

2.4.4 Evaluation results and analysis of heterozygosity check for deletions (39)

2.4.5 Evaluation results and analysis of CNV size and distribution (40)

2.4.6 Evaluation results and analysis of the concordance across four

目录

CNV-calling methods (42)

2.4.7 Causal analysis of low sensitivity and specificity (43)

2.4.8 Advantages and disadvantages analysis of four WES CNV calling methods (44)

2.5 Brief Summary (46)

Chapter 3 Method of copy number variation detection from pooled-sample based whole-exome sequencing data (48)

3.1 Introduction (48)

3.2 Pooled-sample based sequencing data normalization and denoising method.49

3.2.1 Collection and normalization of genomic sequencing data (49)

3.2.2 Denoising method of pooled-sample based WES data (50)

3.3 Segmental algorithm for CNV regions of pooled-sample based WES data.52

3.3.1 Hidden states collection and transition probabilities (52)

3.3.2 Emission probabilities of read depth signal (53)

3.3.3 Emission probabilities of SNV signal (54)

3.3.4 Overall emission probabilities of paired hidden Markov model (55)

3.4 Experimental results and analysis of pooled-sample based WES CNV calling methods (55)

3.4.1 Sensitivities of experimental results of pooled-sample based WES CNV calling methods (56)

3.4.2 Specificities of experimental results of pooled-sample based WES CNV calling methods (58)

3.5 Brief Summary (59)

Chapter 4 A hybrid approach for copy number variation detection from whole-exome sequencing data (61)

4.1 Introduction (61)

4.2 Analysis of influencing factors on read depth in WES data (62)

4.2.1 Analysis of the influence of GC content on the read depth (62)

4.2.2 Analysis of the influence of mappability on the read depth (63)

4.2.3 Analysis of the influence of exon length on the read depth (64)

4.3 Denoising method of single-sample based WES data (65)

4.4 Research on the probabilistic model of single-sample based WES data (67)

4.4.1 The probabilistic model of the read depth on single-sample based WES data (67)

4.4.2 Parameter estimation of negative binomial distribution (69)

4.5 Merging method of CNV results identified by hybrid approach (70)

4.6 Experimental results and analysis of WES CNV calling by hybrid approach (72)

4.6.1 Sensitivities of experimental results of WES CNV calling by hybrid

Contents

approach (73)

4.6.2 Specificities of experimental results of WES CNV calling by hybrid approach (75)

4.7 Brief Summary (77)

Chapter 5 Analysis method of duplicated sequences based on generalized topological entropy (78)

5.1 Introduction (78)

5.2 Genomic sequence analysis based on generalized topological entropy (78)

5.2.1 The concept, computational method and limitations of topological entropy (78)

5.2.2 The concept and computational method of generalized topological entropy (80)

5.2.3 Approximation algorithm of generalized topological entropy on finite sequence (81)

5.3 Experimental results and analysis of genomic elements and duplicated sequences (83)

5.3.1 Experimental results and analysis of genomic elements (83)

5.3.2 Experimental results and analysis of segmental duplication (85)

5.3.3 Experimental results and analysis of short tandem repeats (90)

5.4 Brief Summary (91)

Conclusions (92)

References (94)

Papers published in the period of Ph.D. education (107)

Statement of copyright and Letter of authorization (109)

Acknowledgements (110)

Resume (111)

相关主题
文本预览
相关文档 最新文档