Submodular approximation sampling-based algorithms and lower bounds

格式：pdf
大小：232.21 KB
文档页数：26

下载文档原格式

采样算法(Sampling Method,HMC)

Random Walk
• 构造Markov Chain
状态空间为：{-5, -4, -3, -2, -1, 0, +1, +2, +3, +4, +5}
起始状态为：0
0.5 x' x ' ' T ( x , x ) 0.25 x x 1 状态转移概率为： 0 otherwise
Metropolis-Hasting Sampling
• 算法：
自定义proposal distribution
g ( j | i) 1. 从proposal distribution生成候选采样点：
2. 计算接受概率：
候选状态当前状态
j g (i | j ) ( j | i) min{1, } i g ( j | i)
即该Markov Chain从不同的初始概率开始转移经过相当长的步骤之后
都收敛到稳定分布。
pij (n) 0 判定一个Markov Chain为ergodic，只需证明
p 易知一个Markov Chain的稳定分布
再进行判定即可，则可求出其稳定分布
Markov Chain Monte Carlo (MCMC)
Metropolis-Hasting Sampling
• 证明：
P 只需证明该Markov Chain为reversible，并且转移概率为： ij g ( j | i)
不失一般性，假设 j g (i | j) i g ( j | i)
则可得转移概率 Pij g ( j | i ) ( j | i)
i i i (n) ( 1( n ) , 2 ,

Demographic_History综述

One of the many doors that open when a species’ genome is first sequenced is to the world of popula-tion genomics and to the unparalleled study of the evolutionary divergence of closely related populations and species. Ever since Darwin developed his model of how one species splits into two (his principle of diver-gence)1, the field of evolutionary biology has been divided on the question of whether Darwin’s model is correct. With growing access to very large amounts of genome data from multiple individuals of a species, it becomes increasingly likely that we will resolve this long-standing question.Recent next-generation sequencing (NGS) technolo-gies and assembly tools, such as restriction-site-associated DNA (RAD-tag)2 sequencing and genotyping by sequencing (GBS)3, now make it possible to obtain genome-scale data affordably from multiple individuals (reviewed in REFS 4,5). When individuals are sampled from multiple populations of a species, as has been done for humans 6–8, stickleback fish 2, fruitflies (Drosophila Population Genomics Project ), Arabidopsis thaliana (1001 Genome Project )9, dogs 10 and different species of great apes 11–13, among others 14,15, we gain an excep-tional view not only on the variation within populations and species but also on the variation that lies between them. Data sets such as these can include millions of variable single-nucleotide polymorphisms (SNPs) and other kinds of polymorphisms and hold the promise of finally revealing the genetic side of how species and populations diverge.However, a flood of new data may not lead directly to a commensurate gain in knowledge, and today, as new population genomic data sets are emerging, our skills of analysis and interpretation are partly overwhelmed. With the rise of large NGS data sets that reveal com-plex patterns of variation across species’ genomes, we find that our best models and tools for explaining pat-terns of variation were designed for a simpler time and smaller data sets. In the first place, NGS data sets present unique challenges, apart from their size, that result from the way in which they are generated. For example, it is common to use a reference genome to aid the assembly of additional NGS data, and yet this introduces a form of ascertainment bias , or reference bias, that can affect one’s findings 16,17 (BOX 1). Second, most models and methods available to analyse NGS data have limitations that pre-vent using all of the information in the data that bears on the processes of interest.In this Review, we survey the state of the art of population divergence models and inference methods, with regard to population genomics data sets. We do not examine in detail the technical challenges that are related to NGS for correcting sequencing, assembly, SNP and genotype-calling errors, as these have recently been reviewed elsewhere 4,17,18. Rather, we focus on models of population divergence and on methods to detect and to quantify gene flow, as well as methods to distinguish alternative modes of speciation. We discuss the limita-tions of these methods and provide examples of their application to recently available genome-wide data sets.Department of Genetics, Rutgers, the State University of New Jersey, Piscataway, New Jersey 08854, USA.Correspondence to J.H.e-mail: hey@doi:10.1038/nrg3446Published online 9 May 2013Single-nucleotide polymorphisms(SNPs). Sites in the DNA in which there is variation across the genomes in a population, usually comprising two alleles that correspond to two different nucleotides.Ascertainment biasSystematic bias introduced by the sampling design (for example, criteria used to select individuals and/or genetic markers) that induces a nonrandom sample of observations.Understanding the origin of species with genome-scale data: modelling gene flowVitor Sousa and Jody HeyAbstract | As it becomes easier to sequence multiple genomes from closely related species, evolutionary biologists working on speciation are struggling to get the most out of very large population genomic data sets. Such data hold the potential to resolve long-standing questions in evolutionary biology about the role of gene exchange in species formation. In principle, the new population genomic data can be used to disentangle the conflicting roles of natural selection and gene flow during the divergence process. However, there are great challenges in taking full advantage of such data, especially with regard to including recombination in genetic models of the divergence process. Current data, models, methods and the potential pitfalls in using them will be considered here.REVIEWSModels of species formationSpeciation in the absence of gene flow. When two popula-tions become allopatric (that is, completely geographically separated), they can diverge without mixing genes and eventually become reproductively isolated 19. Compared with divergence in the presence of gene exchange, thisis a comparatively simple process, and this simplicity, together with clear biogeographic evidence (such as the finding that in island archipelagos it is common to find different species on different islands), convinced many that this was how nearly all new species formed 19–24.Speciation in the presence of gene flow. Darwin, how-ever, envisioned that natural selection can act in dis-parate ways over a species range to pull a species in different directions and eventually to split it, first into different varieties and finally into separate species. This model, which has come to be called ‘sympatric speciation ’, was considered by many to be unlikely as gene exchange across a species range was considered to be a strong homogenizing force that counteracts divergence by natu-ral selection. However, more recently, genetic data (for example, mitochondrial DNA sequences and microsat-ellites) together with biogeographic circumstances have provided compelling evidence that sympatric speciation has occurred in numerous contexts 25–27.At the genetic level, Darwin’s model raises complexities, as it predicts that divergence in the presence of gene flow can cause different genes to experience very dif-ferent histories. Diversifying selection favours different alleles in different parts of a species range (and at one or more loci). However, the movement of all genes across the range of the species as the normal result of organ-isms reproducing and dispersing will regularly move alleles that are affected by the diversifying selection into the ‘wrong’ part of the species range. It will also cause the species to appear to be homogeneous when examined for patterns of variation at neutral genes .An additional major player in the divergence process, particularly when gene flow is occurring, is recombina-tion, which is the breaking and rejoining of chromo-somes that happens during meiosis every generation and that allows different parts of the genome to have different histories. Because of recombination, an allele that is favoured by selection and that is increasing in frequency will carry with it in its trajectory towards fixation only those flanking regions to which it is most tightly linked 28,29. Recombination also makes it possible for alleles at neutral loci to move by gene flow across the species range and to co-occur in the same population of genomes in which there are loci diverging by the action of diversifying selection 30,31. Recombination thus allows a species to have a population of genomes with a split personality: to resemble two diverging gene pools at loci affected by diversifying selection and to resemble a sin-gle gene pool at loci that are not under selection in this way. Evidence of this kind of genomic schism has come from a diversity of systems in recent years, on the basis of DNA Sanger sequence and microsatellite data 32 and more recently from NGS data in stickleback fish 2, Heliconius butterflies 14 and flycatchers 15.Modelling population divergence. A widely used theo-retical framework for studying speciation using genetic data is the ‘isolation with migration’ model, so named because it includes both the separation of two popula-tions (a process called isolation) following a splitting| Geneticsdiﬀerent from referenceREVIEWSSympatric speciationThe process of divergence between populations or species occupying the same geographical area and in presence of gene flow.Diversifying selectionNatural selection acting towards different alleles (or phenotypes) being favoured in different regions within a single population or among multiple connected populations.Neutral genesGenes for which geneticpatterns are mostly affected by mutation and demographic factors, such as genetic drift and migration.Allopatric divergenceThe process of divergence between populations orspecies that are geographically separated, in the absence of gene flow.Linkage disequilibrium(LD). The nonrandom association of alleles at different sites or loci.Islands of differentiationGenomic regions of elevated differentiation owing to the action of natural selection.F STThe proportion of the total genetic variability occurring among populations, typically used as a measure of the level of population genetic differentiation.Island modelA model introduced by Sewall Wright to study population structure comprising multiple populations connected to each other through migration.Metapopulation modelIn the context of F ST -based statistics, this is an idealized model in which severalpopulations diverge without migration from a common ancestral gene pool (or metapopulation).event from their common ancestral population as well as migration between populations 33–35. At one extreme, we can consider a simple isolation model in which the migration rate is zero in both directions; this corresponds to an allopatric divergence scenario (FIG. 1a). Other models include isolation with migration (FIG. 1b), isolation after migration (FIG. 1c) and secondary contact (FIG. 1d). It has been shown that patterns of genetic variation in samples from two closely related populations or species can be used to distinguish a pure isolation model (FIG. 1a) from a model with migration 33,34 (FIG. 1b–d). Furthermore, the growing evidence of persistent gene exchange between closely related species means that divergence often arises in the midst of conflicting evolutionary processes 32.Inferring the history of divergenceNGS data from multiple individuals offer the promise of disentangling the complex interplay between selec-tion, gene flow and recombination that occurs during speciation with gene flow. First, by having information for essentially all parts of the genome, we can gain a more detailed and accurate picture of the demography of popu-lations 36,37. Second, it becomes possible to ask whether some parts of the genome have been exchanging genes more than others. Substantial variation in gene flow lev-els across the genome constitutes clear, albeit indirect, evidence that selection is acting against gene flow to a greater degree in some genome regions than in oth-ers 38,39. Third, NGS data allow us to get better estimates of recombination rates and linkage disequilibrium (LD) patterns along the genome 40,41, and this can in princi-ple be used to infer the timing and magnitude of gene flow. Finally, polymorphism and LD along the genome also bear information about selective sweeps and genes that are the targets of diversifying selection (reviewed in REFS 42,43). However, all of these inferences depend on having a theoretical framework that connects patterns of variation to an explicit model.Genome scans using indicators of divergence. Depending on an investigator’s question, it can sometimes be useful to take a fairly simple approach that does not use models with lots of parameters to study the levels of divergence between populations. This can be achieved by tailoring analyses to a specific component of the divergence pro-cess and scanning across the genome while calculating statistics that are expected to be sensitive to that feature. For example, there has been lots of interest in detecting ‘islands of differentiation ’ by looking at the distribution of summary statistics that measure genetic differen-tiation, such as F ST 44. In the first study using RAD-tag sequencing, the differentiation of 45,789 SNPs along the genome between oceanic and freshwater populations of threespine sticklebacks (Gasterosteus aculeatus ) showed, overall, reduced levels of differentiation (F ST values close to zero)2. However, when a sliding window was run along the aligned genomes of freshwater and oceanic populations, the authors found evidence for genomic regions characterized by very high F ST values (>0.35), potentially harbouring genes under divergent selection. Interestingly, the same genomic regions were high-lighted in contrast to different freshwater populations, suggesting parallel adaptation to the freshwater environ-ment. These results are in agreement with a larger study comprising seven pairs of closely related marine andfreshwater populations comprising 5,897,368 SNPs 45.Divergence summaries can also be used within demographic models of divergence, such as an island model or a metapopulation model (reviewed in REFS 46,47). One approach is to scan the genome using a hierarchical F ST model that assumes a nested island model underlying the divergence process 48. A related approach explicitly accounts for variation in read depth in NGS data within a Bayesian framework 49 and hence should be preferred for analysing such data.Another type of genome scan that is targeted to identify recent admixture relies on comparing the| Genetics Figure 1 | Alternative modes of divergence. All models assume that an ancestral population of size N A splits into twopopulations at time of split (t s ). The two present-day populations have effective sizes N 1 and N 2, respectively. Panel ashows the model in which migration rate is zero in both directions, which corresponds to an allopatric divergence scenario. Panels b –d represent alternative models in which populations have been exchanging migrants. Gene flow occurs at constant rates since the split from the ancestral population (b ). Migration rates are assumed to be constant through time, but gene flow can be asymmetric: that is, one migration rate for each direction. Panel c shows a scenario in which populations begin diverging in the presence of gene flow but experience a cessation of gene flow after time since isolation (t i ). If the lack of current gene flow in this model is due to reproductive isolation then this represents a history in which divergence occurred to the point of speciation in the presence of gene flow. In panel d , we consider the alternative migration history in which populations were isolated and diverged for a period of time in the absence of gene flow, followed by secondary contact at time of secondary contact (t sc ) and the introgression of alleles from the other population by gene flow.Nested island modelA hierarchical island model with groups of populations in which migration among populations within the same group is higher than among populations in different groups. Gene treesBifurcating trees that represent the ancestral relationships of homologous haplotypes sampled from a single or multiple populations. A gene tree includes coalescent events and, in models with gene flow, migration events. A gene tree is characterized by a topology, branch lengths, coalescence times and migration times.population tree (assumed to be known) with the genetrees inferred at a specific site. Incongruences betweenthe population tree and the gene tree can be due toincomplete lineage sorting (shared ancestral poly-morphism) or to gene flow. One statistic, called ‘D’,was specifically designed to detect introgression fromone population to another50(FIG. 2). Computing Drequires a genome from each of two sister populations,a genome from a third population (a potential sourceof introgressed genes) and a fourth outgroup genometo identify the ancestral state (identified as the A allele).Focusing on SNPs in which the candidate source popu-lation has the derived allele (B) and in which the twosister genomes have different alleles, there are two pos-sible configurations: either ABBA or BABA. Under thehypothesis of shared ancestral polymorphism, the num-ber of tree topologies of ABBA and BABA are expectedto be equal, and the expected D will be zero. Deviationsfrom that expectation are interpreted as evidence ofintrogression. As with FSTgenome scans, investigatorscan look at the distribution of D along the genome,but when using D, the aim is to find genomic regionsthat specifically experienced introgression, whereas inthe case of FST, the goal is to identify regions of highdifferentiation, regardless of the cause.Genome scans using D were used, for instance,to detect admixture between archaic and modernhumans51,52 and to study the patterns of introgressionin Heliconius butterflies14. In the case of modern andarchaic humans, unidirectional introgression fromNeanderthals to non-African humans was estimatedto have occurred for 1–4% of the genome51. Similarly,data from 642,690 SNPs point to 4–6% of the presentday Melanesian genomes being derived from admix-ture with Denisovans52. For Heliconius butterflies,RAD-tag sequencing of 4% of the genome (~12 Mb)indicated introgression from Heliconius timareta toHeliconius melpomene amaryllis (2–5% admixture),which are sympatric species that exhibit the samewing colour patterns. Interestingly, only a few regionsexhibited significant D values, including genes knownto contain the mimicry loci B/D and N/Yb. Despite thelack of an explicit test of positive selection, the fact thatthese regions harbour genes involved in mimicry is inagreement with an active role of selection promotingintrogression at these regions. In these species, the pat-terns of differentiation along the genome suggest a casein which most of the genome is differentiated — con-sistent with a model of allopatric divergence (FIG. 1a) ordivergence with limited gene flow (FIG. 1b) — whereasa few regions show evidence of secondary contact anduni- or bidirectional introgression of genes from onepopulation (species) to the other (FIG. 1c). In both cases ofhumans and Heliconius spp., there was evidenceof regions exchanged between populations that werealready differentiated, pointing to the importance ofsecondary contact.Although genome scan approaches are flexible andapplicable to large genomic data sets, the focus onamenable summary statistics typically entails settingaside much of the information in a data set. A relatedlimitation is that the same numerical value of a par-ticular statistic can result from very distinct scenarios.For instance, a low FSTcan be due to shared ancestralpolymorphism or due to gene flow44. Similarly, the Dstatistic can be significantly different from zero owingto other events rather than admixture. The evidenceof admixture between modern human non-Africanpopulations and Neanderthals has been questionedby a simulation study showing that spatial expansionsand population substructure without admixture couldlead to D values that are similar to the observed ones53.|Genetics→BAncestral polymorphismIntrogression (gene ﬂow)Figure 2 | Disentangling ancestral polymorphism from gene flow (ABBA and BABA test). The diagram shows thedivergence of two sister populations (1 and 2), a third population (potential source of introgressed genes; 3) and anoutgroup population (4) over time. The black line represents the gene tree of a given site, and the star represents amutation from the ancestral state (allele A) to the derived state (allele B). The pattern ABBA can occur owing toan ancestral polymorphism (a): that is, coalescent of lineage from population 2 with lineage from population 3 in theancestral population (population ancestral to populations 1, 2 and 3), or gene flow from population 3 to population 2(b). Under a model with no gene flow, we expect that the pattern ABBA is as frequent as BABA owing to the fact thatthere is 50% chance that either the lineage from population 1 or from population 2 coalesces with lineage frompopulation 3 in the population ancestral to populations 1, 2 and 3.REVIEWSBayesian statistics Statistical framework in which the parameters of the models are treated as random variables, allowing expression of the probability of parameters, given the data; this is called the posterior. The posterior probability is obtained by Bayes’ rule, and it is proportional to the likelihood times the prior.Allele frequency spectrum (AFS). A distribution of the counts of single-nucleotide polymorphisms with a given observed frequency in a single or multiple populations. Genetic driftStochastic changes in gene frequency owing to finite size of populations, resulting from the random sampling of gametes from the parents at each generation.Likelihood and model-based methods. As useful asgenome scans with indicator variables can be to identifycomponents of the divergence process, they fall short ofproviding a full portrait of divergence unless they arecombined with other analyses. In this light, the goal formany investigators is to be able to calculate the likeli-hood under a rich divergence model. For some model ofdivergence M, with a parameter set Θ, the likelihood isthe probability (P) of the data given the parameters: thatis, PM(Data | Θ). Having a likelihood function at handallows estimating the most likely parameters of a givenmodel either with frequentist or Bayesian statistics54. Also,comparing the likelihood of alternative models opensthe door to model choice approaches to infer the mostprobable divergence model. Currently, there are twomain families of likelihood-based approaches for study-ing divergence: one based on the allele frequency spectrum(AFS) and a second based on sampling genealogies forshort portions of the genome (BOX 2).Likelihoods using the allele frequency spectrum. For asingle SNP sampled in each of two populations, consid-ered together with the base that is present in an outgroupgenome, the data can be summarized as the number ofcopies of the derived allele in each of the two popula-tions. For a large number of SNPs, these counts fill adiscrete distribution — the allele frequency spectrum(AFS) — in two dimensions (one for each sampled pop-ulation), which can be represented in graphical form(FIG. 3). This approach has seen renewed interest as largeSNP data sets have become more common55–57. FIGURE 3shows how the AFS can vary considerably for the dif-ferent isolation with migration models shown in FIG. 1,particularly how simple isolation differs from modelswith gene flow. In the absence of gene flow (FIG. 3a), thefrequencies of SNPs found in only one population aredifferent from the SNPs in the other populations becausegenetic drift drives different alleles to fixation in eachpopulation. By contrast, in models with gene flow, thecells along the diagonal exhibit a higher density (FIG. 3b)because there are many SNPs with similar frequenciesin the two populations. However, as exemplified in theseAFSs, it can be difficult to separate alternative scenarioswith gene flow, as these tend to be similar (FIG. 3b–d).Although the expected AFS can be generated bysimulations55,58,59, it is also the focus of a populationgenetic theory in which differential equations describethe diffusion of allele frequencies in populations60,61. Inrecent years, the diffusion equation approach has beenreawakened for the study of the AFS under isolationwith migration models, such as the ones shown in FIG. 1(REFS 57,62,63). If it is assumed that the SNPs segregateCoalescent theoryA theory that describes thedistribution of gene trees (and ancestral recombination graphs) under a givendemographic model that canbe used to compute theprobability of a given gene tree.independently, then given both an observed and an expected AFS for a model of interest, the likelihood canbe directly calculated using a multinomial distribution. One difficulty is that in reality most data sets include many SNPs that are sufficiently close to one another that the assumption of independence does not apply. Still, the same likelihood calculation can be applied (now identi-fied as a ‘composite likelihood’57) without introducing bias to the parameter estimates, albeit with limited access to confidence intervals and other analyses for which a likelihood is often used 64. By reducing the data to counts of SNP frequencies, AFS methods are also guilty of discarding all linkage information in the data. This means that these methods are not expected to be verysensitive to processes that can affect local LD patterns,such as gene flow or admixture. AFS-based analyses onpopulation genomic data sets have so far mostly beenconducted on human data 57,63,65, but the same approachcan be used to study the divergence of closely related species. A nice example is the study of the divergence ofSumatran orangutans (Pongo abelii ) and Bornean oran-gutans (Pongo pygmaeus )13. Low-coverage (8×) Illumina sequencing of 5 individuals from each species yielded a total of 12.74 million SNPs, and an AFS analysis led to an estimated speciation time of 400,000 years with a low level of gene exchange between the species 13.The AFS approach has also been applied to morecomplex models with more than two populations or spe-cies. One example comes from the analysis of humandata from the 1000 Genomes Project under a three-population isolation with migration model with gene flow and population expansions 65. By considering onlySNPs at synonymous sites and by explicitly model-ling genotype calling errors, these authors estimated a time for expansion out of Africa around 51,000 years ago, a split between Europeans and East Asians around 23,000 years ago, recent population expansion in both Europeans and East Asians and statistically significant but reduced gene flow among all populations.However, AFS-based methods become computa-tionally challenging and expensive for models with more than three populations. There is thus considerable interest in finding suitable approximations to the diffu-sion process that do not rely on a full multidimensionalAFS 66–68. Recently, some new methods have appeared that implement simplified diffusion processes that do not include mutation models but that do account for divergence from common ancestry by genetic drift 66–68.The lack of a mutational component means that these methods are intended for cases of recent divergence among populations. By modelling the branch lengths ofthe population and species tree as proportional to drift and by treating drift in different branches as independ-ent (with no gene flow), it is possible to write down a likelihood function, opening the door to infer the pop-ulation and species trees. It is also possible to include admixture within this framework by allowing for one population to have ancestry in multiple populations 66.For instance, 60,000 SNPs from 82 dog breeds and wild canids (obtained with SNP arrays) supported a popula-tion tree with admixture events (as in FIG. 1c ) rather thana pure isolation model 66 (as in FIG. 1a ).Likelihoods by sampling genealogies. If the recombina-tion rate is low, such that it is unlikely to have occurred in the time since the common ancestor of a sample of sequences from one or various populations, as can be the case over a short region of the genome, the history of a sample of sequences can be described by a gene tree or genealogy (BOX 3). The depth and structure of such genealogies have been described by coalescent theory for a diversity of models, including the models shown in FIG. 1 (REFS 69–71), and this coalescent modelling has made it possible to calculate the likelihood for data| Geneticsa clog(prob)Population 1Population 1Population 1Population 1P o p u l a t i o n 2P o p u l a t i o n 2P o p u l a t i o n 2bd P o p u l a t i o n2Figure 3 | Allele frequency spectrum under alternative divergence models. Each entry in the matrix (x ,y ) corresponds to the probability of observing a single-nucleotide polymorphism (SNP) with frequency of derived allele x in population 1 and y in population 2. The colours represent the log of the expected probability for each cell ofthe allele frequency spectrum (AFS). The white colour corresponds to –Inf: that is, tocells with an expected probability of zero. These AFSs are conditional on polymorphicSNPs, hence the cells (0,0) and (10,10) have zero probability. The likelihood for anobserved AFS can be computed by comparing it with these expected AFSs. a | Isolation model. b | Isolation with migration. c | Isolation after migration. d | Secondary contact.The joint allele frequency spectrums for the different scenarios were obtained withcoalescent simulations carried out with ms 125. All scenarios were simulated, assuming all populations share the same effective sizes (N = 10,000), a time of split t s = 20,000 generations ago (t / 4N = 0.5), symmetrical migration rate (2N 1m 12 = 5, 2N 2m 21 = 5, for scenarios b , c and d , for scenario c , a time of isolation of t i = 2,000 generations ago (t i / 4N = 0.05) and, for scenario d , a time of secondary contact of t sc = 6,000 generationsago (t sc / 4N = 0.15).REVIEWS。

21种nr的iqa处理算法及通用函数

21种nr的iqa处理算法及通用函数在自然语言处理中，IQA（Information Quality Assessment）是指评估文本中信息的质量和可靠性。

NR（Noise Reduction）是一种常用的IQA处理算法，旨在减少文本中的噪声（即无用信息）并提高文本的准确性和可信度。

下面是介绍21种NR的IQA处理算法及其通用函数。

1. 去除停用词（Stopword Removal）：通过删除常见且没有实际含义的词语来减少文本中的噪声。

常用的停用词包括“是”、“的”、“在”等。

```pythondef remove_stopwords(text):stopwords = set(['是', '的', '在']) # 常见停用词列表words = text.splitwords = [word for word in words if word not in stopwords]return ' '.join(words)```2. 词干提取（Stemming）：通过移除单词的后缀来减少词汇的变异，使文本更加统一、例如，“running”、“ran”和“runs”可以统一为“run”。

```pythonfrom nltk.stem import PorterStemmerdef stemming(text):stemmer = PorterStemmerwords = text.splitstemmed_words = [stemmer.stem(word) for word in words]return ' '.join(stemmed_words)```3. 删除重复词（Duplicate Word Removal）：将文本中重复出现的词语合并为一个，减少信息的冗余度。

```pythondef remove_duplicates(text):words = text.splitunique_words = set(words)return ' '.join(unique_words)``````pythonimport redef remove_html_tags(text):clean_text = re.sub('<.*?>', '', text)return clean_text```5. 删除特殊字符（Special Character Removal）：通过删除文本中的特殊字符和标点符号来降低噪声水平。

基于遗传算子采样的自适应代理优化算法

基于遗传算子采样的自适应代理优化算法宋保维;王新晶;王鹏【摘要】提出一种应用于黑盒问题（ black⁃box problem ）的优化算法，称为遗传算子采样（ genetic operator sampling，GOS）自适应代理优化算法。

通过对当前样本进行两两交叉，对当前最优样本进行高斯变异2种算子获得候选样本集。

对候选样本进行适应性评估，评估标准为候选样本处的交叉验证误差和该样本与父代样本之间最小距离的乘积，将乘积最大的样本加入已有样本集。

GOS优化算法在一维问题中详细阐述，与有效全局优化算法（ efficient global optimization，EGO）和最大化模型误差算法（maximum square error，MSE）在3个典型数学算例中进行对比，验证该算法的有效性。

%This paper proposes an optimization algorithm that is applied to Black⁃box Problem, called Genetic Op⁃erator Sampling ( GOS ) adaptive surrogate⁃based optimization algorithm. Genetic operators produce candidate sample set. Cross⁃over operator is executed between any two of samples and mutation operator is executed only on present best sample. Then an assessment criterion, which is the product of the cross validation error of the candidate sample and the minimum distance between it and existing samples, is used to judge the adaptation of each sample. The candidate sample with largest product will be added to existing samples. GOS is illustrated on 1⁃D function in detail and is compared to EGO and MSE algorithm on three typical functions, the results validated the effectiveness of GOS algorithm.【期刊名称】《西北工业大学学报》【年(卷),期】2016(034)004【总页数】7页(P613-619)【关键词】代理模型;遗传算子;采样准则;代理优化【作者】宋保维;王新晶;王鹏【作者单位】西北工业大学航海学院，陕西西安 710072;西北工业大学航海学院，陕西西安 710072;西北工业大学航海学院，陕西西安 710072【正文语种】中文【中图分类】TP18设计优化问题通常包含计算量大的高保真度模型(计算流体力学、有限元分析、基于试验的工程计算等)。

英汉对照计量经济学术语

计量经济学术语A校正R2（Adjusted R-Squared）：多元回归分析中拟合优度的量度，在估计误差的方差时对添加的解释变量用一个自由度来调整。

对立假设（Alternative Hypothesis）：检验虚拟假设时的相对假设。

AR（1）序列相关（AR(1) Serial Correlation）：时间序列回归模型中的误差遵循AR（1）模型。

渐近置信区间（Asymptotic Confidence Interval）：大样本容量下近似成立的置信区间。

渐近正态性（Asymptotic Normality）：适当正态化后样本分布收敛到标准正态分布的估计量。

渐近性质（Asymptotic Properties）：当样本容量无限增长时适用的估计量和检验统计量性质。

渐近标准误（Asymptotic Standard Error）：大样本下生效的标准误。

渐近t 统计量（Asymptotic t Statistic）：大样本下近似服从标准正态分布的t 统计量。

渐近方差（Asymptotic Variance）：为了获得渐近标准正态分布，我们必须用以除估计量的平方值。

渐近有效（Asymptotically Efficient）：对于服从渐近正态分布的一致性估计量，有最小渐近方差的估计量。

渐近不相关（Asymptotically Uncorrelated）：时间序列过程中，随着两个时点上的随机变量的时间间隔增加，它们之间的相关趋于零。

衰减偏误（Attenuation Bias）：总是朝向零的估计量偏误，因而有衰减偏误的估计量的期望值小于参数的绝对值。

自回归条件异方差性（Autoregressive Conditional Heteroskedasticity, ARCH）：动态异方差性模型，即给定过去信息，误差项的方差线性依赖于过去的误差的平方。

一阶自回归过程[AR（1）]（Autoregressive Process of Order One [AR(1)]）：一个时间序列模型，其当前值线性依赖于最近的值加上一个无法预测的扰动。

sampling-随机模拟的基本思想和常用采样方法

sampling-随机模拟的基本思想和常用采样方法xianlingmao2013-06-26 23:11:19[简要]通常，我们会遇到很多问题无法用分析的方法来求得精确解，例如由于式子特别，真的解不出来；一般遇到这种情况，人们经常会采用一些方法去得到近似解（越逼近精确解越好，当然如果一个近似算法与精确解的接近程度能够通过一个式子来衡量或者有上下界，那么这种近似算法比较…本文要谈的随机模拟就是一类近似求解的方法，这种方法非常的牛逼哦，它的诞生虽然最早可以追溯到18xx年法国数学家蒲松的投针问题（用模拟的方法来求解\pi的问题），但是真正的大规模应用还是被用来解决二战时候美国佬生产原zi弹所碰到的各种难以解决的问题而提出的蒙特卡洛方法（Monte Carlo)，从此一发不可收拾。

本文将分为两个大类来分别叙述，首先我们先谈谈随机模拟的基本思想和基本思路，然后我们考察随机模拟的核心：对一个分布进行抽样。

我们将介绍常用的抽样方法，1. 接受-拒绝抽样；2）重要性抽样；3）MCMC（马尔科夫链蒙特卡洛方法）方法，主要介绍MCMC的两个非常著名的采样算法（metropolis- hasting算法和它的特例Gibbs采样算法）。

一. 随机模拟的基本思想我们先看一个例子：现在假设我们有一个矩形的区域R（大小已知），在这个区域中有一个不规则的区域M（即不能通过公式直接计算出来），现在要求取M 的面积？怎么求？近似的方法很多，例如：把这个不规则的区域M划分为很多很多个小的规则区域，用这些规则区域的面积求和来近似M，另外一个近似的方法就是采样的方法，我们抓一把黄豆，把它们均匀地铺在矩形区域，如果我们知道黄豆的总个数S，那么只要我们数数位于不规则区域M中的黄豆个数S1，那么我们就可以求出M 的面积：M=S1*R/S。

另外一个例子，在机器学习或统计计算领域，我们常常遇到这样一类问题：即如何求取一个定积分：\inf _a ^b f(x) dx，如归一化因子等。

alternative sampling methods

alternative sampling methods
替代性采样方法是指在采样过程中使用不同于传统随机抽样方
法的技术。

这些方法可以提供更高效、更精确的结果，并在特定情况下更适用。

以下是一些常见的替代性采样方法：
1. 分层抽样：将总体按照某些特征（如年龄、性别、地区等）
分成若干层，然后从每一层中随机抽取样本。

这种方法可以确保样本代表性更好，减小估计误差。

2. 簇抽样：将总体分为若干个簇，然后从每个簇中抽取全部个
体或随机选取部分个体。

这种方法适用于总体分布不均匀的情况下，可以提高效率和精度。

3. 比例抽样：根据总体中不同类别的比例，在每个类别中按比
例随机抽取样本。

这种方法适用于样本量不足或样本分布不均匀的情况下，可以提高估计精度。

4. 系统抽样：从总体中随机选取一个起始点，然后每隔一定间
距选择一个样本。

这种方法适用于总体无特殊分布的情况下，可以提高效率和精度。

5. 聚类抽样：首先将总体分为若干个聚类（如地理区域、行业等），然后从每个聚类中随机抽取一个样本。

这种方法适用于总体分
布不均匀或某些聚类比其他聚类更重要的情况下，可以提高精度和效率。

替代性采样方法可以根据具体情况进行选择，以提高效率和精度。

但是，采样方法的选择应该基于总体特征和研究目的，同时需要注意
样本的代表性和可比性。

Graph Regularized Nonnegative Matrix

Ç
1 INTRODUCTION
HE
techniques for matrix factorization have become popular in recent years for data representation. In many problems in information retrieval, computer vision, and pattern recognition, the input data matrix is of very high dimension. This makes learning from example infeasible [15]. One then hopes to find two or more lower dimensional matrices whose product provides a good approximation to the original one. The canonical matrix factorization techniques include LU decomposition, QR decomposition, vector quantization, and Singular Value Decomposition (SVD). SVD is one of the most frequently used matrix factorization techniques. A singular value decomposition of an M Â N matrix X has the following form: X ¼ UÆVT ; where U is an M Â M orthogonal matrix, V is an N Â N orthogonal matrix, and Æ is an M Â N diagonal matrix with Æij ¼ 0 if i 6¼ j and Æii ! 0. The quantities Æii are called the singular values of X, and the columns of U and V are called

计算机视觉读后感

Computer vision姓名：学号；专业：1 introductionObject segmentation in the presence of clutter and occlusions is a challenging task for computer vision and cross-media. Without utilizing any high-level prior information about expected objects, purely low-level information such as intensity, color and texture does not provide the desired segmentations. In numerous studies , prior knowledge about the shapes of the objects to be segmented can significantly improve the final reliability and accuracy of the segmentation result. However, given a training set of arbitrary prior shapes, there remains an open problem of how to define an appropriate prior shape model to guide object segmentation.Early work on this problem is the Active Shape Model (ASM), which was developed by T. Cootes et al.. The shape of an object is represented as a set of points. These points can represent the boundary or significant internal locations of the object. The evolutional shape is constrained by the point distribution model which is inferred from a training set of shapes. However, these methods suffer from a parameterized representation and the manual positioning of the landmarks. Later, level set based approaches have gained significant attention toward the integration of shape prior into variational segmentation . Almost all these works optimize a linear combination of a data-driven term and a shape constraint term. Data-driven term aims at driving the segmenting curve to the object boundaries, and shape constraint term restricts possible shapes embodied bythe contourThere are many ways to define the shape constraint term. Simple uniform distribution , Gaussian densities ,non-parametric estimator , manifold learning ,and sparse representation were considered to model shape variation within a training set. However, most methods are recognition-based segmentation. They are suitable for segmenting objects of a known class in the image according to their possible similar shapes. If the given training set of shapes is large and associated with multiple different object classes, the statistical shape models and manifold learning do not effectively represent the shape distributions due to large variability of shapes. In addition, global transformations like translation, rotation and scaling and local transformations like bending and stretching are expensive to shape model in image2 Deep Learning Shape PriorsRecently, deep learning models，are attractive for their well performance in modeling high-dimensional richly structured data. A deep learning model is about learning multiple levels of representation and abstraction that help to make sense of data such as images, sound, and text. The deep Boltzmann machine (DBM) has been an important developmentin the quest for powerful deep learning models . Applications that use deep learning models are numerous in computer vision and information retrieval, including classification , dimensionality reduction , visual recognition tasks , acoustic modeling , etc. Very recently, a strong probabilistic model called Shape Boltzmann Machine (SBM) is proposed for the task of modeling binary object shapes. This shape generative model has the appealing property that it can both generate realistic samples and generalize to generate samples that differ from shapes in the training set.A Restricted Boltzmann Machine (RBM) is a particular type of Markov Random Field (MRF) that has a two-layer architecture, in which the visible units are connected to hidden units. A Deep Boltzmann Machine (DBM) is an extension of the RBM that has multiple layers of hidden units arranged in layers . In general, the shape prior can be simply described as two levels of representation: low-level local features (like edges or corners) and high-level global features (like object parts or object). Low-level local features with good invariant properties can be re-used in different object samples. On the other hand, high-level global features describe the image content, and they are more appropriate to cope with occlusion, noise, and changes on the object pose. In order to learn a model that accurately captures the global and local properties of binary shapes. We use three-layered DBM to automatically extract the hierarchical structure of shape data.In DBM, the learned weights and biases implicitly define a probability distribution over all possible binary shapes via the energy . Moreover, this three-layered learning can effectively capture hierarchical structures of shape priors. Lower layers detect simple local features of shape and feed into higher layers, which in turn capture more complex global features of shape. Once binary states have been chosen for the hidden units, a shape generative model can be inferred by conditional probability. Since such generative shape is defined by probability, we adopt Cremers’s shape relaxed method [8] to replace the 2D visible vector _ with a shape _ of probabilistic representation, and define a shape constraint term in the following energetic form.In this paper we introduce a new shape-driven approach for object segmentation. Given a training set of shapes, we first use deep Boltzmann machine to learn the hierarchical architecture of shape priors. This learned hierarchical architecture is then used to model shape variations of global and local structures in an energetic form. Finally, it is applied to data-driven variational methods to perform object extraction of corrupted data based on shape probabilistic representation.3 Trust Region FrameworkTrust region is a general iterative optimization framework that in some sense is dual to the gradient descent, While gradient descent fixes the direction of the step and then chooses the step size, trust region fixes the step size and then computes the optimal descent direction, as described below.At each iteration, an approximate model of the energy is constructed near the current solution. The model is only “trusted” within some small ball around the current solution, a.k.a. “trust region”. The global minimum of the approximate model within the trust region gives a new solution. This procedure is called trust region sub-problem. The size of the trust region is adjusted for the next iterationbased on the quality of the current approximation. Variants of trust region approach differ in the kind of approximate model used, optimizer for the trust region subproblem, and a merit function to decide on the acceptance of the candidate solution and adjustment of the next trust region size. For a detailed review of trust region methods this， paper propose a Fast Trust Region (FTR) approach for optimization of segmentation energies with nonlinear regional terms, which are known to be challenging for existing algorithms. These energies include, but are not limited to, KL divergence and Bhattacharyya distance between the observed and the target appearance distributions, volume constraint on segment size, and shape prior constraint in a form of distance from target shape moments. Our method is 1-2 orders of magnitude faster than the existing state-of-the-art methods while converging to comparable or better solutions.4 The Short Boundary Bias and Coupling EdgesWe first introduce the standard pairwise Markov random field (MRF) model for interactive segmentation and the extension Although a large number of neighbouring pixels in such images may take different labels, the majority of these pixels has a consistent appearance. most pixel pairs along the object boundary have a consistent (brown-white) transition. Therefore, proposed a priorthat penalizes not the length but the diversity of the object boundary, i.e., the number of types of transitions. This new potential does not suffer from the short-boundary bias.MAP Inference. Inferring the Maximum a Posteriori (MAP) solution of the models above corresponds to minimizing their respective energy functions. It is well known that the energy function (1) is submodular if all _(Ii; Ij) _0 and can then be minimized in polynomial time by solving an (s;t)-mincut problem [4]. In contrast, the higher order potential (5) makes MAP inference in general NP hard, and therefore [11] proposed an iterative bound minimization algorithm for approximate inference. We show next that higher order potentials of the form (5) can be converted into a pairwise model by the addition of some binary auxiliary variables.One of the key challenges posed by higher-order models is efficient MAP inference. Since inference in pair wise models is very well studied, one popular technique is to transform the higher-order energy function into that of a pair wise random field. In fact, any higher-order pseudo boolean function can be converted to a pair wise one, by introducing additional auxiliary random variables Unfortunately, the number of auxiliary variables grows exponentially with the arity of the function, and in practice this approach is only feasible for higher-order functions with few variables. If however the higher-order function contains inherent “structure”, then MAP inference can be practically feasible even with terms that act on thousands of variables [14, 12, 25, 29]. This is the case for the edge coupling potentials4，Co-segmentation.In this paper, we propose a novel correspondence-based object discovery and co-segmentation algorithm that performs well even in the presence of many noise images. Our algorithm automatically discovers the common object among the majority of images and computes a binary object/ background label mask for each image. Images that do not contain the common object are naturally handled by returning an empty labeling Our algorithm is designed based on the assumption that pixels (features) belonging to the common object should be: (a) salient, i.e. dissimilar to other pixels within their image, and (b) sparseCo-segmentation was first introduced by Rother et al. , who used histogram matching to simultaneously segment the same object in two different images. Co-segmentation was also explored in weakly supervised setups with multiple object categories . While image annotations may facilitate object discovery and segmentation, image tags are often noisy, and bounding boxes or class labels are usually unavailable. In this work we show that it is plausible to automatically discover visual objects from the Internet using image search alone. Let I = {I1, . . . , IN} be the image dataset consisting of N images. Our goal is to compute the binary masks B = {b1, . . . , bN}, where for each image Ii and pixel x = (x, y),bi(x) = 1 indicates foreground (the common object), and bi(x) = 0 indicates background (not the object) at location x.The saliency of a pixel or a region in an image can be defined in numerous ways and extensive research in computer and human vision has been devoted to this topic. In our experiments, we used an off the-shelf saliency measure—Cheng et al.’s Contrast-based Saliency —that produced sufficiently good saliency estimates for our purposes, but our formulation is not limited to a particular saliency measure and others can be usedthe saliency of a pixel based on its color contrast to other pixels in the image (how different it is from the other pixels). Since high contrast to surrounding regions is usually a stronger evidence for saliency of a region than high contrast to far away regions, they weigh the contrast by the spatial distances in the image.Formally, let wij denote the flow field from image Ii to image Ij . Given the binary masks bi, bj , the SIFT flow objective function becomesE (wij ; bi, bj) =_x∈Λibi(x)_bj(x + wij(x)) _Si(x) − Sj(x + wij(x))_1+(1 − bj(x + wij(x))C0 +_y∈Nixα _w(x) − w(y)_2_where Si are the dense SIFT descriptors of image Ii, x _→ _x_p is the Lp distance for p = 1 and 2, Λi is image Ii’s lattice, Nix is the neighborhood of x, α weighs the smoothness term, and C0 is a large constant. We then denote byWthe set of all pixel correspondences in the dataset: W= ∪Ni=1∪Ij∈Ni wij .The key insight to our algorithm is that common object patterns should be salient within each image, while being sparse with respect to smooth transformations across images. We propose to use dense correspondences between images to capture the sparsity and visual variability of the common object over the entire database, which enables us to ignore noise objects that may be salient within their own images but do not commonly occur in others. We performed extensive numerical evaluation on established co-segmentation datasets, as well as several new datasets generated using Internet search. Our approach is able to effectively segment out the common object for diverse object categories, while naturally identifying images where the common object is not present6 Technical ApproachThe overall approach to image segmentation embodied in this work is inspired by the work on gPb [3, 2] in that it starts with a set of edges derived from local pixel differences and then invokes a globalization procedure which strengthens or weakens those edges based on an analysis of the eigenvectors produced by a normalized cuts procedure.The procedure consists of three main processing stages：The first step in the procedure is an edge extraction stage which produces a set of edgels. This system employs a variant 哦f the method proposed by Meer and Georgescu which can be thought of as computing the normalized cross correlation between each pixel window and a set of oriented edge templates. In this work we modify this edge detection scheme by introducing an additional factor in the denominator which is based on the average response to the template. This modification serves to reintroduce some contrast information into the edge response so that larger steps have a greater response than smaller ones and slight variations in low contrast regions are not unduly amplified. second，In the original formulation of the Normalized Cuts segmentation procedure the principal goal is to minimize the Rayleigh quotient，Our aim then is to construct a matrix,L, whose column span captures the variation we expect in the eigenvectors of the orginal system. Once the L matrix has been constructed we can turn our attention to solving the reduced order eigensystem，We first note that this system is much smaller than the original system since m is on the order of 5000 where n was on the order of 115000 for the images in our test set. The approach advocated in this paper leverages the observation that in this image segmentation task the edge signal provides useful information about the final structure of the eigen problem. By constructing a basis tailored to the content of the image we are able to identify a subspace that captures the nuances of the edges and the details found in the full system7 summaryS egmentation, the problem of breaking a given image into salient regions, is one of the most fundamental issues in Computer Vision and a number of approaches have been advanced to accomplish this task. Among these schemes, three methods have proven to be quite popular in practice owing to their performance and/or their running time. The Mean Shift method of Comaniciu and Meer , the Normalized Cuts method developed by Shi and Malik [14] and the graph based method of Felzenswalb and Huttenlocher. More Recently Arbelaez, Maire, Fowlkes and Malik have proposed an impressive segmentation algorithm that achieves state of the art results on commonly available data sets. Their gPb method starts with a local edge extraction procedure which has been optimized using learning techniques. The results of this edge extraction step are then used as input to a spectral partitioning procedure which globalizes the results using Normalized Cuts. This globalization stage helps to focus attention on the most salient edges in the scene.Refers[A} Fei Chen， Hui min Yu， Roland Hu ,Xun xun Zeng ，Deep Learning Shape Priors for Object Segmentation, in CVPR, 2013.[B] Lena Gorelick, Frank R. Schmidt, Yuri Boykov ，Fast Trust Region for Segmentation，in CVPR, 2013.[C] Pushmeet Kohli，Anton Osokin，Stefanie Jegelka，A Principled Deep Random Field Model for Image Segmentation，in CVPR, 2013.[D] Michael Rubinstein，Armand Joulin，Johannes Kopf，Ce Liu ，Unsupervised Joint Object Discovery and Segmentation in Internet Images，in CVPR, 2013.[E] Camillo Jose Taylor ，GRASP Laboratory，Towards Fast and Accurate Segmentation，in CVPR, 2013.。

Sampling Methods and the Central Limit Theorem

d Random Sampling

When a population can be clearly divided into groups based on some characteristic, we may use stratiﬁed random sampling. It guarantees each group is represented in the sample. The groups are also called strata. Once the strata are deﬁned, we can apply simple random sampling within each group or stratum to collect the sample.
8-20
Example- Stratiﬁed Random Sampling
8-21
Example- Stratiﬁed Random Sampling

stratiﬁed random sampling will guarantee that at least one ﬁrm in strata 1 and 5 are represented in the sample. In this case, the number of ﬁrms sampled from each stratum is proportional to the stratum’s relative frequency in the population.
8-14
Example- Systematic Random Sampling

First, k is calculated as the population size divided by the sample size. For Computer Graphic, Inc., we would select every 20th invoice from the ﬁle drawers; in so doing the numbering process is avoided. If k is not a whole number, then round down.

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Submodular Approximation: Sampling-based Algorithms and Lower Bounds
Zoya Svitkina Lisa Fleischer
arXiv:0805.1071v1 [cs.DS] 7 May 2008
Department of Computer Science Dartmouth College {zoya, lkf} @ May 7, 2008
1 Introduction
A function f deﬁned on subsets of a ground set V is called submodular if for all subsets A, B ⊆ V , f (A) + f (B ) ≥ f (A ∩ B ) + f (A ∪ B ). Submodularity is a discrete analog of convexity. It also shares some nice properties with concave functions, as it captures decreasing marginal returns. Submodular functions generalize cut functions of graphs and rank functions of matrices and matroids, and arise in a variety of applications including facility location, assignment, scheduling, and network design. In this paper, we introduce and study several generalizations of classical computer science problems. These new problems involve a general submodular function in their objectives, in place of much simpler functions in the objectives of their classical counterparts. The problems include submodular knapsack, which generalizes knapsack, and submodular load balancing, which generalizes load balancing or minimum-makespan scheduling. In these two problems, the size of a collection of items, instead of being just a sum of their individual sizes, is now a submodular function. Another two new problems are submodular sparsest cut and submodular balanced cut, which generalize their respective graph cut problems. Here, a general submodular function replaces the graph cut function, which itself is a well-known special case of a submodular function. The last problem that we study is learning a submodular function, which was recently introduced by Goemans, Harvey, Kleinberg, and Mirrokni [7]. All of these problems are deﬁned on a set V of n elements with a nonnegative submodular function f : 2V → R given by a value-oracle: a black box that, given a subset S , returns the value f (S ). The following are formal deﬁnitions of the problems. ˆ (not necessarily submodular) that for all S ⊆ V Learning a Submodular Function Produce a function f ˆ satisﬁes f (S )/g(n) ≤ f (S ) ≤ f (S ), for g(n) ≥ 1 as small as possible.
A version intermediate between the uniform SSC and the general SSC is the weighted SSC problem, in which each element v ∈ V has a non-negative weight w(v ), and the demand between any pair of elements (u, v ) is equal to the productd Balancing (SLB) Uniform version: For a monotone function f , given a number m, ﬁnd a partition of V into m sets, V1 , . . . , Vm (some possibly empty), to minimize maxi f (Vi ). Non-uniform version: For m monotone submodular functions f1 , . . . , fm on V , ﬁnd a partition of V into m sets V1 , . . . , Vm to minimize maxi fi (Vi ). Submodular Sparsest Cut (SSC) Given k unordered pairs {{ui , vi } | 1 ≤ i ≤ k, ui , vi ∈ V }, each with a demand di > 0, ﬁnd a subset S ⊆ V minimizing f (S )/ i:|S ∩{ui ,vi }|=1 di . The denominator is the ¯).1 amount of demand separated by the “cut” (S, S We also consider uniform SSC, in which all pairs of nodes have demand equal to one. In this case the (S ) objective function is to ﬁnd S ⊆ V minimizing |f ¯| . We focus on a closely related formulation of the S ||S
Abstract We introduce several natural optimization problems using submodular functions and give asymptotically tight or close upper and lower bounds on approximation guarantees achievable using polynomial number of queries to a function-value oracle. The optimization problems we consider are submodular load balancing, submodular sparsest cut, submodular balanced cut, submodular knapsack. n For the ﬁrst two problems, our algorithms asymptotically match our lower bounds, yielding Θ( log n)
We also consider a special case of monotone two-partition functions, which we deﬁne as follows. A submodular function f on ground set V is a two-partition (2P) function if there is a R ⊆ V such that for all S ⊆ V the value of f (S ) depends only on |S ∩ R| and |S ∩ R|. A function f is monotone if f (S ) ≤ f (T ) whenever S ⊆ T .
1 approximation guarantees. The approximation guarantee for the last problem is bicriteria ( 2 , O( n log n )),
and we show a lower bound on the ratio of the bicriteria guarantees that matches this. We also give a new lower bound for approximately learning a submodular function. Our lower bounds, and the previous lower bounds for learning a submodular function, use submodular functions with very special structure. We show how to approximately learn submodular functions with this structure with a guarantee only slightly larger than the lower bound, demonstrating that a much tighter lower bound would need to use a function with more complex structure.