数据挖掘_A subset of about 1700 labeled email messages(1700个标记的电子邮件信息)
- 格式:pdf
- 大小:183.93 KB
- 文档页数:7
数据挖掘聚类方法数据挖掘是从大量数据中发现有用的信息和模式的过程。
聚类是数据挖掘中的一种重要方法,它将数据对象划分为一组相似的子集,称为簇。
聚类方法可以为数据分析和决策提供有用的信息,有助于理解数据之间的关系,以及发现数据中隐藏的模式和结构。
在数据挖掘中,有许多聚类方法可以选择,下面将简要介绍几种常见的聚类方法。
1. K-means聚类算法:K-means是最常用的聚类算法之一、它将数据划分为K个簇,其中K是用户定义的参数。
该算法通过计算每个数据点和簇中心之间的距离来确定每个数据点属于哪个簇。
迭代地更新簇中心直到达到停止准则,例如簇中心不再改变或达到最大迭代次数。
2.层次聚类算法:层次聚类是一种自底向上或自顶向下的聚类方法。
自底向上的层次聚类从每个数据点开始,并将其合并到形成类似的数据点的簇中,最终形成一个完整的层次聚类树。
自顶向下的层次聚类从所有数据点开始,将其划分为较小的簇,并逐渐进行合并,最终形成一个完整的层次聚类树。
层次聚类可以通过不同的相似度度量方法来执行,例如单连接和完整连接。
3. 密度聚类算法:密度聚类是一种根据数据点之间的密度将数据划分为不同簇的方法。
DBSCAN(Density-Based Spatial Clustering of Applications with Noise)是一种常见的密度聚类算法。
DBSCAN基于定义半径内存在最小数量数据点的密度来确定核心点,并通过核心点之间的连通性来形成簇。
4. 基于模型的聚类算法:基于模型的聚类方法假设数据是从特定概率分布生成的,并试图通过对数据进行建模来识别簇。
混合高斯模型(Gaussian Mixture Model,GMM)是基于模型的聚类方法的一个例子。
GMM假设数据是由多个高斯分布组成的,通过最大似然估计来确定每个数据点属于哪个高斯分布。
在选择合适的聚类方法时,需要考虑数据的特性、问题的目标以及算法的优缺点。
不同聚类方法适用于不同类型的数据和问题。
基于密度的DBSCAN聚类算法数据挖掘(Data mining)就是从大量的、不完全的、有噪声的、模糊的、随机的数据中发现隐含的、先前末知的、对决策有潜在价值的规则的过程。
从上世纪七十年代开始到至今,在数据挖掘领域的研究已经取得了相当丰硕的成果,并产生了很多应用实际商业活动的系统,这些系统为企业界、政府组织等带来了巨大的效益。
聚类(clustering)是数据挖掘领域中最活跃的研究分支之一,聚类在统计学、模式识别、图像处理、机器学习、生物学、市场营销等许多领域有广泛的应用。
所谓聚类,就是将物理或抽象对象的集合组成为由类似的对象组成的多个类或簇(cluster)的过程,由聚类所生成的簇是一组数据对象的集合,同一簇中的对象尽可能相似,而不同簇中的对象尽可能相异。
通过聚类,人们能够识别密集的和稀疏的区域,发现全局的分布模式和数据属性之间有趣的相互关系,如在商业上,聚类可以帮助市场分析人员从消费者数据库中区分出不同的消费群体来,并且概括出每一类消费者的消费模式或者说习惯。
在数据挖掘中,聚类分析能作为一个独立的工具来获得数据分布的情况,观察每个簇的特点,集中对特定的某些簇做进一步的分析。
此外,聚类分析还可以作为其它算法(如特征和分类等)的预处理步骤,这些算法再在生成的簇上进行处理。
聚类分析已经成为数据挖掘领域一个非常活跃的研究课题。
1 几种具有代表性的基于密度的聚类算法1.1DBSCAN 算法DBSCAN(Density-based spatial clustering of applications with noise)是一个基于高密度连接区域密度聚类算法。
这个方法将密度足够大的那部分记录组成聚类,其基本思想涉及一些新的定义。
定义1 对于给定的对象,我们称在其半径r范围内的一个记录为这个记录的r-邻域。
定义2 如果一个对象的r -邻域个数超过一个最小值minpts,那么我们就将这个记录称作核心对象。
定义3 一个对象的集合D,我们说一个对象p 在q 的r-邻域内,且q 是一个核心对象,我们说对象p 是从对象q出发直接密度可达的。
Mario Lucic *1Michael Tschannen *2Marvin Ritter *1Xiaohua Zhai 1Olivier Bachem 1Sylvain Gelly 1AbstractDeep generative models are becoming a cor-nerstone of modern machine learning.Recent work on conditional generative adversarial net-works has shown that learning complex,high-dimensional distributions over natural images is within reach.While the latest models are able to generate high-fidelity,diverse natural images at high resolution,they rely on a vast quantity of labeled data.In this work we demonstrate how one can benefit from recent work on self-and semi-supervised learning to outperform state-of-the-art (SOTA)on both unsupervised ImageNet synthesis,as well as in the conditional setting.In particular,the proposed approach is able to match the sample quality (as measured by FID)of the current state-of-the art conditional model Big-GAN on ImageNet using only 10%of the labels and outperform it using 20%of the labels.1.IntroductionDeep generative models have received a great deal of attention due to their power to learn complex high-dimensional distributions,such as distributions over nat-ural images (Zhang et al.,2018;Brock et al.,2019),videos (Kalchbrenner et al.,2017),and audio (Van Den Oord et al.,2016).Recent progress was driven by scalable train-ing of large-scale models (Brock et al.,2019;Menick &Kalchbrenner ,2019),architectural modifications (Zhang et al.,2018;Chen et al.,2019a ;Karras et al.,2018),and normalization techniques (Miyato et al.,2018).High-fidelity natural image generation (typically trained on ImageNet)hinges upon having access to vast quantities of labeled data.This is unsurprising as labels induce rich side information into the training process,effectively divid-ing the extremely challenging image generation task into semantically meaningful sub-tasks.*Equal contribution 1Google Brain,Zurich,Switzer-land 2ETH Zurich,Zurich,Switzerland.Correspondence to:Mario Lucic <****************>,Michael Tschan-nen <**********************>,Marvin Ritter <marvinrit-**************>.Figure 1.FID of the baselines and the proposed method.The vertical line indicates the baseline (BigGAN)which uses all the labeled data.The proposed method (S 3GAN)is able to match the state-of-the-art while using only 10%of the labeled data and outperform it with 20%.However,this dependence on vast quantities of labeled data is at odds with the fact that most data is unlabeled,and labeling itself is often costly and error-prone.Despite the recent progress on unsupervised image generation,the gap between conditional and unsupervised models in terms of sample quality is significant.In this work,we take a significant step towards closing the gap between conditional and unsupervised generation of high-fidelity images using generative adversarial networks (GANs).We leverage two simple yet powerful concepts:(i)Self-supervised learning:A semantic feature extractor for the training data can be learned via self-supervision,and the resulting feature representation can then be employed to guide the GAN training process.(ii)Semi-supervised learning:Labels for the entire train-ing set can be inferred from a small subset of labeled training images and the inferred labels can be used as conditional information for GAN training.Our contributions In this work,we1.propose and study various approaches to reduce or fully omit ground-truth label information for natural image generation tasks,2.achieve a new SOTA in unsupervised generation on Ima-geNet,match the SOTA on 128×128IMAGENET using only 10%of the labels,and set a new SOTA using only 20%of the labels (measured by FID),and3.open-source all the code used for the experiments at /google/compare_gan .a r X i v :1903.02271v 1 [c s .L G ] 6 M a r 2019High-Fidelity Image Generation With Fewer Labels2.Background and related workHigh-fidelity GANs on IMAGENET Besides B IG GAN (Brock et al.,2019)only a few prior methods have man-aged to scale GANs to ImageNet,most of them relying on class-conditional generation using labels.One of the earliest attempts are GANs with auxiliary classifier(AC-GANs)(Odena et al.,2017)which feed one-hot encoded label-information with the latent code to the generator and equip the discriminator with an auxiliary head predicting the image class in addition to whether the input is real or fake. More recent approaches rely on a label projection layer in the discriminator essentially resulting in per-class real/fake classification(Miyato&Koyama,2018)and self-attention in the generator(Zhang et al.,2018).Both methods use modulated batch normalization(De Vries et al.,2017)to provide label information to the generator.On the unsu-pervised side,Chen et al.(2019b)showed that auxiliary rotation loss added to the discriminator has a stabilizing effect on the training.Finally,appropriate gradient regular-ization enables scaling MMD-GANs to ImageNet without using labels(Arbel et al.,2018).Semi-supervised GANs Several recent works leveraged GANs for semi-supervised learning of classifiers.Both Salimans et al.(2016)and Odena(2016)train a discrimi-nator that classifies its input into K+1classes:K image classes for real images,and one class for generated images. Similarly,Springenberg(2016)extends the standard GAN objective to K classes.This approach was also consid-ered by Li et al.(2017)where separate discriminator and classifier models are applied.Other approaches incorpo-rate inference models to predict missing labels(Deng et al., 2017)or harness the joint distribution(of labels and data) matching for semi-supervised learning(Gan et al.,2017). We emphasize that this line of work focuses on training a classifier from a few labels,rather than using few labels to improve the quality of the generated model.Up to our knowledge,improvements in sample quality through partial label information are reported in Li et al.(2017);Deng et al. (2017);Sricharan et al.(2017),all of which consider only low-resolution data sets from a restricted domain.Self-supervised learning Self-supervised learning meth-ods employ a label-free auxiliary task to learn a semantic feature representation of the data.This approach was suc-cessfully applied to different data modalities,such as images (Doersch et al.,2015;Caron et al.,2018),video(Agrawal et al.,2015;Lee et al.,2017),and robotics(Jang et al.,2018; Pinto&Gupta,2016).The current state-of-the-art method on IMAGENET is due to Gidaris et al.(2018)who proposed predicting the rotation angle of rotated training images as an auxiliary task.This simple self-supervision approach yields representations which are useful for downstream image clas-sification tasks.Other forms of self-supervisioninclude Figure2.Top row:128×128samples from the fully supervised current state-of-the-art model B IG GAN.Bottom row:Samples form the proposed S3GAN which matches B IG GAN in terms of FID and IS using only10%of the ground-truth labels. predicting relative locations of disjoint image patches of a given image(Doersch et al.,2015;Mundhenk et al.,2018) or estimating the permutation of randomly swapped image patches on a regular grid(Noroozi&Favaro,2016).A study on self-supervised learning with modern neural architectures is provided in Kolesnikov et al.(2019).3.Reducing the appetite for labeled dataIn a nutshell,instead of providing hand-annotated ground truth labels for real images to the discriminator,we will provide inferred ones.To obtain these labels we will make use of recent advancements in self-and semi-supervised learning.Before introducing these methods in detail,we first discuss how label information is used in state-of-the-art GANs.The following exposition assumes familiarity with the basics of the GAN framework(Goodfellow et al.,2014). Incorporating the labels To provide the label informa-tion to the discriminator we employ a linear projection layer as proposed by Miyato&Koyama(2018).To make the exposition self-contained,we will briefly recall the main ideas.In a”vanilla”(unconditional)GAN,the discriminator D learns to predict whether the image at its input x is real or generated by the generator G.We decompose the discrimi-nator into a learned discriminator representation,˜D,which is fed into a linear classifier,c r/f,i.e.,the discriminator is given by c r/f(˜D(x)).In the projection discriminator,one learns an embedding for each class of the same dimension as the representation˜D(x).Then,for a given image,la-bel input x,y the decision on whether the sample is real or generated is based on two components:(a)on whether the representation˜D(x)itself is consistent with the real data, and(b)on whether the representation˜D(x)is consistent with the real data from class y.More formally,the discrim-inator takes the form D(x,y)=c r/f(˜D(x))+P(˜D(x),y), where P(˜x,y)=˜x W y is a linear projection layer applied to a feature vector˜x and the one-hot encoded label y as anCL CLGAN using the hinge loss,alternatively minimizing the discriminator loss L D and generator loss L G,namelyL D=−E x∼pdata (x)[min(0,−1+D(x,c CL(F(x))))]−E(z,y)∼ˆp(z,y)[min(0,−1−D(G(z,y),y))]L G=−E(z,y)∼ˆp(z,y)[D(G(z,y),y)],whereˆp(z,y)=p(z)ˆp(y)is the prior distribution with p(z)=N(0,I)andˆp(y)the empirical distribution of the latter expectation is replaced by the empirical average over the subset of labeled training examples,whereas the former is set to the empirical average over the entire training set (this convention is followed throughout the paper).After we obtain F and cS2Lwe proceed with GAN training where we 1Note that an even simpler approach would be tofirst learn the representation via self-supervision and subsequently the linear classifier,but we observed that learning the representation and classifier simultaneously leads to better results.y_FGy f z˜Dfx ry f c r/f Pc CT Figure 5.S 2GAN-CO:During GAN training we learn an auxiliaryclassifier c CT on the discriminator representation ˜D,based on the labeled real examples,to predict labels for the unlabeled ones.This avoids training a feature extractor F and classifier c S 2L prior to GAN training as in S 2GAN.label the real images as ˆy S 2L =c S 2L (F (x )).In particular,we alternatively minimize the same generator and discrim-inator losses as for C LUSTERING except that we use c S 2L and F obtained by minimizing (2):L D =−E x ∼p data (x )[min(0,−1+D (x,c S 2L (F (x ))))]−E (z,y )∼p (z,y )[min(0,−1−D (G (z,y ),y ))]L G =−E (z,y )∼p (z,y )[D (G (z,y ),y )],where p (z,y )=p (z )p (y )with p (z )=N (0,I )and p (y )uniform categorical.We use the abbreviation S 2GAN for this method.3.2.Co-training approachThe main drawback of the transfer-based methods is that one needs to train a feature extractor F via self supervision and learn an inference mechanism for the labels (linear clas-sifier or clustering).In what follows we detail co-training approaches that avoid this two-step procedure and learn to infer label information during GAN training.Unsupervised method We consider two approaches.In the first one,we completely remove the labels by simply la-beling all real and generated examples with the same label 2and removing the projection layer from the discriminator,i.e.,we set D (x )=c r/f (˜D(x )).We use the abbreviation S INGLE LABEL for this method.For the second approach we assign random labels to (unlabeled)real images.While the labels for the real images do not provide any useful sig-nal to the discriminator,the sampled labels could potentially help the generator by providing additional randomness with different statistics than z ,as well as additional trainable pa-rameters due to the embedding matrices in class-conditional BatchNorm.Furthermore,the labels for the fake data could2Note that this is not necessarily equivalent to replacing class-conditional BatchNorm with standard (unconditional)BatchNorm as the variant of conditional BatchNorm used in this paper also uses chunks of the latent code as input;besides the label information.DGx_R x_F z y_F y_F F c \hat y_RD c\hat y_RD DGy f z˜Dfx ry fc r/f Py r c R Figure 6.Self-supervision by rotation-prediction during GAN training.Additionally to predicting whether the images at its input are real or generated,the discriminator is trained to predict rotations of both rotated real and fake images via an auxiliary lin-ear classifier c R .This approach was successfully applied by Chen et al.(2019b )to stabilize GAN training.Here we combine it with our pre-trained and co-training approaches,replacing the ground truth labels y r with predicted ones.facilitate the discrimination as they provide side informationabout the fake images to the discriminator.We term thismethod R ANDOM LABEL .Semi-supervised method When labels are available for a subset of the real data,we train an auxiliary linear classifierc CT directly on the feature representation ˜D of the discrimi-nator,during GAN training ,and use it to predict labels for the unlabeled real images.In this case the discriminator loss takes the formL D =−E (x,y )∼p data (x,y )[min(0,−1+D (x,y ))]−λE (x,y )∼p data(x,y )[log p (c CT (˜D(x ))=y )]−E x ∼p data (x )[min(0,−1+D (x,c CT (˜D (x ))))]−E (z,y )∼p (z,y )[min(0,−1−D (G (z,y ),y ))],(3)where the first term corresponds to standard conditionaltraining on (k %)labeled real images,the second term is the cross-entropy loss (with weight λ>0)for the auxiliary classifier c CT on the labeled real images,the third term is an unsupervised discriminator loss where the labels for the unlabeled real images are predicted by c CT ,and the last term is the standard conditional discriminator loss on the generated data.We use the abbreviation S 2GAN-CO for this method.See Figure 5for an illustration.3.3.Self-supervision during GAN trainingSo far we leveraged self-supervision to either craft good fea-ture representations,or to learn a semi-supervised model (cf.Section 3.1).However,given that the discriminator itself is just a classifier,one may benefit from augmenting this classifier with an auxiliary task—namely self-supervision through rotation prediction.This approach was already ex-plored in (Chen et al.,2019b ),where it was observed to stabilize GAN training.Here we want to assess its impactwhen combined with the methods introduced in Sections3.1 and3.2.To this end,similarly to the training of F in(1) and(2),we train an additional linear classifier c R on the discriminator feature representation˜D to predict rotations r∈R of the rotated real images x r and rotated fake im-ages G(z,y)r.The corresponding loss terms added to the discriminator and generator losses are−β|R|r∈RE x∼pdata(x)[log p(c R(˜D(x r)=r)](4)and−α|R|E(z,y)∼p(z,y)[log p(c R(˜D(G(z,y)r)=r)],(5)respectively,whereα,β>0are weights to balance the loss terms.This approach is illustrated in Figure6.4.Experimental setupArchitecture and hyperparameters GANs are notori-ously unstable to train and their performance strongly de-pends on the capacity of the neural architecture,optimiza-tion hyperparameters,and appropriate regularization(Lucic et al.,2018;Kurach et al.,2018).We implemented the con-ditional BigGAN architecture(Brock et al.,2019)which achieves state-of-the-art results on ImageNet.3We use ex-actly the same optimization hyper-parameters as Brock et al. (2019).Specifically,we employ the Adam Optimizer with the learning rates5·10−5for the generator and2·10−4 for the discriminator(β1=0β2=0.999).We train for 250k generator steps with2discriminator iterations before each generator step.The batch size wasfixed to2048,and we use a latent code z with120dimensions.We employ spectral normalization in both generator and discriminator. In contrast to BigGAN,we do not apply orthogonal regu-larization as this was observed to only marginally improve sample quality(cf.Table1in Brock et al.(2019))and we do not use the truncation trick.Datasets We focus primarily on IMAGENET,the largest and most diverse image data set commonly used to evaluate GANs.IMAGENET contains1.3M training images and50k test images,each corresponding to one of1k object classes. We resize the images to128×128×3as done in Miyato& Koyama(2018)and Zhang et al.(2018).Partially labeled data sets for the semi-supervised approaches are obtained by randomly selecting k%of the samples from each class.3We dissected the model checkpoints released by Brock et al. (2019)to obtain exact counts of trainable parameters and their dimensions,and match them to byte level(cf.Tables10and 11).We want to emphasize that at this point this methodology is bleeding-edge and successful state-of-the-art methods require careful architecture-level tuning.To foster reproducibility we meticulously detail this architecture at tensor-level detail in Ap-pendix B and open-source our code at https:/// google/compare_gan.Evaluation metrics We use the Fr´e chet Inception Dis-tance(FID)(Heusel et al.,2017)and Inception Score(Sal-imans et al.,2016)to evaluate the quality of the gener-ated samples.To compute the FID,the real data and generated samples arefirst embedded in a specific layer of a pre-trained Inception network.Then,a multivariate Gaussian isfit to the data and the distance computed as FID(x,g)=||µx−µg||22+Tr(Σx+Σg−2(ΣxΣg)12), whereµandΣdenote the empirical mean and covariance, and subscripts x and g denote the real and generated data respectively.FID was shown to be sensitive to both the addi-tion of spurious modes and to mode dropping(Sajjadi et al., 2018;Lucic et al.,2018).Inception Score posits that condi-tional label distribution of samples containing meaningful objects should have low entropy,and the variability of the samples should be high leading to the following formula-tion:IS=exp(E x∼Q[d KL(p(y|x),p(y))]).Although it has someflaws(Barratt&Sharma,2018),we report it to en-able comparison with existing methods.Following(Brock et al.,2019),the FID is computed using the50k IMAGENET testing images and50k randomly sampled fake images,and the IS is computed from50k randomly sampled fake images. All metrics are computed for5different randomly sampled sets of fake images and we report the mean.Methods We conduct an extensive comparison of meth-ods detailed in Table1,namely:Unmodified B IG GAN,the unsupervised methods S INGLE LABEL,R ANDOM LABEL, C LUSTERING,and the semi-supervised methods S2GAN and S2GAN-CO.In all S2GAN-CO experiments we use soft labels,i.e.,the soft-max output of c CT instead of one-hot encoded hard estimates,as we observed in preliminary experiments that this stabilizes training.For S2GAN we use hard labels by default,but investigate the effect of soft labels in separate experiments.For all semi-supervised methods we have access only to k%of the ground truth labels where k∈{5,10,20}.As an additional baseline,we retain k% labeled real images and discard all unlabeled real images, then using the remaining labeled images to train B IG GAN (the resulting model is designated by B IG GAN-k%).Fi-nally,we explore the effect of self-supervision during GAN training on the unsupervised and semi-supervised methods. We train every model three times with a different random seed and report the median FID and the median IS.With the exception of the S INGLE LABEL and B IG GAN-k%,the standard deviation of the mean across three runs is very low.We therefore defer tables with the mean FID and IS values and standard deviations to Appendix D.All models are trained on128cores of a Google TPU v3Pod with BatchNorm statistics synchronized across cores. Unsupervised approaches For C LUSTERING we sim-ply used the best available self-supervised rotation model from(Kolesnikov et al.,2019).The numberTable1.A short summary of the analyzed methods.The detailed descriptions of pre-training and co-trained approaches can be found in Sections3.1and3.2,respectively.Self-supervision during GAN training is described in Section3.3.M ETHOD D ESCRIPTIONB IG GAN Conditional(Brock et al.,2019)S INGLE LABEL Co-training:Single labelR ANDOM LABEL Co-training:Random labelsC LUSTERING Pre-trained:ClusteringB IG GAN-k%Drop all but k%labeled dataS2GAN-CO Co-training:Semi-supervisedS2GAN Pre-trained:Semi-supervisedS3GAN S2GAN with self-supervisionS3GAN-CO S2GAN-CO with self-supervision of clusters for C LUSTERING is selected from the set {50,100,200,500,1000}.The other unsupervised ap-proaches do not have hyper-parameters.Pre-trained and co-training approaches We employ the wide ResNet-50v2architecture with widening factor16 (Zagoruyko&Komodakis,2016)for the feature extractor F in the pre-trained approaches described in Section3.1. We optimize the loss in(2)using SGD for65epochs.The batch size is set to2048,composed of B unlabeled examples and2048−B labeled examples.Following the recommen-dations from Goyal et al.(2017)for training with largebatch size,we(i)set the learning rate to0.1B256,and(ii)use linear learning rate warm-up during the initial5epochs. The learning rate is decayed twice with a factor of10at epoch45and epoch55.The parameterγin(2)is set to 0.5and the number of unlabeled examples per batch B is 1536.The parametersγand B are tuned on0.1%labeled examples held out from the training set,the search space is{0.1,0.5,1.0}×{1024,1536,1792}.The accuracy ofthe so-obtained classifier cS2L (F(x))on the IMAGENET val-idation set is reported in Table3.The parameterλin the loss used for S2GAN-CO in(3)is selected form the set {0.1,0.2,0.4}.Self-supervision during GAN training For all ap-proaches we use the recommend parameterα=0.2from (Chen et al.,2019b)in(5)and do a small sweep forβin (4).For the values tried({0.25,0.5,1.0,2})we do not see a huge effect and useβ=0.5for S3GAN.For S3GAN-CO we did not repeat the sweep usedβ=1.0.5.Results and discussionRecall that the main goal of this work is to match(or out-perform)the fully supervised B IG GAN in an unsupervised fashion,or with a small subset of labeled data.In the fol-lowing,we discuss the advantages and drawbacks of theanalyzed approaches with respect to this goal.As a baseline,our reimplementation of B IG GAN obtainsan FID of8.4and IS of75.0,and hence reproduces theresult reported by Brock et al.(2019)in terms of FID.Weobserved some differences in training dynamics,which wediscuss in detail in Section5.4.5.1.Unsupervised approachesThe results for unsupervised approaches are summarizedin Figure7and Table2.The fully unsupervised R ANDOM LABEL and S INGLE LABEL models both achieve a similar FID of∼25and IS of∼20.This is a quite considerablegap compared to B IG GAN and indicates that additionalsupervision is necessary.We note that one of the three S IN-GLE LABEL models collapsed whereas all three R ANDOM LABEL models trained stably for250k generator iterations. Pre-training a semantic representation using self-supervision and clustering the training data on this representation as done by C LUSTERING reduces the FID by about10%and increases IS by about10%.These results were obtained for 50clusters,all other options led to worse results.While this performance is still considerably worse than that of B IG GAN this result is the current state-of-the-art in unsu-pervised image generation(Chen et al.(2019b)report an FID of33for unsupervised generation).Example images from the clustering are shown in Figures14,15,and16in the supplementary material.The clusteringis clearly meaningful and groups similar objects within thesame cluster.Furthermore,the objects generated by C LUS-TERING conditionally on a given cluster index reflect the distribution of the training data belonging the corresponding cluster.On the other hand,we can clearly observe multiple classes being present in the same cluster.This is to be ex-pected when under-clustering to50clusters.Interestingly, clustering to many more clusters(say500)yields results similar to S INGLE LABEL.Table2.Median FID and IS for the unsupervised approaches(see Table14in the appendix for mean and standard deviation).FID ISR ANDOM LABEL26.520.2S INGLE LABEL25.320.4S INGLE LABEL(SS)23.722.2C LUSTERING23.222.7C LUSTERING(SS)22.023.5Figure7.Median FID obtained by our unsupervised approaches. The vertical line indicates the the median FID of our B IG GAN implementation which uses labels for all training images.While the gap between unsupervised and fully supervised approaches remains significant,using a pre-trained self-supervised represen-tation(C LUSTERING)improves the sample quality compared to S INGLE LABEL and R ANDOM LABEL,leading to a new state-of-the art in unsupervised generation on IMAGENET.5.2.Semi-supervised approachesPre-trained The S2GAN model where we use the clas-sifier pre-trained with both a self-supervised and semi-supervised loss(cf.Section3.1)suffers a very minor in-crease in FID for10%and5%labeled real training data, and matches B IG GAN both in terms of FID and IS when 20%of the labels are used(cf.Table3).We stress that this is despite the fact that the classifier used to infer the labels has a top-1accuracy of only50%,63%,and71%for 5%,10%,and20%labeled data,respectively(cf.Table3), compared to100%of the original labels.The results are shown in Table4and Figure8,and random samples as well as interpolations can be found in Figures9–17in the supplementary material.Co-trained The results for our co-trained model S2GAN-CO which trains a linear classifier in semi-supervised fash-ion on top of the discriminator representation during GAN training(cf.Section3.2)are shown in Table4.It can be Table3.Top-1and top-5error rate(%)on the IMAGENET valida-tion set of cS2L (F(x))using both self-and semi-supervised lossesas described in Section3.1.While the models are clearly not state-of-the-art compared to the fully supervised IMAGENET classi-fication task,the quality of labels is sufficient to match and in some cases improve the state-of-the-art GAN natural image synthesis.L ABELSM ETRIC5%10%20%T OP-1ERROR50.0836.7429.21T OP-5ERROR26.9416.0410.33S2GAN10.88.98.457.673.477.4S2GAN-CO21.817.713.930.037.249.2S3GAN10.48.07.759.678.783.1S3GAN-CO20.216.612.731.038.553.1 seen that S2GAN-CO outperforms all fully unsupervised approaches for all considered label percentages.While thegap between S2GAN-CO with5%labels and C LUSTER-ING in terms of FID is small,S2GAN-CO has a consider-ably larger IS.When using20%labeled training examplesS2GAN-CO obtains an FID of13.9and an IS of49.2,which is remarkably close to B IG GAN and S2GAN giventhe simplicity of the S2GAN-CO approach.As the thepercentage of labels decreases,the gap between S2GANand S2GAN-CO increases.Interestingly,S2GAN-CO does not seem to train less stablythan S2GAN approaches even though it is forced to learnthe classifier during GAN training.This is particularly re-markable as the B IG GAN-k%approaches,where we onlyretain the labeled data for training and discard all unlabeleddata,are very unstable and collapse after60k to120k itera-tions,for all three random seeds and for both10%and20% labeled data.5.3.Self-supervision during GAN trainingSo far we have seen that the pre-trained semi-supervised approach,namely S2GAN,is able to achieve state-of-the-art performance for20%labeled data.Here we investigate whether self-supervision during GAN training as described in Section3.3can lead to further improvements.Table4and Figure8show the experimental results for S3GAN,namely S2GAN coupled with self-supervision in the discriminator. Self-supervision leads to a reduction in FID and increase in IS across all considered settings.In particular we can match the state-of-the-art B IG GAN with only10%of the labels and outperform it using20%labels,both in terms of FID and IS.For S3GAN the improvements due to self-supervision dur-ing GAN training in FID are considerable,around10%in most of the cases.Tuning the parameterβof the discrimina-tor self-supervision loss in(4)did not dramatically increase。
数据挖掘与数据分析,数据可视化试题1. Data Mining is also referred to as ……………………..data analysisdata discovery(正确答案)data recoveryData visualization2. Data Mining is a method and technique inclusive of …………………………. data analysis.(正确答案)data discoveryData visualizationdata recovery3. In which step of Data Science consume Almost 80% of the work period of the procedure.Accumulating the dataAnalyzing the dataWrangling the data(正确答案)Recapitulation of the Data4. Which Step of Data Science allows the model to consistently improve and provide punctual performance and deliverapproximate results.Wrangling the dataAccumulating the dataRecapitulation of the Data(正确答案)Analyzing the data5. Which tool of Data Science is robust machine learning library, which allows the implementation of deep learning ?algorithms. STableauD3.jsApache SparkTensorFlow(正确答案)6. What is the main aim of Data Mining ?to obtain data from a less number of sources and to transform it into a more useful version of itself.to obtain data from a less number of sources and to transform it into a less useful version of itself.to obtain data from a great number of sources and to transform it into a less useful version of itself.to obtain data from a great number of sources and to transform it into a more useful version of itself.(正确答案)7. In which step of data mining the irrelevant patterns are eliminated to avoid cluttering ? Cleaning the data(正确答案)Evaluating the dataConversion of the dataIntegration of data8. Data Science t is mainly used for ………………. purposes. Data mining is mainly used for ……………………. purposes.scientific,business(正确答案)business,scientificscientific,scientificNone9. Pandas ………………... is a one dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).Series(正确答案)FramePanelNone10. How many principal components Pandas DataFrame consists of ?4213(正确答案)11. Important data structure of pandas is/are ___________SeriesData FrameBoth(正确答案)None of the above12. Which of the following command is used to install pandas?pip install pandas(正确答案)install pandaspip pandasNone of the above13. Which of the following function/method help to create Series? series()Series()(正确答案)createSeries()None of the above14. NumPY stands for?Numbering PythonNumber In PythonNumerical Python(正确答案)None Of the above15. Which of the following is not correct sub-packages of SciPy? scipy.integratescipy.source(正确答案)scipy.interpolatescipy.signal16. How to import Constants Package in SciPy?import scipy.constantsfrom scipy.constants(正确答案)import scipy.constants.packagefrom scipy.constants.package17. ………………….. involveslooking at and describing the data set from different angles and then summarizing it ?Data FrameData VisualizationEDA(正确答案)All of the above18. what involves the preparation of data sets for analysis by removing irregularities in the data so that these irregularities do not affect further steps in the process of data analysis and machine learning model building ?Data AnalysisEDA(正确答案)Data FrameNone of the above19. What is not Utility of EDA ?Maximize the insight in the data setDetect outliers and anomaliesVisualization of dataTest underlying assumptions(正确答案)20. what can hamper the further steps in the machine learning model building process If not performed properly ?Recapitulation of the DataAccumulating the dataEDA(正确答案)None of the above21. Which plot for EDA to check the dependency between two variables ? HistogramsScatter plots(正确答案)MapsTime series plots22. What function will tell you the top records in the data set?shapehead(正确答案)showall of the aboce23. what type of data is useful for internal policymaking and business strategy building for an organization ?public dataprivate data(正确答案)bothNone of the above24. The ………… function can “fill in” NA valueswith non-null data ?headfillna(正确答案)shapeall of the above25. If you want to simply exclude the missing values, then what function along with the axis argument will be use?fillnareplacedropna(正确答案)isnull26. Which of the following attribute of DataFrame is used to display data type of each column in DataFrame?DtypesDTypesdtypes(正确答案)datatypes27. Which of the following function is used to load the data from the CSV file into a DataFrame?read.csv()readcsv()read_csv()(正确答案)Read_csv()28. how to Display first row of dataframe ‘DF’ ?print(DF.head(1))print(DF[0 : 1])print(DF.iloc[0 : 1])All of the above(正确答案)29. Spread function is known as ................ in spreadsheets ?pivotunpivot(正确答案)castorder30. ................. extract a subset of rows from a data fram based on logical conditions ? renamefilter(正确答案)setsubset31. We can shift the DataFrame’s index by a certain number of periods usingthe …………. Method ?melt()merge()tail()shift()(正确答案)32. We can join melted DataFrames into one Analytical Base Table using the ……….. function.join()append()merge()(正确答案)truncate()33. What methos is used to concatenate datasets along an axis ?concatenate()concat()(正确答案)add()merge()34. Rows can be …………….. if the number of missing values is insignificant, as thiswould not impact the overall analysis results.deleted(正确答案)updatedaddedall35. There is a specific reason behind the missing value.What stands for Missing not at randomMCARMARMNAR(正确答案)None of the above36. While plotting data, some values of one variable may not lie beyond the expectedrange, but when you plot the data with some other variable, these values may lie far from the expected value.Identify the type of outliers?Univariate outliersMultivariate outliers(正确答案)ManyVariate outlinersNone of the above37. if numeric values are stored as strings, then it would not be possible to calculatemetrics such as mean, median, etc.Then what type of data cleaning exercises you will perform ?Convert incorrect data types:(正确答案)Correct the values that lie beyond the rangeCorrect the values not belonging in the listFix incorrect structure:38. Rows that are not required in the analysis. E.g ifobservations before or after a particular date only are required for analysis.What steps we will do when perform data filering ?Deduplicate Data/Remove duplicateddataFilter rows tokeep only therelevant data.(正确答案)Filter columns Pick columnsrelevant toanalysisBring the datatogether, Groupby required keys,aggregate therest39. you need to…………... the data in order to get what you need for your analysis. searchlengthorderfilter(正确答案)40. Write the output of the following ?>>> import pandas as pd >>> series1 =pd.Series([10,20,30])>>> print(series1)0 101 202 30dtype: int64(正确答案)102030dtype: int640 1 2 dtype: int64None of the above41. What will be output for the following code?import numpy as np a = np.array([1, 2, 3], dtype = complex) print a[[ 1.+0.j, 2.+0.j, 3.+0.j]][ 1.+0.j]Error[ 1.+0.j, 2.+0.j, 3.+0.j](正确答案)42. What will be output for the following code?import numpy as np a =np.array([1,2,3]) print a[[1, 2, 3]][1][1, 2, 3](正确答案)Error43. What will be output for the following code?import numpy as np dt = dt =np.dtype('i4') print dtint32(正确答案)int64int128int1644. What will be output for the following code?import numpy as np dt =np.dtype([('age',np.int8)]) a = np.array([(10,),(20,),(30,)], dtype = dt)print a['age'][[10 20 30]][10 20 30](正确答案)[10]Error45. We can add a new row to a DataFrame using the _____________ methodrloc[ ]iloc[ ]loc[ ](正确答案)None of the above46. Function _____ can be used to drop missing values.fillna()isnull()dropna()(正确答案)delna()47. The function to perform pivoting with dataframes having duplicate values is _____ ? pivot(unique = True)pivot()pivot_table(unique = True)pivot_table()(正确答案)48. A technique, which when performed on a dataframe, rearranges the data from rows and columns in a report form, is called _____ ?summarisingreportinggroupingpivoting(正确答案)49. Normal Distribution is symmetric is about ___________ ?VarianceMean(正确答案)Standard deviationCovariance50. Write a statement to display “Amount” as x-axis label. (consider plt as an alias name of matplotlib.pyplot)bel(“Amount”)plt.xlabel(“Amount”)(正确答案)plt.xlabel(Amount)None of the above51. Fill in the blank in the given code, if we want to plot a line chart for values of list ‘a’ vs values of list ‘b’.a = [1, 2, 3, 4, 5]b = [10, 20, 30, 40, 50]import matplotlib.pyplot as pltplt.plot __________(a, b)(正确答案)(b, a)[a, b]None of the above52. #Loading the datasetimport seaborn as snstips =sns.load_dataset("tips")tips.head()In this code what is tips ?plotdataset name(正确答案)paletteNone of the above53. Visualization can make sense of information by helping to find relationships in the data and support (or disproving) ideas about the dataAnalyzeRelationShip(正确答案)AccessiblePrecise54. In which option provides A detailed data analysis tool that has an easy-to-use tool interface and graphical designoptions for visuals.Jupyter NotebookSisenseTableau DesktopMATLAB(正确答案)55. Consider a bank having thousands of ATMs across China. In every transaction, Many variables are recorded.Which among the following are not fact variables.Transaction charge amountWithdrawal amountAccount balance after withdrawalATM ID(正确答案)56. Which module of matplotlib library is required for plotting of graph?plotmatplotpyplot(正确答案)None of the above57. Write a statement to display “Amount” as x-axis label. (consider plt as an alias name of matplotlib.pyplot)bel(“Amount”)plt.xlabel(“Amount”)(正确答案)plt.xlabel(Amount)None of the above58. What will happen when you pass ‘h’ as as a value to orient parameter of the barplot function?It will make the orientation vertical.It will make the orientation horizontal.(正确答案)It will make line graphNone of the above59. what is the name of the function to display Parameters available are viewed .set_style()axes_style()(正确答案)despine()show_style()60. In stacked barplot, subgroups are displayed as bars on top of each other. How many parameters barplot() functionhave to draw stacked bars?OneTwoNone(正确答案)three61. In Line Chart or Line Plot which parameter is an object determining how to draw the markers for differentlevels of the style variable.?x.yhuemarkers(正确答案)legend62. …………………..similar to Box Plot but with a rotated plot on each side, giving more information about the density estimate on the y axis.Pie ChartLine ChartViolin Chart(正确答案)None63. By default plot() function plots a ________________HistogramBar graphLine chart(正确答案)Pie chart64. ____________ are column-charts, where each column represents a range of values, and the height of a column corresponds to how many values are in that range.Bar graphHistograms(正确答案)Line chartpie chart65. The ________ project builds on top of pandas and matplotlib to provide easy plotting of data.yhatSeaborn(正确答案)VincentPychart66. A palette means a ________.. surface on which a painter arranges and mixed paints. circlerectangularflat(正确答案)all67. The default theme of the plotwill be ________?Darkgrid(正确答案)WhitegridDarkTicks68. Outliers should be treated after investigating data and drawing insights from a dataset.在调查数据并从数据集中得出见解后,应对异常值进行处理。
深度学习与浅度学习的英文作文English:In the field of machine learning, deep learning and shallow learning are two different approaches to training a model to make predictions or decisions. Shallow learning, also known as traditional machine learning, involves the use of algorithms that rely on manually-engineered features to make predictions. These algorithms are typically used for tasks such as classification and regression, and they require extensive feature engineering to extract relevant information from the input data. On the other hand, deep learning is a subset of machine learning that uses neural networks to automatically learn hierarchical representations of data. These neural networks are composed of multiple layers of interconnected nodes, and they are capable of learning intricate patterns and features directly from the raw input data, eliminating the need for manual feature engineering. Deep learning has shown remarkable success in tasks such as image and speech recognition, natural language processing, and reinforcement learning, outperforming traditional shallow learning methods in many cases. However, deep learning models oftenrequire large amounts of labeled data and computational resourcesto train effectively, which can be a limitation in some applications.Translated content:在机器学习领域,深度学习和浅度学习是训练模型进行预测或决策的两种不同方法。
常见数据挖掘分析方法介绍在数据分析领域,数据挖掘是一种重要的技术,它可以帮助我们从大量的数据中提取有价值的信息和知识。
在实际应用中,有许多常见的数据挖掘分析方法,本文将对其中一些方法进行介绍。
一、聚类分析聚类分析是一种将数据集合划分为不同群组的方法,以使得同一群组内的数据对象相似度高,不同群组之间的相似度低。
其中,K均值算法是一种常用的聚类分析方法。
它首先将数据集合划分为K个初始聚类中心,然后迭代地将数据对象分配到最近的聚类中心,再更新聚类中心的位置,直到达到收敛条件。
二、分类分析分类分析是一种通过对已有数据进行学习,来预测新数据所属类别的方法。
其中,决策树算法是一种常用的分类分析方法。
决策树通过构建一棵树状结构,每个节点代表一个属性,每个分支代表属性的取值,从根节点到叶节点的路径表示一个分类规则。
通过遍历决策树,我们可以将新数据进行分类。
三、关联规则挖掘关联规则挖掘是一种寻找数据集中项集之间相关性的方法。
其中,Apriori算法是一种常用的关联规则挖掘方法。
Apriori算法基于一个重要的原则:如果一个项集是频繁的,那么它的所有子集也是频繁的。
Apriori算法通过迭代地生成候选项集,并计算其支持度来寻找频繁项集,然后通过计算置信度来生成关联规则。
四、回归分析回归分析是一种通过对数据的学习来预测数值型输出的方法。
其中,线性回归是一种常用的回归分析方法。
线性回归通过拟合一条直线或者超平面来表示输入与输出之间的关系。
它通过最小化实际输出值与预测输出值之间的差距来求解模型参数。
五、异常检测异常检测是一种发现与正常模式不符的数据对象的方法。
其中,基于密度的离群点检测算法是一种常用的异常检测方法。
该算法通过计算数据对象与其邻域之间的密度来确定是否为离群点。
六、时序分析时序分析是一种对时间序列数据进行建模和预测的方法。
其中,ARIMA模型是一种常用的时序分析方法。
ARIMA模型通过将时间序列数据转化为平稳时间序列,然后通过自回归与滑动平均的组合进行建模与预测。
训练集测试集验证集英文The Importance of Training, Testing, and ValidationSets in Machine Learning.In the field of machine learning, the division of data into training, testing, and validation sets is crucial for ensuring the effective development and evaluation of models. Each set serves a distinct purpose in the machine learning workflow, and their integration is essential for achieving accurate and reliable results.Training Set.The training set is used to teach the machine learning model how to perform a specific task. It contains a subsetof labeled data, which the model uses to learn the underlying patterns and relationships between inputs and outputs. The model's parameters are adjusted based on the training data to minimize a predefined loss function, which measures the difference between the model's predictions andthe actual labels.During the training phase, the model's goal is to fit the training data as well as possible. However, it'scrucial to avoid overfitting, where the model performs poorly on new, unseen data because it has learned the noise or irrelevant details in the training set. To mitigate this issue, techniques such as regularization and dropout are often employed.Validation Set.The validation set serves as a middle ground between the training set and the testing set. It's used to evaluate the model's performance during the training process, allowing for adjustments to be made without corrupting the test set's integrity. The validation set helps in hyperparameter tuning, model selection, and early stopping to prevent overfitting.By monitoring the model's performance on the validation set, practitioners can assess how well it generalizes tounseen data. If the model's performance on the validation set stops improving, it's a signal to stop training to prevent overfitting. The validation set also allows for comparisons between different models or algorithms, enabling practitioners to choose the best-performing one.Testing Set.The testing set is used to assess the final performance of the trained model on unseen data. It's crucial to evaluate the model's performance on data it hasn't encountered during training or validation to ensure its generalization capabilities. The testing set should be completely separate from the training and validation sets and only used once at the end of the machine learning workflow.By comparing the model's predictions on the testing set to the actual labels, practitioners can calculate evaluation metrics such as accuracy, precision, recall, and F1 score. These metrics provide a quantitative measure of the model's performance and allow for comparisons withother models or benchmarks.Conclusion.The division of data into training, testing, and validation sets is fundamental to the success of machine learning projects. The training set teaches the model, the validation set helps in tuning and evaluating the model during training, and the testing set provides an unbiased assessment of the model's final performance. By leveraging these sets effectively, practitioners can develop accurate, robust, and reliable machine learning models that generalize well to new, unseen data.。
A subset of about 1700 labeled email messages(1700
个标记的电子邮件信息)
数据摘要:
These were chosen in a semi-motivated fashion (focusing on business-related emails and the California Energy Crises and on emails that occurred later in the collection, trying to avoid very personal messages, jokes, and so on). Students in the ANLP course annotated the selected messages with the category labels. Each message was labeled by two people, but no claims of consistency, comprehensiveness, nor generality are made about these labelings.
中文关键词:
电子邮件,商务邮件,收集,个人信息,笑话,
英文关键词:
email messages,business emails,collection,personal messages,jokes,
数据格式:
TEXT
数据用途:
Social Network Analysis
Information Processing
Classification
数据详细介绍:
A subset of about 1700 labeled email messages
These were chosen in a semi-motivated fashion (focusing on business-related emails and the California Energy Crises and on emails that occurred later in the collection, trying to avoid very personal messages, jokes, and so on). Students in the ANLP course annotated the selected messages with the category labels. Each message was labeled by two people, but no claims of consistency, comprehensiveness, nor generality are made about these labelings.
Format of each line in .cats file:
n1,n2, n3
n1 = top-level category
n2 = second-level category
n3 = frequency with which this category was assigned to this message
Here are the categories:
1 Coarse genre
1.1 Company Business, Strategy, etc. (elaborate in Section 3 [Topics])
1.2 Purely Personal
1.3 Personal but in professional context (e.g., it was good working with you)
1.4 Logistic Arrangements (meeting scheduling, technical support, etc)
1.5 Employment arrangements (job seeking, hiring, recommendations, etc)
1.6 Document editing/checking (collaboration)
1.7 Empty message (due to missing attachment)
1.8 Empty message
2 Included/forwarded information
2.1 Includes new text in addition to forwarded material
2.2 Forwarded email(s) including replies
2.3 Business letter(s) / document(s)
2.4 News article(s)
2.5 Government / academic report(s)
2.6 Government action(s) (such as results of a hearing, etc) 2.7 Press release(s)
2.8 Legal documents (complaints, lawsuits, advice)
2.9 Pointers to url(s)
2.10 Newsletters
2.11 Jokes, humor (related to business)
2.12 Jokes, humor (unrelated to business)
2.13 Attachment(s) (assumed missing)
3 Primary topics (if coarse genre 1.1 is selected)
3.1 regulations and regulators (includes price caps)
3.2 internal projects -- progress and strategy
3.3 company image -- current
3.4 company image -- changing / influencing
3.5 political influence / contributions / contacts
3.6 california energy crisis / california politics
3.7 internal company policy
3.8 internal company operations
3.9 alliances / partnerships
3.10 legal advice
3.11 talking points
3.12 meeting minutes
3.13 trip reports
4 Emotional tone (if not neutral)
4.1 jubilation
4.2 hope / anticipation
4.3 humor
4.4 camaraderie
4.5 admiration
4.6 gratitude
4.7 friendship / affection
4.8 sympathy / support
4.9 sarcasm
4.10 secrecy / confidentiality
4.11 worry / anxiety
4.12 concern
4.13 competitiveness / aggressiveness 4.14 triumph / gloating
4.15 pride
4.16 anger / agitation
4.17 sadness / despair
4.18 shame
4.19 dislike / scorn
数据预览:
点此下载完整数据集。