IMPROVING LATENT SEMANTIC INDEXING BASED CLASSIFIER WITH INFORMATION GAIN

格式：pdf
大小：265.22 KB
文档页数：4

下载文档原格式

基于DOMAIN和NeuralEnsembles模型预测中国毛竹潜在分布

第４７卷第７期
２０１１年７月
林
业
科
学
Ｖｏ＿７．．７Ｉ４Ｎ０
ＳＥＮＴＬＣＩＡ
ＳＬＡＥＩＮ
ＳＩＣＡＥＩＮＩ
Ｊ１２０１１ｕ．，
基于ＤＭＡＮ和Ｎｕａｎｅｂｅ模型ＯＩｅｒｌｓｍｌｓＥ预测中国毛竹潜在分布
ｙＬｂｒｔｒｏｅｃｌｙａｏａｙｏＦｒｔｏｇｏｆｓＥｏ
ａｄＥｖｏｍｅｔｅｔｔＦｒｓｙＡｍｉｉｒｔｎＢｉｎ００１．ＣｉｅｃｄｍｏｅｒＢｉｎ００１ｎｎｉｎｎ１ａｅｏｅｒｄｎｓａｉｅｉｇ１０９；２ｈｎｓＡａｅｙｏＦｒｔｅｉｇ１０９；ｒｆｏｈＳｔｔｏｊｅｆｓｙｊ３ｅａｔｅｔｏｅｔｃｎｅ，Ｕｉｒｔｏｒｉｏｕｉ．ＤｐｒｎＦｒｉｃｓｎｅｓｙｆＢｉｓＣｌｍｂｍｏｆｓＳｅｖｉｔｈａｎ０ｃ“ ＶＴ１４６Ｚ）
张雷刘世荣孙鹏森王同立
（．中国林业科学研究院森林生态环境与保护研究所国家林业局森林生态环境重点实验室北京１０９；１００１２中国林业科学研究院．北京１０９；３ｅａｔｅｔｆｏｅｔｃｎｅ，Ｕｉｒｔ０ＢｉｓｏｍｂａＶｎｏｖｒＶＴ１４００１．ＤｐｒｎｒｓＳｉｃｓｎｖｓｙｆｒｉＣｌｉｍｏＦｅｅｉｔｈｕａｃｕｅ６Ｚ）

人工智能领域中英文专有名词汇总

名词解释中英文对比<using_information_sources> social networks 社会网络abductive reasoning 溯因推理action recognition(行为识别)active learning(主动学习)adaptive systems 自适应系统adverse drugs reactions(药物不良反应)algorithm design and analysis(算法设计与分析) algorithm(算法)artificial intelligence 人工智能association rule(关联规则)attribute value taxonomy 属性分类规范automomous agent 自动代理automomous systems 自动系统background knowledge 背景知识bayes methods(贝叶斯方法)bayesian inference(贝叶斯推断)bayesian methods(bayes 方法)belief propagation(置信传播)better understanding 内涵理解big data 大数据big data(大数据)biological network(生物网络)biological sciences(生物科学)biomedical domain 生物医学领域biomedical research(生物医学研究)biomedical text(生物医学文本)boltzmann machine(玻尔兹曼机)bootstrapping method 拔靴法case based reasoning 实例推理causual models 因果模型citation matching (引文匹配)classification (分类)classification algorithms(分类算法)clistering algorithms 聚类算法cloud computing(云计算)cluster-based retrieval (聚类检索)clustering (聚类)clustering algorithms(聚类算法)clustering 聚类cognitive science 认知科学collaborative filtering (协同过滤)collaborative filtering(协同过滤)collabrative ontology development 联合本体开发collabrative ontology engineering 联合本体工程commonsense knowledge 常识communication networks(通讯网络)community detection(社区发现)complex data(复杂数据)complex dynamical networks(复杂动态网络)complex network(复杂网络)complex network(复杂网络)computational biology 计算生物学computational biology(计算生物学)computational complexity(计算复杂性) computational intelligence 智能计算computational modeling(计算模型)computer animation(计算机动画)computer networks(计算机网络)computer science 计算机科学concept clustering 概念聚类concept formation 概念形成concept learning 概念学习concept map 概念图concept model 概念模型concept modelling 概念模型conceptual model 概念模型conditional random field(条件随机场模型) conjunctive quries 合取查询constrained least squares (约束最小二乘) convex programming(凸规划)convolutional neural networks(卷积神经网络) customer relationship management(客户关系管理) data analysis(数据分析)data analysis(数据分析)data center(数据中心)data clustering (数据聚类)data compression(数据压缩)data envelopment analysis (数据包络分析)data fusion 数据融合data generation(数据生成)data handling(数据处理)data hierarchy (数据层次)data integration(数据整合)data integrity 数据完整性data intensive computing(数据密集型计算)data management 数据管理data management(数据管理)data management(数据管理)data miningdata mining 数据挖掘data model 数据模型data models(数据模型)data partitioning 数据划分data point(数据点)data privacy(数据隐私)data security(数据安全)data stream(数据流)data streams(数据流)data structure( 数据结构)data structure(数据结构)data visualisation(数据可视化)data visualization 数据可视化data visualization(数据可视化)data warehouse(数据仓库)data warehouses(数据仓库)data warehousing(数据仓库)database management systems(数据库管理系统)database management(数据库管理)date interlinking 日期互联date linking 日期链接Decision analysis(决策分析)decision maker 决策者decision making (决策)decision models 决策模型decision models 决策模型decision rule 决策规则decision support system 决策支持系统decision support systems (决策支持系统) decision tree(决策树)decission tree 决策树deep belief network(深度信念网络)deep learning(深度学习)defult reasoning 默认推理density estimation(密度估计)design methodology 设计方法论dimension reduction(降维) dimensionality reduction(降维)directed graph(有向图)disaster management 灾害管理disastrous event(灾难性事件)discovery(知识发现)dissimilarity (相异性)distributed databases 分布式数据库distributed databases(分布式数据库) distributed query 分布式查询document clustering (文档聚类)domain experts 领域专家domain knowledge 领域知识domain specific language 领域专用语言dynamic databases(动态数据库)dynamic logic 动态逻辑dynamic network(动态网络)dynamic system(动态系统)earth mover's distance(EMD 距离) education 教育efficient algorithm(有效算法)electric commerce 电子商务electronic health records(电子健康档案) entity disambiguation 实体消歧entity recognition 实体识别entity recognition(实体识别)entity resolution 实体解析event detection 事件检测event detection(事件检测)event extraction 事件抽取event identificaton 事件识别exhaustive indexing 完整索引expert system 专家系统expert systems(专家系统)explanation based learning 解释学习factor graph(因子图)feature extraction 特征提取feature extraction(特征提取)feature extraction(特征提取)feature selection (特征选择)feature selection 特征选择feature selection(特征选择)feature space 特征空间first order logic 一阶逻辑formal logic 形式逻辑formal meaning prepresentation 形式意义表示formal semantics 形式语义formal specification 形式描述frame based system 框为本的系统frequent itemsets(频繁项目集)frequent pattern(频繁模式)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy data mining(模糊数据挖掘)fuzzy logic 模糊逻辑fuzzy set theory(模糊集合论)fuzzy set(模糊集)fuzzy sets 模糊集合fuzzy systems 模糊系统gaussian processes(高斯过程)gene expression data 基因表达数据gene expression(基因表达)generative model(生成模型)generative model(生成模型)genetic algorithm 遗传算法genome wide association study(全基因组关联分析) graph classification(图分类)graph classification(图分类)graph clustering(图聚类)graph data(图数据)graph data(图形数据)graph database 图数据库graph database(图数据库)graph mining(图挖掘)graph mining(图挖掘)graph partitioning 图划分graph query 图查询graph structure(图结构)graph theory(图论)graph theory(图论)graph theory(图论)graph theroy 图论graph visualization(图形可视化)graphical user interface 图形用户界面graphical user interfaces(图形用户界面)health care 卫生保健health care(卫生保健)heterogeneous data source 异构数据源heterogeneous data(异构数据)heterogeneous database 异构数据库heterogeneous information network(异构信息网络) heterogeneous network(异构网络)heterogenous ontology 异构本体heuristic rule 启发式规则hidden markov model(隐马尔可夫模型)hidden markov model(隐马尔可夫模型)hidden markov models(隐马尔可夫模型) hierarchical clustering (层次聚类) homogeneous network(同构网络)human centered computing 人机交互技术human computer interaction 人机交互human interaction 人机交互human robot interaction 人机交互image classification(图像分类)image clustering (图像聚类)image mining( 图像挖掘)image reconstruction(图像重建)image retrieval (图像检索)image segmentation(图像分割)inconsistent ontology 本体不一致incremental learning(增量学习)inductive learning (归纳学习)inference mechanisms 推理机制inference mechanisms(推理机制)inference rule 推理规则information cascades(信息追随)information diffusion(信息扩散)information extraction 信息提取information filtering(信息过滤)information filtering(信息过滤)information integration(信息集成)information network analysis(信息网络分析) information network mining(信息网络挖掘) information network(信息网络)information processing 信息处理information processing 信息处理information resource management (信息资源管理) information retrieval models(信息检索模型) information retrieval 信息检索information retrieval(信息检索)information retrieval(信息检索)information science 情报科学information sources 信息源information system( 信息系统)information system(信息系统)information technology(信息技术)information visualization(信息可视化)instance matching 实例匹配intelligent assistant 智能辅助intelligent systems 智能系统interaction network(交互网络)interactive visualization(交互式可视化)kernel function(核函数)kernel operator (核算子)keyword search(关键字检索)knowledege reuse 知识再利用knowledgeknowledgeknowledge acquisitionknowledge base 知识库knowledge based system 知识系统knowledge building 知识建构knowledge capture 知识获取knowledge construction 知识建构knowledge discovery(知识发现)knowledge extraction 知识提取knowledge fusion 知识融合knowledge integrationknowledge management systems 知识管理系统knowledge management 知识管理knowledge management(知识管理)knowledge model 知识模型knowledge reasoningknowledge representationknowledge representation(知识表达) knowledge sharing 知识共享knowledge storageknowledge technology 知识技术knowledge verification 知识验证language model(语言模型)language modeling approach(语言模型方法) large graph(大图)large graph(大图)learning(无监督学习)life science 生命科学linear programming(线性规划)link analysis (链接分析)link prediction(链接预测)link prediction(链接预测)link prediction(链接预测)linked data(关联数据)location based service(基于位置的服务) loclation based services(基于位置的服务) logic programming 逻辑编程logical implication 逻辑蕴涵logistic regression(logistic 回归)machine learning 机器学习machine translation(机器翻译)management system(管理系统)management( 知识管理)manifold learning(流形学习)markov chains 马尔可夫链markov processes(马尔可夫过程)matching function 匹配函数matrix decomposition(矩阵分解)matrix decomposition(矩阵分解)maximum likelihood estimation(最大似然估计)medical research(医学研究)mixture of gaussians(混合高斯模型)mobile computing(移动计算)multi agnet systems 多智能体系统multiagent systems 多智能体系统multimedia 多媒体natural language processing 自然语言处理natural language processing(自然语言处理) nearest neighbor (近邻)network analysis( 网络分析)network analysis(网络分析)network analysis(网络分析)network formation(组网)network structure(网络结构)network theory(网络理论)network topology(网络拓扑)network visualization(网络可视化)neural network(神经网络)neural networks (神经网络)neural networks(神经网络)nonlinear dynamics(非线性动力学)nonmonotonic reasoning 非单调推理nonnegative matrix factorization (非负矩阵分解) nonnegative matrix factorization(非负矩阵分解) object detection(目标检测)object oriented 面向对象object recognition(目标识别)object recognition(目标识别)online community(网络社区)online social network(在线社交网络)online social networks(在线社交网络)ontology alignment 本体映射ontology development 本体开发ontology engineering 本体工程ontology evolution 本体演化ontology extraction 本体抽取ontology interoperablity 互用性本体ontology language 本体语言ontology mapping 本体映射ontology matching 本体匹配ontology versioning 本体版本ontology 本体论open government data 政府公开数据opinion analysis(舆情分析)opinion mining(意见挖掘)opinion mining(意见挖掘)outlier detection(孤立点检测)parallel processing(并行处理)patient care(病人医疗护理)pattern classification(模式分类)pattern matching(模式匹配)pattern mining(模式挖掘)pattern recognition 模式识别pattern recognition(模式识别)pattern recognition(模式识别)personal data(个人数据)prediction algorithms(预测算法)predictive model 预测模型predictive models(预测模型)privacy preservation(隐私保护)probabilistic logic(概率逻辑)probabilistic logic(概率逻辑)probabilistic model(概率模型)probabilistic model(概率模型)probability distribution(概率分布)probability distribution(概率分布)project management(项目管理)pruning technique(修剪技术)quality management 质量管理query expansion(查询扩展)query language 查询语言query language(查询语言)query processing(查询处理)query rewrite 查询重写question answering system 问答系统random forest(随机森林)random graph(随机图)random processes(随机过程)random walk(随机游走)range query(范围查询)RDF database 资源描述框架数据库RDF query 资源描述框架查询RDF repository 资源描述框架存储库RDF storge 资源描述框架存储real time(实时)recommender system(推荐系统)recommender system(推荐系统)recommender systems 推荐系统recommender systems(推荐系统)record linkage 记录链接recurrent neural network(递归神经网络) regression(回归)reinforcement learning 强化学习reinforcement learning(强化学习)relation extraction 关系抽取relational database 关系数据库relational learning 关系学习relevance feedback (相关反馈)resource description framework 资源描述框架restricted boltzmann machines(受限玻尔兹曼机) retrieval models(检索模型)rough set theroy 粗糙集理论rough set 粗糙集rule based system 基于规则系统rule based 基于规则rule induction (规则归纳)rule learning (规则学习)rule learning 规则学习schema mapping 模式映射schema matching 模式匹配scientific domain 科学域search problems(搜索问题)semantic (web) technology 语义技术semantic analysis 语义分析semantic annotation 语义标注semantic computing 语义计算semantic integration 语义集成semantic interpretation 语义解释semantic model 语义模型semantic network 语义网络semantic relatedness 语义相关性semantic relation learning 语义关系学习semantic search 语义检索semantic similarity 语义相似度semantic similarity(语义相似度)semantic web rule language 语义网规则语言semantic web 语义网semantic web(语义网)semantic workflow 语义工作流semi supervised learning(半监督学习)sensor data(传感器数据)sensor networks(传感器网络)sentiment analysis(情感分析)sentiment analysis(情感分析)sequential pattern(序列模式)service oriented architecture 面向服务的体系结构shortest path(最短路径)similar kernel function(相似核函数)similarity measure(相似性度量)similarity relationship (相似关系)similarity search(相似搜索)similarity(相似性)situation aware 情境感知social behavior(社交行为)social influence(社会影响)social interaction(社交互动)social interaction(社交互动)social learning(社会学习)social life networks(社交生活网络)social machine 社交机器social media(社交媒体)social media(社交媒体)social media(社交媒体)social network analysis 社会网络分析social network analysis(社交网络分析)social network(社交网络)social network(社交网络)social science(社会科学)social tagging system(社交标签系统)social tagging(社交标签)social web(社交网页)sparse coding(稀疏编码)sparse matrices(稀疏矩阵)sparse representation(稀疏表示)spatial database(空间数据库)spatial reasoning 空间推理statistical analysis(统计分析)statistical model 统计模型string matching(串匹配)structural risk minimization (结构风险最小化) structured data 结构化数据subgraph matching 子图匹配subspace clustering(子空间聚类)supervised learning( 有support vector machine 支持向量机support vector machines(支持向量机)system dynamics(系统动力学)tag recommendation(标签推荐)taxonmy induction 感应规范temporal logic 时态逻辑temporal reasoning 时序推理text analysis(文本分析)text anaylsis 文本分析text classification (文本分类)text data(文本数据)text mining technique(文本挖掘技术)text mining 文本挖掘text mining(文本挖掘)text summarization(文本摘要)thesaurus alignment 同义对齐time frequency analysis(时频分析)time series analysis( 时time series data(时间序列数据)time series data(时间序列数据)time series(时间序列)topic model(主题模型)topic modeling(主题模型)transfer learning 迁移学习triple store 三元组存储uncertainty reasoning 不精确推理undirected graph(无向图)unified modeling language 统一建模语言unsupervisedupper bound(上界)user behavior(用户行为)user generated content(用户生成内容)utility mining(效用挖掘)visual analytics(可视化分析)visual content(视觉内容)visual representation(视觉表征)visualisation(可视化)visualization technique(可视化技术) visualization tool(可视化工具)web 2.0(网络2.0)web forum(web 论坛)web mining(网络挖掘)web of data 数据网web ontology lanuage 网络本体语言web pages(web 页面)web resource 网络资源web science 万维科学web search (网络检索)web usage mining(web 使用挖掘)wireless networks 无线网络world knowledge 世界知识world wide web 万维网world wide web(万维网)xml database 可扩展标志语言数据库附录 2 Data Mining 知识图谱（共包含二级节点15 个，三级节点93 个）间序列分析)监督学习)领域二级分类三级分类。

利用浅层句法分析提取特征的词义消歧

进行训练和预测实验，确率达到了７．１准８７％。
关键词：义消歧；浅层句法分析；语块；特征提取；最大熵模型词中图法分类号：Ｐ９Ｔ３１文献标识码：Ａ文章编号：００７２２１）１７４０１０ —０４（００２ — ０—４４
ＣｈｎｓＳｂｓｄｏｅｔｒｓｏｔｉｉｇｗｉｈｌｏｐｒｉｇｉｅｅＷＤａｅｎｆａｕｅｂａｎｎｔｓａｌｗａｓｎｈ
ＳＵＮａ．ＺＨＡＮＧｎ — ｅＣｈｏＹａｇｓｎ
（ｓｔｅｆｎｅｌｅｔｎｏｍｔｎｒｃｓｉ，Ｂｉｎｆｒａｏｃｎｅｎｅｈｏｇｉｒｉ，ＩｔｕＩｔｌｇｎＩｆａｉｏｅｓｇｅｉｇｎｏｔｎＳｉｃｄＴｃｎｌｙｖｓｙｎｉｔｏｉｒｏＰｎｊＩｍｉｅａｏＵｎｅｔＢｉｎ０１２Ｃｉａｅｉｇ１０９，ｈｎ）ｊ
０引言
词义消歧（ｒｎｅｄｓｍｉａｉ，Ｄ）自然语言处ｗｏｄｓｓｉｂｇｔｎＷＳ是ｅａｕｏ
些成分通常被称作语块（ｕｋ，层句法分析的主要任务就ｃｎ）浅ｈ
是对语块的识别和分析。它使句法分析的任务在某种程度上
ＯｎｔｅｂｓｓｏｏｄＳｎｅＴｇｉｇＣｏｐｓｏｎｔｕｅｏｍｐｔｔｎｌＬｎｕｓｉｓＰｋｎｉｅｓｔ，ｔｅｅａｅ４ｏｙｅｅａｉｆＷｒ — ｅｓａｇｎｒｕｆＩｓｔｔｆＣｏｕａｉａｉｇｉｃ，ｅｉｇＵｎｖｒｉｈｉｏｔｙｈｒｒ４ｐｌｓｍｉｓｓｌｃｅ．Ｏｎｔｅｅｐｒｍｅｔｆｒｉｎｎｎｒｄｃｉｇｕｉｇｍａｉｍｎｒｐｉａｉｕｔｎｍｏｅ，ｔｅａｃａｙｒｔｆｉａｉｅｅｔｄｘｅｈｉｎａｎｉｇａｄｐｅｉｔｓｘｍｕｅｔｏｙｄｓｍｂｇａｉｄｌｈｃｕｃａｅｏｓｍｂ — ｏｔｎｎｏｒｄ

医学文本预训练的 bert-base model-概述说明以及解释

医学文本预训练的bert-base model-概述说明以及解释1.引言1.1 概述概述:医学文本预训练技术是近年来兴起的一种新兴研究领域，它通过使用深度学习模型，特别是基于神经网络的自然语言处理模型，对大规模医学文本数据进行预训练，以提取医学领域特定的知识和模式。

在这一领域中，BERT-base模型作为一种被广泛应用的预训练模型，具有强大的文本表示能力和泛化能力，可以有效地应用于医学文本的处理和分析。

本文将重点介绍医学文本预训练技术和BERT-base模型的原理和应用，探讨它们在医学领域中的潜在应用和发展前景，为读者提供一份全面了解和掌握这一领域技术的参考资料。

1.2 文章结构文章结构部分需要介绍整篇文章的布局和内容安排。

在本篇文章中，我们将首先在引言部分对医学文本预训练的背景和重要性进行概述，然后介绍BERT-base模型的基本原理和特点。

接下来，我们将探讨医学文本预训练和BERT-base模型在医学领域的应用。

在结论部分，我们将对整篇文章进行总结，详细地归纳医学文本预训练和BERT-base模型的关键特点和应用价值。

此外，我们也将展望未来医学文本预训练和BERT-base模型在医学领域的发展前景，并结束文章强调医学文本预训练和BERT-base模型在医学领域的重要性和潜在贡献。

1.3 目的本文旨在探讨医学文本预训练技术在自然语言处理领域的应用，重点介绍了基于BERT-base模型的医学文本预训练方法。

通过对医学领域的文本数据进行预训练，可以提高模型在医学文本理解、医学知识图谱构建等任务上的性能表现，有助于医学大数据的应用与挖掘，推动医学人工智能的发展。

通过本文的介绍和分析，读者可以了解到医学文本预训练技术的重要性和应用前景，为进一步探索医学自然语言处理领域提供理论支持和实践指导。

2.正文2.1 医学文本预训练：医学文本预训练是指在医学领域专门针对医学文本数据进行预训练的过程。

在传统的自然语言处理领域，预训练模型如BERT在通用文本数据上表现出色，但在医学领域的应用效果并不尽如人意。

综述Representation learning a review and new perspectives

explanatory factors for the observed input. A good representation is also one that is useful as input to a supervised predictor. Among the various ways of learning representations, this paper focuses on deep learning methods: those that are formed by the composition of multiple non-linear transformations, with the goal of yielding more abstract – and ultimately more useful – representations. Here we survey this rapidly developing area with special emphasis on recent progress. We consider some of the fundamental questions that have been driving research in this area. Speciﬁcally, what makes one representation better than another? Given an example, how should we compute its representation, i.e. perform feature extraction? Also, what are appropriate objectives for learning good representations?

《神经网络与深度学习综述DeepLearning15May2014

Draft:Deep Learning in Neural Networks:An OverviewTechnical Report IDSIA-03-14/arXiv:1404.7828(v1.5)[cs.NE]J¨u rgen SchmidhuberThe Swiss AI Lab IDSIAIstituto Dalle Molle di Studi sull’Intelligenza ArtiﬁcialeUniversity of Lugano&SUPSIGalleria2,6928Manno-LuganoSwitzerland15May2014AbstractIn recent years,deep artiﬁcial neural networks(including recurrent ones)have won numerous con-tests in pattern recognition and machine learning.This historical survey compactly summarises relevantwork,much of it from the previous millennium.Shallow and deep learners are distinguished by thedepth of their credit assignment paths,which are chains of possibly learnable,causal links between ac-tions and effects.I review deep supervised learning(also recapitulating the history of backpropagation),unsupervised learning,reinforcement learning&evolutionary computation,and indirect search for shortprograms encoding deep and large networks.PDF of earlier draft(v1):http://www.idsia.ch/∼juergen/DeepLearning30April2014.pdfLATEX source:http://www.idsia.ch/∼juergen/DeepLearning30April2014.texComplete BIBTEXﬁle:http://www.idsia.ch/∼juergen/bib.bibPrefaceThis is the draft of an invited Deep Learning(DL)overview.One of its goals is to assign credit to those who contributed to the present state of the art.I acknowledge the limitations of attempting to achieve this goal.The DL research community itself may be viewed as a continually evolving,deep network of scientists who have inﬂuenced each other in complex ways.Starting from recent DL results,I tried to trace back the origins of relevant ideas through the past half century and beyond,sometimes using“local search”to follow citations of citations backwards in time.Since not all DL publications properly acknowledge earlier relevant work,additional global search strategies were employed,aided by consulting numerous neural network experts.As a result,the present draft mostly consists of references(about800entries so far).Nevertheless,through an expert selection bias I may have missed important work.A related bias was surely introduced by my special familiarity with the work of my own DL research group in the past quarter-century.For these reasons,the present draft should be viewed as merely a snapshot of an ongoing credit assignment process.To help improve it,please do not hesitate to send corrections and suggestions to juergen@idsia.ch.Contents1Introduction to Deep Learning(DL)in Neural Networks(NNs)3 2Event-Oriented Notation for Activation Spreading in FNNs/RNNs3 3Depth of Credit Assignment Paths(CAPs)and of Problems4 4Recurring Themes of Deep Learning54.1Dynamic Programming(DP)for DL (5)4.2Unsupervised Learning(UL)Facilitating Supervised Learning(SL)and RL (6)4.3Occam’s Razor:Compression and Minimum Description Length(MDL) (6)4.4Learning Hierarchical Representations Through Deep SL,UL,RL (6)4.5Fast Graphics Processing Units(GPUs)for DL in NNs (6)5Supervised NNs,Some Helped by Unsupervised NNs75.11940s and Earlier (7)5.2Around1960:More Neurobiological Inspiration for DL (7)5.31965:Deep Networks Based on the Group Method of Data Handling(GMDH) (8)5.41979:Convolution+Weight Replication+Winner-Take-All(WTA) (8)5.51960-1981and Beyond:Development of Backpropagation(BP)for NNs (8)5.5.1BP for Weight-Sharing Feedforward NNs(FNNs)and Recurrent NNs(RNNs)..95.6Late1980s-2000:Numerous Improvements of NNs (9)5.6.1Ideas for Dealing with Long Time Lags and Deep CAPs (10)5.6.2Better BP Through Advanced Gradient Descent (10)5.6.3Discovering Low-Complexity,Problem-Solving NNs (11)5.6.4Potential Beneﬁts of UL for SL (11)5.71987:UL Through Autoencoder(AE)Hierarchies (12)5.81989:BP for Convolutional NNs(CNNs) (13)5.91991:Fundamental Deep Learning Problem of Gradient Descent (13)5.101991:UL-Based History Compression Through a Deep Hierarchy of RNNs (14)5.111992:Max-Pooling(MP):Towards MPCNNs (14)5.121994:Contest-Winning Not So Deep NNs (15)5.131995:Supervised Recurrent Very Deep Learner(LSTM RNN) (15)5.142003:More Contest-Winning/Record-Setting,Often Not So Deep NNs (16)5.152006/7:Deep Belief Networks(DBNs)&AE Stacks Fine-Tuned by BP (17)5.162006/7:Improved CNNs/GPU-CNNs/BP-Trained MPCNNs (17)5.172009:First Ofﬁcial Competitions Won by RNNs,and with MPCNNs (18)5.182010:Plain Backprop(+Distortions)on GPU Yields Excellent Results (18)5.192011:MPCNNs on GPU Achieve Superhuman Vision Performance (18)5.202011:Hessian-Free Optimization for RNNs (19)5.212012:First Contests Won on ImageNet&Object Detection&Segmentation (19)5.222013-:More Contests and Benchmark Records (20)5.22.1Currently Successful Supervised Techniques:LSTM RNNs/GPU-MPCNNs (21)5.23Recent Tricks for Improving SL Deep NNs(Compare Sec.5.6.2,5.6.3) (21)5.24Consequences for Neuroscience (22)5.25DL with Spiking Neurons? (22)6DL in FNNs and RNNs for Reinforcement Learning(RL)236.1RL Through NN World Models Yields RNNs With Deep CAPs (23)6.2Deep FNNs for Traditional RL and Markov Decision Processes(MDPs) (24)6.3Deep RL RNNs for Partially Observable MDPs(POMDPs) (24)6.4RL Facilitated by Deep UL in FNNs and RNNs (25)6.5Deep Hierarchical RL(HRL)and Subgoal Learning with FNNs and RNNs (25)6.6Deep RL by Direct NN Search/Policy Gradients/Evolution (25)6.7Deep RL by Indirect Policy Search/Compressed NN Search (26)6.8Universal RL (27)7Conclusion271Introduction to Deep Learning(DL)in Neural Networks(NNs) Which modiﬁable components of a learning system are responsible for its success or failure?What changes to them improve performance?This has been called the fundamental credit assignment problem(Minsky, 1963).There are general credit assignment methods for universal problem solvers that are time-optimal in various theoretical senses(Sec.6.8).The present survey,however,will focus on the narrower,but now commercially important,subﬁeld of Deep Learning(DL)in Artiﬁcial Neural Networks(NNs).We are interested in accurate credit assignment across possibly many,often nonlinear,computational stages of NNs.Shallow NN-like models have been around for many decades if not centuries(Sec.5.1).Models with several successive nonlinear layers of neurons date back at least to the1960s(Sec.5.3)and1970s(Sec.5.5). An efﬁcient gradient descent method for teacher-based Supervised Learning(SL)in discrete,differentiable networks of arbitrary depth called backpropagation(BP)was developed in the1960s and1970s,and ap-plied to NNs in1981(Sec.5.5).BP-based training of deep NNs with many layers,however,had been found to be difﬁcult in practice by the late1980s(Sec.5.6),and had become an explicit research subject by the early1990s(Sec.5.9).DL became practically feasible to some extent through the help of Unsupervised Learning(UL)(e.g.,Sec.5.10,5.15).The1990s and2000s also saw many improvements of purely super-vised DL(Sec.5).In the new millennium,deep NNs haveﬁnally attracted wide-spread attention,mainly by outperforming alternative machine learning methods such as kernel machines(Vapnik,1995;Sch¨o lkopf et al.,1998)in numerous important applications.In fact,supervised deep NNs have won numerous of-ﬁcial international pattern recognition competitions(e.g.,Sec.5.17,5.19,5.21,5.22),achieving theﬁrst superhuman visual pattern recognition results in limited domains(Sec.5.19).Deep NNs also have become relevant for the more generalﬁeld of Reinforcement Learning(RL)where there is no supervising teacher (Sec.6).Both feedforward(acyclic)NNs(FNNs)and recurrent(cyclic)NNs(RNNs)have won contests(Sec.5.12,5.14,5.17,5.19,5.21,5.22).In a sense,RNNs are the deepest of all NNs(Sec.3)—they are general computers more powerful than FNNs,and can in principle create and process memories of ar-bitrary sequences of input patterns(e.g.,Siegelmann and Sontag,1991;Schmidhuber,1990a).Unlike traditional methods for automatic sequential program synthesis(e.g.,Waldinger and Lee,1969;Balzer, 1985;Soloway,1986;Deville and Lau,1994),RNNs can learn programs that mix sequential and parallel information processing in a natural and efﬁcient way,exploiting the massive parallelism viewed as crucial for sustaining the rapid decline of computation cost observed over the past75years.The rest of this paper is structured as follows.Sec.2introduces a compact,event-oriented notation that is simple yet general enough to accommodate both FNNs and RNNs.Sec.3introduces the concept of Credit Assignment Paths(CAPs)to measure whether learning in a given NN application is of the deep or shallow type.Sec.4lists recurring themes of DL in SL,UL,and RL.Sec.5focuses on SL and UL,and on how UL can facilitate SL,although pure SL has become dominant in recent competitions(Sec.5.17-5.22). Sec.5is arranged in a historical timeline format with subsections on important inspirations and technical contributions.Sec.6on deep RL discusses traditional Dynamic Programming(DP)-based RL combined with gradient-based search techniques for SL or UL in deep NNs,as well as general methods for direct and indirect search in the weight space of deep FNNs and RNNs,including successful policy gradient and evolutionary methods.2Event-Oriented Notation for Activation Spreading in FNNs/RNNs Throughout this paper,let i,j,k,t,p,q,r denote positive integer variables assuming ranges implicit in the given contexts.Let n,m,T denote positive integer constants.An NN’s topology may change over time(e.g.,Fahlman,1991;Ring,1991;Weng et al.,1992;Fritzke, 1994).At any given moment,it can be described as aﬁnite subset of units(or nodes or neurons)N= {u1,u2,...,}and aﬁnite set H⊆N×N of directed edges or connections between nodes.FNNs are acyclic graphs,RNNs cyclic.Theﬁrst(input)layer is the set of input units,a subset of N.In FNNs,the k-th layer(k>1)is the set of all nodes u∈N such that there is an edge path of length k−1(but no longer path)between some input unit and u.There may be shortcut connections between distant layers.The NN’s behavior or program is determined by a set of real-valued,possibly modiﬁable,parameters or weights w i(i=1,...,n).We now focus on a singleﬁnite episode or epoch of information processing and activation spreading,without learning through weight changes.The following slightly unconventional notation is designed to compactly describe what is happening during the runtime of the system.During an episode,there is a partially causal sequence x t(t=1,...,T)of real values that I call events.Each x t is either an input set by the environment,or the activation of a unit that may directly depend on other x k(k<t)through a current NN topology-dependent set in t of indices k representing incoming causal connections or links.Let the function v encode topology information and map such event index pairs(k,t)to weight indices.For example,in the non-input case we may have x t=f t(net t)with real-valued net t= k∈in t x k w v(k,t)(additive case)or net t= k∈in t x k w v(k,t)(multiplicative case), where f t is a typically nonlinear real-valued activation function such as tanh.In many recent competition-winning NNs(Sec.5.19,5.21,5.22)there also are events of the type x t=max k∈int (x k);some networktypes may also use complex polynomial activation functions(Sec.5.3).x t may directly affect certain x k(k>t)through outgoing connections or links represented through a current set out t of indices k with t∈in k.Some non-input events are called output events.Note that many of the x t may refer to different,time-varying activations of the same unit in sequence-processing RNNs(e.g.,Williams,1989,“unfolding in time”),or also in FNNs sequentially exposed to time-varying input patterns of a large training set encoded as input events.During an episode,the same weight may get reused over and over again in topology-dependent ways,e.g.,in RNNs,or in convolutional NNs(Sec.5.4,5.8).I call this weight sharing across space and/or time.Weight sharing may greatly reduce the NN’s descriptive complexity,which is the number of bits of information required to describe the NN (Sec.4.3).In Supervised Learning(SL),certain NN output events x t may be associated with teacher-given,real-valued labels or targets d t yielding errors e t,e.g.,e t=1/2(x t−d t)2.A typical goal of supervised NN training is toﬁnd weights that yield episodes with small total error E,the sum of all such e t.The hope is that the NN will generalize well in later episodes,causing only small errors on previously unseen sequences of input events.Many alternative error functions for SL and UL are possible.SL assumes that input events are independent of earlier output events(which may affect the environ-ment through actions causing subsequent perceptions).This assumption does not hold in the broaderﬁelds of Sequential Decision Making and Reinforcement Learning(RL)(Kaelbling et al.,1996;Sutton and Barto, 1998;Hutter,2005)(Sec.6).In RL,some of the input events may encode real-valued reward signals given by the environment,and a typical goal is toﬁnd weights that yield episodes with a high sum of reward signals,through sequences of appropriate output actions.Sec.5.5will use the notation above to compactly describe a central algorithm of DL,namely,back-propagation(BP)for supervised weight-sharing FNNs and RNNs.(FNNs may be viewed as RNNs with certainﬁxed zero weights.)Sec.6will address the more general RL case.3Depth of Credit Assignment Paths(CAPs)and of ProblemsTo measure whether credit assignment in a given NN application is of the deep or shallow type,I introduce the concept of Credit Assignment Paths or CAPs,which are chains of possibly causal links between events.Let usﬁrst focus on SL.Consider two events x p and x q(1≤p<q≤T).Depending on the appli-cation,they may have a Potential Direct Causal Connection(PDCC)expressed by the Boolean predicate pdcc(p,q),which is true if and only if p∈in q.Then the2-element list(p,q)is deﬁned to be a CAP from p to q(a minimal one).A learning algorithm may be allowed to change w v(p,q)to improve performance in future episodes.More general,possibly indirect,Potential Causal Connections(PCC)are expressed by the recursively deﬁned Boolean predicate pcc(p,q),which in the SL case is true only if pdcc(p,q),or if pcc(p,k)for some k and pdcc(k,q).In the latter case,appending q to any CAP from p to k yields a CAP from p to q(this is a recursive deﬁnition,too).The set of such CAPs may be large but isﬁnite.Note that the same weight may affect many different PDCCs between successive events listed by a given CAP,e.g.,in the case of RNNs, or weight-sharing FNNs.Suppose a CAP has the form(...,k,t,...,q),where k and t(possibly t=q)are theﬁrst successive elements with modiﬁable w v(k,t).Then the length of the sufﬁx list(t,...,q)is called the CAP’s depth (which is0if there are no modiﬁable links at all).This depth limits how far backwards credit assignment can move down the causal chain toﬁnd a modiﬁable weight.1Suppose an episode and its event sequence x1,...,x T satisfy a computable criterion used to decide whether a given problem has been solved(e.g.,total error E below some threshold).Then the set of used weights is called a solution to the problem,and the depth of the deepest CAP within the sequence is called the solution’s depth.There may be other solutions(yielding different event sequences)with different depths.Given someﬁxed NN topology,the smallest depth of any solution is called the problem’s depth.Sometimes we also speak of the depth of an architecture:SL FNNs withﬁxed topology imply a problem-independent maximal problem depth bounded by the number of non-input layers.Certain SL RNNs withﬁxed weights for all connections except those to output units(Jaeger,2001;Maass et al.,2002; Jaeger,2004;Schrauwen et al.,2007)have a maximal problem depth of1,because only theﬁnal links in the corresponding CAPs are modiﬁable.In general,however,RNNs may learn to solve problems of potentially unlimited depth.Note that the deﬁnitions above are solely based on the depths of causal chains,and agnostic of the temporal distance between events.For example,shallow FNNs perceiving large“time windows”of in-put events may correctly classify long input sequences through appropriate output events,and thus solve shallow problems involving long time lags between relevant events.At which problem depth does Shallow Learning end,and Deep Learning begin?Discussions with DL experts have not yet yielded a conclusive response to this question.Instead of committing myself to a precise answer,let me just deﬁne for the purposes of this overview:problems of depth>10require Very Deep Learning.The difﬁculty of a problem may have little to do with its depth.Some NNs can quickly learn to solve certain deep problems,e.g.,through random weight guessing(Sec.5.9)or other types of direct search (Sec.6.6)or indirect search(Sec.6.7)in weight space,or through training an NNﬁrst on shallow problems whose solutions may then generalize to deep problems,or through collapsing sequences of(non)linear operations into a single(non)linear operation—but see an analysis of non-trivial aspects of deep linear networks(Baldi and Hornik,1994,Section B).In general,however,ﬁnding an NN that precisely models a given training set is an NP-complete problem(Judd,1990;Blum and Rivest,1992),also in the case of deep NNs(S´ıma,1994;de Souto et al.,1999;Windisch,2005);compare a survey of negative results(S´ıma, 2002,Section1).Above we have focused on SL.In the more general case of RL in unknown environments,pcc(p,q) is also true if x p is an output event and x q any later input event—any action may affect the environment and thus any later perception.(In the real world,the environment may even inﬂuence non-input events computed on a physical hardware entangled with the entire universe,but this is ignored here.)It is possible to model and replace such unmodiﬁable environmental PCCs through a part of the NN that has already learned to predict(through some of its units)input events(including reward signals)from former input events and actions(Sec.6.1).Its weights are frozen,but can help to assign credit to other,still modiﬁable weights used to compute actions(Sec.6.1).This approach may lead to very deep CAPs though.Some DL research is about automatically rephrasing problems such that their depth is reduced(Sec.4). In particular,sometimes UL is used to make SL problems less deep,e.g.,Sec.5.10.Often Dynamic Programming(Sec.4.1)is used to facilitate certain traditional RL problems,e.g.,Sec.6.2.Sec.5focuses on CAPs for SL,Sec.6on the more complex case of RL.4Recurring Themes of Deep Learning4.1Dynamic Programming(DP)for DLOne recurring theme of DL is Dynamic Programming(DP)(Bellman,1957),which can help to facili-tate credit assignment under certain assumptions.For example,in SL NNs,backpropagation itself can 1An alternative would be to count only modiﬁable links when measuring depth.In many typical NN applications this would not make a difference,but in some it would,e.g.,Sec.6.1.be viewed as a DP-derived method(Sec.5.5).In traditional RL based on strong Markovian assumptions, DP-derived methods can help to greatly reduce problem depth(Sec.6.2).DP algorithms are also essen-tial for systems that combine concepts of NNs and graphical models,such as Hidden Markov Models (HMMs)(Stratonovich,1960;Baum and Petrie,1966)and Expectation Maximization(EM)(Dempster et al.,1977),e.g.,(Bottou,1991;Bengio,1991;Bourlard and Morgan,1994;Baldi and Chauvin,1996; Jordan and Sejnowski,2001;Bishop,2006;Poon and Domingos,2011;Dahl et al.,2012;Hinton et al., 2012a).4.2Unsupervised Learning(UL)Facilitating Supervised Learning(SL)and RL Another recurring theme is how UL can facilitate both SL(Sec.5)and RL(Sec.6).UL(Sec.5.6.4) is normally used to encode raw incoming data such as video or speech streams in a form that is more convenient for subsequent goal-directed learning.In particular,codes that describe the original data in a less redundant or more compact way can be fed into SL(Sec.5.10,5.15)or RL machines(Sec.6.4),whose search spaces may thus become smaller(and whose CAPs shallower)than those necessary for dealing with the raw data.UL is closely connected to the topics of regularization and compression(Sec.4.3,5.6.3). 4.3Occam’s Razor:Compression and Minimum Description Length(MDL) Occam’s razor favors simple solutions over complex ones.Given some programming language,the prin-ciple of Minimum Description Length(MDL)can be used to measure the complexity of a solution candi-date by the length of the shortest program that computes it(e.g.,Solomonoff,1964;Kolmogorov,1965b; Chaitin,1966;Wallace and Boulton,1968;Levin,1973a;Rissanen,1986;Blumer et al.,1987;Li and Vit´a nyi,1997;Gr¨u nwald et al.,2005).Some methods explicitly take into account program runtime(Al-lender,1992;Watanabe,1992;Schmidhuber,2002,1995);many consider only programs with constant runtime,written in non-universal programming languages(e.g.,Rissanen,1986;Hinton and van Camp, 1993).In the NN case,the MDL principle suggests that low NN weight complexity corresponds to high NN probability in the Bayesian view(e.g.,MacKay,1992;Buntine and Weigend,1991;De Freitas,2003), and to high generalization performance(e.g.,Baum and Haussler,1989),without overﬁtting the training data.Many methods have been proposed for regularizing NNs,that is,searching for solution-computing, low-complexity SL NNs(Sec.5.6.3)and RL NNs(Sec.6.7).This is closely related to certain UL methods (Sec.4.2,5.6.4).4.4Learning Hierarchical Representations Through Deep SL,UL,RLMany methods of Good Old-Fashioned Artiﬁcial Intelligence(GOFAI)(Nilsson,1980)as well as more recent approaches to AI(Russell et al.,1995)and Machine Learning(Mitchell,1997)learn hierarchies of more and more abstract data representations.For example,certain methods of syntactic pattern recog-nition(Fu,1977)such as grammar induction discover hierarchies of formal rules to model observations. The partially(un)supervised Automated Mathematician/EURISKO(Lenat,1983;Lenat and Brown,1984) continually learns concepts by combining previously learnt concepts.Such hierarchical representation learning(Ring,1994;Bengio et al.,2013;Deng and Yu,2014)is also a recurring theme of DL NNs for SL (Sec.5),UL-aided SL(Sec.5.7,5.10,5.15),and hierarchical RL(Sec.6.5).Often,abstract hierarchical representations are natural by-products of data compression(Sec.4.3),e.g.,Sec.5.10.4.5Fast Graphics Processing Units(GPUs)for DL in NNsWhile the previous millennium saw several attempts at creating fast NN-speciﬁc hardware(e.g.,Jackel et al.,1990;Faggin,1992;Ramacher et al.,1993;Widrow et al.,1994;Heemskerk,1995;Korkin et al., 1997;Urlbe,1999),and at exploiting standard hardware(e.g.,Anguita et al.,1994;Muller et al.,1995; Anguita and Gomes,1996),the new millennium brought a DL breakthrough in form of cheap,multi-processor graphics cards or GPUs.GPUs are widely used for video games,a huge and competitive market that has driven down hardware prices.GPUs excel at fast matrix and vector multiplications required not only for convincing virtual realities but also for NN training,where they can speed up learning by a factorof50and more.Some of the GPU-based FNN implementations(Sec.5.16-5.19)have greatly contributed to recent successes in contests for pattern recognition(Sec.5.19-5.22),image segmentation(Sec.5.21), and object detection(Sec.5.21-5.22).5Supervised NNs,Some Helped by Unsupervised NNsThe main focus of current practical applications is on Supervised Learning(SL),which has dominated re-cent pattern recognition contests(Sec.5.17-5.22).Several methods,however,use additional Unsupervised Learning(UL)to facilitate SL(Sec.5.7,5.10,5.15).It does make sense to treat SL and UL in the same section:often gradient-based methods,such as BP(Sec.5.5.1),are used to optimize objective functions of both UL and SL,and the boundary between SL and UL may blur,for example,when it comes to time series prediction and sequence classiﬁcation,e.g.,Sec.5.10,5.12.A historical timeline format will help to arrange subsections on important inspirations and techni-cal contributions(although such a subsection may span a time interval of many years).Sec.5.1brieﬂy mentions early,shallow NN models since the1940s,Sec.5.2additional early neurobiological inspiration relevant for modern Deep Learning(DL).Sec.5.3is about GMDH networks(since1965),perhaps theﬁrst (feedforward)DL systems.Sec.5.4is about the relatively deep Neocognitron NN(1979)which is similar to certain modern deep FNN architectures,as it combines convolutional NNs(CNNs),weight pattern repli-cation,and winner-take-all(WTA)mechanisms.Sec.5.5uses the notation of Sec.2to compactly describe a central algorithm of DL,namely,backpropagation(BP)for supervised weight-sharing FNNs and RNNs. It also summarizes the history of BP1960-1981and beyond.Sec.5.6describes problems encountered in the late1980s with BP for deep NNs,and mentions several ideas from the previous millennium to overcome them.Sec.5.7discusses aﬁrst hierarchical stack of coupled UL-based Autoencoders(AEs)—this concept resurfaced in the new millennium(Sec.5.15).Sec.5.8is about applying BP to CNNs,which is important for today’s DL applications.Sec.5.9explains BP’s Fundamental DL Problem(of vanishing/exploding gradients)discovered in1991.Sec.5.10explains how a deep RNN stack of1991(the History Compressor) pre-trained by UL helped to solve previously unlearnable DL benchmarks requiring Credit Assignment Paths(CAPs,Sec.3)of depth1000and more.Sec.5.11discusses a particular WTA method called Max-Pooling(MP)important in today’s DL FNNs.Sec.5.12mentions aﬁrst important contest won by SL NNs in1994.Sec.5.13describes a purely supervised DL RNN(Long Short-Term Memory,LSTM)for problems of depth1000and more.Sec.5.14mentions an early contest of2003won by an ensemble of shallow NNs, as well as good pattern recognition results with CNNs and LSTM RNNs(2003).Sec.5.15is mostly about Deep Belief Networks(DBNs,2006)and related stacks of Autoencoders(AEs,Sec.5.7)pre-trained by UL to facilitate BP-based SL.Sec.5.16mentions theﬁrst BP-trained MPCNNs(2007)and GPU-CNNs(2006). Sec.5.17-5.22focus on ofﬁcial competitions with secret test sets won by(mostly purely supervised)DL NNs since2009,in sequence recognition,image classiﬁcation,image segmentation,and object detection. Many RNN results depended on LSTM(Sec.5.13);many FNN results depended on GPU-based FNN code developed since2004(Sec.5.16,5.17,5.18,5.19),in particular,GPU-MPCNNs(Sec.5.19).5.11940s and EarlierNN research started in the1940s(e.g.,McCulloch and Pitts,1943;Hebb,1949);compare also later work on learning NNs(Rosenblatt,1958,1962;Widrow and Hoff,1962;Grossberg,1969;Kohonen,1972; von der Malsburg,1973;Narendra and Thathatchar,1974;Willshaw and von der Malsburg,1976;Palm, 1980;Hopﬁeld,1982).In a sense NNs have been around even longer,since early supervised NNs were essentially variants of linear regression methods going back at least to the early1800s(e.g.,Legendre, 1805;Gauss,1809,1821).Early NNs had a maximal CAP depth of1(Sec.3).5.2Around1960:More Neurobiological Inspiration for DLSimple cells and complex cells were found in the cat’s visual cortex(e.g.,Hubel and Wiesel,1962;Wiesel and Hubel,1959).These cellsﬁre in response to certain properties of visual sensory inputs,such as theorientation of plex cells exhibit more spatial invariance than simple cells.This inspired later deep NN architectures(Sec.5.4)used in certain modern award-winning Deep Learners(Sec.5.19-5.22).5.31965:Deep Networks Based on the Group Method of Data Handling(GMDH) Networks trained by the Group Method of Data Handling(GMDH)(Ivakhnenko and Lapa,1965; Ivakhnenko et al.,1967;Ivakhnenko,1968,1971)were perhaps theﬁrst DL systems of the Feedforward Multilayer Perceptron type.The units of GMDH nets may have polynomial activation functions imple-menting Kolmogorov-Gabor polynomials(more general than traditional NN activation functions).Given a training set,layers are incrementally grown and trained by regression analysis,then pruned with the help of a separate validation set(using today’s terminology),where Decision Regularisation is used to weed out superﬂuous units.The numbers of layers and units per layer can be learned in problem-dependent fashion. This is a good example of hierarchical representation learning(Sec.4.4).There have been numerous ap-plications of GMDH-style networks,e.g.(Ikeda et al.,1976;Farlow,1984;Madala and Ivakhnenko,1994; Ivakhnenko,1995;Kondo,1998;Kord´ık et al.,2003;Witczak et al.,2006;Kondo and Ueno,2008).5.41979:Convolution+Weight Replication+Winner-Take-All(WTA)Apart from deep GMDH networks(Sec.5.3),the Neocognitron(Fukushima,1979,1980,2013a)was per-haps theﬁrst artiﬁcial NN that deserved the attribute deep,and theﬁrst to incorporate the neurophysiolog-ical insights of Sec.5.2.It introduced convolutional NNs(today often called CNNs or convnets),where the(typically rectangular)receptiveﬁeld of a convolutional unit with given weight vector is shifted step by step across a2-dimensional array of input values,such as the pixels of an image.The resulting2D array of subsequent activation events of this unit can then provide inputs to higher-level units,and so on.Due to massive weight replication(Sec.2),relatively few parameters may be necessary to describe the behavior of such a convolutional layer.Competition layers have WTA subsets whose maximally active units are the only ones to adopt non-zero activation values.They essentially“down-sample”the competition layer’s input.This helps to create units whose responses are insensitive to small image shifts(compare Sec.5.2).The Neocognitron is very similar to the architecture of modern,contest-winning,purely super-vised,feedforward,gradient-based Deep Learners with alternating convolutional and competition lay-ers(e.g.,Sec.5.19-5.22).Fukushima,however,did not set the weights by supervised backpropagation (Sec.5.5,5.8),but by local un supervised learning rules(e.g.,Fukushima,2013b),or by pre-wiring.In that sense he did not care for the DL problem(Sec.5.9),although his architecture was comparatively deep indeed.He also used Spatial Averaging(Fukushima,1980,2011)instead of Max-Pooling(MP,Sec.5.11), currently a particularly convenient and popular WTA mechanism.Today’s CNN-based DL machines proﬁta lot from later CNN work(e.g.,LeCun et al.,1989;Ranzato et al.,2007)(Sec.5.8,5.16,5.19).5.51960-1981and Beyond:Development of Backpropagation(BP)for NNsThe minimisation of errors through gradient descent(Hadamard,1908)in the parameter space of com-plex,nonlinear,differentiable,multi-stage,NN-related systems has been discussed at least since the early 1960s(e.g.,Kelley,1960;Bryson,1961;Bryson and Denham,1961;Pontryagin et al.,1961;Dreyfus,1962; Wilkinson,1965;Amari,1967;Bryson and Ho,1969;Director and Rohrer,1969;Griewank,2012),ini-tially within the framework of Euler-LaGrange equations in the Calculus of Variations(e.g.,Euler,1744). Steepest descent in such systems can be performed(Bryson,1961;Kelley,1960;Bryson and Ho,1969)by iterating the ancient chain rule(Leibniz,1676;L’Hˆo pital,1696)in Dynamic Programming(DP)style(Bell-man,1957).A simpliﬁed derivation of the method uses the chain rule only(Dreyfus,1962).The methods of the1960s were already efﬁcient in the DP sense.However,they backpropagated derivative information through standard Jacobian matrix calculations from one“layer”to the previous one, explicitly addressing neither direct links across several layers nor potential additional efﬁciency gains due to network sparsity(but perhaps such enhancements seemed obvious to the authors).。

CiteSpace使用手册

The CiteSpace ManualVersion 1.01Chaomei ChenCollege of Computing and InformaticsDrexel UniversityHow to cite:Chen, Chaomei (2014) The CiteSpace Manual. /~cchen/citespace/CiteSpaceManual.pdfContents1How can I find the latest version of the CiteSpace Manual? (5)2What can I use CiteSpace for? (5)2.1What if I have Questions (7)2.2How should I cite CiteSpace? (7)2.3Where are the Users of CiteSpace? (8)3Requirements to Run CiteSpace (10)3.1Java Runtime (JRE) (10)3.2How do I check whether Java is on my computer? (10)3.3Do I have a 32-bit or 64-bit Computer? (12)4How to Install and Configure CiteSpace (12)4.1Where Can I download CiteSpace from the Web? (12)4.2What is the maximum number of records that I can handle with CiteSpace? (13)4.3How to configure the memory allocation for CiteSpace? (13)4.4How to uninstall CiteSpace (14)4.5On Mac or Unix-based Systems (15)5Get Started with CiteSpace (19)5.1Try it with a demonstrative dataset (19)5.1.1The Demo Project (20)5.1.2Clustering (23)5.1.3Generate Cluster Labels (25)5.1.4Where are the major areas of research based on the input dataset? (27)5.1.5How are these major areas connected? (28)5.1.6Where are the most active areas? (28)5.1.7What is each major area about? Which/where are the key papers for a given area?365.1.8Timeline View (38)5.2Try it with a dataset of your own (39)5.2.1Collecting Data (39)5.2.2Working with a CiteSpace Project (43)5.2.3Data Sources in Chinese (44)5.2.4How to handle search results containing irrelevant topics (45)6Configure a CiteSpace Run (47)6.1Time Slicing (47)6.3Configure the Networks (48)6.3.1Bibliographic Coupling (49)6.4Node Selection Criteria (49)6.4.1Do I have the right network? (50)6.5Pruning, or Link Reduction (50)6.6Visualization (51)7Interacting with CiteSpace (51)7.1How to Show or Hide Link Strengths (51)7.2Adding a Persistent Label to a Node (52)7.3Using Aliases to Merge Nodes (53)7.4How to Exclude a Node from the Network (55)7.5How to Use the Fisheye View Slider (55)7.6How to Configure When to Calculate Centrality Scores Automatically (56)7.7How to Save the Visualization as a PNG File (57)7.8Filters: Match Records with Pubmed (58)8Additional Functions (62)8.1Menu: Data (62)8.1.1CiteSpace Built-in Database (62)8.1.2Utility Functions for the Web of Science Format (65)8.1.3Scopus (66)8.1.4PubMed (67)8.2Menu: Network (69)8.2.1Batch Export to Pajek .net Files (69)8.3Menu: Geographical (69)8.3.1Generate Google Earth Maps (69)8.4Menu: Overlay Maps (72)8.4.1Add an Overlay (73)8.4.2Further Reading and Terms of Use (75)8.5Menu: Text (75)8.5.1Concept Trees and Predicate Trees (75)8.5.2List Terms by Clumping Properties (78)8.5.3Latent Semantic Analysis (79)9Selected Examples (80)10.1Information Theoretic (82)10.1.1Information Entropy (82)10.2Structural (82)10.2.1Betweenness Centrality (82)10.2.2Modularity (82)10.2.3Silhouette (82)10.3Temporal (82)10.3.1Burstness (82)10.4Combined (82)10.4.1Sigma (82)10.5Cluster Labeling (83)10.5.1Term Frequency by Inversed Document Frequency (83)10.5.2Log-Likelihood Ratio (83)10.5.3Mutual Information (83)11References (83)1How can I find the latest version of the CiteSpace Manual?The latest version of the CiteSpace Manual is always at the following location:/~cchen/citespace/CiteSpaceManual.pdfYou can also access the manual from CiteSpace: Help ►View the CiteSpace Manual (PDF). It will open up the PDF file in a new browser window.Figure 1. The latest version of the CiteSpace Manual is accessible from CiteSpace itself.2What can I use CiteSpace for?CiteSpace is designed to answer questions about a knowledge domain, which is a broadly defined concept that covers a scientific field, a research area, or a scientific discipline. A knowledge domain is typically represented by a set of bibliographic records of relevant publications. It is your responsibility to prepare the most appropriate and representative dataset that contains adequate information to answer your questions.CiteSpace is designed to make it easy for you to answer questions about the structure and dynamics of a knowledge domain. Here are some typical questions:•What are the major areas of research based on the input dataset?•How are these major areas connected, i.e. through which specific articles?•Where are the most active areas?•What is each major area about? Which/where are the key papers for a given area?•Are there critical transitions in the history of the development of the field? Where are the ‘turning points’?The design of CiteSpace is inspired by Thomas Kuhn’s structure of scientific revolutions. The central idea is that centers of research focus change over time, sometime incrementally and other times drastically. The development of science can be traced by studying their footprints revealed by scholarly publications.Members of the contemporary scientific community make their contributions. Their contributions form a dynamic and self-organizing system of knowledge. The system contains consensus, disputes, uncertainties, hypotheses, mysteries, unsolved problems, and unanswered questions. It is not enough to study a single school of thought. In fact, a better understanding of a specific topic often relies on an understanding of how it is related to other topics.The foundation of the CiteSpace is network analysis and visualization. Through network modeling and visualization, you can explore the intellectual landscape of a knowledge domain and discern what questions researchers have been trying to answer and what methods and tools they have developed to reach their goals.This is not a simple task. Rather it is often conceptually demanding and complex. If you are about to write a novel, the word processor or a text editor can make the task easier, but it cannot help you to create the plot or enrich the character of your hero. Similarly, and probably to a greater extent, CiteSpace can generate X-ray photos of a knowledge domain, but to interpret what these X-ray photos mean, you need to have some knowledge of various elements involved. The role of CiteSpace is to shift some of the traditionally labor-some burdens to computer algorithms and interactive visualizations so that you can concentrate on what human users are most good at in problem solving and truth finding. However, it is probably easier to generate some mysterious looking visualizations with CiteSpace than to fully understand what these visualizations tell you and who may benefit from such findings.Figure 2. Hierarchically organized functions of CiteSpace, for example, GUI ►Pruning ►Pathfinder: true.2.1What if I have QuestionsIf you have a question regarding the use of CiteSpace, you should first check the manual whether your question is answered in the manual. You can do a simple search through the PDF file to find out.If the manual does not get you anywhere, you can ask your questions on the Facebook page of CiteSpace:https:///pages/CiteSpace/276625072366558You can also post questions to my blog on sciencenet:/home.php?mod=space&uid=496649Please refrain from sending me emails because you will have a much better chance to get my response from either the Facebook or the sciencenet blog.Generally speaking, thoughtful questions get answered quickly. Questions that you may be able to figure out the answer for yourself if you think a little bit more about it would have a lower priority in the answering queue; it is quite possible that some of them never get answered.2.2How should I cite CiteSpace?The following three publications represent the core ideas of CiteSpace.The 2004 PNAS paper is the initial publication on CiteSpace (Chen 2004). In hindsight, it could have been named CiteSpace I. The 19-page 2006 JASIST paper gives the most thorough and in-depth description of CiteSpace II’s key functions (C. M. Chen, 2006), plus a follow-up study of domain experts identified in the visualizations. The 2010 JASIST paper is even longer with 24 pages (C. Chen, Ibekwe-SanJuan, & Hou, 2010), which is the third of the trilogy. It describes technical details on how cluster labels are selected and how each of the three selection algorithms in comparison with labels chosen by domain experts.ReferenceCitations(Google Scholar)800 Chen, C. (2006). "CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature." Journal of the AmericanSociety for Information Science and Technology 57(3): 359-377.394 Chen , C. (2004). "Searching for intellectual turning points: Progressive Knowledge Domain Visualization." Proc. Natl. Acad. Sci. USA101(Suppl.): 5303-5310.157 Chen, C., et al. (2010). "The structure and dynamics of co-citation clusters:A multiple-perspective co-citation analysis." Journal of the AmericanSociety for Information Science and Technology 61(7): 1386-1409.The most recent case study of a topic outside the realm of information science and scientometrics is a scienometric study of regenerative medicine (C. Chen, Hu, Liu, & Tseng, 2012).Chen, C., et al. (2012). "Emerging trends in regenerative medicine: A scientometric analysis in CiteSpace."Expert Opinions on Biological Therapy 12(5): 593-608.2.3Where are the Users of CiteSpace?In terms of the cities where CiteSpace were used, China, the United States, and Europe are prominent. Brazil, Turkey, and Spain also have many cities on the chart.Figure 3. Cities with users of CiteSpace between August 2013 and March 2014 are shown on the map. The colors of markers depict the level of user intensity: green (1-10), yellow (10-100), red (100-1000), and the large red water dropshaped marker (1000+).Figure 4. The use of CiteSpace in China (August 2013 – March 2014).Figure 5. The use of CiteSpace in the United States (August 2013 – March 2014).Figure 6. The use of CiteSpace in Europe (August 2013 – March 2014).3Requirements to Run CiteSpace3.1Java Runtime (JRE)CiteSpace is written in Java. It is a Java application. You should be able to run it on a computer that supports Java, including Windows or Mac.CiteSpace is currently optimized for Windows 64-bit Java 7 (i.e. Java 1.7).To run a Java application on your computer, you need to have Java Runtime (JRE) installed on your computer.3.2How do I check whether Java is on my computer?Figure 7. Select Control Panel.Figure 8. Click into the Programs category to find the Java control panel.Figure 9. Locate the Java control panel.Figure 10. Java Control Panel. Choose the Java tab and press the View button to see more detail.Figure 11. Java Runtime 1.7 is installed.3.3Do I have a 32-bit or 64-bit Computer?You need to find out whether your computer has a 32-bit or a 64-bit operating system.Go to Control Panel ►System and Security ►System. You will see various details about your computer. Under the System type, you will see whether you have a 32-bit or a 64-bit operating system.Follow the link below for further instructions on how to install Java:/en/download/help/index_installing.xmlOnce you have Java Runtime setup on your computer, you can proceed to install CiteSpace.4How to Install and Configure CiteSpaceCiteSpace is provided as a zip file for 64-bit and 32-bit computers. For Mac users, you need to download the 64-bit version.4.1Where Can I download CiteSpace from the Web?You can download the latest version of CiteSpace from the following website:/~cchen/citespace/download.htmlFigure 12.The download page of CiteSpace.After you download the zip file to your computer, unpack the zip file to a folder of your choice.Figure 13. CiteSpace is unpacked to the D drive on a computer.Now you can start CiteSpace by double clicking on the StartCiteSpace file.If you need to modify the amount memory allocated for CiteSpace (more precisely for Java Virtual Machine on which CiteSpace to be running), you can edit StartCiteSpace as a plain text file with any text editor.4.2What is the maximum number of records that I can handle with CiteSpace?This question needs to be answered at two levels: the number of records processed by CiteSpace and the number of nodes visualized, i.e. you can see and interact with them in CiteSpace.The first number is the total number of records in your downloaded dataset. CiteSpace reads through each record in your download files.The second number is determined by the selection criteria you specify and by the amount of memory, i.e. RAM, available on your computer. The more RAM you can make available for CiteSpace, the larger sized network you can visualize with a faster response rate.The speed of processing is also affected by a few computationally expensive algorithms such as Pathfinder network scaling and cluster labeling. Empirically, the best options for Pathfinder network scaling would be 50~500 nodes per slice. With faster computers or if you can wait for a bit longer, you can raise the number accordingly.The completion time of cluster labeling is related to the size of your dataset. If the entire timespan of your dataset is 100 years but you will only need to consider the most recent 10 years, it will be a good idea to carve out a much smaller dataset as long as it covers the 10 years of interest. It will reduce the processing time considerably.4.3How to configure the memory allocation for CiteSpace?The performance of CiteSpace is influenced by the amount of memory accessible to the Java Virtual Machine (JVM) on which CiteSpace is running. To analyze a large amount of records, you should consider allocating as much as memory for CiteSpace to use.You can modify the StartCiteSpace.cmd file to optimize the setting. More specifically, modify line 14 in the file. For example, -Xmx2g means that CiteSpace may get a maximum of 2GB of RAM to work with. Save the file after making any changes. And restart CiteSpace.Figure 14. Configure the memory for Java in line 14.4.4How to uninstall CiteSpaceYou can use the following steps to remove cached copies of CiteSpace from your computer.Figure 15. In a Command Prompt window, type javaws –viewer.When you see a list of cached copies of CiteSpace in the Java Cache Viewer, select the items that you want to remove and then click on the button with a red cross.Figure 16. Select a cached copy of CiteSpace and remove the item.4.5On Mac or Unix-based SystemsThe following example shows you the basic steps to get started with CiteSpace on a Mac. First, go to the CiteSpace homepage in a browser such as Chrome and download the latest 64-bit version.Figure 17. On a Mac, go to the CiteSpace home page in a browser such as Chrome and download the latest 64-bit version. Once the download is completed, follow the option “Show in Finder.” It will take you to a list of files downloaded to your Mac. The most recent file should be the zip file for CiteSpace.Figure 18. Choose “Show in Finder.”Figure 19. The downloaded zip file is shown in your Finder.Double-click on the zip file to unzip the file to a folder in the current folder.Figure 20. The zip file is unzipped to a new folder on the list.Figure 21. The new folder contains CiteSpaceII.jar and a lib folder.The simplest way to get started with CiteSpace is to open the CiteSpaceII.jar by clicking on it while holding the “Control” key on Mac. Select Open from the pop-up menu.Figure 22. Click on the CiteSpaceII.jar while holding the “Control” key and select “Open.”Due to the Java security settings, you will see a dialog box with two options for Open or Cancel.Choose Open to proceed. It will not harm your computer.Figure 23. Choose “Open” from the dialog box to proceed.After you choose Open, CiteSpace is getting started on Mac. You will see its opening page asfollows. Choose “Agree” to continue.Figure 24. CiteSpace is now started on Mac.Figure 25. Screenshots of running the Demo project of CiteSpace on Mac.It is a good idea to get familiar with the basic functions of CiteSpace by going through the Demo project on terrorism, which is included in the zip file.If you want to configure various Java Virtual Machine parameters in more detail than what is shown in the above example, you may generate a bash file for your Mac as follows.The Mac equivalent of the StartCiteSpace.cmd would be a bash file, which should have a file extension of .sh and should be executable. Let’s name the file as StartCiteSpace.sh to be consistent.1.The content of the StartCiteSpace.sh file should have the following two lines:#!/bin/bashjava -Xms1g -Xmx4g -Xss5m -jar CiteSpaceIII.jar2.The following instruction turns the StartCiteSpace.sh file to an executable file:chmod +x StartCiteSpace.sh3.To invoke the executable file, simply type its name or double click on it.StartCiteSpace5Get Started with CiteSpace5.1Try it with a demonstrative datasetWhen you installed CiteSpace for the first time, a demonstrative dataset on terrorism research is setup for you to play with and get familiar with the major analytic functions in CiteSpace.If you have never used CiteSpace before, I strongly recommend you to start with this demo dataset.To launch CiteSpace, double click on the StartCiteSpace.cmd file. You will see a command prompt window first. This window will also display various information on the status and any errors.Figure 26. The command prompt window.You will see another window of “About CiteSpace” – it displays system information of your computer, including the Java version.To proceed, you need to click on the Agree button. CiteSpace may collect user driven events for research purposes.Figure 27. The “About CiteSpace” window. To proceed, click on the Agree button.Next, you will see the main user interface of CiteSpace.The user interface is divided into left and right halves. The left-hand side contains controls of projects (i.e. input datasets) and progress report windows. The right-hand side contains several panels for configuring the process with various parameters.In a nutshell, the process in CiteSpace takes an input dataset specified in the current project, constructs network models of bibliographic entities, and visualizes the networks for interactive exploration for trends and patterns identified from the dataset.The demo project contains a dataset on publications about terrorism research. These bibliographic records were retrieved from the Web of Science. See later sections on tips for how to construct your own dataset.5.1.1The Demo ProjectWe will start the process and explain how CiteSpace is designed to help you answer some of the key questions about a knowledge domain, i.e. a field of study, a research area, or a set of publications defined by the user.Press the green GO! button to start the process.Figure 28. The main user interface of CiteSpace.CiteSpace will read the data files in the current project (Demo) and report its progress in the two windows on the left-hand side of the user interface. When the modeling process is completed, you have three options to choose: Visualize, Save As GraphML, or Cancel.Visualize:This option will take you to the visualization window for further interactive exploration. Save As GraphML:This option will save the constructed network in a file in a common graph format. No visualization.Cancel:This option will not generate any interactive visualization nor save any files. It allowsyou to reconfigure the process and re-run the process.Figure 29. CiteSpace is ready to visualize the constructed network.If you click on the Visualize button, a new window will pop up. This is the Visualization Window. Initially you will see some movements on your screen with a black background. Once the movements are settled, the background color turns to white.Let’s focus on what the initial visualization tells us and then explore what else we can find by using additional functions.First, CiteSpace visualizes a merged network based on several networks corresponding to snapshots of consecutive years. In the Demo project example, the overall time span is from 1996 through 2003. The merged network characterizes the development of the field over time, showing the most important footprints of the related research activities. Each dot represents a node in the network. In the Demo case, the nodes are cited references. CiteSpace can generate networks of other types of entities. Here let’s focus on cited references only for now. Lines that connect nodes are co-citation links; again, CiteSpace can generate networks of other types of links. The colors of these lines are designed to show when a connection was made for the first time. Note that this is influenced by the scope and the depth of the given dataset.The color encoding makes it easy for us to tell which part of the network is old and which is new. If you see that some references are shown with labels, then you will know that these references are highly cited, suggesting that they are probably landmark papers in the field. A list on the left side of the window shows all the nodes appeared in the visualization. The list can be sorted by the frequency of citations, Betweenness centrality, or by year or references as text. You can alsochoose to show or hide a node on the list.Figure 30. The Visualization window.A control panel is shown on the right-hand side of the Visualization Window. You can change how node labels are displayed by a combination of a few threshold values through sliders. You can also change the size of a node by sliding the node size slider.To answer the typical questions we asked before, let’s use several functions in CiteSpace to obtain more specific information through clustering, labeling, and exploring.5.1.2ClusteringAlthough we can probably eyeball the visualized network and identify some prominent groupings, CiteSpace provides more precise ways to identify groupings, or clusters, using theclustering function.To start the clustering function, simply click on this icon .How do I know whether the clustering process is completed? You will see #clusters on the upper right corner of the canvas. In the Demo example, a total of 37 clusters of co-cited references are identified. Each cluster corresponds to an underlying theme, a topic, or a line of research.The signature of the network is shown on the upper left corner of the display. In particular, the modularity Q and the mean silhouette scores are two important metrics that tell us about the overall structural properties of the network. For example, the modularity Q of 0.7141 is relatively high, which means that the network is reasonably divided into loosely coupled clusters. The mean silhouette score of 0.5904 suggests that the homogeneity of these clusters on averageis not very high, but not very low either.Figure 32. The clustering process is completed. 37 clusters are identified (#clusters shown in the upper right corner).Modularity and silhouette scores are shown in the signature of the network on the left.Figure 33. Members of different clusters are shown in different colors.You can inspect various measures of each cluster in a summary table of all the clusters using: Clusters ►4. Summarization of Clusters. The Silhouette column shows the homogeneity of a cluster. The higher the silhouette score, the more consistent of the cluster members are, provided the clusters in comparison have similar sizes. If the cluster size is small, then a high homogeneity does not mean much. For example, cluster #9 has 7 members and a silhouette of 1.00, this is most likely due to the possibility that all 7 references are the citation references of the same underlying author. In other words, cluster #9 may reflect the citing behavior or preferences of a single paper, thus it is less representative.The average year of publication of a cluster indicates whether it is formed by generally recent papers or old papers. This is a simple and useful indicator.Figure 34. A summary table of clusters.5.1.3Generate Cluster LabelsTo characterize the nature of an identified cluster, CiteSpace can extract noun phrases from the titles (T in the following icon), keyword lists (K), or abstracts (A) of articles that cited the particular cluster.Let’s ask CiteSpace to choose noun phrases from titles (i.e. select the T icon). This process may take a while as CiteSpace needs to compute several selection metrics. Once the process is finished, the chosen labels will be displayed. By default, labels based on one of the three selection algorithms will be shown, namely, tf*idf. Our study has found that LLR usually gives the best result in terms of the uniqueness and coverage.Figure 35. Icons for performing Clustering and Labeling functions.Cluster labels are displayed once the process is completed. The clusters are numbered in the descending order of the cluster size, starting from the largest cluster #0, the second largest #1, and so on.Figure 36. Cluster labels are generated and displayed.To make it easier to see which clusters are the largest, you can choose to change the font size of the labels from the uniformed to proportional:Display ►Label Font Size ►Cluster: Uniformed/ProportionalThis is a toggle function. That means there are two states. Your selection will switch back and forth between the two states, i.e. either using a uniformed font size or proportional.Figure 37. Set the cluster labels’ font size proportional to their size.Figure 38. Cluster labels’ font sizes are proportional to the size of a cluster. The largest cluster is #0 on biologicalterrorism.5.1.4Where are the major areas of research based on the input dataset?This is one of the primary questions that CiteSpace is designed to answer. To answer this question, we will focus on the big picture of the collection of publications represented by your dataset. Let’s make a few adjustments with the sliders in the control panel on the right so that the information of our interest will be shown clearly and information that is less relevant to this question right now will be temporarily hidden from the view.1.Node SizeAt this level, we don’t really need to see the size of a node, although it provides rich information about the history of a node. Use the slider under Article Labeling ►Node Size ►[Slide to 0] (marked by the pointer #1 in the following figure).2.Cluster Label SizeThe font size of the cluster labels are controlled by a slider with two controls: one control the threshold for showing or hiding a label based on the size of the cluster (i.e. to make sure large-enough clusters are always labeled), and the other control the font size of the cluster labels (marked by the pointer #2 in the screenshot).3.Transparency of LinksDetailed links would be useful later, but we can ignore them for now using the transparency slider to set all the links’ transparency to the lowest level, i.e. invisible. In hindsight, a more accurate term would be completely transparent.After making these minor adjustments, it will be straightforward to answer the question: Where are the major areas of research? Evidently, the largest area (cluster #0 with the largest number of member references) is biological terrorism. The second largest is posttraumatic stress (cluster #1), i.e. PTSD. The third one is ocular injury (cluster #2). The fourth one is blast (cluster #3). And there are a few smaller clusters. So now we have a general idea what constituted terrorism research during the period of 1996 and 2003. You can repeat the process on a current dataset to get an up-to-date big picture.。

Long CVPR 2013 - Transfer Sparse Coding for Robust Image Representation

Transfer Sparse Coding for Robust Image Representation∗Mingsheng Long†‡,Guiguang Ding†,Jianmin Wang†,Jiaguang Sun†,Yuchen Guo†,and Philip S.Yu§†TNLIST;MOE Lab of Information System Security;School of Software ‡Department of Computer Science and Technology,Tsinghua University,Beijing,China §Department of Computer Science,University of Illinois at Chicago,IL,USA {longmingsheng,guoyc09}@{dinggg,jimwang,sunjg}@ psyu@AbstractSparse coding learns a set of basis functions such that each input signal can be well approximated by a linear combination of just a few of the bases.It has attracted in-creasing interest due to its state-of-the-art performance in BoW based image representation.However,when labeled and unlabeled images are sampled from different distribu-tions,they may be quantized into different visual words of the codebook and encoded with different representations, which may severely degrade classiﬁcation performance.In this paper,we propose a Transfer Sparse Coding(TSC)ap-proach to construct robust sparse representations for classi-fying cross-distribution images accurately.Speciﬁcally,we aim to minimize the distribution divergence between the la-beled and unlabeled images,and incorporate this criterion into the objective function of sparse coding to make the new representations robust to the distribution difference.Exper-iments show that TSC can signiﬁcantly outperform state-of-the-art methods on three types of computer vision datasets.1.IntroductionIn computer vision,image representation is a crucial pro-cedure for image processing and understanding.As a pow-erful tool forﬁnding succinct representations of stimuli and capturing high-level semantics in visual data,sparse coding can represent images using only a few active coefﬁcients. This makes the sparse representations easy to interpret and manipulate,and facilitates efﬁcient content-based image in-dexing and retrieval.Sparse coding is receiving increasing ∗Corresponding author:Jianmin Wang.This work is supported in part by National HGJ Key Project(2010ZX01042-002-002),National High-Tech Development Program(2012AA040911),National Basic Research Program(2009CB320700),and National Natural Science Foundation of China(61073005and61271394),and Philip S.Yu is partially support-ed by US NSF through grants IIS-0905215,CNS-1115234,IIS-0914934, DBI-0960443,and OISE-1129076,US Department of Army through grant W911NF-12-1-0066,Google Mobile2014Program and KAU grant.interest in machine learning,pattern recognition,signal pro-cessing[9,11,8],and has been successfully applied to im-age classiﬁcation[22,12,24]and face recognition[21,6].One major computational problem of sparse coding is to improve the quality of the sparse representation while max-imally preserving the signalﬁdelity.To achieve this goal, many works have been proposed to modify the sparsity con-straint.Liu et al.[10]added nonnegative constraint to the sparse coefﬁcients.Gao et al.[6]introduced a Laplacian term of coefﬁcients in sparse coding,which was extended to an efﬁcient algorithm in Cai et al.[24].Wang et al.[20] adopted the weightedℓ2-norm for the sparsity constraint. Another line of works focus on improving the signalﬁdeli-ty,e.g.,robust sparse coding proposed by Yang et al.[23].However,when labeled and unlabeled images are sam-pled from different distributions,they may be quantized into different visual words of the codebook and encoded with d-ifferent representations.In this case,the dictionary learned from the labeled images cannot effectively encode the unla-beled images with highﬁdelity,and also the unlabeled im-ages may reside far away from the labeled images under the new representation.This distribution difference will great-ly challenge the robustness of existing sparse coding algo-rithms for cross-distribution image classiﬁcation problems.Recently,the literature has witnessed an increasing focus on transfer learning[15]problems where the labeled train-ing data and unlabeled test data are sampled from different probability distributions.This is a very common scenario in real applications,since training and test data are usually collected in different time periods,or under different condi-tions.In this case,standard classiﬁers such as SVM and lo-gistic regression trained on the labeled data may fail to make correct predictions on the unlabeled data[13,14,16,17]. To improve the generalization performance of supervised classiﬁers across different distributions,Pan et al.[13,14] proposed to extract a“good”feature representation through which the probability distributions of labeled and unlabeled data are drawn close.It achieves much better classiﬁcation performance by explicitly reducing distribution divergence.2013 IEEE Conference on Computer Vision and Pattern RecognitionInspired by recent progress in sparse coding and trans-fer learning,we propose a novel Transfer Sparse Coding (TSC)algorithm to construct robust sparse representations for classifying cross-distribution images accurately.We aim to minimize the distribution divergence between labeled and unlabeled images using a nonparametric distance measure. Speciﬁcally,we incorporate this criterion into the objective function of sparse coding to make the new representations of the labeled and unlabeled images close to each other.In this way,the induced representations are made robust for cross-distribution image classiﬁcation problems.Moreover, to enrich the new representations with more discriminating power,we also incorporate the graph Laplacian term of co-efﬁcients[24]in our objective function.Extensive experi-mental results verify the effectiveness of the TSC approach.2.Related WorkIn this section,we discuss prior works that are most re-lated to ours,including sparse coding and transfer learning.Recently,sparse coding has been a hot research focus in computer vision.To solve theℓ1-regularized least squares problem more efﬁciently,Lee et al.[9]proposed a feature-sign search method to reduce the nondifferentiable problem to an unconstrained quadratic programming(QP),which ac-celerates the optimization process.Our work adapts Lee’s method to solve the proposed TSC optimization problem. For adapting the dictionary to achieve sparse representation, Aharon et al.[1]proposed a K-SVD method to learn the dictionary using orthogonal matching pursuit or basis pur-suit.Our work aims to discover a shared dictionary which can encode both labeled and unlabeled data sampled from different probability distributions.To improve the quality of sparse representations,researchers have modiﬁed the sparse constraint by adding nonnegative constraint[10],graph reg-ularization[6,24],weightedℓ2-norm constraint[20],etc. Our approach aims to construct robust sparse representa-tions for cross-distribution image classiﬁcation problems, which is a different learning goal from the previous works.In the machine learning literature,transfer learning[15], which aims to transfer knowledge between the labeled and unlabeled data sampled from different distributions,has al-so attracted extensive research interest.To achieve this goal, Pan et al.proposed a Transfer Component Analysis(TCA) method to reduce the Maximum Mean Discrepancy(MMD) [7]between the labeled and unlabeled data,and simultane-ously minimize the reconstruction error of the input data using PCA.Different from their method,our work focus-es on learning robust image representations by building an adaptive model based on sparse stly,Quanz et al. [16,17]have explored sparse coding to extract features for knowledge transfer.However,their method adopts a kernel density estimation(KDE)technique to estimate the PDFs of distributions and then minimizes the Jensen-Shannon di-vergence between them.This is a more restricted procedurethan TSC and is prone to overﬁtting.Moreover,our work additionally incorporates the graph Laplacian term of coef-ﬁcients[24]in the objective function,which can discover more discriminating representations for classiﬁcation tasks.3.Preliminaries3.1.Sparse CodingGiven a data matrix X=[x1,...,x n]∈ℝm×n,with n data points sampled in the m-dimensional feature space, let B=[b1,...,b k]∈ℝm×k be the dictionary matrix where each column b i represents a basis vector in the dic-tionary,and let S=[s1,...,s n]∈ℝk×n be the coding matrix where each column s i is a sparse representation for a data point x i.The goal of sparse coding is to learn a dic-tionary(over-complete if k>m)and corresponding sparse codes such that input data can be well approximated[16]. Assuming the reconstruction error for a data point follows a zero-mean Gaussian distribution with isotropic covariance, while taking a Laplace prior for the coding coefﬁcients and a uniform prior for the basis vectors,then the maximum a posterior estimate(MAP)of B and S given X is reduced to minB,S∥X−BS∥2F+λn∑i=1∣s i∣s.t.∥b i∥2≤c,∀i=1,...,k(1) whereλis a tunable regularization parameter to trade off the sparsity of coding and the approximation of input data. The constraints on the basis vectors are to control the model complexity.Although the objective function in Equation(1) is not convex in both variables,it is convex in either B or S.Therefore,it can be solved by alternatingly optimizing one variable whileﬁxing the other one.Finally,it can be reduced to anℓ1-regularized least squares problem and an ℓ2-constrained least squares problem,both of which can be solved efﬁciently by existing optimization software[9,11].3.2.Graph Regularized Sparse CodingTo make the basis vectors respect the intrinsic geometric structure underlying the input data,Cai et al.[24]proposed a Graph Regularized Sparse Coding(GraphSC)method, which further explores the manifold assumption[2].Graph-SC assumes that if two data points x i and x j are close in the intrinsic geometry of data distribution,then their cod-ings s i and s j are also close.Given a set of m-dimensional data points{x1,...,x n},GraphSC constructs a p-nearest neighbor graph G with n vertices each representing a data point.Let W be the weight matrix of G,if x i is among the p-nearest neighbor of x j or vice versa,W ij=1;otherwise, W ij=0.Deﬁne d i=∑nj=1W ij,D=diag(d1,...,d n), and graph Laplacian L=D−W.A reasonable criterion for preserving the geometric structure in graph G is to min-imize12∑ni,j=1∥s i−s j∥2W ij=tr(SLS T).Integratingthis criterion into Equation (1)leads to the GraphSC [6,24]:min B ,S∥X −BS ∥2F +γtr (SLS T)+λ∑n i =1∣s i ∣s.t.∥b i ∥2≤c,i =1,...,k(2)where γis a graph regularization parameter to trade off the weight between sparse coding and geometric preservation.4.Transfer Sparse CodingIn this section,we present the Transfer Sparse Coding (TSC)algorithm for robust image representation,which ex-tends GraphSC by taking into account the minimization of distribution divergence between labeled and unlabeled data.4.1.Problem DeﬁnitionGiven labeled data D l ={(x 1,y 1),...,(x n l ,y n l )}with n l examples,unlabeled data D u ={x n l +1,...,x n l +n u }with n u examples,denote X =[x 1,...,x n ]∈ℝm ×n ,n =n l +n u as the input data matrix.Assume that the labeled and unlabeled data are sampled from different probability distributions in an m -dimensional feature space.Frequently used notations and descriptions are summarized in Table 1.Problem 1(Transfer Sparse Coding)Given labeled data D l and unlabeled data D u under different distributions,our goal is to learn a dictionary B and a sparse coding S which performs robustly across the labeled and unlabeled data.With Transfer Sparse Coding (TSC),we aim to construct a robust representation for images sampled from different distributions.In this way,a supervised classiﬁer trained on the labeled data can generalize better on the unlabeled data.4.2.Objective FunctionTo make sparse coding robust to different probability dis-tributions,one may expect that the basis vectors can capturethe commonality underlying both the labeled and unlabeled data,rather than only the individual property in the labeled data.However,even in the extracted k -dimensional sparse representation,the distribution difference between labeled and unlabeled data will still be signiﬁcantly large.Thus one major computational problem is to reduce the distribution d-ifference by explicitly minimizing some predeﬁned distance measures.To realize this idea,a natural strategy is to make the probability distributions of labeled and unlabeled data close to each other in the sparse representation.That is,by representing all data points X with the learned coding ma-trix S ,the probability distributions of the sparse codes for the labeled and unlabeled data should be close enough.In this paper,we follow [7,13,14]and adopt the empirical Maximum Mean Discrepancy (MMD)as the nonparametric distance measure to compare different distributions,whichTable 1.Notations and descriptions used in this paper.Notation DescriptionNotation Description D l ,D u labeled/unlabeled data X input data matrix n l ,n u #labeled/unlabeled examplesB dictionary matrix m #shared featuresS coding matrix k,p #basis vectors/nearest neighbors M MMD matrix μ,γ,λMMD/graph/sparsity reg.param.Lgraph Laplacian matrixcomputes the distance between the sample means of the la-beled and unlabeled data in the k -dimensional coefﬁcients:1n l n l ∑i =1s i −1n u n l +n u∑j =n l +1s j 2=n ∑i,j =1s T i s j M ij =tr (SMS T )(3)where M is the MMD matrix and is computed as followsM ij =⎧⎨ ⎩1/n 2l ,x i ,x j ∈D l 1/n 2u ,x i ,x j ∈D u −1n l n u ,otherwise (4)By regularizing Equation (2)with Equation (3),dictio-nary matrix B is reﬁned and the probability distributions oflabeled and unlabeled data are drawn close under the new representation S .We obtain the objective function for TSC:min B ,S∥X −BS ∥2F +tr (S (μM +γL )S T )+λ∑n i =1∣s i ∣s.t.∥b i ∥2≤c,∀i =1,...,k(5)where μ>0is the MMD regularization parameter trading off the weight between GraphSC and distribution matching.To compare the effectiveness between MMD regularization and graph regularization (GraphSC),we refer to the special case of TSC with γ=0as TSC MMD and test it empirically.The MMD regularization in Equation 5is important to make TSC robust to different probability distributions.Ac-cording to Gretton et al .[7],MMD will asymptotically approach zero if and only if the two distributions are the same.By minimizing MMD,TSC can match distributions between labeled and unlabeled data based on sparse coding.Following [9,11,24],we divide the optimization of TSC into two iterative steps:1)learning transfer sparse codes S with dictionary B ﬁxed,i.e .,an ℓ1-regularized least squares problem;and 2)learning dictionary B with transfer sparse codes S ﬁxed,i.e .,an ℓ2-constrained least squares problem.4.3.Learning Transfer Sparse CodesWe solve optimization problem (5)for transfer sparse codes S .By ﬁxing dictionary B ,problem (5)becomesmin S∥X −BS ∥2F +tr (S (μM +γL )S T )+λ∑n i =1∣s i ∣(7)Unfortunately,problem (7)is nondifferentiable when s i takes values of 0,which makes standard unconstrained op-timization techniques infeasible.Several recent approachesAlgorithm 1:Learning Transfer Sparse CodesInput :Data matrix X ,dictionary B ,MMD matrix M ,graph Laplacian matrix L ,MMD/graph/sparsity regularization parameters μ,γ,λ.Output :Current optimal coding matrix S ∗=[s ∗1,...,s ∗n ].1begin Learning Transfer Sparse Codes2foreach s i ,i ∈[1,n ]do3step Initialize4s i :=0,θ:=0,and active set A :=∅,where θj ∈{−1,0,1}denotes sign(s (j )i ).5step Activate6From zero coefﬁcients of s i ,select j :=arg max j ∣∇(j )i g (s i )∣.Activate s (j )i (add j to A )only if it locally improves (9),namely:7If ∇(j )i g (s i )>λ,then set θj :=−1,A :={j }∪A ;else if ∇(j )i g (s i )<−λ,then set θj :=1,A :={j }∪A .8step Feature-sign Search9Let ˆB be a submatrix of B that contains only the columns in A ;let ˆs i ,ˆhi ,and ˆθbe subvectors of s i ,h i ,and θin A ,respectively.10Compute the solution to the resulting unconstrained QP:min ˜f (ˆs i )=∥x i −ˆB ˆs i ∥2+(μM ii +γL ii )ˆs T i ˆs i +ˆs T i ˆh i +λˆθT ˆs i 11Let ∂˜f (ˆs i )/∂ˆs i :=0,we can obtain the optimal value of s i under the current A :ˆs new i:=(ˆB T ˆB +(μM ii +γL ii )I )−1(ˆB T x i −(λˆθ+ˆh i )/2)(6)12Perform a discrete line search on the closed line segment from ˆs i to ˆs new i:13Check the objective value at ˆs new i and all other points where any coefﬁcient changes sign.14Update ˆs i (and the corresponding entries in s i )to the point with the lowest objective value.15Remove zero coefﬁcients of ˆs i from A and update θ:=sign (s i ).16step Check Optimality Conditions17(a)Optimality condition for nonzero coefﬁcients:∇(j )i g (s i )+λsign(s (j )i )=0,∀s (j )i =018If condition (a)is not satisﬁed,go to step “Feature-sign Search”(without any new activation);else check condition (b).19(b)Optimality condition for zero coefﬁcients:∣∇(j )i g (s i )∣≤λ,∀s (j )i =020If condition (b)is not satisﬁed,go to step “Activate”;otherwise return s i as the optimal solution,redenote it as s ∗i .have been proposed to solve the ℓ1-regularized least squaresproblem [1,9,11,24],where the coordinate descent opti-mization strategy is often adopted to update each vector s i individually with the other vectors {s j }j =i ﬁxed.To facili-tate vector-wise manipulations,we rewrite problem (7)asmin{s i }n ∑i =1∥x i −Bs i ∥2+n ∑i,j =1(μM ij +γL ij )s T i s j+λn ∑i =1∣s i ∣(8)The optimization problem involving only s i is reduced tomin s if (s i )=∥x i −Bs i ∥2+λ∑kj =1∣s (j )i ∣+(μM ii +γL ii )s T i s i+s T i h i(9)h i =2∑j =i (μM ij +γL ij )s j ,s (j )i is j th element of s i .We adapt the feature-sign search algorithm [9]to solve the optimization problem (9).In nonsmooth optimization methods for solving nondifferentiable problems,a neces-sary condition for a parameter vector to be a local minimum is that the zero-vector is an element of the subdifferential—the set containing all subgradients at the parameter vector[5].Deﬁne g (s i )=∥x i −Bs i ∥2+(μK ii +γL ii )s T i s i +s T i h i ,then f (s i )=g (s i )+λ∑k j =1∣s (j )i ∣.Let ∇(j )i ∣s i ∣be the subdifferentiable value of the j th coefﬁcient of s i :if ∣s (j )i ∣>0,∇(j )i ∣s i ∣=sign(s (j )i );else s (j )i =0,∇(j )i ∣s i ∣is nondifferentiable and can take values in {−1,1}.Theoptimality conditions for getting minimum value of f (s i )is{∇(j )i g (s i )+λsign(s (j )i )=0,if ∣s (j )i ∣=0∣∇(j )i g (s i )∣≤λ,otherwise (10)We consider how to select optimal subgradients ∇(j )i f (s i )when the optimality conditions (10)are violated,that is,∣∇(j )i g (s i )∣>λif s (j )i =0.Suppose that ∇(j )i g (s i )>λ,which implies ∇(j )i f (s i )>0regardless of sign(s (j )i ).Inthis case,to decrease f (s i ),we need to decrease s (j )i .Since s (j )i starts at zero,any inﬁnitesimal adjustment to s (j )i willtake it negative.Thus we directly let sign(s (j )i )=−1.Sim-ilarly,if ∇(j )i g (s i )<−λ,we directly let sign(s (j )i )=1.Notice that,if we have known the signs of s (j )i ’s at theoptimal value,we can just replace each term ∣s (j )i ∣with ei-ther s j i (if s (j )i >0),−s j i (if s j i <0),or 0(if s ji =0).Thus by considering only nonzero coefﬁcients,problem (9)is reduced to an unstrained quadratic optimization problem (QP),which can be solved analytically and efﬁciently.The sketch of learning transfer sparse codes {s i :i ∈[1,n ]}is:∙for each s i ,search for the signs of {s (j )i:j ∈[1,k ]};∙solve the equivalent QP problem to get the optimal s ∗i that minimizes the vector-wise objective function (9);∙return the optimal coding matrix S ∗=[s ∗i ,...,s ∗n ].It maintains an active set A ≜{j ∣s (j )i =0,∇(j )i g (s i )>λ}for potentially nonzero coefﬁcients and their corresponding signs θ=[θ1,...,θk ]while updating each s i ,and system-atically searches for the optimal active set and coefﬁcients signs which minimize objective function (9).In each acti-vate step,the algorithm uses the zero-value whose violationto the optimality condition ∣∇(j )i g (s i )∣>λis the largest.In each feature-sign step:1)given a current value for the active set and the signs,it computes the analytical solutions new ito the resulting unconstrained QP;2)it updates the solution,the active set,and the signs using an efﬁcient dis-crete line search between the current solution and s new i.The complete learning procedure is summarized in Algorithm 1.4.4.Learning DictionaryLearning the dictionary B with the coding S ﬁxed is re-duced to the following ℓ2-constrained optimization problem min B∥X −BS ∥2F ,s.t.∥b i ∥2≤c,∀i =1,...,k (11)This problem has been well studied by prior works [9,11,24].For space limitation,we omit the technical details here.5.ExperimentsIn this section,we conduct extensive experiments for im-age classiﬁcation problems to evaluate the TSC approach.5.1.Data PreparationUSPS,MNIST,PIE,MSRC,and VOC2007(see Figure 1and Table 2)are ﬁve benchmark datasets widely adopted to evaluate computer vision and patter recognition PS 1dataset consists of 7,291training images and 2,007test images of size 16×16.MNIST 2dataset has a training set of 60,000examples and a test set of 10,000examples of size 28×28.From Figure 1,we see that USPS and MNIST follow very different distributions.They share 10semantic classes,each corresponding to one digit.To speed up experiments,we construct one dataset USPS vs MNIST by randomly sam-pling 1,800images in USPS to form the training data,and randomly sampling 2,000images in MNIST to form the test data.We uniformly rescale all images to size 16×16,and represent each image by a 256-dimensional vector encoding the gray-scale values of all pixels.In this way,the training and test data can share the same label set and feature space.PIE 3,which stands for “Pose,Illumination,Expression”,is a benchmark face database.The database has 68individu-als with 41,368face images of size 32×32.The face images1rmatik.rwth-aachen.de/˜keysers/usps.html2/exdb/mnist3/idb/html/faceFigure 1.Examples of PIE,USPS,MNIST,MSRC,and VOC2007.Table 2.Statistics of the six benchmark image datasets.Dataset Type #Examples #Features #Classes USPS Digit 1,80025610MNIST Digit 2,00025610PIE1Face 2,8561,02468PIE2Face 3,3291,02468MSRC Photo 1,2692406VOC2007Photo1,5302406were captured by 13synchronized cameras and 21ﬂashes,under varying poses,illuminations,and expressions.In the experiments,we adopt two preprocessed versions of PIE 4,i.e .,PIE1[4]and PIE2[3],which are generated by randomly sampling the face images from the near-frontal poses (C27)under different lighting and illumination con-ditions.We construct one dataset PIE1vs PIE2by selecting all 2,856images in PIE1to form the training data,and all 3,329images in PIE2to form the test data.Due to the vari-ations in lighting and illumination,the training and test data can follow different distributions in the same feature space.MSRC 5dataset is provided by Microsoft Research Cam-bridge,which contains 4,323images labeled by 18classes.VOC20076dataset (the training/validation subset)con-tains 5,011images annotated with 20concepts.From Figure 1,we see that MSRC and VOC2007follow very different distributions,since MSRC is from standard images for evaluations,while VOC2007is from digital pho-tos in Flickr 7.They share the following 6semantic classes:“aeroplane”,“bicycle”,“bird”,“car”,“cow”,“sheep”.We construct one dataset MSRC vs VOC by selecting all 1,269images in MSRC to form the training data,and all 1,530im-ages in VOC2007to form the test data.Then following [19],we uniformly rescale all images to be 256pixels in length,and extract 128-dimensional dense SIFT (DSIFT)features using the VLFeat open package [18].A 240-dimensional codebook is created,where K-means clustering is used to obtain the codewords.In this way,the training and test data are constructed to share the same label set and feature space.4/home/dengcai/Data/FaceData.html5/en-us/projects/objectclassrecognition6/challenges/VOC/voc200775.2.Experimental Setup5.2.1Baseline MethodsWe compare the TSC approach withﬁve state-of-the-art baseline methods for image classiﬁcation,as shown below.∙Logistic Regression(LR)∙Principle Component Analysis(PCA)+LR∙Sparse Coding(SC)[9]+LR∙Graph Regularized SC(GraphSC)[24]+LR∙Our proposed MMD Regularized SC(TSC MMD)+LR All SC,GraphSC,TSC MMD,and TSC algorithms can learn sparse representations for input data points.In particular, SC is a special case of TSC withμ=γ=0,GraphSC is a special case of TSC withμ=0,TSC MMD is a special case of TSC withγ=0.Note that,our proposed TSC MMD is essentially different from the method introduced in Quanz et al.[16,17],which adopts kernel density estimation(KDE) to estimate the PDFs of distributions and then minimizes the Jensen-Shannon divergence between them.This is a stricter regularization than MMD and may be prone to overﬁtting.5.2.2Implementation DetailsFollowing[24,14],SC,GraphSC,TSC MMD,and TSC are performed on both labeled and unlabeled data as an unsu-pervised dimensionality reduction procedure,then a super-vised LR classiﬁer is trained on labeled data to classify un-labeled data.We apply PCA to reduce the data dimensional-ity by keeping98%information in the largest eigenvectors, and then perform all above algorithms in the PCA subspace.Under our experimental setup,it is impossible to auto-matically tune the optimal parameters for the target classi-ﬁer using cross validation,since the labeled and unlabeled data are sampled from different distributions.Therefore,we evaluate theﬁve baseline methods on our datasets by em-pirically searching the parameter space for the optimal pa-rameter settings,and report the best results of each method. For LR8,we set the trade-off parameter C by searching C∈{0.01,0.05,0.1,0.5,1,5,10,50,100}.For SC9[9] based methods,we set the#basis vectors as k=128.For GraphSC10[24],we set the trade-off parameterγby search-ingγ∈{0.01,0.1,1,10,100}.For dimensionality reduc-tion methods,we useℓ2-norm normalized feature vectors.The TSC approach has three model parameters:MMD regularization parameterμ,graph regularization parameter γ,and sparsity regularization parameterλ.In the coming sections,we provide empirical analysis on parameter sen-sitivity,which veriﬁes that TSC can achieve stable perfor-8.tw/˜cjlin/liblinear9/˜hllee/softwares/nips06-sparsecoding.htm10/home/dengcai/Data/ SparseCoding.htmlDataset USPS vs MNIST PIE1vs PIE2MSRC vs VOC LR31.70±0.0029.53±0.0034.38±0.00 PCA32.15±0.0028.93±0.0032.75±0.00 SC[9]36.90±0.6517.74±0.8530.28±0.93 GraphSC[24]41.18±0.1519.72±1.5530.61±0.34 TSC MMD47.30±2.1336.71±1.7634.27±0.45 TSC57.77±1.6937.30±1.6836.47±0.40 Table3.Classiﬁcation accuracy(%)on cross-distribution datasets.mance under a wide range of parameter values.When com-paring with the baseline methods,we use the following pa-rameter settings:k=128,p=5,μ=105,γ=1,λ=0.1, and#iterations T=100.We run TSC10repeated times to remove any randomness caused by random initialization.We use classiﬁcation Accuracy on test data as the evalu-ation metric,which is widely used in literature[17,24,16] Accuracy=∣x:x∈D ts∧ˆy(x)=y(x)∣∣x:x∈D ts∣where D ts is the set of test data,y(x)is the truth label of x,ˆy(x)is the label predicted by the classiﬁcation algorithm.5.3.Experimental ResultsThe classiﬁcation accuracy of TSC and theﬁve baseline methods on the three cross-distribution image datasets USP-S vs MNIST,PIE1vs PIE2,and MSRC vs VOC is illustrated in Table3.From the results we observe that TSC achieves much better performance than theﬁrst four baseline meth-ods.The average classiﬁcation accuracies of TSC on the three datasets are57.77%,37.30%,and36.47%,respective-ly.The performance improvements are16.59%,7.77%,and 2.09%compared to the best baseline methods GraphSC and LR,respectively.Furthermore,from the results averaged by 10repeated runs in Table3,we see that the deviations are small compared to the accuracy improvements,which vali-dates that TSC performs stably to the random initialization. This veriﬁes that TSC can construct robust sparse represen-tations for classifying cross-distribution images accurately.We have noticed that our TSC MMD approach,which is a special case of TSC withγ=0,also outperforms all the ﬁrst four baseline methods.This validates that minimizing the distribution divergence is very important to make the induced representations robust for cross-distribution image classiﬁcation.In particular,TSC MMD has signiﬁcantly out-performed GraphSC,which indicates that minimizing the distribution divergence is more important than preserving the geometric structure when labeled and unlabeled images are sampled from different distributions.It is expected that TSC can also achieve better performance than TSC MMD.By incorporating the graph Laplacian term of coefﬁcients into TSC,we aim to enrich the sparse representations with more discriminating power to beneﬁt the classiﬁcation problems.。

困顿感量表中文版测评医学生的效度和信度

·心理卫生评估·困顿感量表中文版测评医学生的效度和信度*龚睿婕1刘景壹1王亦晨2蔡泳3王甦平3（1上海市徐汇区疾病预防控制中心，上海2002372上海交通大学医学院附属瑞金医院，上海2000253上海交通大学医学院，上海200025通信作者：王甦平wangsuping@）【摘要】目的：引进困顿感量表（ES），评价其在医学生群体中的信效度。

方法：选取某医学院校学生1768名，将其随机分半，一半（n=855）进行探索性因子分析，另一半（n=913）进行验证性因子分析；采用病人健康问卷9条目（PHQ-9）检验效标效度。

间隔1个月后，在总样本中选取53名学生进行重测。

结果：探索性因子分析显示量表共16个条目，包含1个公因子，累计方差解释率64.66%，各条目的因子负荷值在0.23 0.77之间；验证性因子分析表明两因子模型拟合情况略优于一因子模型（χ2/df=7.00，ＲMSEA=0.08，GFI=0.91，CFI=0.95），各因子负荷在0.48 0.89之间。

ES得分与PHQ-9得分呈正相关（ICC=0.44）。

总量表的Cronbachα系数为0.96，2个维度的α系数分别为0.94和0.93；总量表的重测信度为0.83，2个维度的重测信度为0.80、0.83。

结论：困顿感量表中文版在医学生群体有良好的信效度，可以用于评估该群体的困顿感。

【关键词】困顿感量表；医学生；效度；信度中图分类号：B841.7文献标识码：A文章编号：1000－6729（2019）005－0393－05doi：10.3969/j.issn.1000－6729.2019.05.015（中国心理卫生杂志，2019，33（5）：393－397.）Validity and reliability of the Chinese vision of theEntrapment Scale in medical studentsGONGＲuijie1，LIU Jingyi1，WANG Yichen2，CAI Yong3，WANG Suping3 1Shanghai Xuhui Center for Disease Control and Prevention，Shanghai200237，China2Shanghai Jiao Tong University School of Medicine，Ｒuijin Hospital，Department of Hospital Infection Control，Shanghai200025，China3School of Medicine，Shanghai Jiao Tong University，Shanghai200025，ChinaCorresponding author：WANG Suping，wangsuping@【Abstract】Objective：To evaluate the validity and reliability of the Chinese vision of the Entrapment Scale （ES）in medical students.Methods：Totally1768medical students were selected.They were randomly allocated into two groups for exploratory factor analysis（n=855）and confirmatory factor analysis（n=913）.All samples were assessed in criterion validity with the Patient Health Questionnaire（PHQ-9）.One month later，53partici-pants were retested.Ｒesults：The exploratory factor analysis extracted1component from16items，and explained64.66%of the total variance.The factor loading of items ranged between0.23－0.77.The confirmatory factor a-nalysis verified that the fitting of the two-factor model was slightly better than that of the one-factor model（χ2/df=7.00，ＲMSEA=0.08，GFI=0.91，CFI=0.95）.The factor loading of items ranged from0.48to0.89.TheES scores were positively correlated with the PHQ-9scores（ICC=0.44）.Cronbach＇sαcoefficients were0.96 for the total scale and0.83for the test-retest reliability.The internal consistency reliabilities for the2factors were*基金项目：国家自然科学基金———基于IMB模型的跨性别男男性行为者艾滋病高危行为干预策略研究0.94and0.93，and the test-retest reliabilities for the2factors were0.80and0.83.Conclusion：The Chinese vi-sion of the Entrapment Scale has good validity and reliability among Chinese medical students，and it could be used in the evaluation of entrapment.【Key words】Entrapment Scale；medical students；validity；reliability（Chin Ment Health J，2019，33（5）：393－397.）困顿感（entrapment）在心理生理学中指想要摆脱威胁或者压力，但是自身没有能力而继续处于被困的一种感觉或个人心理状态［1］。

图像去雾增强算法的研究-文献综述

福州大学专业英语文献综述题目：图像去雾增强算法的研究姓名：学号：专业：一、引言由于近年来空气污染加重，我国雾霾天气越来越频繁地出现，例如：2012底到2013年初，几次连续七日以上的雾霾天气笼罩了大半个中国，给海陆空交通，人民生活及生命安全造成了巨大的影响。

因此,除降低空气污染之外，提高雾霾图像、视频的清晰度是亟待解决的重要问题。

图像去雾实质上就是图像增强的一种现实的应用。

一般情况下,在各类图像系统的传送和转换(如显示、复制、成像、扫描以及传输等）总会在某种程度上造成图像质量的下降。

例如摄像时,由于雾天的原因使图像模糊；再如传输过程中,噪声污染图像，使人观察起来不满意;或者是计算机从中提取的信息减少造成错误，因此，必须对降质图像进行改善处理，主要目的是使图像更适合于人的视觉特性或计算机识别系统。

从图像质量评价观点来看，图像增强技术主要目的是提高图像可辨识度。

通过设法有选择地突出便于人或机器分析的某些感兴趣的信息,抑制一些无用信息，以提高图像的使用价值，即图像增强处理只是增强了对某些信息的辨别能力［1］.二、背景及意义近几年空气质量退化严重,雾霾等恶劣天气出现频繁，PM2。

5[2］值越来越引起人们的广泛关注。

在有雾天气下拍摄的图像，由于大气中混浊的媒介对光的吸收和散射影响严重,使“透过光"强度衰减，从而使得光学传感器接收到的光强发生了改变，直接导致图像对比度降低，动态范围缩小，模糊不清,清晰度不够，图像细节信息不明显，许多特征被覆盖或模糊，信息的可辨识度大大降低。

同时，色彩保真度下降，出现严重的颜色偏移和失真，达不到满意的视觉效果［3—6]。

上述视觉效果不佳的图像部分信息缺失，给判定目标带来了一定的困难,直接限制和影响了室外目标识别和跟踪、智能导航、公路视觉监视、卫星遥感监测、军事航空侦察等系统效用的发挥，给生产与生活等各方面都造成了极大的影响[7—9].以公路监控为例,由于大雾弥漫，道路的能见度大大降低，司机通过视觉获得的路况信息往往不准确，进一步影响对环境的判读，很容易发生交通事故，此时高速封闭或者公路限行,给人们的出行带来了极大的不便［10］。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

ABSTRACT
In this paper, we describe an approach of using a discriminative term selection process based on information grain (IG) to improve the performance of the latent semantic indexing (LSI). The discriminative power of the term is measured by entropy variations averaged over all categories conditioned upon whether the term is present or absent. The proposed approach is applied to the task of natural language call routing (NLCR), where natural language based classifiers are used to route calls to desired destinations. Various experimental studies are performed. Significant performance gains of 27% on precision and 26.5% on recall are observed. Most importantly, the proposed approach is almost independent of task dependent language resources and robust to term variations, making it highly portable to various information retrieval and natural language understanding tasks.
IMPROVING LATENT SEMANTIC INDEXING BASED CLASSIFIER WITH INFORMATION GAIN
Li Li, Wu Chou Avaya Labs Research 233 Mt. Airy Road Basking Ridge, New Jersey 07920, USA
2.
Information Gain in Term Selection
Term Selection is an active research area in statistical text classification and machine learning [5,8,9]. Apart from approaches that require subjective knowledge, many term selection methods employ an iterative searching process between the terms and classifier. For large number of terms,
lli5@ wuchou@n document maps to a unique column vectorarge and highly asymmetric with many times more rows (terms) than the number of columns (documents). The accuracy of a LSI classifier is therefore subject to the selection of terms. In addition to accuracy, term selection for large data set can significantly improves the runtime performance. In the previous LSI based approach [2,3], terms are selected based on their occurrence statistics (or frequencies) in the training data. Terms with occurrence less than a pre-set threshold are thrown away. Terms selected or discarded in this process may or may not be salient. It relies on other language processing resources (e.g. stop word list, ignore word list, etc.) and LSI to find out latent structure between various terms. Although this process has the advantage of making the LSI classifier construction almost automatic, it nevertheless is based on the statistics of term occurrences not the discriminative power nor the statistical significance of the term in classification. In this paper, we propose an approach of LSI in natural language call routing that is based on a discriminative term selection for improved performance and robustness. The discriminative term selection is based on the criterion of Information Gain (IG) [5,6,7]. The discriminative power of the term is measured by the average entropy variations on the categories when the term is present or absent. The proposed approach was tested on several natural language call routing tasks. The experimental results indicate that significant performance gain in terms of precision and recall can be obtained from the proposed approach. The performance gain is most significant for small or medium size training corpus, where performance gain as much as 27% on precision and 26.5% on recall over the baseline LSI approach is observed. Moreover, the proposed approach can remove some critical language dependent resource in LSI based natural language call router, making it more robust and portable for different tasks and even for different languages.
the cost of term selection process can be very high and the selected terms are often tied to particular classifier [5]. More practical methods of term selection are often based on direct evaluation of the salient nature of the term. In this approach, each term is assigned a numeric value that indicates the importance of the term. A subset of terms can be chosen based on the value of this importance factor. Criteria in this class include Information Gain (IG), Mutual Information (MI), χ2-test [5,7,8]. Other techniques can also be used such as RELIEF and FOCUS [5,8]. Major advantages of this class of term selection methods are:
1.
INTRODUCTION
In natural language understanding and information retrieval, it is well known that literally matching terms in documents with those of a query can be a problem. This is because there are usually many ways to express a given concept (synonym), and the literal terms in a query may not match those of a relevant document. Latent Semantics Indexing (LSI) is a powerful approach and successfully applied to information retrieval [1], natural language understanding and, particularly, to the problem of natural language call routing [2,3,4]. LSI is a vector space based approach. It assumes that there is some underlying or latent structure in the usage of terms in the document and query, which is obscured by the variability of the language and the choice of terms. It represents the terms (features) and documents (categories) as vectors in a semantic space. This representation is derived from a truncated singular value decomposition (SVD) on the term-document matrix constructed from the labeled training data. An unknown query is mapped into this semantic space as a vector, and its category can be determined based on the similarities between the query vector and its closest document vector. The construction of term-document matrix is the first step of implementing LSI based classifier. In the term-document matrix, each term maps to a unique row vector and each

趋势分析之语义网

页数:6
语义网本体

页数:3
语义检索

页数:8
语义网技术

页数:4
语义网络与语义网

页数:20
语义网与本体技术

页数:33
语义网与本体

页数:89
语义网基础教程

页数:9
语义网络与语义网方案

页数:22
语义网(一) 概述

页数:170