Long-term Recurrent Convolutional Networks for

格式：pdf
大小：4.39 MB
文档页数：13

下载文档原格式

Introduction to Artificial Intelli智慧树知到课后章节答案2023年

Introduction to Artificial Intelligence智慧树知到课后章节答案2023年下哈尔滨工程大学哈尔滨工程大学第一章测试1.All life has intelligence The following statements about intelligence arewrong（）A:All life has intelligence B:Bacteria do not have intelligence C:At present,human intelligence is the highest level of nature D:From the perspective of life, intelligence is the basic ability of life to adapt to the natural world答案:Bacteria do not have intelligence2.Which of the following techniques is unsupervised learning in artificialintelligence?（）A:Neural network B:Support vector machine C:Decision tree D:Clustering答案:Clustering3.To which period can the history of the development of artificial intelligencebe traced back?（）A:1970s B:Late 19th century C:Early 21st century D:1950s答案:Late 19th century4.Which of the following fields does not belong to the scope of artificialintelligence application?（）A:Aviation B:Medical C:Agriculture D:Finance答案:Aviation5.The first artificial neuron model in human history was the MP model,proposed by Hebb.（）A:对 B:错答案:错6.Big data will bring considerable value in government public services, medicalservices, retail, manufacturing, and personal location services. （）A:错 B:对答案:对第二章测试1.Which of the following options is not human reason:（）A:Value rationality B:Intellectual rationality C:Methodological rationalityD:Cognitive rationality答案:Intellectual rationality2.When did life begin? （）A:Between 10 billion and 4.5 billion years B:Between 13.8 billion years and10 billion years C:Between 4.5 billion and 3.5 billion years D:Before 13.8billion years答案:Between 4.5 billion and 3.5 billion years3.Which of the following statements is true regarding the philosophicalthinking about artificial intelligence?（）A:Philosophical thinking has hindered the progress of artificial intelligence.B:Philosophical thinking has contributed to the development of artificialintelligence. C:Philosophical thinking is only concerned with the ethicalimplications of artificial intelligence. D:Philosophical thinking has no impact on the development of artificial intelligence.答案:Philosophical thinking has contributed to the development ofartificial intelligence.4.What is the rational nature of artificial intelligence?（）A:The ability to communicate effectively with humans. B:The ability to feel emotions and express creativity. C:The ability to reason and make logicaldeductions. D:The ability to learn from experience and adapt to newsituations.答案:The ability to reason and make logical deductions.5.Which of the following statements is true regarding the rational nature ofartificial intelligence?（）A:The rational nature of artificial intelligence includes emotional intelligence.B:The rational nature of artificial intelligence is limited to logical reasoning.C:The rational nature of artificial intelligence is not important for itsdevelopment. D:The rational nature of artificial intelligence is only concerned with mathematical calculations.答案:The rational nature of artificial intelligence is limited to logicalreasoning.6.Connectionism believes that the basic element of human thinking is symbol,not neuron; Human's cognitive process is a self-organization process ofsymbol operation rather than weight. （）A:对 B:错答案:错第三章测试1.The brain of all organisms can be divided into three primitive parts:forebrain, midbrain and hindbrain. Specifically, the human brain is composed of brainstem, cerebellum and brain (forebrain). （）A:错 B:对答案:对2.The neural connections in the brain are chaotic. （）A:对 B:错答案:错3.The following statement about the left and right half of the brain and itsfunction is wrong （）.A:When dictating questions, the left brain is responsible for logical thinking,and the right brain is responsible for language description. B:The left brain is like a scientist, good at abstract thinking and complex calculation, but lacking rich emotion. C:The right brain is like an artist, creative in music, art andother artistic activities, and rich in emotion D:The left and right hemispheres of the brain have the same shape, but their functions are quite different. They are generally called the left brain and the right brain respectively.答案:When dictating questions, the left brain is responsible for logicalthinking, and the right brain is responsible for language description.4.What is the basic unit of the nervous system?（）A:Neuron B:Gene C:Atom D:Molecule答案:Neuron5.What is the role of the prefrontal cortex in cognitive functions?（）A:It is responsible for sensory processing. B:It is involved in emotionalprocessing. C:It is responsible for higher-level cognitive functions. D:It isinvolved in motor control.答案:It is responsible for higher-level cognitive functions.6.What is the definition of intelligence?（）A:The ability to communicate effectively. B:The ability to perform physicaltasks. C:The ability to acquire and apply knowledge and skills. D:The abilityto regulate emotions.答案:The ability to acquire and apply knowledge and skills.第四章测试1.The forward propagation neural network is based on the mathematicalmodel of neurons and is composed of neurons connected together by specific connection methods. Different artificial neural networks generally havedifferent structures, but the basis is still the mathematical model of neurons.（）A:对 B:错答案:对2.In the perceptron, the weights are adjusted by learning so that the networkcan get the desired output for any input. （）A:对 B:错答案:对3.Convolution neural network is a feedforward neural network, which hasmany advantages and has excellent performance for large image processing.Among the following options, the advantage of convolution neural network is（）.A:Implicit learning avoids explicit feature extraction B:Weight sharingC:Translation invariance D:Strong robustness答案:Implicit learning avoids explicit feature extraction;Weightsharing;Strong robustness4.In a feedforward neural network, information travels in which direction?（）A:Forward B:Both A and B C:None of the above D:Backward答案:Forward5.What is the main feature of a convolutional neural network?（）A:They are used for speech recognition. B:They are used for natural languageprocessing. C:They are used for reinforcement learning. D:They are used forimage recognition.答案:They are used for image recognition.6.Which of the following is a characteristic of deep neural networks?（）A:They require less training data than shallow neural networks. B:They havefewer hidden layers than shallow neural networks. C:They have loweraccuracy than shallow neural networks. D:They are more computationallyexpensive than shallow neural networks.答案:They are more computationally expensive than shallow neuralnetworks.第五章测试1.Machine learning refers to how the computer simulates or realizes humanlearning behavior to obtain new knowledge or skills, and reorganizes the existing knowledge structure to continuously improve its own performance.（）A:对 B:错答案:对2.The best decision sequence of Markov decision process is solved by Bellmanequation, and the value of each state is determined not only by the current state but also by the later state.（）A:对 B:错答案:对3.Alex Net's contributions to this work include: （）.A:Use GPUNVIDIAGTX580 to reduce the training time B:Use the modified linear unit (Re LU) as the nonlinear activation function C:Cover the larger pool to avoid the average effect of average pool D:Use the Dropouttechnology to selectively ignore the single neuron during training to avoid over-fitting the model答案:Use GPUNVIDIAGTX580 to reduce the training time;Use themodified linear unit (Re LU) as the nonlinear activation function;Cover the larger pool to avoid the average effect of average pool;Use theDropout technology to selectively ignore the single neuron duringtraining to avoid over-fitting the model4.In supervised learning, what is the role of the labeled data?（）A:To evaluate the model B:To train the model C:None of the above D:To test the model答案:To train the model5.In reinforcement learning, what is the goal of the agent?（）A:To identify patterns in input data B:To minimize the error between thepredicted and actual output C:To maximize the reward obtained from theenvironment D:To classify input data into different categories答案:To maximize the reward obtained from the environment6.Which of the following is a characteristic of transfer learning?（）A:It can only be used for supervised learning tasks B:It requires a largeamount of labeled data C:It involves transferring knowledge from onedomain to another D:It is only applicable to small-scale problems答案:It involves transferring knowledge from one domain to another第六章测试1.Image segmentation is the technology and process of dividing an image intoseveral specific regions with unique properties and proposing objects ofinterest. In the following statement about image segmentation algorithm, the error is （）.A:Region growth method is to complete the segmentation by calculating the mean vector of the offset. B:Watershed algorithm, MeanShift segmentation,region growth and Ostu threshold segmentation can complete imagesegmentation. C:Watershed algorithm is often used to segment the objectsconnected in the image. D:Otsu threshold segmentation, also known as themaximum between-class difference method, realizes the automatic selection of global threshold T by counting the histogram characteristics of the entire image答案:Region growth method is to complete the segmentation bycalculating the mean vector of the offset.2.Camera calibration is a key step when using machine vision to measureobjects. Its calibration accuracy will directly affect the measurementaccuracy. Among them, camera calibration generally involves the mutualconversion of object point coordinates in several coordinate systems. So,what coordinate systems do you mean by "several coordinate systems" here?（）A:Image coordinate system B:Image plane coordinate system C:Cameracoordinate system D:World coordinate system答案:Image coordinate system;Image plane coordinate system;Camera coordinate system;World coordinate systemmonly used digital image filtering methods:（）.A:bilateral filtering B:median filter C:mean filtering D:Gaussian filter答案:bilateral filtering;median filter;mean filtering;Gaussian filter4.Application areas of digital image processing include:（）A:Industrial inspection B:Biomedical Science C:Scenario simulation D:remote sensing答案:Industrial inspection;Biomedical Science5.Image segmentation is the technology and process of dividing an image intoseveral specific regions with unique properties and proposing objects ofinterest. In the following statement about image segmentation algorithm, the error is ( ).A:Otsu threshold segmentation, also known as the maximum between-class difference method, realizes the automatic selection of global threshold T by counting the histogram characteristics of the entire imageB: Watershed algorithm is often used to segment the objects connected in the image. C:Region growth method is to complete the segmentation bycalculating the mean vector of the offset. D:Watershed algorithm, MeanShift segmentation, region growth and Ostu threshold segmentation can complete image segmentation.答案:Region growth method is to complete the segmentation bycalculating the mean vector of the offset.第七章测试1.Blind search can be applied to many different search problems, but it has notbeen widely used due to its low efficiency.（）A:错 B:对答案:对2.Which of the following search methods uses a FIFO queue （）.A:width-first search B:random search C:depth-first search D:generation-test method答案:width-first search3.What causes the complexity of the semantic network （）.A:There is no recognized formal representation system B:The quantifiernetwork is inadequate C:The means of knowledge representation are diverse D:The relationship between nodes can be linear, nonlinear, or even recursive 答案:The means of knowledge representation are diverse;Therelationship between nodes can be linear, nonlinear, or even recursive4.In the knowledge graph taking Leonardo da Vinci as an example, the entity ofthe character represents a node, and the relationship between the artist and the character represents an edge. Search is the process of finding the actionsequence of an intelligent system.（）A:对 B:错答案:对5.Which of the following statements about common methods of path search iswrong（）A:When using the artificial potential field method, when there are someobstacles in any distance around the target point, it is easy to cause the path to be unreachable B:The A* algorithm occupies too much memory during the search, the search efficiency is reduced, and the optimal result cannot beguaranteed C:The artificial potential field method can quickly search for acollision-free path with strong flexibility D:A* algorithm can solve theshortest path of state space search答案:When using the artificial potential field method, when there aresome obstacles in any distance around the target point, it is easy tocause the path to be unreachable第八章测试1.The language, spoken language, written language, sign language and Pythonlanguage of human communication are all natural languages.（）A:对 B:错答案:错2.The following statement about machine translation is wrong （）.A:The analysis stage of machine translation is mainly lexical analysis andpragmatic analysis B:The essence of machine translation is the discovery and application of bilingual translation laws. C:The four stages of machinetranslation are retrieval, analysis, conversion and generation. D:At present,natural language machine translation generally takes sentences as thetranslation unit.答案:The analysis stage of machine translation is mainly lexical analysis and pragmatic analysis3.Which of the following fields does machine translation belong to? （）A:Expert system B:Machine learning C:Human sensory simulation D:Natural language system答案:Natural language system4.The following statements about language are wrong: （）。

卷积神经网络研究综述

卷积神经网络研究综述一、引言卷积神经网络（Convolutional Neural Network，简称CNN）是深度学习领域中的一类重要算法，它在计算机视觉、自然语言处理等多个领域中都取得了显著的成果。

CNN的设计灵感来源于生物视觉神经系统的结构，尤其是视觉皮层的组织方式，它通过模拟视觉皮层的层级结构来实现对输入数据的层次化特征提取。

在引言部分，我们首先要介绍CNN的研究背景。

随着信息技术的飞速发展，大数据和人工智能逐渐成为研究的热点。

在这个过程中，如何有效地处理和分析海量的图像、视频等数据成为了一个亟待解决的问题。

传统的机器学习方法在处理这类数据时往往面临着特征提取困难、模型复杂度高等问题。

而CNN的出现，为解决这些问题提供了新的思路。

接着，我们要阐述CNN的研究意义。

CNN通过其独特的卷积操作和层次化结构，能够自动学习并提取输入数据中的特征，从而避免了繁琐的特征工程。

同时，CNN还具有良好的泛化能力和鲁棒性，能够处理各种复杂的数据类型和场景。

因此，CNN在计算机视觉、自然语言处理等领域中都得到了广泛的应用，并取得了显著的成果。

最后，我们要介绍本文的研究目的和结构安排。

本文旨在对CNN 的基本原理、发展历程和改进优化方法进行系统的综述，以便读者能够全面了解CNN的相关知识和技术。

为了达到这个目的，我们将按照CNN的基本原理、发展历程和改进优化方法的顺序进行论述，并在最后对全文进行总结和展望。

二、卷积神经网络基本原理卷积神经网络的基本原理主要包括卷积操作、池化操作和全连接操作。

这些操作共同构成了CNN的基本框架，并使其具有强大的特征学习和分类能力。

首先，卷积操作是CNN的核心操作之一。

它通过一个可学习的卷积核在输入数据上进行滑动窗口式的计算，从而提取出输入数据中的局部特征。

卷积操作具有两个重要的特点：局部连接和权值共享。

局部连接意味着每个神经元只与输入数据的一个局部区域相连，这大大降低了模型的复杂度；权值共享则意味着同一卷积层内的所有神经元共享同一组权值参数，这进一步减少了模型的参数数量并提高了计算效率。

语义分析的一些方法

语义分析的一些方法(上篇)人工智能林 17小时前70℃0评论作者：火光摇曳念。

wikipedia上的解释：In machine learning, semantic analysis of a corpus is the task of building structures that approximate concepts from a large set of documents(or images)。

工作这几年，陆陆续续实践过一些项目，有搜索广告，社交广告，微博广告，品牌广告，内容广告等。

要使我们广告平台效益最大化，首先需要理解用户，Context(将展示广告的上下文)和广告，才能将最合适的广告展示给用户。

而这其中，就离不开对用户，对上下文，对广告的语义分析，由此催生了一些子项目，例如文本语义分析，图片语义理解，语义索引，短串语义关联，用户广告语义匹配等。

接下来我将写一写我所认识的语义分析的一些方法，虽说我们在做的时候，效果导向居多，方法理论理解也许并不深入，不过权当个人知识点总结，有任何不当之处请指正，谢谢。

本文主要由以下四部分组成：文本基本处理，文本语义分析，图片语义分析，语义分析小结。

先讲述文本处理的基本方法，这构成了语义分析的基础。

接着分文本和图片两节讲述各自语义分析的一些方法，值得注意的是，虽说分为两节，但文本和图片在语义分析方法上有很多共通与关联。

最后我们简单介绍下语义分析在广点通“用户广告匹配”上的应用，并展望一下未来的语义分析方法。

1 文本基本处理在讲文本语义分析之前，我们先说下文本基本处理，因为它构成了语义分析的基础。

而文本处理有很多方面，考虑到本文主题，这里只介绍中文分词以及Term Weighting。

1.1 中文分词拿到一段文本后，通常情况下，首先要做分词。

分词的方法一般有如下几种：基于字符串匹配的分词方法。

此方法按照不同的扫描方式，逐个查找词库进行分词。

根据扫描方式可细分为：正向最大匹配，反向最大匹配，双向最大匹配，最小切分(即最短路径)；总之就是各种不同的启发规则。

基于深度级联孪生网络的小样本调制识别算法

第47卷第4期Vol.47No.4计算机工程Computer Engineering2021年4月April 2021基于深度级联孪生网络的小样本调制识别算法冯磊，蒋磊，许华，苟泽中（空军工程大学信息与导航学院，西安710077）摘要：为解决传统基于深度学习的调制识别算法在小样本条件下识别准确率较低的问题，提出一种基于深度级联孪生网络的通信信号小样本调制识别算法。

根据通信信号时序图的时空特性，设计由卷积神经网络和长短时记忆网络级联的特征提取模块将原始信号特征映射至特征空间，同时在孪生网络架构下对提取的特征进行距离度量并以相似性约束训练网络，避免特征提取模块在训练过程中出现过拟合现象，最终通过最近邻分类器识别待测样本的调制类别。

在DeepSig 公开调制数据集上的实验结果表明，与传统基于深度学习的调制识别算法相比，该算法能有效降低训练过程中所需的样本量，且在小样本条件下的识别准确率更高。

关键词：卷积神经网络；长短时记忆网络；孪生网络；小样本；调制识别开放科学（资源服务）标志码（OSID ）：中文引用格式：冯磊，蒋磊，许华，等.基于深度级联孪生网络的小样本调制识别算法［J ］.计算机工程，2021，47（4）：108-114.英文引用格式：FENG Lei ，JIANG Lei ，XU Hua ，et al.Small sample modulation recognition algorithm based on depth cascade siamese network ［J ］.Computer Engineering ，2021，47（4）：108-114.Small Sample Modulation Recognition Algorithm Based onDepth Cascade Siamese NetworkFENG Lei ，JIANG Lei ，XU Hua ，GOU Zezhong（Information and Navigation College ，Air Force Engineering University ，Xi ’an 710077，China ）【Abstract 】The recognition accuracy of traditional modulation recognition algorithms based on deep learning is reducedin the case of small sample size.To solve the problem ，this paper proposes a small sample modulation recognition algorithm for communication signal based on deep cascade Siamese network.Firstly ，according to the spatial and temporal features of the sequence diagram of communication signals ，the feature extraction module cascaded by Convolutional Neural Network （CNN ）and Long Short Term Memory （LSTM ）network is designed to map the original signal features to the feature space.At the same time ，the extracted features are measured under the Siamese network architecture ，and train the network with similarity constraints to avoid the problem of over fitting in the training.Finally ，the modulation category of the to be tested samples is identified through the nearest neighbor classifier.Results of experiments on the public modulation dataset DeepSig show that compared with the traditional modulation recognition algorithms based on deep learning ，the proposed algorithm can significantly reduce the number of samples required in training ，and the recognition accuracy is higher under small sample conditions.【Key words 】Convolutional Neural Network （CNN ）；Long Short Term Memory （LSTM ）network ；Siamese network ；small sample ；modulation recognition DOI ：10.19678/j.issn.1000-3428.00584720概述自从2012年Hinton 团队的深度卷积神经网络AlexNet 在ImageNet 挑战赛上获得冠军后［1］，深度学习技术得到快速发展，各种针对语音识别和图像分类任务的新型深度网络被不断提出并取得了很好的处理效果，而基于深度学习技术的通信信号调制识别方法也成为当前国内外的研究热点。

常用的深度学习模型

常用的深度学习模型深度学习是一种涉及人工神经网络的机器学习方法，主要用于处理大型数据集，使模型能够更准确地预测和分类数据。

它已成为人工智能领域的一个热点，在计算机视觉、语音识别、自然语言处理等众多领域有广泛的应用。

本文将介绍常用的深度学习模型。

一、前馈神经网络（Feedforward Neural Network）前馈神经网络是最简单和最基本的深度学习模型，也是其他深度学习模型的基础。

它由输入层、隐藏层和输出层组成。

每层都由若干个神经元节点组成，节点与上一层或下一层的所有节点相连，并带有权重值。

前馈神经网络使用反向传播算法来训练模型，使它能够预测未来的数据。

二、卷积神经网络（Convolutional Neural Network）卷积神经网络是一种用于图像处理的深度学习模型，它能够对图像进行分类、分割、定位等任务。

它的核心是卷积层和池化层。

卷积层通过滤波器来识别图像中的特征，池化层则用于下采样，以减少计算量，同时保留重要特征。

卷积神经网络具有良好的特征提取能力和空间不变性。

三、递归神经网络（Recurrent Neural Network）递归神经网络是一种用于序列数据处理的深度学习模型，它能够处理可变长度的数据，如语音识别、自然语言处理等任务。

它的核心是循环层，每个循环层都可以接受来自上一次迭代的输出，并将其传递到下一次迭代。

递归神经网络具有记忆能力，能够学习序列数据的上下文信息。

四、长短时记忆网络（Long Short-Term Memory）长短时记忆网络是一种改进的递归神经网络，它能够处理长序列数据，并避免传统递归神经网络的梯度消失问题。

它的核心是LSTM单元，每个LSTM单元由输入门、遗忘门和输出门组成，能够掌握序列数据的长期依赖关系。

五、生成对抗网络（Generative Adversarial Networks）生成对抗网络是一种概率模型，由生成器和判别器两部分组成。

生成器用于生成假数据，判别器则用于将假数据与真实数据进行区分。

赵丽丽-美图短视频AI赋能

自动剪辑滤镜美化 AR特效风格转换背景分割
存储
flv mp4 avi …
信息提取
物体识别场景检测人物分析行为识别主题提取事件检测 …
检索
视频内容检索相似视频检索
背景介绍
AI在美图短视频业务中的应用
工具
内容
背景介绍
美图短视频内容分析和检索应用
场
景
行为
物体
标签化
违规
竞品
画
质
优质视频
美图算法标签定义方法
业务标签视频数据文本数据
提取标题提取评论词干提取视频特征提取音频特征
聚类
算法标签
多多多维层模度级态
美图短视频内容分析技术实践
美图算法标签定义方法
算法标签体系
视觉、文本标签
内容
场景
服饰
音频标签
音乐
讲话
其他
自宠舞
拍物…蹈
城自
市然
餐
建风…厅
筑景
汉服
民族服
…
校服
Donahue et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. CVPR2015. 伯克利大学
Simonyan et al. Two-stream convolutional networks for action recognition in videos. NIPS2014. 牛津大学
美图短视频内容分析技术实践
视频分类技术-美图成果
美图云视觉技术部&中科院：基于交互感知的时空金字塔注意力机制神经网络的行为分类算法 Interaction-aware Spatio-temporal Pyramid Attention Networks for Action Classification. ECCV2018.

基于深度学习的雷达辐射源识别算法

·技术前沿·航天电子对抗2021年第1期基于深度学习的雷达辐射源识别算法殷雪凤，武斌（西安电子科技大学电子工程学院，陕西西安710126）摘要：为解决雷达辐射源识别中特征提取困难、低信噪比条件下识别效率低的问题，提出了一种基于一维卷积神经网络和长短期记忆网络的深度学习智能识别算法，构建了一个CNN‑LSTM网络，能实现对不同脉内调制方式雷达辐射源的端到端识别。

该网络首先利用卷积层学习信号局部特征，然后将卷积层输出的结果输入长短期记忆网络，学习信号的全局特征，最终构造逻辑回归分类完成分类识别任务。

仿真结果表明，该算法较单一卷积神经网络模型具有更好的识别效果，抗噪声效果更强，在-6dB信噪比的条件下，识别的准确率仍能够达到90%以上。

关键词：卷积神经网络；长短期记忆网络；雷达辐射源识别；深度学习中图分类号:TN971+.5；TN974文献标志码:ARadar emitter identification algorithm based on deep learningYin Xuefeng，Wu Bin(School of Electronic Engineering，Xidian University，Xi’an710126，Shanxi，China)Abstract：In order to solve problems of feature extraction and low recognition efficiency under low signal-to-noise ratio in radar emitter identification，a deep learning intelligent recognition algorithm is proposed based onone-dimensional convolutional neural network and long short-term memory（CNN-LSTM）network.A CNN-LSTM network is constructed，which can realize end-to-end identification of different pulse modulation radarsources.The convolutional layers are used to learn the local characteristics of the signal by this network，and thelong short-term memory layer is used to learn the global characteristics.Finally a logistic regression classificationis constructed to complete the recognition task.Simulation results show that the algorithm has better recognitioneffect and stronger anti-noise effect compared with the single convolutional neural network model.Under the con‑dition of-6dB signal-to-noise ratio，the recognition accuracy can still reach more than90%.Key words：convolutional neural network；long short-term memory network；radar emitter identification；deep learning0引言雷达辐射源识别（REI/RER）技术是电子战中至关重要的一部分，是电子支援措施（ESM）的核心和雷达对抗系统中的关键技术[1]。

卷积神经网络在字符识别方面的应用

卷积神经网络在字符识别方面的应用一、概述二、背景三、人脑视觉机理四、关于特征4.1、特征表示的粒度4.2、初级(浅层)特征表示4.3、结构性特征表示4.4、需要有多少个特征, 五、Deep Learning的基本思想六、浅层学习(Shallow Learning)和深度学习(Deep Learning)七、Deep learning与Neural Network 八、Deep learning训练过程8.1、传统神经网络的训练方法8.2、deep learning训练过程九、Deep Learning的常用模型或者方法9.1、AutoEncoder自动编码器9.2、Sparse Coding稀疏编码9.3、Restricted Boltzmann Machine(RBM)限制波尔兹曼机9.4、Deep BeliefNetworks深信度网络9.5、Convolutional Neural Networks卷积神经网络十、总结与展望十一、参考文献和Deep Learning学习资源一、概述Artificial Intelligence，也就是人工智能，就像长生不老和星际漫游一样，是人类最美好的梦想之一。

虽然计算机技术已经取得了长足的进步，但是到目前为止，还没有一台电脑能产生“自我”的意识。

是的，在人类和大量现成数据的帮助下，电脑可以表现的十分强大，但是离开了这两者，它甚至都不能分辨一个喵星人和一个汪星人。

图灵(图灵，大家都知道吧。

计算机和人工智能的鼻祖，分别对应于其著名的“图灵机”和“图灵测试”)在 1950 年的论文里，提出图灵试验的设想，即，隔墙对话，你将不知道与你谈话的，是人还是电脑。

这无疑给计算机，尤其是人工智能，预设了一个很高的期望值。

但是半个世纪过去了，人工智能的进展，远远没有达到图灵试验的标准。

这不仅让多年翘首以待的人们，心灰意冷，认为人工智能是忽悠，相关领域是“伪科学”。

但是自 2006 年以来，机器学习领域，取得了突破性的进展。

CNN学习笔记

Convolutional Neural Networks卷积神经网络卷积神经网络是人工神经网络的一种，已成为当前语音分析和图像识别领域的研究热点。

它的权值共享网络结构使之更类似于生物神经网络，降低了网络模型的复杂度，减少了权值的数量。

该优点在网络的输入是多维图像时表现的更为明显，使图像可以直接作为网络的输入，避免了传统识别算法中复杂的特征提取和数据重建过程。

卷积网络是为识别二维形状而特殊设计的一个多层感知器，这种网络结构对平移、比例缩放、倾斜或者共他形式的变形具有高度不变性。

CNNs是受早期的延时神经网络（TDNN）的影响。

延时神经网络通过在时间维度上共享权值降低学习复杂度，适用于语音和时间序列信号的处理。

CNNs是第一个真正成功训练多层网络结构的学习算法。

它利用空间关系减少需要学习的参数数目以提高一般前向BP算法的训练性能。

CNNs作为一个深度学习架构提出是为了最小化数据的预处理要求。

在CNN中，图像的一小部分（局部感受区域）作为层级结构的最低层的输入，信息再依次传输到不同的层，每层通过一个数字滤波器去获得观测数据的最显著的特征。

这个方法能够获取对平移、缩放和旋转不变的观测数据的显著特征，因为图像的局部感受区域允许神经元或者处理单元可以访问到最基础的特征，例如定向边缘或者角点。

2）卷积神经网络的网络结构图：卷积神经网络的概念示范：输入图像通过和三个可训练的滤波器和可加偏置进行卷积，滤波过程如图一，卷积后在C1层产生三个特征映射图，然后特征映射图中每组的四个像素再进行求和，加权值，加偏置，通过一个Sigmoid函数得到三个S2层的特征映射图。

这些映射图再进过滤波得到C3层。

这个层级结构再和S2一样产生S4。

最终，这些像素值被光栅化，并连接成一个向量输入到传统的神经网络，得到输出。

一般地，C层为特征提取层，每个神经元的输入与前一层的局部感受野相连，并提取该局部的特征，一旦该局部特征被提取后，它与其他特征间的位置关系也随之确定下来；S层是特征映射层，网络的每个计算层由多个特征映射组成，每个特征映射为一个平面，平面上所有神经元的权值相等。

长短期记忆网络在自然语言处理中的应用

长短期记忆网络在自然语言处理中的应用长短期记忆网络（Long Short-Term Memory, LSTM）是一种神经网络模型，它的出现解决了传统的RNN（Recurrent Neural Network）模型无法处理长序列数据的问题。

在自然语言处理中，LSTM具有极高的应用价值，可以用于文本分类、命名实体识别、机器翻译等任务。

本文将探讨LSTM在自然语言处理中的应用。

一、LSTM的基本原理LSTM的基本模型由一个单位（unit）和记忆细胞（memory cell）组成。

记忆细胞是LSTM的核心，其中包含了一些门（gates）来控制信息的流入与流出。

LSTM的门有三种：输入门（input gate）、遗忘门（forget gate）和输出门（output gate）。

输入门：用于控制新信息的流入。

在每个时间步上，输入门会根据当前输入和上一个时间步的隐藏状态计算出一个0-1之间的权重向量。

这个向量将控制哪些信息会进入到当前的记忆细胞中。

遗忘门：用于控制旧信息的保留。

在每个时间步上，遗忘门会根据当前的输入和上一个时间步的隐藏状态计算出一个0-1之间的权重向量。

这个向量将指定哪些信息从上一个记忆细胞中被保留下来，并传递到当前的记忆细胞中。

输出门：用于控制记忆细胞中哪些信息将被用作当前时间步的输出。

在每个时间步上，输出门会根据当前的输入和隐藏状态计算出一个0-1之间的权重向量。

这个向量将筛选出记忆细胞中需要输出的信息。

通过这些门的控制，LSTM能够有效地处理长序列数据，避免了传统的RNN模型中存在的梯度消失和爆炸的问题。

二、LSTM在文本分类中的应用文本分类是NLP中最基本的任务之一，它的目标是将一些文本分类到预定义的类别之中。

LSTM的输入是一个词向量序列，输出是表示输入文本分类的向量。

在LSTM中，输入序列中的每个词向量都会被传递到一个LSTM单位中。

每个LSTM单位将根据它之前和它接收到的输入计算出一个隐藏状态并输出到下一个LSTM单位。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

guage text)and can model complex temporal dynamics;yetthey can be optimized with backpropagation.Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual rep-resentations.Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately deﬁned and/or optimized.1.IntroductionRecognition and description of images and videos is a fundamental challenge of computer vision.Dramaticprogress has been achieved by supervised convolutional models on image recognition tasks,and a number of exten-sions to process video have been recently proposed.Ideally,a video model should allow processing of variable lengthinput sequences,and also provide for variable length out-puts,including generation of full-length sentence descrip-tions that go beyond conventional one-versus-all prediction tasks.In this paper we propose long-term recurrent convo-lutional networks(LRCNs),a novel architecture for visual recognition and description which combines convolutional layers and long-range temporal recursion and is end-to-end trainable(see Figure1).We instantiate our architecture for speciﬁc video activity recognition,image caption genera-1a rtion,and video description tasks as described below.To date,CNN models for video processing have success-fully considered learning of3-D spatio-temporalﬁlters over raw sequence data[13,2],and learning of frame-to-frame representations which incorporate instantaneous opticﬂow or trajectory-based models aggregated overﬁxed windows or video shot segments[16,33].Such models explore two extrema of perceptual time-series representation learning: either learn a fully general time-varying weighting,or apply simple temporal pooling.Following the same inspiration that motivates current deep convolutional models,we advo-cate for video recognition and description models which are also deep over temporal dimensions;i.e.,have temporal re-currence of latent variables.RNN models are well known to be“deep in time”;e.g.,explicitly so when unrolled,and form implicit compositional representations in the time do-main.Such“deep”models predated deep spatial convolu-tion models in the literature[31,44].Recurrent Neural Networks have long been explored in perceptual applications for many decades,with varying re-sults.A signiﬁcant limitation of simple RNN models which strictly integrate state information over time is known as the “vanishing gradient”effect:the ability to backpropogate an error signal through a long-range temporal interval becomes increasingly impossible in practice.A class of models which enable long-range learning wasﬁrst proposed in[12], and augments hidden state with nonlinear mechanisms to cause state to propagate without modiﬁcation,be updated, or be reset,using simple memory-cell like neural gates. While this model proved useful for several tasks,its util-ity became apparent in recent results reporting large-scale learning of speech recognition[10]and language transla-tion models[38,5].We show here that long-term recurrent convolutional models are generally applicable to visual time-series mod-eling;we argue that in visual tasks where static orﬂat tem-poral models have previously been employed,long-term RNNs can provide signiﬁcant improvement when ample training data are available to learn or reﬁne the representa-tion.Speciﬁcally,we show LSTM-type models provide for improved recognition on conventional video activity chal-lenges and enable a novel end-to-end optimizable mapping from image pixels to sentence-level natural language de-scriptions.We also show that these models improve gen-eration of descriptions from intermediate visual representa-tions derived from conventional visual models.We instantiate our proposed architecture in three experi-mental settings(see Figure3).First,we show that directly connecting a visual convolutional model to deep LSTM net-works,we are able to train video recognition models that capture complex temporal state dependencies(Figure3left; Section4).While existing labeled video activity datasets may not have actions or activities with extremely complex time dynamics,we nonetheless see improvements on the or-der of4%on conventional benchmarks.Second,we explore direct end-to-end trainable image to sentence mappings.Strong results for machine translation tasks have recently been reported[38,5];such models are encoder/decoder pairs based on LSTM networks.We pro-pose a multimodal analog of this model,and describe an architecture which uses a visual convnet to encode a deep state vector,and an LSTM to decode the vector into an natu-ral language string(Figure3middle;Section5).The result-ing model can be trained end-to-end on large-scale image and text datasets,and even with modest training provides competitive generation results compared to existing meth-ods.Finally,we show that LSTM decoders can be driven di-rectly from conventional computer vision methods which predict higher-level discriminative labels,such as the se-mantic video role tuple predictors in[30](Figure3right; Section6).While not end-to-end trainable,such models offer architectural and performance advantages over previ-ous statistical machine translation-based approaches,as re-ported below.We have realized a generalized“LSTM”-style RNN model in the widely-adopted open source deep learning framework Caffe[14],incorporating the speciﬁc LSTM units of[46,38,5].2.Background:Recurrent Neural Networks(RNNs)Traditional RNNs(Figure2,left)can learn complex tem-poral dynamics by mapping input sequences to a sequence of hidden states,and hidden states to outputs via the follow-ing recurrence equations(Figure2,left):h t=g(W xh x t+W hh h t−1+b h)z t=g(W hz h t+b z)where g is an element-wise non-linearity,such as a sigmoid or hyperbolic tangent,x t is the input,h t∈R N is the hidden state with N hidden units,and y t is the output at time t. For a length T input sequence x1,x2,...,x T ,the updates above are computed sequentially as h1(letting h0=0),y1, h2,y2,...,h T,y T.Though RNNs have proven successful on tasks such as speech recognition[42]and text generation[37],it can be difﬁcult to train them to learn long-term dynamics,likely due in part to the vanishing and exploding gradients prob-lem[12]that can result from propagating the gradients down through the many layers of the recurrent network, each corresponding to a particular timestep.LSTMs pro-vide a solution by incorporating memory units that allow the network to learn when to forget previous hidden states and when to update hidden states given new information.of the architecture described in[9],which was derived from the LSTM initially proposed in[12]).In addition to a hidden unit h t∈R N,the LSTM in-cludes an input gate i t∈R N,forget gate f t∈R N,output gate o t∈R N,input modulation gate g t∈R N,and mem-ory cell c t∈R N.The memory cell unit c t is a summation of two things:the previous memory cell unit c t−1which is modulated by f t,and g t,a function of the current input and previous hidden state,modulated by the input gate i t. Because i t and f t are sigmoidal,their values lie within the range[0,1],and i t and f t can be thought of as knobs that the LSTM learns to selectively forget its previous memory or consider its current input.Likewise,the output gate o t learns how much of the memory cell to transfer to the hid-den state.These additional cells enable the LSTM to learn extremely complex and long-term temporal dynamics the RNN is not capable of learning.Additional depth can be added to LSTMs by stacking them on top of each other,us-ing the hidden state of the LSTM in layer l−1as the input to the LSTM in layer l.Recently,LSTMs have achieved impressive results on language tasks such as speech recognition[10]and ma-chine translation[38,5].Analogous to CNNs,LSTMs are model that can learn to recognize and synthesize temporal dynamics for tasks involving sequential data(inputs or out-puts),visual,linsguistical or otherwise.Figure1depicts the core of our approach.Our LRCN model works by pass-ing each visual input v t(an image in isolation,or a frame from a video)through a feature transformationφV(v t) parametrized by V to produce aﬁxed-length vector rep-resentationφt∈R d.Having computed the feature-space representation of the visual input sequence φ1,φ2,...,φT , the sequence model then takes over.In its most general form,a sequence model parametrized by W maps an input x t and a previous timestep hidden state h t−1to an output z t and updated hidden state h t.There-fore,inference must be run sequentially(i.e.,from top to bottom,in the Sequence Learning box of Figure1),by computing in order:h1=f W(x1,h0)=f W(x1,0),then h2=f W(x2,h1),etc.,up to h T.Some of our models stack multiple LSTMs atop one another as described in Section2.Theﬁnal step in predicting a distribution P(y t)at timestep t is to take a softmax over the outputs z t of the sequential model,producing a distribution over the(in our case,ﬁnite and discrete)space C of possible per-timestepoutputs:P(y t=c)=exp(W zc z t,c+b c)c ∈Cexp(W zc z t,c +b c)The success of recent very deep models for object recog-nition[22,34,39]suggests that strategically composing many“layers”of non-linear functions can result in very powerful models for perceptual problems.For large T, the above recurrence indicates that the last few predictions from a recurrent network with T timesteps are computed by a very“deep”(T-layered)non-linear function,suggesting that the resulting recurrent model may have similar repre-sentational power to a T-layer neural network.Critically, however,the sequential model’s weights W are reused at every timestep,forcing the model to learn generic timestep-to-timestep dynamics(as opposed to dynamics directly con-ditioned on t,the sequence index)and preventing the pa-rameter size from growing in proportion to the maximum number of timesteps.In most of our experiments,the visual feature transfor-mationφcorresponds to the activations in some layer of a large ing a visual transformationφV(.)which is time-invariant and independent at each timestep has the important advantage of making the expensive convolutional inference and training parallelizable over all timesteps of the input,facilitating the use of fast contemporary CNN im-plementations whose efﬁciency relies on independent batch processing,and end-to-end optimization of the visual and sequential model parameters V and W.We consider three vision problems(activity recognition, image description and video description)which instantiate one of the following broad classes of sequential learning tasks:1.Sequential inputs,ﬁxed outputs(Figure3,left):x1,x2,...,x T →y.The visual activity recognition problem can fall under this umbrella,with videos of arbitrary length T as input,but with the goal of pre-dicting a single label like running or jumping drawn from aﬁxed vocabulary.2.Fixed inputs,sequential outputs(Figure3,middle):x→ y1,y2,...,y T .The image description problem ﬁts in this category,with a non-time-varying image as input,but a much larger and richer label space consist-ing of sentences of any length.3.Sequential inputs and outputs(Figure3,right):x1,x2,...,x T → y1,y2,...,y T .Finally,it’s easy to imagine tasks for which both the visual input and output are time-varying,and in general the number of input and output timesteps may differ(i.e.,we may have T=T ).In the video description task,for exam-ple,the input and output are both sequential,and thenumber of frames in the video should not constrain the length of(number of words in)the natural-language description.In the previously described formulation,each instance has T inputs x1,x2,...,x T and T outputs y1,y2,...,y T . We describe how we adapt this formulation in our hybrid model to tackle each of the above three problem settings. With sequential inputs and scalar outputs,we take a late fusion approach to merging the per-timestep predictions y1,y2,...,y T into a single prediction y for the full se-quence.Withﬁxed-size inputs and sequential outputs,we simply duplicate the input x at all T timesteps x t:=x(not-ing this can be done cheaply due to the time-invariant vi-sual feature extractor).Finally,for a sequence-to-sequence problem with(in general)different input and output lengths, we take an“encoder-decoder”approach inspired by[46].In this approach,one sequence model,the encoder,is used to map the input sequence to aﬁxed-length vector,then an-other sequence model,the decoder,is used to unroll this vector to sequential outputs of arbitrary length.Under this model,the system as a whole may be thought of as having T+T timesteps of input and output,wherein the input is processed and the decoder outputs are ignored for theﬁrst T timesteps,and the predictions are made and“dummy”inputs are ignored for the latter T timesteps.Under the proposed system,the weights(V,W)of the model’s visual and sequential components can be learned jointly by maximizing the likelihood of the ground truth outputs y t conditioned on the input data and labels up to that point(x1:t,y1:t−1)In particular,we minimize the negative log likelihood L(V,W)=−log P V,W(y t|x1:t,y1:t−1)of the training data(x,y).One of the most appealing aspects of the described sys-tem is the ability to learn the parameters“end-to-end,”such that the parameters V of the visual feature extractor learn to pick out the aspects of the visual input that are rele-vant to the sequential classiﬁcation problem.We train our LRCN models using stochastic gradient descent with mo-mentum,with backpropagation used to compute the gradi-ent∇L(V,W)of the objective L with respect to all param-eters(V,W).We next demonstrate the power of models which are both deep in space and deep in time by exploring three appli-cations:activity recognition,image description,and video description.4.Activity recognitionActivity recognition is an example of theﬁrst sequen-tial learning task described above;T individual frames are inputs into T convolutional networks which are then con-nected to a single-layer LSTM with256hidden units.A large body of recent work has proposed deep architecturesLSTMC N N building thefrontFigure 3:Task-speciﬁc instantiations of our LRCN model for activity recognition,image description,and video description.for activity recognition ([16,33,13,2,1]).[33,16]both propose convolutional networks which learn ﬁlters based on a stack of N input frames.Though we analyze clips of 16frames in this work,we note that the LRCN system is more ﬂexible than [33,16]since it is not constrained to analyz-ing ﬁxed length inputs and could potentially learn to rec-ognize complex video sequences (e.g .,cooking sequences as presented in 6).[1,2]use recurrent neural networks to learn temporal dynamics of either traditional vision features ([1])or deep features ([2]),but do not train their models end-to-end and do not pre-train on larger object recognition databases for important performance gains.We explore two variants of the LRCN architecture:one in which the LSTM is placed after the ﬁrst fully connected layer of the CNN (LRCN-fc 6)and another in which the LSTM is placed after the second fully connected layer of the CNN (LRCN-fc 7).We train the LRCN networks with video clips of 16frames.The LRCN predicts the video class at each time step and we average these predictions for ﬁnal classiﬁcation.At test time,we extract 16frame clips with a stride of 8frames from each video and average across clips.We also consider both RGB and ﬂow inputs.Flow is computed with [4]and transformed into a “ﬂow image”by centering x and y ﬂow values around 128and mul-tiplying by a scalar such that ﬂow values fall between 0and 255.A third channel for the ﬂow image is created by calculating the ﬂow magnitude.The CNN base of the LRCN is a hybrid of the Caffe [14]reference model,a mi-nor variant of AlexNet [22],and the network used by Zeiler &Fergus [47].The net is pre-trained on the 1.2M image ILSVRC-2012[32]classiﬁcation training subset of the Im-ageNet [7]dataset,giving the network a strong initialization to facilitate faster training and prevent over-ﬁtting to the rel-atively small video datasets.When classifying center crops,the top-1classiﬁcation accuracy is 60.2%and 57.4%for the hybrid and Caffe reference models,respectively.In our baseline model,T video frames are individually classiﬁed by a CNN.As in the LSTM model,whole video classiﬁca-tion is done by averaging scores across all video frames.4.1.EvaluationWe evaluate our architecture on the UCF-101dataset [36]which consists of over 12,000videos categorized into 101human action classes.The dataset is split into three splits,with a little under 8,000videos in the training set for each split.We report accuracy for split-1.Figure 1,columns 2-3,compare video classiﬁcation of our proposed models (LRCN-fc 6,LRCN-fc 7)against the baseline architecture for both RGB and ﬂow inputs.Each LRCN network is trained end-to-end.To determine if end-to-end training is necessary,we also train a LRCN-fc 6network in which only the LSTM parameters are learned.The fully ﬁne-tuned network increases performance from 70.47%to 71.12%,demonstrating that end-to-end ﬁne-tuning is indeed beneﬁcial.The LRCN-fc 6network yields the best results for both RGB and ﬂow and improves upon the baseline network by 2.12%and 4.75%respectively.RGB and ﬂow networks can be combined by comput-ing a weighted average of network scores as proposed in [33].Like [33],we report two weighted averages of the predictions from the RGB and ﬂow networks in Table 1(right).Since the ﬂow network outperforms the RGB net-work,weighting the ﬂow network higher unsurprisingly leads to better accuracy.In this case,LRCN outperforms the baseline single-frame model by 3.88%.The LRCN shows clear improvement over the baseline single-frame system and approaches the accuracy achieved by other deep models.[33]report the results on UCF-101Input Type Weighted Average Model RGB Flow1/2,1/21/3,2/3 Single frame65.4053.20––Single frame(ave.)69.0072.2075.7179.04 LRCN-fc671.1276.9581.9782.92 LRCN-fc770.6869.36––Table1:Activity recognition:Comparing single frame models to LRCN networks for activity recognition in the UCF-101[36] dataset,with both RGB andﬂow inputs.Our LRCN model con-sistently and strongly outperforms a model based on predictions from the underlying convolutional network architecture alone.by computing a weighted average betweenﬂow and RGB networks(86.4%for split1and87.6%averaging over all splits).Though[16]does not report numbers on the sepa-rate splits of UCF-101,the average split accuracy is65.4% which is substantially lower than our LRCN model.5.Image descriptionIn contrast to activity recognition,the static image de-scription task only requires a single convolutional network since the input consists of a single image.A variety of deep and multi-modal models[8,35,19,20,15,25,20,18]have been proposed for image description;in particular,[20,18] combine deep temporal models with convolutional repre-sentations.[20],utilizes a“vanilla”RNN as described in Section2,potentially making learning long-term tempo-ral dependencies difﬁcult.Contemporaneous with and most similar to our work is[18],which proposes a different ar-chitecture that uses the hidden state of an LSTM encoder at time T as the encoded representation of the length T in-put sequence.It then maps this sequence representation, combined with the visual representation from a convnet, into a joint space from which a separate decoder predicts words.This is distinct from our arguably simpler architec-ture,which takes as per-timestep input a copy of the static input image,along with the previous word.We present empirical results showing that our integrated LRCN archi-tecture outperforms these prior approaches,none of which comprise an end-to-end optimizable system over a hierar-chy of visual and temporal parameters.We now describe our instantiation of the LRCN architec-ture for the image description task.At each timestep,both the image features and the previous word are provided as in-puts to the sequential model,in this case a stack of LSTMs (each with1000hidden units),which is used to learn the dynamics of the time-varying output sequence,natural lan-guage.At timestep t,the input to the bottom-most LSTM is the embedded ground truth word from the previous timestep w t−1.For sentence generation,the input becomes a sample ˆw t−1from the model’s predicted distribution at the previous timestep.The second LSTM in the stack fuses the outputs of the bottom-most LSTM with the image representation φV(x)to produce a joint representation of the visual and language inputs up to time t.(The visual modelφV(x)used in this experiment is the base Caffe[14]reference model, very similar to the well-known AlexNet[22],pre-trained on ILSVRC-2012[32]as in Section4.)Any further LSTMs in the stack transform the outputs of the LSTM below,and the fourth LSTM’s outputs are inputs to the softmax which produces a distribution over words p(w t|w1:t−1).Following[19],we refer to the use of the bottom-most LSTM to exclusively process the language input(with no visual input)as the factored version of the model,and study the importance of this by comparing it to an unfactored vari-ant.See Figure4for details on the variants we study.Without any explicit language modeling or deﬁned syn-tax structure,the described LRCN system learns mappings from pixel intensity values to natural language descriptions that are often semantically descriptive and grammatically correct.5.1.EvaluationWe evaluate our image description model on both image retrieval and image annotation generation.Weﬁrst show the effectiveness of our model by quantitatively evaluating it on the image retrieval task proposed by[26]and seen in [25,15,35,8,18].Our model is trained on the combined training sets of the Flickr30k[28](28,000training images) and COCO2014[24]dataset(80,000training images).We report results on Flickr30k[28],with30,000images and ﬁve sentence annotations per image.We use1000images each for test and validation and the remaining28,000for training.Image retrieval results are recorded in Table2and re-port median rank,Med r,of theﬁrst retrieved ground truth image and Recall@K,the number of sentences for which the correct image is retrieved in the top-K.Our model consistently outperforms the strong baselines from recent work[18,25,15,35,8]as can be seen in Table2.Here, we make a note that the new OxfordNet model in[18]out-performs our model on the retrieval task.However,Ox-fordNet[18]utilizes a better-performing convolutional net-work to get the additional edge over the base ConvNet[18]. The strength of our temporal model(and integration of the temporal and visual models)can be more directly measured against the ConvNet[18]result,which uses the same base CNN architecture[22]pretrained on the same data.In Table3,we report image-to-caption retrieval results for each of the architectural variants in Figure4,as well as a four-layer version(LRCN4f)of the factored model.Based on the facts that LRCN2f outperforms the LRCN4f model, and LRCN1u outperforms LRCN2u,there seems to be little to be gained from naively stacking additional LSTM layers atop an existing network.On the other hand,a compari-of the LRCN architectures.See Figure4for diagrams of these ar-chitectures.The results indicate that the“factorization”is impor-tant to the LRCN’s retrieval performance,while simply stacking additional LSTM layers does not seem to improve performance. son of the LRCN2f and LRCN2u results indicatees that the “factorization”in the architecture is quite important to the model’s retrieval performance.To evaluate sentence generation,we use the BLEU[27] metric which was designed for automated evaluation of sta-tistical machine translation.BLEU is a modiﬁed form of precision that compares N-gram fragments of the hypothe-sis translation with multiple reference translations.We use BLEU as a measure of similarity of the descriptions.The unigram scores(B-1)account for the adequacy of(or the information retained)by the translation,while longer N-gram scores(B-2,B-3)account for theﬂuency.We com-pare our results with[25](on Flickr30k),and two strong reported in Table4.Additionally,we report results on the new COCO2014[24]dataset which has80,000training im-ages,and40,000validation images.Similar to Flickr30k, each image is annotated with5or more image annotations. We isolate5,000images from the validation set for testing purposes and the results are reported in Table4.Based on the B-1scores in Table4,generation using LRCN performs comparably with m-RNN[25]in terms of the information conveyed in the description.Furthermore, LRCN signiﬁcantly outperforms the baselines and the m-RNN with regard to theﬂuency(B-2,B-3)of the genera-tion,indicating the LRCN retains more of the bigrams and trigrams from the human-annotated descriptions.In addition to standard quantitative evaluations,we also employ Amazon Mechnical Turkers(AMT)to evaluate the generated sentences.Given an image and a set of descrip-tions from different models,we ask Turkers to rank the sentences based on correctness,grammar and relevance.Correctness Grammar Relevance TreeTalk[23] 4.08 4.35 3.98 OxfordNet[18] 3.71 3.46 3.70 NN[18] 3.44 3.20 3.49 LRCN fc8(ours) 3.74 3.19 3.72 LRCN ft(ours) 3.47 3.01 3.50 Captions 2.55 3.72 2.59Table5:Image description:Human evaluator rankings from1-6 (low is good)averaged for each method and criterion.We eval-uated on785Flickr images selected by the authors of[18]for the purposes of comparison against this similar contemporary ap-proach.We compared sentences from our model to the ones made publicly available by[18].As seen in Table5,ourﬁne-tuned(ft)LRCN model performs on par with the Nearest Neighbour(NN)on correctness and relevance,and better on grammar.We show example sentence generations in Fig-ure6.6.Video descriptionIn video description we must generate a variable length stream of words,similar to Section5.[11,30,17,3,6,17, 40,41]propose methods for generating sentence descrip-tions for video,but to our knowledge we present theﬁrst application of deep models to the vision description task.The LSTM framework allows us to model the video as a variable length input stream as discussed in Section3. However,due to limitations of available video description datasets we take a different path.We rely on more“tra-ditional”activity and video recognition processing for the input and use LSTMs for generating a sentence.Weﬁrst distinguish the following architectures for video description(see Figure5).For each architecture,we assume we have predictions of objects,subjects,and verbs present in the video from a CRF based on the full video input.In this way,we observe the video as whole at each time step, not incrementally frame by frame.(a)LSTM encoder&decoder with CRF max.(Fig-ure5(a))Theﬁrst architecture is motivated by the video de-scription approach presented in[30].Theyﬁrst recognize a semantic representation of the video using the maximum a posterior estimate(MAP)of a CRF taking in video features as unaries.This representation, e.g. person,cut,cutting board ,is then concatenated to a input sentence(person cut cutting board)which is translated to a natural sentence(a person cuts on the board)using phrase-based statistical ma-chine translation(SMT)[21].We replace the SMT with an LSTM,which has shown state-of-the-art performance for machine translation between languages[38,5].The archi-tecture(shown in Figure5(a))has an encoder LSTM(or-Architecture Input BLEU SMT[30]CRF max24.9 SMT[29]CRF prob26.9(a)LSTM Encoder-Decoder(ours)CRF max25.3(b)LSTM Decoder(ours)CRF max27.4(c)LSTM Decoder(ours)CRF prob28.8 Table6:Video description:Results on detailed description of TACoS multilevel[29],in%,see Section6for details.ange)which encodes the one-hot vector(binary index vec-tor in a vocabulary)of the input sentence as done in[38]. This allows for variable-length inputs.(Note that the input sentence might have a different number of words than el-ements of the semantic representation.)At the end of the encoder stage,theﬁnal hidden unit must remember all nec-essary information before being input into the decoder stage (pink)in which the hidden representation is decoded into a sentence,one word at each time step.We use the same two-layer LSTM for encoding and decoding.(b)LSTM decoder with CRF max.(Figure5(b))In this variant we exploit that the semantic representation can be encoded as a singleﬁxed length vector.We provide the entire visual input representation at each time step to the LSTM,analogous to how an entire image is provided as an input to the LSTM in image description.(c)LSTM decoder with CRF prob.(Figure5(c))A beneﬁt of using LSTMs for machine translation compared to phrase-based SMT[21]is that it can naturally incorpo-rate probability vectors during training and test time which allows the LSTM to learn uncertainties in visual generation rather than relying on MAP estimates.The architecture is the the same as in(b),but we replace max predictions with probability distributions.6.1.EvaluationWe evaluate our approach on the TACoS multilevel [29]dataset,which has44,762video/sentence pairs(about 40,000for training/validation).We compare to[30]who use max prediction as well as a variant presented in[29] which takes CRF probabilities at test time and uses a word lattice toﬁnd an optimal sentence prediction.Since we use the max prediction as well as the probability scores pro-vided by[29],we have an identical visual representation.[29]uses dense trajectories[43]and SIFT features as well as temporal context reasoning modeled in a CRF.Table6shows the BLEU-4score.The results show that (1)the LSTM outperforms an SMT-based approach to video description;(2)the simpler decoder architecture(b)and(c) achieve better performance than(a),likely because the in-put does not need to be memorized;and(3)our approach achieves28.8%,clearly outperforming the best reported number of26.9%on TACoS multilevel by[29].More broadly,these results show that our architecture。