Improving collaborative filtering-basedrecommender systems

格式：pdf
大小：1.53 MB
文档页数：12

下载文档原格式

/ 12

分析：基于文本内容推荐和协同过滤推荐

分析：基于文本内容推荐和协同过滤推荐当用户看完某个感兴趣的事物时，推荐系统会给你推荐类似你喜欢的东西，而本文主要分析一下关于协同过滤推荐和基于文本内容推荐的这两种推荐方式。

（1）需求背景当用户表示出对一些内容感兴趣的时候，满足用户的一个拓展的兴趣；比如：feed 流产品，让你对新的内容既有熟悉感并且有新颖感，这样的话就能够促进用户进一步内容消费。

一般是在内容消费完结时推荐，比如：看完一部小说，会给你推荐通类型的小说，看完一部钢铁侠的电影，会给你推荐钢铁侠系列电影。

相似内容推荐的核心逻辑——即推荐用户在当前当刻下最感兴趣的或者与这个内容最相似的一个内容。

（2）业务目标业务目标：推荐内容用户消费行为的最大化（3）衡量标准简单的方式就是CTR的方式，用户点击的数量/推荐的数量。

用户行为消费的深浅，比如：一个网页的用户停留事件，网页的浏览完成时间。

基于文本内容推荐Content-base1. 基本原理使用内容的元数据，或者针对内容的自身的分析，对于任意内容A、B，计算AB之间两两相似度Sab，推荐给用户相似度最高的N个内容。

2. 关键路径（1）定义度量标准标准类似于坐标轴，例如：人有很多属性，性别、年龄、身高、体重、文化程度、专业技能等。

这些共同构成的一个多维空间，每一个特定的人，在每一个维度上面都会有一个具体的值，这样就实现对一个特定人的量化表示。

实现从一个人的个体到一个N维度的向量的一个映射，并且由于面对的需求不一样，我们构建的一个特征空间可能是不一样的。

继续上面的例子，如果我们要挑选好的战士，那么特征空间可能就包括性别、年龄、身高、体重，等维度基本就够了。

那如果要挑选好的产品经理，这些维度肯定不不够全面。

（2）对内容进行量化对各个内容，如：文章、商品，通过上面定义的维度进行量化。

（3）计算相似度算距离度量，及文本在立体在空间上存在的距离，距离越远说明个体间的差异越大。

算相似度度量，相似度度量的值越小，说明个体间相似度越小，差异越大。

人工智能领域中英文专有名词汇总

名词解释中英文对比<using_information_sources> social networks 社会网络abductive reasoning 溯因推理action recognition(行为识别)active learning(主动学习)adaptive systems 自适应系统adverse drugs reactions(药物不良反应)algorithm design and analysis(算法设计与分析) algorithm(算法)artificial intelligence 人工智能association rule(关联规则)attribute value taxonomy 属性分类规范automomous agent 自动代理automomous systems 自动系统background knowledge 背景知识bayes methods(贝叶斯方法)bayesian inference(贝叶斯推断)bayesian methods(bayes 方法)belief propagation(置信传播)better understanding 内涵理解big data 大数据big data(大数据)biological network(生物网络)biological sciences(生物科学)biomedical domain 生物医学领域biomedical research(生物医学研究)biomedical text(生物医学文本)boltzmann machine(玻尔兹曼机)bootstrapping method 拔靴法case based reasoning 实例推理causual models 因果模型citation matching (引文匹配)classification (分类)classification algorithms(分类算法)clistering algorithms 聚类算法cloud computing(云计算)cluster-based retrieval (聚类检索)clustering (聚类)clustering algorithms(聚类算法)clustering 聚类cognitive science 认知科学collaborative filtering (协同过滤)collaborative filtering(协同过滤)collabrative ontology development 联合本体开发collabrative ontology engineering 联合本体工程commonsense knowledge 常识communication networks(通讯网络)community detection(社区发现)complex data(复杂数据)complex dynamical networks(复杂动态网络)complex network(复杂网络)complex network(复杂网络)computational biology 计算生物学computational biology(计算生物学)computational complexity(计算复杂性) computational intelligence 智能计算computational modeling(计算模型)computer animation(计算机动画)computer networks(计算机网络)computer science 计算机科学concept clustering 概念聚类concept formation 概念形成concept learning 概念学习concept map 概念图concept model 概念模型concept modelling 概念模型conceptual model 概念模型conditional random field(条件随机场模型) conjunctive quries 合取查询constrained least squares (约束最小二乘) convex programming(凸规划)convolutional neural networks(卷积神经网络) customer relationship management(客户关系管理) data analysis(数据分析)data analysis(数据分析)data center(数据中心)data clustering (数据聚类)data compression(数据压缩)data envelopment analysis (数据包络分析)data fusion 数据融合data generation(数据生成)data handling(数据处理)data hierarchy (数据层次)data integration(数据整合)data integrity 数据完整性data intensive computing(数据密集型计算)data management 数据管理data management(数据管理)data management(数据管理)data miningdata mining 数据挖掘data model 数据模型data models(数据模型)data partitioning 数据划分data point(数据点)data privacy(数据隐私)data security(数据安全)data stream(数据流)data streams(数据流)data structure( 数据结构)data structure(数据结构)data visualisation(数据可视化)data visualization 数据可视化data visualization(数据可视化)data warehouse(数据仓库)data warehouses(数据仓库)data warehousing(数据仓库)database management systems(数据库管理系统)database management(数据库管理)date interlinking 日期互联date linking 日期链接Decision analysis(决策分析)decision maker 决策者decision making (决策)decision models 决策模型decision models 决策模型decision rule 决策规则decision support system 决策支持系统decision support systems (决策支持系统) decision tree(决策树)decission tree 决策树deep belief network(深度信念网络)deep learning(深度学习)defult reasoning 默认推理density estimation(密度估计)design methodology 设计方法论dimension reduction(降维) dimensionality reduction(降维)directed graph(有向图)disaster management 灾害管理disastrous event(灾难性事件)discovery(知识发现)dissimilarity (相异性)distributed databases 分布式数据库distributed databases(分布式数据库) distributed query 分布式查询document clustering (文档聚类)domain experts 领域专家domain knowledge 领域知识domain specific language 领域专用语言dynamic databases(动态数据库)dynamic logic 动态逻辑dynamic network(动态网络)dynamic system(动态系统)earth mover's distance(EMD 距离) education 教育efficient algorithm(有效算法)electric commerce 电子商务electronic health records(电子健康档案) entity disambiguation 实体消歧entity recognition 实体识别entity recognition(实体识别)entity resolution 实体解析event detection 事件检测event detection(事件检测)event extraction 事件抽取event identificaton 事件识别exhaustive indexing 完整索引expert system 专家系统expert systems(专家系统)explanation based learning 解释学习factor graph(因子图)feature extraction 特征提取feature extraction(特征提取)feature extraction(特征提取)feature selection (特征选择)feature selection 特征选择feature selection(特征选择)feature space 特征空间first order logic 一阶逻辑formal logic 形式逻辑formal meaning prepresentation 形式意义表示formal semantics 形式语义formal specification 形式描述frame based system 框为本的系统frequent itemsets(频繁项目集)frequent pattern(频繁模式)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy data mining(模糊数据挖掘)fuzzy logic 模糊逻辑fuzzy set theory(模糊集合论)fuzzy set(模糊集)fuzzy sets 模糊集合fuzzy systems 模糊系统gaussian processes(高斯过程)gene expression data 基因表达数据gene expression(基因表达)generative model(生成模型)generative model(生成模型)genetic algorithm 遗传算法genome wide association study(全基因组关联分析) graph classification(图分类)graph classification(图分类)graph clustering(图聚类)graph data(图数据)graph data(图形数据)graph database 图数据库graph database(图数据库)graph mining(图挖掘)graph mining(图挖掘)graph partitioning 图划分graph query 图查询graph structure(图结构)graph theory(图论)graph theory(图论)graph theory(图论)graph theroy 图论graph visualization(图形可视化)graphical user interface 图形用户界面graphical user interfaces(图形用户界面)health care 卫生保健health care(卫生保健)heterogeneous data source 异构数据源heterogeneous data(异构数据)heterogeneous database 异构数据库heterogeneous information network(异构信息网络) heterogeneous network(异构网络)heterogenous ontology 异构本体heuristic rule 启发式规则hidden markov model(隐马尔可夫模型)hidden markov model(隐马尔可夫模型)hidden markov models(隐马尔可夫模型) hierarchical clustering (层次聚类) homogeneous network(同构网络)human centered computing 人机交互技术human computer interaction 人机交互human interaction 人机交互human robot interaction 人机交互image classification(图像分类)image clustering (图像聚类)image mining( 图像挖掘)image reconstruction(图像重建)image retrieval (图像检索)image segmentation(图像分割)inconsistent ontology 本体不一致incremental learning(增量学习)inductive learning (归纳学习)inference mechanisms 推理机制inference mechanisms(推理机制)inference rule 推理规则information cascades(信息追随)information diffusion(信息扩散)information extraction 信息提取information filtering(信息过滤)information filtering(信息过滤)information integration(信息集成)information network analysis(信息网络分析) information network mining(信息网络挖掘) information network(信息网络)information processing 信息处理information processing 信息处理information resource management (信息资源管理) information retrieval models(信息检索模型) information retrieval 信息检索information retrieval(信息检索)information retrieval(信息检索)information science 情报科学information sources 信息源information system( 信息系统)information system(信息系统)information technology(信息技术)information visualization(信息可视化)instance matching 实例匹配intelligent assistant 智能辅助intelligent systems 智能系统interaction network(交互网络)interactive visualization(交互式可视化)kernel function(核函数)kernel operator (核算子)keyword search(关键字检索)knowledege reuse 知识再利用knowledgeknowledgeknowledge acquisitionknowledge base 知识库knowledge based system 知识系统knowledge building 知识建构knowledge capture 知识获取knowledge construction 知识建构knowledge discovery(知识发现)knowledge extraction 知识提取knowledge fusion 知识融合knowledge integrationknowledge management systems 知识管理系统knowledge management 知识管理knowledge management(知识管理)knowledge model 知识模型knowledge reasoningknowledge representationknowledge representation(知识表达) knowledge sharing 知识共享knowledge storageknowledge technology 知识技术knowledge verification 知识验证language model(语言模型)language modeling approach(语言模型方法) large graph(大图)large graph(大图)learning(无监督学习)life science 生命科学linear programming(线性规划)link analysis (链接分析)link prediction(链接预测)link prediction(链接预测)link prediction(链接预测)linked data(关联数据)location based service(基于位置的服务) loclation based services(基于位置的服务) logic programming 逻辑编程logical implication 逻辑蕴涵logistic regression(logistic 回归)machine learning 机器学习machine translation(机器翻译)management system(管理系统)management( 知识管理)manifold learning(流形学习)markov chains 马尔可夫链markov processes(马尔可夫过程)matching function 匹配函数matrix decomposition(矩阵分解)matrix decomposition(矩阵分解)maximum likelihood estimation(最大似然估计)medical research(医学研究)mixture of gaussians(混合高斯模型)mobile computing(移动计算)multi agnet systems 多智能体系统multiagent systems 多智能体系统multimedia 多媒体natural language processing 自然语言处理natural language processing(自然语言处理) nearest neighbor (近邻)network analysis( 网络分析)network analysis(网络分析)network analysis(网络分析)network formation(组网)network structure(网络结构)network theory(网络理论)network topology(网络拓扑)network visualization(网络可视化)neural network(神经网络)neural networks (神经网络)neural networks(神经网络)nonlinear dynamics(非线性动力学)nonmonotonic reasoning 非单调推理nonnegative matrix factorization (非负矩阵分解) nonnegative matrix factorization(非负矩阵分解) object detection(目标检测)object oriented 面向对象object recognition(目标识别)object recognition(目标识别)online community(网络社区)online social network(在线社交网络)online social networks(在线社交网络)ontology alignment 本体映射ontology development 本体开发ontology engineering 本体工程ontology evolution 本体演化ontology extraction 本体抽取ontology interoperablity 互用性本体ontology language 本体语言ontology mapping 本体映射ontology matching 本体匹配ontology versioning 本体版本ontology 本体论open government data 政府公开数据opinion analysis(舆情分析)opinion mining(意见挖掘)opinion mining(意见挖掘)outlier detection(孤立点检测)parallel processing(并行处理)patient care(病人医疗护理)pattern classification(模式分类)pattern matching(模式匹配)pattern mining(模式挖掘)pattern recognition 模式识别pattern recognition(模式识别)pattern recognition(模式识别)personal data(个人数据)prediction algorithms(预测算法)predictive model 预测模型predictive models(预测模型)privacy preservation(隐私保护)probabilistic logic(概率逻辑)probabilistic logic(概率逻辑)probabilistic model(概率模型)probabilistic model(概率模型)probability distribution(概率分布)probability distribution(概率分布)project management(项目管理)pruning technique(修剪技术)quality management 质量管理query expansion(查询扩展)query language 查询语言query language(查询语言)query processing(查询处理)query rewrite 查询重写question answering system 问答系统random forest(随机森林)random graph(随机图)random processes(随机过程)random walk(随机游走)range query(范围查询)RDF database 资源描述框架数据库RDF query 资源描述框架查询RDF repository 资源描述框架存储库RDF storge 资源描述框架存储real time(实时)recommender system(推荐系统)recommender system(推荐系统)recommender systems 推荐系统recommender systems(推荐系统)record linkage 记录链接recurrent neural network(递归神经网络) regression(回归)reinforcement learning 强化学习reinforcement learning(强化学习)relation extraction 关系抽取relational database 关系数据库relational learning 关系学习relevance feedback (相关反馈)resource description framework 资源描述框架restricted boltzmann machines(受限玻尔兹曼机) retrieval models(检索模型)rough set theroy 粗糙集理论rough set 粗糙集rule based system 基于规则系统rule based 基于规则rule induction (规则归纳)rule learning (规则学习)rule learning 规则学习schema mapping 模式映射schema matching 模式匹配scientific domain 科学域search problems(搜索问题)semantic (web) technology 语义技术semantic analysis 语义分析semantic annotation 语义标注semantic computing 语义计算semantic integration 语义集成semantic interpretation 语义解释semantic model 语义模型semantic network 语义网络semantic relatedness 语义相关性semantic relation learning 语义关系学习semantic search 语义检索semantic similarity 语义相似度semantic similarity(语义相似度)semantic web rule language 语义网规则语言semantic web 语义网semantic web(语义网)semantic workflow 语义工作流semi supervised learning(半监督学习)sensor data(传感器数据)sensor networks(传感器网络)sentiment analysis(情感分析)sentiment analysis(情感分析)sequential pattern(序列模式)service oriented architecture 面向服务的体系结构shortest path(最短路径)similar kernel function(相似核函数)similarity measure(相似性度量)similarity relationship (相似关系)similarity search(相似搜索)similarity(相似性)situation aware 情境感知social behavior(社交行为)social influence(社会影响)social interaction(社交互动)social interaction(社交互动)social learning(社会学习)social life networks(社交生活网络)social machine 社交机器social media(社交媒体)social media(社交媒体)social media(社交媒体)social network analysis 社会网络分析social network analysis(社交网络分析)social network(社交网络)social network(社交网络)social science(社会科学)social tagging system(社交标签系统)social tagging(社交标签)social web(社交网页)sparse coding(稀疏编码)sparse matrices(稀疏矩阵)sparse representation(稀疏表示)spatial database(空间数据库)spatial reasoning 空间推理statistical analysis(统计分析)statistical model 统计模型string matching(串匹配)structural risk minimization (结构风险最小化) structured data 结构化数据subgraph matching 子图匹配subspace clustering(子空间聚类)supervised learning( 有support vector machine 支持向量机support vector machines(支持向量机)system dynamics(系统动力学)tag recommendation(标签推荐)taxonmy induction 感应规范temporal logic 时态逻辑temporal reasoning 时序推理text analysis(文本分析)text anaylsis 文本分析text classification (文本分类)text data(文本数据)text mining technique(文本挖掘技术)text mining 文本挖掘text mining(文本挖掘)text summarization(文本摘要)thesaurus alignment 同义对齐time frequency analysis(时频分析)time series analysis( 时time series data(时间序列数据)time series data(时间序列数据)time series(时间序列)topic model(主题模型)topic modeling(主题模型)transfer learning 迁移学习triple store 三元组存储uncertainty reasoning 不精确推理undirected graph(无向图)unified modeling language 统一建模语言unsupervisedupper bound(上界)user behavior(用户行为)user generated content(用户生成内容)utility mining(效用挖掘)visual analytics(可视化分析)visual content(视觉内容)visual representation(视觉表征)visualisation(可视化)visualization technique(可视化技术) visualization tool(可视化工具)web 2.0(网络2.0)web forum(web 论坛)web mining(网络挖掘)web of data 数据网web ontology lanuage 网络本体语言web pages(web 页面)web resource 网络资源web science 万维科学web search (网络检索)web usage mining(web 使用挖掘)wireless networks 无线网络world knowledge 世界知识world wide web 万维网world wide web(万维网)xml database 可扩展标志语言数据库附录 2 Data Mining 知识图谱（共包含二级节点15 个，三级节点93 个）间序列分析)监督学习)领域二级分类三级分类。

协同过滤系统的稀疏性与冷启动问题研究

协同过滤系统的稀疏性与冷启动问题研究一、本文概述随着大数据时代的到来，推荐系统已经成为互联网应用的重要组成部分，尤其在电商、社交媒体和在线视频等领域，发挥着不可替代的作用。

协同过滤（Collaborative Filtering，简称CF）作为推荐系统中最常用、最经典的技术之一，其核心思想是利用用户的历史行为数据来预测其未来的兴趣偏好，从而为其推荐最可能感兴趣的项目。

然而，协同过滤在实际应用中面临着两大核心问题：稀疏性和冷启动问题。

本文旨在深入研究协同过滤系统的稀疏性与冷启动问题，探索其产生的根源，分析现有解决方案的优缺点，并提出新的改进策略，以期提高推荐系统的准确性和效率。

本文首先介绍了协同过滤的基本原理和流程，阐述了其在推荐系统中的重要地位。

随后，重点分析了协同过滤面临的稀疏性问题，即由于用户-项目矩阵中大量元素为0（即用户未对项目进行评分或互动），导致推荐系统难以准确捕捉用户的兴趣偏好。

针对这一问题，本文综述了多种缓解稀疏性的方法，如基于领域的协同过滤、矩阵分解等，并评估了它们的实际效果。

接下来，本文深入探讨了协同过滤的冷启动问题。

冷启动问题指的是对于新用户或新项目，由于缺乏足够的历史数据，推荐系统难以给出准确的推荐结果。

本文分析了冷启动问题的成因，包括数据稀疏性、用户兴趣的不确定性等，并综述了现有的解决方案，如利用用户注册信息、社交关系等进行初始化推荐。

同时，本文也指出了这些方法的局限性和潜在改进方向。

在综述现有研究的基础上，本文提出了针对协同过滤稀疏性和冷启动问题的新改进策略。

通过引入更多维度的用户信息和项目特征，增强推荐系统的表征能力，从而缓解数据稀疏性问题。

利用深度学习等先进技术，对用户和项目的潜在特征进行建模和挖掘，提高推荐的准确性和泛化能力。

本文还探索了基于社交网络分析的冷启动解决方案，通过利用用户间的社交关系传递信任度和兴趣偏好，为新用户和新项目提供初始化的推荐结果。

本文旨在全面分析协同过滤系统的稀疏性与冷启动问题，梳理现有解决方案的优缺点，并提出新的改进策略。

协同过滤算法综述

协同过滤算法综述一、本文概述随着信息技术的飞速发展，大数据已经成为了现代社会不可或缺的一部分。

如何在海量的数据中发现用户感兴趣的信息，成为了信息推荐系统面临的重要问题。

协同过滤算法作为一种经典的信息推荐技术，凭借其高效、准确的特点，在电子商务、社交网络、音乐推荐等多个领域得到了广泛应用。

本文旨在全面综述协同过滤算法的发展历程、基本原理、分类及应用现状，以期对协同过滤算法有更深入的理解，并为未来的研究提供有益的参考。

本文首先回顾了协同过滤算法的发展历程，从早期的基于用户的协同过滤到后来的基于物品的协同过滤，再到基于模型的协同过滤，每个阶段都有其独特的特点和优势。

然后，本文详细介绍了协同过滤算法的基本原理，包括相似度计算、邻居选择、生成推荐等关键步骤，以及这些步骤中常用的技术和方法。

接着，本文根据协同过滤算法的不同实现方式，将其分为基于内存的协同过滤和基于模型的协同过滤两大类，并分别对其进行了详细阐述。

在应用现状方面，本文分析了协同过滤算法在电子商务、社交网络、音乐推荐等领域的实际应用情况，总结了其取得的成功和面临的挑战。

本文还探讨了协同过滤算法未来的发展趋势，包括与其他推荐技术的结合、在动态环境中的应用以及隐私保护等方面的问题。

本文总结了协同过滤算法的优点和局限性，并对未来的研究方向进行了展望。

通过本文的综述，读者可以对协同过滤算法有一个全面而深入的了解，为相关领域的研究和实践提供有益的参考。

二、协同过滤算法的基本原理协同过滤算法（Collaborative Filtering，简称CF）是一种广泛应用于推荐系统的经典算法，其基本原理在于利用用户的历史行为数据来预测用户未来的兴趣偏好，并据此为用户推荐符合其兴趣偏好的物品或服务。

协同过滤算法主要可以分为两类：基于用户的协同过滤（User-Based Collaborative Filtering，简称UserCF）和基于物品的协同过滤（Item-Based Collaborative Filtering，简称ItemCF）。

论基于深度学习的推荐算法对在线购物的影响

论基于深度学习的推荐算法对在线购物的影响随着互联网和移动设备的流行，越来越多的人选择线上购物。

线上购物的便利性和快捷性让人们可以在家轻松购物。

然而，随着电子商务市场的不断发展，消费者有时会发现在海量的商品中选择一个最适合自己的产品变得越来越难。

在这种情况下，推荐算法成为大家的首选方案。

目前，基于深度学习的推荐算法在电子商务中的应用越来越广泛，并具有重大影响。

本文将探讨基于深度学习的推荐算法对在线购物的影响。

一、基于深度学习的推荐算法介绍深度学习是一种人工智能技术，它模仿人类神经网络的结构和工作方式，可以自动地学习并识别数据中的模式。

数据科学和人工智能的研究者们一直在尝试使用深度学习算法来解决现实生活中的各种问题，包括商品推荐。

深度学习算法可以通过对用户行为数据的分析，将每个用户的兴趣和偏好进行预测，并向用户推荐他们喜欢的商品。

二、基于深度学习的推荐算法带来的好处1.提高购物效率基于深度学习的推荐算法会分析用户之前的行为和购买历史，比如搜索记录，购买记录和收藏记录等。

这些数据将让这些推荐算法更准确地了解用户的兴趣和需求。

在繁多的电商产品中，用户可以快速找到自己想要的产品，这将使购物更加高效。

2.提高满意度基于深度学习的推荐算法不仅可以找到与用户需求相符的商品，还可以更多地推荐与用户需求相似的产品。

这将使用户可以更快速地找到他们喜欢的产品，从而使购物更加愉悦和满意。

3.提高销售额基于深度学习的推荐算法可以吸引用户购买更多的产品。

例如，如果有一个用户正在浏览智能手表，推荐算法可以向用户展示其他相关的产品，例如智能手环和智能眼镜等，这将增加用户购买其他智能设备的可能性。

三、基于深度学习的推荐算法的实现方法基于深度学习的推荐算法主要分为三种方法：user-based collaborative filtering, item-based collaborative filtering和深度学习。

er-based collaborative filtering这种方法基于协同过滤技术，根据用户的历史行为和对其他用户的行为的评估，推荐那些其他用户也感兴趣的商品。

计算机名词解释大全

计算机名词解释大全1. 计算机(computer)：一种能够接受数据并按照一定的程序进行处理的电子设备。

2. 硬件(hardware)：计算机系统中的物理设备，包括中央处理器、内存、硬盘、显示器等。

3. 软件(software)：计算机系统中的程序和数据的集合，包括操作系统、应用软件等。

4. 操作系统(operating system)：管理计算机系统资源并提供用户与计算机系统交互的软件。

5. 网络(network)：将多台计算机连接在一起，使其能够相互通信和共享资源的系统。

6. 网络协议(network protocol)：用于在计算机网络中传输数据的一组规则和约定。

7. 互联网(Internet)：全球范围内的计算机网络，通过TCP/IP 协议家族进行通信。

8. 网页(web page)：在互联网上显示的一个文档，可以包含文字、图像、超链接等多种内容。

9. 网站(website)：由一组网页组成的，可在互联网上访问的信息资源集合。

10. 数据库(database)：组织和存储数据的集合，可方便地访问和管理大量数据。

11. 算法(algorithm)：解决问题或执行任务的一系列步骤或指令。

12. 编程(programming)：使用特定的语言编写计算机程序的过程。

13. 程序(program)：由一组指令组成的，用于实现特定功能的计算机软件。

14. 源代码(source code)：人类可读的，用特定的编程语言编写的程序代码。

15. 二进制(binary)：由0和1组成的计算机内部使用的数制系统。

16. 编译器(compiler)：将高级程序语言转化为机器语言代码的软件工具。

17. 虚拟现实(VR)：通过计算机生成的仿真环境，使用户获得身临其境的感觉。

18. 增强现实(AR)：将计算机生成的信息与现实世界中的场景结合在一起，增强用户的感知。

19. 人工智能(AI)：使计算机具备类似人类智能的能力，如学习、推理和理解。

机器学习（04）——常用专业术语

机器学习（04）——常⽤专业术语对于机器学习的常⽤专业术语，我们在开始学习之前，最好⼤概的看⼀两次，简单了解⼀些常识和术语，有了基本了解后，对于后续学习会有很⼤的帮助。

AA/B 测试 (A/B testing)⼀种统计⽅法，⽤于将两种或多种技术进⾏⽐较，通常是将当前采⽤的技术与新技术进⾏⽐较。

A/B 测试不仅旨在确定哪种技术的效果更好，⽽且还有助于了解相应差异是否具有显著的统计意义。

A/B 测试通常是采⽤⼀种衡量⽅式对两种技术进⾏⽐较，但也适⽤于任意有限数量的技术和衡量⽅式。

准确率（accuracy）分类模型预测准确的⽐例。

在多类别分类中，准确率定义如下：在⼆分类中，准确率定义为：请参阅【正例 (TP, true positive)】和【负例 (TN, true negative)】。

激活函数 (activation function)⼀种函数（例如 ReLU 或 S 型函数），⽤于对上⼀层的所有输⼊求加权和，然后⽣成⼀个输出值（通常为⾮线性值），并将其传递给下⼀层。

AdaGrad⼀种先进的梯度下降法，⽤于重新调整每个参数的梯度，以便有效地为每个参数指定独⽴的学习速率。

如需查看完整的解释，请参阅这篇论⽂。

ROC 曲线下⾯积 (AUC, Area under the ROC Curve)⼀种会考虑所有可能分类阈值的评估指标。

ROC 曲线下⾯积是，对于随机选择的正类别样本确实为正类别，以及随机选择的负类别样本为正类别，分类器更确信前者的概率。

B反向传播算法 (backpropagation)在神经⽹络上执⾏梯度下降法的主要算法。

该算法会先按前向传播⽅式计算（并缓存）每个节点的输出值，然后再按反向传播遍历图的⽅式计算损失函数值相对于每个参数的偏导数。

基准 (baseline)⼀种简单的模型或启发法，⽤作⽐较模型效果时的参考点。

基准有助于模型开发者针对特定问题量化最低预期效果。

批次 (batch)模型训练的⼀次迭代（即⼀次梯度更新）中使⽤的样本集。

计算机的专有名词解释

计算机的专有名词解释近年来，计算机科技的高速发展，如同一股劲力强劲的狂风，将我们带入一个被信息和科技浸透的时代。

随之而来的是大量的计算机专有名词不断涌现，对于非专业人士来说，这些名词常常令人眼花缭乱。

在本文中，我们将对一些常见的计算机专有名词进行解释，以助读者更好地理解计算机科技的重要概念以及其在现代社会中的应用。

一、人工智能（Artificial Intelligence，简称AI）人工智能是指计算机系统通过模拟人类智能的某些行为和功能，实现自主学习、自动推理和自主决策的能力。

人工智能技术的核心是机器学习（Machine Learning）和深度学习（Deep Learning），它们依赖于庞大的数据集和复杂的算法，能够处理复杂的任务和问题。

二、大数据（Big Data）大数据是指规模巨大、来源多样的数据集合。

在计算机领域中，大数据涉及到数据的存储、处理和分析，需要使用到特定的技术和工具，以便更好地挖掘数据中的信息和洞见。

大数据分析称为“数据挖掘”，它能帮助企业和组织做出更明智的决策。

三、云计算（Cloud Computing）云计算是指通过互联网来共享计算资源和服务。

使用云计算技术可以允许用户在任何地方、任何时间、通过任何设备访问存储在云端的数据和应用程序。

云计算可以提供更高的灵活性和可扩展性，降低了成本并提高了可用性。

四、物联网（Internet of Things，简称IoT）物联网是指通过互联网连接各种物理设备，使它们能够相互通信和协作。

物联网的应用包括智能家居、智能城市、智能交通等领域。

通过物联网，各种传感器和设备可以实时收集、传输和分析数据，从而提供更智能、高效的服务和解决方案。

五、虚拟现实（Virtual Reality，简称VR）虚拟现实是一种计算机技术，通过模拟或重现真实环境，使用户能够沉浸其中，并与虚拟环境进行互动。

虚拟现实通常涉及使用特殊的头戴式显示设备和手持控制器，通过感应器和跟踪技术实现用户的身体感知和行为反馈。

关于协同过滤推荐算法的研究文献综述

关于协同过滤推荐算法的研究文献综述吴佳炜摘要：协同过滤推荐算法从庞大的数据资源中为用户推荐其感兴趣的内容，在推荐系统中该算法得到广泛应用。

但是随着用户数目和项目资源的不断增加，传统的协同过滤算法暴露出数据稀疏和冷启动等问题，大大降低了用户相似度和项目相似度计算的准确度。

本篇文章介绍了协同过滤算法的基本概念，指出该算法的局限性以及在此基础上研究人员所做的一系列优化改进。

关键词：协同过滤；推荐系统；用户相似性；项目相似性一、引言现今互联网的快速发展，大数据时代应运而生，数据资源的增长速度以几何数量级呈现，个性化推荐技术[1]的出现解决了庞大的用户群体对数据的需求问题，更是广泛应用于数字图书馆[2]、电子商务[3]、新闻网站[4]等系统中。

协同过滤（collaborative filtering）[5]在推荐系统中最为常用，它的根本思想是根据相似的用户群体或者项目群体来向目标用户推荐其可能感兴趣的项目资源。

基于用户的协同过滤推荐算法[6]和基于项目的协同过滤推荐算法[7，8]是构成传统的协同过滤算法的两大主体。

在基于用户的协同过滤推荐算法中，算法依据目标用户的类似用户对项目的评分来预测目标用户对该项目是否感兴趣，然而鉴于部分用户与之相关联的信息量有限，所以对相关项目的评分并不完全，导致用户-项目评分矩阵稀疏度高而不能完全体现其相对关系，从而加大了相似用户群的选择程度，降低了推荐系统的效率。

若通过基于项目的协同过滤推荐算法，依靠未评分目标项目的相似项目的评分来预测目标用户对未评分项目的评分，但是当用户对项目的评分较少时，易导致忽略项目自身属性的问题，降低了推荐效率。

二、协同过滤推荐算法（一）核心内容1、计算相似度为了计算用户或项目之间的相似度，协同过滤推荐算法主要利用皮尔逊相关度系数[9]（Pearson Correlation Coefficient，PCC）来实现，其中PCC的取值范围是[-1，1]。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Improving collaborative ﬁltering-based recommender systemsresults using Pareto dominanceFernando Ortega ⇑,José-Luis Sánchez,Jesús Bobadilla,Abraham GutiérrezUniversidad Politecnica de Madrid,Crta.De Valencia,Km.7,28031 Madrid,Spaina r t i c l e i n f o Article history:Received 15June 2011Received in revised form 25February 2013 Accepted 4March 2013Available online 20March 2013 Keywords:Collaborative ﬁltering Recommender system Pareto dominance Similarity Neighboura b s t r a c tRecommender systems are a type of solution to the information overload problem suffered by users of websites that allow the rating of certain items.The collabora t ive ﬁltering rec- ommender system is considered to be the most successful approach,as it makes its recom- mendations based on ratings provided by users who are similar to the active user.Nevertheless ,the traditional collaborative ﬁltering method can select insufﬁciently repre- sentative users as neighbours of the active user.This means that recommendations made a posteriori are not sufﬁciently precise.The method proposed in this paper uses Paret o dom- inance to perform a pre-ﬁltering process eliminating less representative users from the k -neighbour selection process while retaining the most promising ones.The results from experimen t s performed on the Movielens and Netﬂix websites show signiﬁcant improve- ments in all tested quality measure s when the proposed method is applied.Ó2013 Elsevier Inc.All rights reserved.1.IntroductionRecommend e r Systems (RS)are responsible for providing users with a series of personali s ed suggestions (recommenda-tions)for certain types of items.RS extract the user’s relevant characterist i cs and determine the subset of items that may be of interest to them.The way RS work can differ to a great extent depending on the types of items to be recommended and the available infor- mation about the user’s preferenc e s.Therefore,RS must be able to use different mechanism s to obtain the most promising items for each user.RS are commonly called ‘‘ﬁlters’’because they act as such.Currently,the following types of RS are found in practice [1,12]:Content-bas e d ﬁltering [3,26]:Recomm e ndations are based on preferenc e s users have expresse d in the past about the items (e.g.,books purchased). Demograph i c ﬁltering [22]:Recommendati o ns are based on positive ratings from other users who share the same age,geographical location,gender,profession,etc.as the active user. Collaborativ e ﬁltering (CF)[1,8–10,17,20,23] :Relevant information is stored in a database that contains the ratings of a large number of users on a large number of items (ﬁlms,books,jokes,study material,holiday destinations,etc.).CF aims to determine which users are similar to each other and then to recommend to the active user those items preferred by users similar to him or her.Hybrid ﬁltering methods [2,4,11,12,14]:In some cases,content-bas e d ﬁltering and demographic ﬁltering complemen t CF.By combining these methods with CF,recommend a tions can be made based on a more extensive set of information .0020-0255/$-see front matter Ó2013 Elsevier Inc.All rights reserved./10.1016/j.ins.2013.03.011⇑Corresponding author.Tel.:+34 913365104.E-mail addresses:fernando.ortega@upm.es ,fortegarequena@ (F.Ortega).F.Ortega et al./Information Sciences 239 (2013)50–6151Generally speaking,CF obtains better results than the other ﬁltering systems described [12,16];thus,it is the most widely used ﬁltering method.However,CF suffers from two main interrelated problems:(1)It requires a large enough database to guarantee the correct process to search for similar users and to make the recommendati o ns [17].(2)The rating matrix is typically excessively sparse [24],which makes it difﬁcult to calculate the similarity between pairs of users and reduces the accuracy of the computed recomme n dations.In this paper,we present a method that improves CF prediction and recommend a tion quality measures.The rest of the paper is structured in the following way:In Section 2,we describe the motivation for and the foundations of the proposed method.In Section 3,we formalise the process by which recommend a tions are obtained.In Section 4,we deﬁne experiments designed to validate the correct functioning of the proposed method.In Section 5,we compare results obtained by applying the new method to results obtained by applying the traditional method.In Section 6,we explain our conclusions and propose future work.2.Designing the new method2.1.IntroductionThe main objective of CF is to provide an user,whom we will call the active user,with a series of suggestions about items that may be of interest to him or her.Its operation is based on a very simple idea [1]:if two users have similar preferences,it is highly probable that one of them will like items they are not aware of if the other user likes them.Therefore,the purpose of CF is to search for those users who are most similar to the active user and then analyse their ratings to determine which items may interest him or her.This process can be summarised in the following three steps [1,9,16]:1.Find the k users most similar to the active user (the k-neighbours of the active user).This phase has the most signiﬁcantimpact on the quality of the recommend a tions.The method proposed in this paper provides a novel approach for obtain- ing a suitable set of neighbours to the active user.2.Predict the rating that the active user would give to items they have not yet rated,by observing the ratings of their k-neighbours.When trying to predict an item’s value,there will normally be a signiﬁcant number of neighbou r s who have not rated the item;therefore,mechanisms must be deﬁned that enable the k-neighbours ’ratings to be combined satisfactoril y.3.Find the most suitable N items to be recomme n ded (due to their high rating,novelty,etc.).2.2.MotivationOne of the main problems faced by CF is the high degree of sparsity [24]in the rating databases ,which arises from the small percentage of available items for which a given user generally provides ratings.Thus,when we want to calculate the similarity between each pair of users,we must do so by only considering the items that both users have rated in common. Traditional similarity metrics [1,12],including Pearson Correlation,Cosine,and Mean Squared Difference (MSD),while appli- cable in many statistical applications ,are not suitable in the ﬁeld of RS,where not only are the data very sparse,but there is also a very small set of permitted values for the ratings.Traditional metrics display a marked tendency to show high similarity between users based on the similarity of their rat- ings on a very small set of items.These metrics can assign maximum similarity to two users who have each rated hundreds of items but who have only rated three items in common.Using the k-nearest neighbou r s(KNN)algorithm,it is common toﬁnd active users with a signiﬁcant number of inade- quate neighbours (neighbours who have little informat i on in common with the active user).Our hypothesis is that it is pos- sible to improve the quality of the recomme n dations of a CF RS if we use the Pareto dominan c e concept,which eliminates the less representative users from the k-neighbour s selection process and keeps the most promising ones.Fig.1shows the characteri s tics of the k-neighbours (k=150)in the Movielens database and the positive impact of dis- carding neighbou r s with a small number of items in common.In this experiment,the following similarity metrics have been used:Pearson Correlation (COR),Cosine (COS),and Mean Squared Difference (MSD).Fig.1A shows the percentage of items rated by the active user that was included in the calculation of the similarity metric (with each of the neighbou r s)using tra- ditional CF.Clearly,traditional CF generally identiﬁes the active user’s neighbours using a very low percentage of his or her ratings.Fig.1B displays a decreasing trend in the Mean Absolute Error (MAE)(that is,the improvement)when the neigh- bours are identiﬁed using a higher percentage of items in common with the active user.Fig.1C displays an increasing trend in coverage when neighbours are identiﬁed using a higher percentage of items in common with the active user (the low cov- erage values arise from consideration of only a single neighbour).These experime n ts were repeated with the Netﬂix data- base,and very similar results were obtained.The proposed method attempts to solve this problem by using a novel approach ,based on Pareto dominance,to exclude unpromisin g users in the k -neighbours selection phase.2.3.The concept of dominanceSome authors have attempted to solve the problem stated in Section 2.2by including a term in the similarity metric that discriminates against similarities calculated with too few items.In [9],the Jaccard Index is combined with the Mean Squared Difference to ensure that similar users have more items in common.In [10],the probability that the most important users (those with more knowledge about the items)are chosen as neighbou r s increases.In [16],the similarity metric is multiplied by a weighting factor correspond i ng to the percentage of common items that both the active user and the other user have rated.The solution provided here is to use the Pareto Dominance concept used in multiobject i ve optimisation problems [30]to identify those users who correctly represent the user and who therefore must be considered to be candidate neighbou r s.In a multiobject i ve optimisation problem,it is said that a solution x 0is Pareto-optima l ,efﬁcient or non-dominated (under the minimisation hypothesis)if no other feasible solution exists (we will call the set of all feasible solutions X ),which takes a lower value in some objective without causing a simultaneou s increase in at least one other one.Formally,we are faced with a problem of optimisa t ion in which we must ﬁnd the vector x =(x 1,...,x n )T 2X that optimises the multiobject i ve functionf (x )=(f 1(x ),...,f m (x ))T.We say that a solution x 02X is a non-dom i nated solution if there is no other solution x 2X such that f i ðx Þ6f i ðx 0Þfor all i =1,...,m and there is at least one index i such that f i (x )<f i (x 0).The concept of dominance has been widely used in the design of algorithms to solve multiobject i ve optimisation prob- lems.Among the most representat i ve examples we ﬁnd the algorithms NPGA [19],NSGA I [27]and II [13],SPEA2 [30],PAES [21],MOSA [29]and MOTS [15],which use dominan c e to make their solutions converge towards the Pareto-op t imal.In the ﬁeld of RS,Pareto dominan c e has been used [28]to determine which items may be of interest to the user in a con- versational recomme n der system.Conversational recommend e r systems are instances of content-bas e d ﬁltering and have been widely used in consumer-o r iented e-commerce applicati o ns.In this paper we present a complete l y different approach that integrates Pareto dominance into the CF RS.Fig.2shows a graphical representat i on of how the proposed method extends and complemen t s traditional CF.Clearly,the proposed method does not modify the traditional method.Instead,it identiﬁes a suitable set of initial users,ensuring that the traditional method obtains the most representat i ve neighbours of the current user.The proposed method performs pre-pro- cessing which discards those users that do not provide information in addition to that provided by the ﬁnally selected set of users.An important advantage of the proposed method is that it is compatible with any other implementati o n or improve- ment of the traditional method,such as similarity measure improvem e nt,and aggregat i on approach improvem ent.1.Characteristics of the 150-neighbours in Movielens for different similarity metrics:Pearson Correlation (COR),Cosine (COS)and Mean Squared Difference (MSD).(A)Percentage of items rated by the active user that were used in the similarity metric calculation using traditional CF.(B)Trend in when the percentage of common items between the active user and his or her neighbours is increased.(C)Trend in coverage when the percentage of common items between the active user and his or her neighbours is increased.3.Formalisation of the new method 3.1.IntroductionWe deﬁne an RS based on CF having l users who choose ratings for m items from the interval {min ,...,max },where the lack of a rating is represented by .The following sets are deﬁned:U ¼f u 2N j 16u 6l g ;set of users ð1ÞI ¼f i 2N j 16i 6m g ;set of itemsð2ÞV ¼f v 2N j min 6v 6max g [f g ;set of possible ratings ð3ÞR u ¼fði ;v Þj i 2I ;v 2V g ;ratings of user uð4ÞWe define the rating v of the user u on item i as r u ;i ¼v ð5ÞWe define the ratings average of the user u as r uð6ÞWe deﬁne the cardinality of a set C as its number of valid elements:#C ¼#f x 2C j x – g ð7Þ#R u ¼#f i 2I j r u ;i – gð8Þ3.2.Selecting the candidate neighbours (non-dominated users)We determine the set of users who are candidate neighbours of the active user or,more formally,the set of non-domi-nated users with respect to the active user.Let I u ¼f i 2I j r u ;i – g be the set of items rated by the user uð9ÞLet d (r x ,i ,r y ,i )be the absolute differenc e between the ratings given by user x and user y to the item i .d ðr x ;i ;r y ;i Þ¼j r x ;i Àr y ;i j r y ;i –1r y ;i ¼ð10ÞWe say that user x dominates user y with respect to another user u (denoted as x >u y ),if the following expression (11)is satisﬁed.x >u y ()8i 2I u :d ðr u ;i ;r x ;i Þ6d ðr u ;i ;r y ;i Þ^9j 2I u j d ðr u ;j ;r x ;j Þ<d ðr u ;j ;r y ;j Þð11ÞConceptually ,dominate d users do not show greater similarity to the active user than the users that dominate them,butthey do show lower similarity.In other words,dominated users do not provide any improvement compare d to those that dominate them and can therefore bediscarded.Proposed method location and operation with respect to the traditional method.The mathematical notation used isWe deﬁne C u as the set of users who are candidat e neighbou r s to the user u(non-dominated users).The following expres- sion must be satisﬁed:Let D u be the set of users who are dominated by at least one user with respect to user uð12ÞC u&U;u R C u;C u¼UÀðD u[f u gÞ;8y2D u;9x2C u j x>u yð13Þ3.3.Finding the k-neighboursToﬁnd the active user’s k-neighbours we must complete the following steps:1.Calculate the similarity of the active user to each of the users who are candidate neighbou r s(C u).2.Find the k users with the highest similarities to the active user.The advantage of the proposed method arises from the fact that the process is applied to the subset C u&U,to which the most suitable candidate neighbours belong,and not to the whole set of users of the RS,U,as in the traditional method.To perform the ﬁrst step we will use some of the traditional similarity measures [1,12].For this paper,we have selected Pearson Correlation (15),Cosine (16)and Mean Squared Differenc e(17).Let A x;y¼f i2I j r x;i– ^r y;i– g be the set of items rated by both users x and yð14ÞLet x be the active user:correlationðx;yÞ¼Pi2A x;yðr x;iÀ r xÞÁðr y;iÀ r yÞﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃPi2A x;yðr x;iÀ r xÞ2ÁPi2A x;yðr y;iÀ r yÞ2q()y2C xð15Þcosineðx;yÞ¼Pi2A x;yr x;iÁr y;iﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃPi2A x;yr2x;iqÁﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃPi2A x;yr2y;iq()y2C xð16Þmsdðx;yÞ¼1À1#A x;yXi2A x;yr x;iÀr y;imaxÀmin2()y2C xð17Þsimðx;yÞ¼ ()y R C xð18ÞAs we can see,the neighbours are chosen only from the set of non-domina t ed users.For the second step,we deﬁne K u as the set of k-neighbours of the active user.The following expressions must be true: K u#C u^#K u6k^u R K uð19Þ8x2K u;8y2ðC uÀK uÞ:simðu;xÞP simðu;yÞð20ÞChoosing the optimal value of k is not a straightforw a rd task.The value of this paramete r depends on several factors,includ- ing the size,sparsity and nature of the database,the singularity of the active user,and the type of items to be recomme n ded. Brute force is required.For example,Schafer et al.[25]state ‘‘...calculating a user’s perfect neighbourhood is expensive –requir-ing comparis o n against all other users .’’In this paper we have tested several values of the parameter k to approach an optimum for each experiment.The candidat e neighbours (C u)should cover all of the active user’s preferences (ratings).Therefore the number of neigh- bours required by each active user has a direct relationship to both the number of items rated by the user and the singularity of the user’s ratings.In the ﬁrst case,if the active user has rated more items,more candidates will be needed to cover all the user’s preferences .In the second case,if the ratings made are very unusual,it will be very difﬁcult toﬁnd users that represent them (the required number of candidat e neighbou r s will be higher),whereas if the ratings are very common,fewer users will be needed to represent them (the number of candidate neighbours will be lower).Fig.3shows the relationship between the number of candidate neighbours determined for active users and the number of items they have rated in the Movielens database.The quotient between the number of candidat e s and the number of ratings deﬁnes an active user’s singularity.If the quotient is small,few users will be needed to cover all of the active user’s prefer- ences,and,therefore,the active user is not very singular.On the other hand,if the quotient is large,it means many users were needed to cover all of the active user’s preferenc e s;therefore,the active user is very singular.The Gaussian distribution displayed in Fig.3indicates the balance provided by the proposed singulari t y measure,which is derived from the ‘‘set of candidates (C u)’’concept given in this paper.54 F.Ortega et al./Information Sciences 239 (2013)50–613.4.Item recommendati o nThe process by which recommend a tions are made to the active user can be divided into two steps:1.Determine the rating predictions for the active user based on ratings made by the set of k-neighbours.2.Find the N items with the highest predictions and recommend these items to the active user.To complete the ﬁrst step (prediction),we must deﬁne the way in which the ratings of the k-neighbour s are combined. Traditionall y,we have the following aggregation approaches [1,9,12]:mean (22),weighted mean (23)and deviation from the mean(24).Let G u;i¼f n2K u j r n;i– g be the set of neighbours who have rated item ið21ÞLet p u,i be the prediction of item i to user u:p u;i ¼1#G u;iXn2G u;ir n;i()G u;i–Øð22Þp u;i ¼Pn2G u;isimðu;nÞÁr n;iPn2G u;isimðu;nÞ()G u;i–Øð23Þp u;i ¼ r uþPn2G u;isimðu;nÞÁðr n;iÀ r nÞPn2G u;isimðu;nÞ()G u;i–Øð24Þpu;i¼ ()G u;i¼Øð25ÞTo develop the second step (recommendation),we deﬁne X u as the set of items likely to be recommend e d to user u,and Z u is deﬁned as the set of(at most)N items to be recommend e d.The following expressions must be satisﬁed: X u&I^8i2X u;r u;i¼ ;p u;i– ð26ÞZ u#X u;#Z u6N;8x2Z u;8y2ðX uÀZ uÞ:p u;x P p u;yð27Þ3.5.Sample computationTo better understa n d how the proposed method performs ,we have developed a sample computation.Table 1shows the parameters used in this sample computation.These parameters have been selected with the purpose of simulating a small, easily understandabl e CF RS.We deﬁne a traditional rating range (one toﬁve stars)and a reduced number of users,items, neighbours and recommend a tions.The users’ratings of the items are displayed in Table 2.Our objective is to determine the recommendati o ns to make for User 1(U1).First weﬁnd the set of candidat e neighbours, i.e.,those users who are non-dominated with respect to User 1.These relationshi p s are displayed in Table 3,in which the ﬁrsttwo columns (d ix;y and U1rated items )contain the differences between User 1’s ratings and those of the rest of user.The third Relationship between the number of candidate neighbours and the number of ratings made by active users in56 F.Ortega et al./Information Sciences 239 (2013)50–61Table 1Parameters used in the samp l e computation.Parameter ValueNumber of users (l)6Number of items (m)10Number of neighbours (k)2Number of recommendations (N)2Maximum vote (max)5Minimum vote (min)1Table 2Users’ratings of items in the sample computation.r u,i I1I2I3I4I5I6I7I8I9I10 U1 35 24 15U2434 412 U351542 3 4 U4 432 1 5 U5 345 2 32 U6 52 53154Table 3Candidate neighbours of U1in the sample computation.d i x;y U1rated items Dominated by Candidates (C U1)the difference in ratings between User 1and the other users of the system,for calculating the dominating and dominatedTable 4Similarities between U1and the set of candidate neighbours (C U1)and set of two nearest neighb o urs obtained .Mean squared differences (16)Neighbours (K U1)U2U3U4U5U6U10.937 0.859 0.55 {U2,U3}Table 5Recommendations made to U1in the sample computation.I1I4I7I10X U1Z U1 p U1,i 4.5 4.0 2.0 {I1,I4,I7}{I1,I4}column (Dominated by)contains the users who dominate other users with respect to User 1.The last column contains User 1’s candidate neighbours (C U1).Fig.4displays this process graphically.The horizontal axis shows the items rated by User 1,and the vertical axis shows the absolute difference between User 1’s ratings and those of the rest of the users.Gray lines represent dominated users and black lines represent non-dom i nated users.For example,we can see that User 2dominates User 4because the line repre- senting User 2is always below or at the same height as the line representing User er 3dominates User 5for the same reason.Table 4shows the set of k-neighbours selected from the set of candidates.We use MSD (17)to measure the similarity between users.Those users who do not belong to the set of candidates of U1have no similarity to them (18).Table 5indicates the predictio n s made for items not rated by U1,using the arithmetic mean (22)as the aggregation ap- proach.X u refers to the set of items likely to be recommended .Z u refers to the recomme n dations actually made.4.Experimental design4.1.Quality measuresTo validate the behaviou r of the proposed method,we use the following prediction and recommend a tion quality mea- sures:mean absolute error,coverage ,precision and recall [5,16–18].We use the Mean Absolute Error (MAE)to determine the mean error that occurs in the aggregate of all the predictio n s made.We deﬁne a user’s MAE as follows:Let B u¼f i2I j r u;i– ^p u;i– g be the set of items rated by user u where a prediction can be obtained ð28Þmae u¼1#B uXi2B uj r u;iÀp u;i jð29ÞWe deﬁne the system’s MAE as follows:mae¼1Xu2Umae uð30ÞWe use the coverage to measure the capacity of a user’s k-neighbours to recommend new items.We deﬁne the user’s coverage as follows:coverageu ¼100Â#f i2I j r u;i¼ ^p u;i– g#f i2I j r u;i¼ gð31Þand the system’s coverage as follows:coverage¼1#UXu2Ucoverageuð32ÞWe use the precision and the recall to measure the quality of the recommend a tions made.The precision indicates the percentage of relevant recommended items with respect to the total number of items recomme n ded.The recall indicates the percentage of relevant recommend e d items with respect to the total number of relevant items.An item is considered relevant or not relevant accordin g to the parameter h.To calculate the precision and the recall,we re-deﬁne(26)and (27) as follow:X0u&I^8i2X0u;r u;i– ;p u;i– ð33ÞZ0 u #X0u;#Z0u6N;8x2Z0u;8y2ðX0uÀZ0uÞ:p u;x P p u;yð34ÞWe deﬁne a user’s precision as follows:precisionu ¼#f i2Z0uj r u;i P h g#Z0uð35Þand the system’s precision as follows:precision¼1#UXu2Uprecisionuð36ÞWe deﬁne a user’s recall as follows:recall u¼#f i2Z0uj r u;i P h g#f i2I j r u;i P h gð37ÞF.Ortega et al./Information Sciences 239 (2013)50–6157and the system’s recall as follows:recall ¼1Xu 2Urecall uð38Þ4.2.Experiments performedThe standard RS baseline is k-nearest neighbours CF,using Pearson Correlation as the similarity measure and deviation from the mean as the aggregat i on approach [5].As the proposed method is a pre-processi n g step applied prior to classical CF,in our experiments we determine whether this step improves the CF RS results.If our hypothesis is correct,a signiﬁcant improvement in all quality measures should be observed.In our experime n ts we use Pearson Correlation (15),Cosine (16)and MSD (17)as similarity measures.As an aggregat i on approach we use deviation from the mean (24).One test is performed using the traditional method and one using the pro- posed method.We then determine whether any improvement is observed in the above-mention e d quality measure s :MAE (30),coverage (32),precision (36)and recall (38).The experiments are performed on the Movielens and Netﬂix databases ,the main paramete r s of which can be seen in Table 6.All the experiments are performed using cross-val i dation.In the case of Movielens we use 20%of test users to per- form the experiments ,whereas in Netﬂix we use 5%of test users.In both cases,the percentage of test items is 20%.Table 7shows the main parameters used in the experiments ,the databases tested,and the ﬁgures where the results are presented.5.ResultsIn this section,we present the results obtained from the experiments using the Movielens and Netﬂix databases (Table 6)Table 6Main parameter s of the databases used in the experim e nts.Movielens NetﬂixNumber of users 4382 480189 Number of items 3952 17,770Number of rates 1000209 100480507 Min and max values1–51–5the items rated by the active user that were used in the similarity metric calculation using the proposed CF.Table 7Experiment parameters.Prediction Recommendation Cross-validation Figures kStep kNkhUsers (%)Items (%)Movielens {50,...,800} 50{1,...,20} 100 52020Fig.6Netﬂix {50,...,800} 50{1,...,20} 200 5520Fig.758 F.Ortega et al./Information Sciences 239 (2013)50–616.Improvements produced in quality measures results using the proposed method instead of the traditional method.Database:Movielens.(A)Coverage.(C)Precision.(D)Recall.7.Improvements produced in quality measures results using the proposed method instead of the traditional method.Database:Netﬂix.(A)MAE. Coverage.(C)Precision.(D)Recall.Fig.6shows the improvement produced in the quality measures for the Movielens database when the proposed method is used rather than the traditional method.Fig.6A shows the improvem e nt produced in the quality of the predictions (MAE). The most notable improvement is obtained with small values of k because as k increases,the neighbourhood s of the two approaches tend to be more and more similar and the quality of the predictions tends to be the same.Fig.6B shows an in- crease in the coverage using the proposed method.For this measure,the most notable improvements are also produced for small values of k.This result is important because we achieve good predictio n quality in the interval of k values (low values) where the operation of RS are more efﬁcient.Figs.6C and D show the improvement produced in precision and recall,respectivel y.For these experime n ts,we used a value of k=100 and a relevance threshold of h=5.The proposed method always produces an improvement.This。