Hortonworks邵赛赛-Spark and YARNBetter Together
- 格式:pdf
- 大小:1.32 MB
- 文档页数:23
2021575⦾大数据与云计算⦾Spark分布式内存计算系统已被广泛应用于大数据处理的众多场景中[1-2]。
批处理应用是Spark系统支撑的一类主要应用,其特点是基于有向无环图(Directed Acyclic Graph,DAG)计算模型对静态数据集进行并行处理。
批处理应用执行时间预测是保证批处理应用达到软实时需求、指导Spark系统资源分配、应用均衡决策以及保障批处理应用服务质量的基础。
然而,如何精确预测Spark批处理应用执行时间仍然是一个开放的技术挑战。
近年来,针对大数据系统的批处理应用执行时间预测研究工作可分为两类,一是基于源代码分析的执行时间预测,二是选取相关因素构建执行时间预测模型。
在基于源代码分析预测的工作中,PACE系统及Pablo系统通过分析源码中包含的每一类操作的执行复杂度和执行次数来预测应用的执行时间[3-4]。
然而,这类方法属于基于源代码的白盒分析,不能适用于无法获取源代码的面向Spark的批处理应用执行时间预测模型李硕,梁毅北京工业大学信息学部,北京100124摘要:Spark批处理应用执行时间预测是指导Spark系统资源分配、应用均衡的关键技术。
然而,既有研究对于具有不同运行特征的应用采用统一的预测模型,且预测模型考虑因素较少,降低了预测的准确度。
针对上述问题,提出了一种考虑了应用特征差异的Spark批处理应用执行时间预测模型,该模型基于强相关指标对Spark批处理应用执行时间进行分类,对于每一类应用,采用PCA和GBDT算法进行应用执行时间预测。
当即席应用到达后,通过判断其所属应用类别并采用相应的预测模型进行执行时间预测。
实验结果表明,与采用统一预测模型相比,提出的方法可使得预测结果的均方根误差和平均绝对百分误差平均降低32.1%和33.9%。
关键词:Spark;批处理应用;分类;预测文献标志码:A中图分类号:TP183doi:10.3778/j.issn.1002-8331.2002-0163Prediction Model of Execution Time for Batch Application in SparkLI Shuo,LIANG YiFaculty of Information,Beijing University of Technology,Beijing100124,ChinaAbstract:The prediction of execution time for batch application in Spark is the key technology to guide the resource allo-cation and application balance of Spark.However,the existing work adopts an unified prediction model for application with different behavior characteristics and considers limited factors in the model learning,which reduces the accuracy of prediction.In order to solve the above problems,an execution time prediction model for Spark batch application is proposed,which considers the diversity of batch application’s behavior characteristics.The model first classifies the execution time of Spark batch application based on strong-correlated metrics,and then uses PCA and GBDT algorithms to predict the execution time for each application category.Finally,when the ad-hoc application arrives,it is mapped into a specific application category and its execution time is predicted with the corresponding prediction model.The experimental results show that,compared with the unified prediction model,the proposed method can reduce the mean square root error and the mean absolute percentage error of the prediction results by32.1%and33.9%on average.Key words:Spark;batch application;classification;prediction基金项目:国家重点研发计划(2017YFC0803300);国家自然科学基金面上项目(91546111)。
Bare HandVineyard & Crop Bird NettingSafe, Humane and Economical Bird Exclusion NettingFor Vineyards, Berry Farms, Row Crops and MoreBird pecked grapes and berries can harbor bacterial and fungal pathogens that changes the taste of the wine or juice. To help growers save their crops from hungry birds, we offer the Bare Hand Vineyard & Crop Bird Netting. Bird exclu-sion netting has proven to be the most effective method of bird control for vineyard and row crop farmers.• High Density Polyethylene knittedyarn is soft on hands and berries• CIBA Specialty Chemicals are usedfor UV stabilization• Rip-Stop construction prevents tearsand holes from getting bigger• 2 types of BareHand are available:FLEX and EASY-FIT Netting• 2 grades of Bare Hand are available:PREMIUM and SUPER PREMIUM• Premium grade (230 denier) can last3 seasons or more.• Super Premium (550 denier) can last6 seasons or more• Soft yarn nets are easy on grapes,berries, plants, etc• Applied from bags with simpleequipment , no spools or rollsBird Peck Damage A Likely Suspect More Evidence Another Suspect Total Loss Safe & Sound• Flexible mesh allowsfor varying widths &heights• Spacing stripes serveas easy to see visualguide• Available in Premium &Super Premium gradesBare Hand FLEX NettingEconomical 3/4” diamond mesh net that stretches in length and width as neededItem # Description Width LengthBHFP500 17x500 Bare Hand FLEX Premium 17 ft 500 ftBHFP2500 17x2500 Bare Hand FLEX Premium 17 ft 2,500 ftBHFP28x1500 28x1500 Bare Hand FLEX Premium 28 ft 1,500 ftBHFP44x1000 44x1000 Bare Hand FLEX Premium 44 ft 1,000 ftBHFSP1000 17x1000 Bare Hand FLEX Super Premium17 ft 1,000 ftContact Nixalite for availabilityBare Hand EASY-FIT NettingA 3/4” square mesh knitted yarn net that provide consistent no-stretch net sizesItem # Description Width LengthBHEFP100 17x100 Bare Hand EASY-FIT Premium 17 ft 100 ftBHEFP25x100 25x100 Bare Hand EASY-FIT Premium 25 ft 100 ftBHEFP500 17x500 Bare Hand EASY-FIT Premium 17 ft 500 ftBHEFP1000 17x1000 Bare Hand EASY-FIT Premium 17 ft 1,000 ftBHEFSP100 17x100 Bare Hand EASY-FIT Super Premium 17 ft 100 ftBHEFSP500 17x500 Bare Hand EASY-FIT Super Premium17 ft 500 ftContact Nixalite for availability• No stretch, knittedsquare mesh forconsistent net size• Drapes neatly overrows for clean & fastinstallation• Available in Premiumand Super Premiumgrades 3/4”DiamondMesh3/4”SquareMeshCopyright© 2018 by Nixalite® of America Inc. All rights reserved. Nixalite® is a registered trademark of Nixalite® of America Inc. Printed with pride in the USA.Nixalite® of America Inc1025 16th Avenue, East Moline, IL 61244Experts In Architectural Bird Control Since 1950Phone: 800.624.1189 or 309.755.8771Fax: 800.624.1196 or 309.755.0077Email:************************Web: General Installation Guide for Bare-Hand™ Vineyard Bird NettingWhen to installInstall Bare-Hand™ netting as close to harvest time as possible . Installing too early increases UV exposure that can shorten the life span of your netting. It also allows the vine to grow through the netting, making removal more dif-fi cult and time consuming.Before you begin:Do all mowing and trellis maintenance before application. Hedge (trim) the row canopy to make it more uniform (square up the row pro fi le). This keeps netting coverage consistent. Cover the top of all exposed posts as they can tear the netting while it is being installed/removed.Getting Ready To Install:Carefully cut or untie the ropes that bind both the bag and the netting.Find the knotted end of the netting. Do notuntie the knot. Thread this end through the rings of the Applicator Platform (see side bar). Tie the knotted end of the netting to the fi rst end-post of the row.ApplicationInstalling is generally a 3* person job ;1 to drive the tractor (transport the net bag), and2 walking behind on each side of the row, pull-ing the edges of the net out and down, letting it drop into position along the ground.* As delivered, the netting is tightly packed. A 4thper-son may be needed to guide the netting out of the bag to reduce tension during the fi rst application .When you get to the end of a row, do not cut thenetting. Keep it in the ‘bundled rope’ form and let it drop to the ground between rows. Con-tinue the next row with the same piece.If you cut it to length you will have to use this piece for the same row every time. To join nets end-to-end , overlap the nets 2 feet. If needed, use clips to hold the nets together.Bare-Hand™ FLEX NettingFLEX Netting is designed to be stretched over the rows and comes with 2 types of spacing stripes to help you install it correctly.Centering Stripe ; 2 closely spaced stripes in the middle of the netting. These always go on the very top of the row.Spacing Stripes ; single stripes spaced evenly over the width of the netting. When installed properly, these will be 2 feet apart.If they are too close together, the net is being installed with too much length-wise tension and will not achieve its full width.If they are too far apart, the net is being in-stalled with too much width-wise tension, and will not reach its full length.Bare-Hand™ EASY-FIT NettingBecause EASY-FIT Netting is square and does notstretch, it is much easier to install. Simply pull the edges of the net out and down, letting it drop into position at the ground.The Applicator Platform & AccessoriesItem# BHAPAThe Applicator Platform is an all HD painted steel work platform with safety rails, overhead guide rings and bag retainer. The optional 4 wheel cart and 3 point hitch attach-ment are sold separately. Contact Nixalite for prices and availability.Item# BHAPA4WOptional 4 wheel cart with standard ball hitch fi ts most vehicles. Applica-tor platform bolts to bogey wheel cart. HD painted steel construction. Contact Nixalite for prices and availability.Item# BHAPA3POptional 3 point hitch attachment with fork blades that fi t into the ap-plicator platform. Bolts to standard agricultural 3 point hitch on most tractors. HD painted steel construc-tion. Contact Nixalite for prices and availability.Hedge the rowsCover exposed post/pole endsApplication2 walking the row, 1 driving the net123Do not cut net at end of y on ground and continue application.Bare-Hand protects all types of row cropsNixalite ® of America Inc1025 16th Avenue, East Moline, IL 61244Experts In Architectural Bird Control Since 1950Phone: 800.624.1189 or 309.755.8771Fax: 800.624.1196 or 309.755.0077Email:************************Web: Optional 4 wheel cart shownOptional 3 Point Hitch AttachmentApplicator PlatformCopyright© 2018 by Nixalite® of America Inc. All rights reserved. Nixalite® is a registered trademark of Nixalite® of America Inc. Printed with pride in the USA.。
Modular Boiler Submittal DataA Caravan modular boiler system consists of two or more compact cast-iron boilers that offer significant advantages in terms of installed cost, efficiency and reliability.Caravan systems are based on step-firing just enough modules to meet heatingdemands. GXHT systems are available in 600,000 to multi-million Btuh gross input.For additional information, see Caravan Engineering Manual, pub. C-30.Standard EquipmentOne per module unless otherwise noted Pre-assembled cast-iron heat exchanger with insulated jacket.Base.Flue Collector.Gas burners, gas orifices and manifold assembly.Combination gas valve including manual shutoff and pressure regulator.System pressure gauge and siphon (unmounted, one per system).Module pressure gauge and siphon (unmounted).Optional EquipmentVent dampers.Internal tankless heater.ASME Package—includes:MM PS-802 Low water cutoff (67 iftankless coil is used) (one per module)PA404A Pressure control (one per mod.)ASME REQUIRED STEAM SYSTEM CONTROLS (One per bank)Order Separately.67M Low water cutoff (one per system)L4079B Manual reset pressure control (one per system)Module pressure gauge and siphon (unmounted).Module pressure relief valve (ASME)(unmounted). 15 psi.Draft hood (unmounted).Drain cock (unmounted).Water level sight glass (unmounted).Spark ignition system.GXHT SERIES—Gas-fired/SteamGXHT -600HZ ¶260047881.00358149214.7816.812153"GXHT -900HZ ¶390071781.00537223821.4225.218053"GXHT -1200HZ ¶4120095681.00716298328.5633.623955"GXHT -1500HZ ¶51500119581.00895372935.7042.029856"Model No.*No. of ModulesInput MBHGrossOutput MBH NetOutput MBH †Sq. Ft.Steam ‡Horse-powerBoiler Water Content (gal.)Recom-mended Header SizeWeight with WaterThermal CombustionEfficiency Ratings for Natural and L.P . Propane GasesCSAAHRI FOR LARGER SIZES, USE MULTIPLES OF THE ABOVE.†Net ratings are based on piping and pick-up allowance of 1.33. Slant/Fin should be consulted before selecting a boiler for installation having unusual piping and/or pick-up requirements. Ratings must be reduced by 4% at 2,000feet elevation and additional 4% for every additional 1,000feet elevation over 2,000 feet.‡Net ratings in square feet based on emission rate of 240 Btuh/sq.ft. at 215°steam temperature.¶For use with natural gas. For L.P . Propane consult factory.Modules in excess of 5 should be piped in parallel in two or more banks.®SLANT/FIN CORPORATION, Greenvale, N.Y. 11548 • Phone: (516) 484-2600FAX: (516) 484-5921•Canada: Slant/Fin LTD/LTEE , Mississauga, OntarioMax. ASME Working Pressure:15 psiPower Requirements:120 volts, 60 HZ,.34 amps per moduleModel No.Dimensions*GXHT -600HZ-2261⁄16591⁄891117⁄324'-111⁄8"GXHT -900HZ-3261⁄16591⁄891117⁄327'-83⁄16"GXHT -1200HZ-4261⁄16591⁄891117⁄3210'-51⁄4"GXHT -1500HZ-5261⁄16591⁄891117⁄3213'-25⁄16"ALBCD* Inches, except “L ” which is feet/inches.。
基于Spark和NRSCA策略的并行深度森林算法毛伊敏;刘绍芬【期刊名称】《计算机应用研究》【年(卷),期】2024(41)1【摘要】针对并行深度森林在大数据环境下存在冗余及无关特征过多、两端特征利用率过低、模型收敛速度慢以及级联森林并行效率低等问题,提出了基于Spark和NRSCA策略的并行深度森林算法——PDF-SNRSCA。
首先,该算法提出了基于邻域粗糙集和Fisher score的特征选择策略(FS-NRS),通过衡量特征的相关性和冗余度,对特征进行过滤,有效减少了冗余及无关特征的数量;其次,提出了一种随机选择和等距提取的扫描策略(S-RSEE),保证了所有特征能够同概率被利用,解决了多粒度扫描两端特征利用率低的问题;最后,结合Spark框架,实现级联森林并行化训练,提出了基于重要性指数的特征筛选机制(FFM-II),筛选出非关键性特征,平衡增强类向量与原始类向量维度,从而加快模型收敛速度,同时设计了基于SCA的任务调度机制(TSM-SCA),将任务重新分配,保证集群负载均衡,解决了级联森林并行效率低的问题。
实验表明,PDF-SNRSCA算法能有效提高深度森林的分类效果,且对深度森林并行化训练的效率也有大幅提升。
【总页数】8页(P126-133)【作者】毛伊敏;刘绍芬【作者单位】江西理工大学信息工程学院;韶关学院信息工程学院【正文语种】中文【中图分类】TP181【相关文献】1.基于Spark的三比值和随机森林结合的并行变压器故障诊断2.基于并行深度孤立森林算法的水质异常数据检测3.基于Spark的扩展孤立森林算法并行化改造实验设计4.基于Spark和三路交互信息的并行深度森林算法5.基于Spark和AMPSO的并行深度卷积神经网络优化算法因版权原因,仅展示原文概要,查看原文内容请购买。
SOLUTION BRIEFKEY BUSINESS BENEFITSEXECUTIVE SUMMARYAnalytic tools such as Spark, Presto and Hive are transforming how enterprises interact with and derive value from their data. Designed to be in memory, these computing and analytical frameworks process volumes of data 100x faster than Hadoop Map/Reduce and HDFS - transforming batch processing tasks into real-time analysis. These advancements have created new business models while accelerating the process of digital transformation for existing enterprises.A critical component in this revolution is the performance of the networking and storage infrastructure that is deployed in support of these modern computing applications. Considering the volumes of data that must be ingested, stored, and analyzed, it quickly becomes evident that the storage architecture must be both highly performant and massively scalable.This solution brief outlines how the promise of in-memory computing can be delivered using high-speed Mellanox Ethernet infrastructure and MinIO’s ultra-high performance object storage solution.IN MEMORY COMPUTINGWith data constantly flowing from multiple sources - logfiles, time series data, vehicles,sensors, and instruments – the compute infrastructure must constantly improve to analyze data in real time. In-memory computing applications, which load data into the memory of a cluster of servers thereby enabling parallel processing, are achieving speeds up to 100x faster than traditional Hadoop clusters that use MapReduce to analyze and HDFS to store data.Although Hadoop was critical to helping enterprises understand the art of the possible in big data analytics, other applications such as Spark, Presto, Hive, H2O.ai, and Kafka have proven to be more effective and efficient tools for analyzing data. The reality of running large Hadoop clusters is one of immense complexity, requiring expensive administrators and a highly inefficient aggregation of compute and storage. This has driven the adoption of tools like SparkDelivering In-memory Computing Using Mellanox Ethernet Infrastructure and MinIO’s Object Storage SolutionMinIO and Mellanox: Better TogetherHigh performance object storage requires the right server and networking components. With industryleading performance combined with the best innovation to accelerate data infrastructure Mellanox provides the networking foundation needed to connect in-memory computing applications with MinIO high performance object storage. Together, they allow in-memory compute applications to access and process large amounts of data to provide high speed business insights.Simple to Deploy, Simpler to ManageMinIO can be installed and configured within minutes simply by downloading a single binary and executing it. The amount of configuration options and variations has been kept to a minimum resulting in near-zero system administration tasks and few paths to failures. Upgrading MinIO is done with a single command which is non-disruptive and incurs zero downtime.MinIO is distributed under the terms of the Apache* License Version 2.0 and is actively developed on Github. MinIO’s development community starts with the MinIO engineering team and includes all of the 4,500 members of MinIO’s Slack Workspace. Since 2015 MinIO has gathered over 16K stars on Github making it one of the top 25 Golang* projects based on a number of stars.which are simpler to use and take advantage of the massive benefits afforded by disaggregating storage and compute. These solutions, based on low cost, memory dense compute nodes allow developers to move analytic workloads into memory where they execute faster, thereby enabling a new class of real time, analytical use cases.These modern applications are built using cloud-native technologies and,in turn, use cloud-native storage. The emerging standard for both the public and private cloud, object storage is prized for its near infinite scalability and simplicity - storing data in its native format while offering many of the same features as block or file. By pairing object storage with high speed, high bandwidth networking and robust compute enterprises can achieve remarkable price/performance results.DISAGGREGATE COMPUTE AND STORAGE Designed in an era of slow 1GbE networks, Hadoop (MapReduce and HDFS) achieved its performance by moving compute tasks closer to the data. A Hadoop cluster often consists of many 100s or 1000s of server nodes that combine both compute and storage.The YARN scheduler first identifies where the data resides, then distributes the jobs to the specific HDFS nodes. This architecture can deliver performance, but at a high price - measured in low compute utilization, costs to manage, and costs associated with its complexity at scale. Also, in practice, enterprises don’t experience high levels of data locality with the results being suboptimal performance.Due to improvements in storage and interconnect technologies speeds it has become possible to send and receive data remotely at high speeds with little (less than 1 microsecond) to no latency difference than if the storage were local to the compute.As a result, it is now possible to separate storage from the compute with no performance penalty. Data analysis is still possible in near real time because the interconnect between the storage and the compute is fast enough to support such demands.By combining dense compute nodes, large amounts of RAM, ultra-highspeed networks and fast object storage, enterprises are able to disaggregate storage from compute creating the flexibility to upgrade, replace, or add individual resources independently. This also allows for better planning for future growth as compute and storage can be added independently and when necessary, improving utilization and budget control.Multiple processing clusters can now share high performance object storage so that different types of processing, such as advanced queries, AI model training, and streaming data analysis, can run on their own independent clusters while sharing the same data stored on the object storage. The result is superior performance and vastly improved economics.HIGH PERFORMANCE OBJECT STORAGEWith in-memory computing, it is now possible to process volumes of data much faster than with Hadoop Map/Reduce and HDFS. Supporting these applications requires a modern data infrastructure with a storage foundation that is able to provide both the performance required by these applications and the scalability to handle the immense volume of data created by the modern enterprise.Building large clusters of storage is best done by combining simple building blocks together, an approach proven out by the hyper-scalers. By joining one cluster with many other clusters, MinIO can grow to provide a single, planet-wide global namespace. MinIO’s object storage server has a wide rangeof optimized, enterprise-grade features including erasure code and bitrot protection for data integrity, identity management, access management, WORM and encryption for data security and continuous replication and lamba compute for dynamic, distributed data.MinIO object storage is the only solution that provides throughput rates over 100GB/sec and scales easily to store 1000s of Petabytes of data under a single namespace. MinIO runs Spark queries faster, captures streaming data more effectively, and shortens the time needed to test, train and deploy AI algorithms.LATENCY AND THROUGHPUTIndustry-leading performance and IT efficiency combined with the best of open innovation assist in accelerating big data analytics workloads which require intensive processing. The Mellanox ConnectX® adapters reduce the CPU overhead through advanced hardware-based stateless offloads and flow steering engines. This allows big data applications utilizing TCP or UDP over IP transport to achieve the highest throughput, allowing completion of heavier analytic workloads in less time for big data clusters so organizations can unlock and efficiently scale data-driven insights while increasing application densities for their business.Mellanox Spectrum® Open Ethernet switches feature consistently low latency and can support a variety of non-blocking, lossless fabric designs while delivering data at line-rate speeds. Spectrum switches can be deployed in a modern spine-leaf topology to efficiently and easily scalefor future needs. Spectrum also delivers packet processing without buffer fairness concerns. The single shared buffer in Mellanox switches eliminates the need to manage port mapping and greatly simplifies deployment. In an© Copyright 2019. Mellanox, Mellanox logo, and ConnectX are registered trademarks of Mellanox Technologies, Ltd. Mellanox Onyx is a trademark of Mellanox Technologies, Ltd. All other trade-marks are property of their respective owners350 Oakmead Parkway, Suite 100 Sunnyvale, CA 94085Tel: 408-970-3400 • Fax: MLNX-423558315-99349object storage environment, fluid resource pools will greatly benefit from fair load balancing. As a result, Mellanox switches are able to deliver optimal and predictable network performance for data analytics workloads.The Mellanox 25, 50 or 100G Ethernet adapters along with Spectrum switches results in an industry leading end-to-end, high bandwidth, low latency Ethernet fabric. The combination of in-memory processing for applications and high-performance object storage from MinIO along with reduced latency and throughput improvements made possible by Mellanox interconnects creates a modern data center infrastructure that provides a simple yet highly performant and scalable foundation for AI, ML, and Big Data workloads.CONCLUSIONAdvanced applications that use in-memory computing, such as Spark, Presto and Hive, are revealing business opportunities to act in real-time on information pulled from large volumes of data. These applications are cloud native, which means they are designed to run on the computing resources in the cloud, a place where Hadoop HDFS is being replaced in favor of using data infrastructures that disaggregates storage from compute. These applications now use object storage as the primary storage vehicle whether running in the cloud or on- premises.Employing Mellanox networking and MinIO object storage allows enterprises to disaggregate compute from storage achieving both performance and scalability. By connecting dense processing nodes to MinIO object storage nodes with high performance Mellanox networking enterprises can deploy object storage solutions that can provide throughput rates over 100GB/sec and scales easily to store 1000s of Petabytes of data under a singlenamespace. The joint solution allows queries to run faster, capture streaming data more effectively, and shortens the time needed to test, train and deploy AI algorithms, effectively replacing existing Hadoop clusters with a data infrastructure solution, based on in-memory computing, that consumes a smaller data center footprint yet provides significantly more performance.WANT TO LEARN MORE?Click the link below to learn more about object storage from MinIO VAST: https://min.io/Follow the link below to learn more about Mellanox end-to-end Ethernet storage fabric:/ethernet-storage-fabric/。
《基于Spark平台推荐系统研究》篇一一、引言随着互联网的快速发展,信息过载问题日益严重,用户面临着从海量数据中筛选出有价值信息的挑战。
推荐系统作为一种解决信息过载问题的有效手段,已经成为了现代互联网服务的重要组成部分。
Spark平台作为一种大规模数据处理框架,具有高性能、高可靠性和高容错性等优点,非常适合用于构建推荐系统。
本文将针对基于Spark平台的推荐系统进行研究,以提高推荐系统的性能和准确性。
二、相关技术及背景1. Spark平台:Apache Spark是一个开源的分布式计算系统,用于处理大规模数据集。
它提供了丰富的API和强大的计算能力,可以处理结构化和非结构化数据。
2. 推荐系统:推荐系统是一种利用用户行为数据和其他相关信息,为用户提供个性化推荐的技术。
常见的推荐系统包括基于内容的推荐、协同过滤推荐和混合推荐等。
3. 数据处理:在推荐系统中,数据处理是非常重要的一环。
需要对用户行为数据、物品信息等进行清洗、转换和存储,以便用于推荐模型的训练和预测。
三、基于Spark平台的推荐系统研究1. 数据处理模块在基于Spark平台的推荐系统中,数据处理模块是不可或缺的一部分。
首先,需要从各种数据源中收集用户行为数据和物品信息等数据,并进行预处理,包括数据清洗、转换和存储等步骤。
然后,利用Spark的分布式计算能力,对数据进行并行处理,提高数据处理的速度和效率。
在数据处理过程中,需要考虑到数据的稀疏性和冷启动问题。
针对这些问题,可以采用一些技术手段,如利用协同过滤技术对用户和物品进行聚类,降低数据的稀疏性;利用用户的社交网络信息和物品的元数据信息等辅助信息进行推荐等。
2. 推荐算法模块推荐算法模块是推荐系统的核心部分,它利用用户行为数据和其他相关信息,为用户提供个性化推荐。
在基于Spark平台的推荐系统中,可以采用多种推荐算法,如基于内容的推荐、协同过滤推荐和混合推荐等。
针对协同过滤推荐算法,可以利用Spark的分布式计算能力,对用户-物品评分矩阵进行并行化处理,提高协同过滤的效率和准确性。
Modular dust collector with integrated pre-separatorFilterMax F is a complete filter solution for the entire workshop. Withit’s integrated preseparator the FilterMax F is ideal for applicationswith fume and coarse particles. FilterMax F is an efficient andcompact filter with capacity up to 10 000 m³/h (6000 cfm) TheFilterMax F is designed for industrial handling of none explosive drydust and fume. The compact and efficient integrated pre-separatorcaptures up to 80% of coarse particles and heavy sparks which willextend the lifetime of the filter cartridge. The small inner volume incombination with the air distributing support cage gives efficientcleaning. Shallow open pleats allows efficient removal of dust.The cartridges is available in different materials. The flat pocketshaped minimizes the area of the “lost” media on top of thecartridge.With it’s sturdy design, smooth inner surfaces, optimized angles ofrepose and digital control system, the filter fulfils stringent demandsfor continuous operation and effective filtration.The modular design makes it easy to to expand the capacity of aninstalled system and will also make transportation, handling andinstallation as easy as possible. To simplify use and guaranteeoptimum performance, FilterMax F is equipped with Nederman’sautomatic cleaning system. The pulse-jet system shoots short,powerful jets of air into the filter cartridges. The dirt is released fromthe filter surface and falls down into a container. The pulse-jetsystem cleans the filter cartridges in sequence while the filter is inoperation.The FilterMax F can also be cleaned after operation if so desired.The FilterMax F cartridge is a high performance, compact filtercartridge. The design is optimized for efficient media usage withgood cleaning properties.To configure your Filtermax1. Select model (DFO 40, 80 or 120)2. Select inlet diameter3. Select outlet diameter4. Select container kit5. Select Filter cartridge type6. Select accessorie• FilterMax F is a complete filter solution for the entire workshop.With it’s integrated preseparator the FilterMax F is ideal forapplications with fume and coarse particles. FilterMax F is anefficient and compact filter with capacity up to 10 000 m³/h (6000cfm) The FilterMax F is designed for industrial handling of noneexplosive dry dust and fume. The compact and efficientintegrated pre-separator captures up to 80% of coarse particlesand heavy sparks which will extend the lifetime of the filtercartridge. The small inner volume in combination with the airdistributing support cage gives efficient cleaning. Shallow openpleats allows efficient removal of dust.The cartridges is available in different materials. The flat pocketshaped minimizes the area of the “lost” media on top of thecartridge.With it’s sturdy design, smooth inner surfaces, optimized angles of repose and digital control system, the filter fulfils stringent demands for continuous operation and effective filtration.The modular design makes it easy to to expand the capacity of an installed system and will also make transportation, handling and installation as easy as possible. To simplify use and guarantee optimum performance, FilterMax F is equipped with Nederman’s automatic cleaning system. The pulse-jet system shoots short, powerful jets of air into the filter cartridges. The dirt is released from the filter surface and falls down into a container. The pulse-jet system cleans the filter cartridges in sequence while the filter is in operation.The FilterMax F can also be cleaned after operation if so desired.The FilterMax F cartridge is a high performance, compact filter cartridge. The design is optimized for efficient media usage with good cleaning properties.To configure your Filtermax1. Select model (DFO 40, 80 or 120)2. Select inlet diameter3. Select outlet diameter4. Select container kit5. Select Filter cartridge type6. Select accessorie• Pulse-jet cleaning system• Pulse-jet cleaning system• Integrated pre-separator / spark trapFumes DustImage*Only for ordering with a FilterMax F. Part numbers for replacement filter cartridges can be found in the instruction manual or .**Welding fume category W3 means that the unit is able to reliably extract and clean low, medium and highalloyed steels, e.g. containing a part of nickel and chrome of 30% and more with a separation efficiency of ≥99%. With this kit the FilterMax F is certified according to the international effective standard DIN EN ISO 15012-1 (2005) that controls industrial and health protection for welding and related processes and the requirements,examination and identification of air cleaning systems.The height dimensions are valid with a 40 l / 10.5 gallon bin. If a 100 l / 26.5 gallon bin is used add 450 mm / 18”.Pressure drop for Inlet and outlet. Actual working pressure will depend on application and dimensioning. Dimensional typical pressure drop for FilterMax 1200 Pa (5" WG)。
基于Spark的分布式并行推理算法叶怡新;汪璟玢【摘要】Multiple MapReduce tasks are needed for most of current distributed parallel reasoning algorithm for RDF data; moreover, the reasoning of instances of triple antecedents under OWL rules can't be performed expeditiously by some of these algorithms during the processing of massive RDF data, and so the overall efficiency can't be fulfilled in reasoning process. In order to solve the problems mentioned above, a method named distributed parallel reasoning algorithm based on Spark with TREAT for RDF data is proposed to perform reasoning on distributed systems. First step, alpha registers of schema triples and models for rule markup with the ontology of RDF data are built; then alpha stage of TREAT algorithm is implemented with MapReduce at the phase of OWL reasoning; at last, reasoning results are dereplicated and a whole reasoning procedure within all the OWL rules is executed. Experimental results show that through this algorithm, the results of parallel reasoning for large-scale data can be achieved efficiently and correctly.%现有的RDF 数据分布式并行推理算法大多需要启动多个MapReduce任务,有些算法对于含有多个实例三元组前件的OWL规则的推理效率低下,使其整体的推理效率不高.针对这些问题,文中提出结合TREAT的基于Spark的分布式并行推理算法(DPRS).该算法首先结合RDF数据本体,构建模式三元组对应的alpha寄存器和规则标记模型;在OWL推理阶段,结合MapReduce实现TREAT算法中的alpha阶段;然后对推理结果进行去重处理,完成一次OWL全部规则推理.实验表明DPRS算法能够高效正确地实现大规模数据的并行推理.【期刊名称】《计算机系统应用》【年(卷),期】2017(026)005【总页数】8页(P97-104)【关键词】RDF;OWL;分布式推理;TREAT;Spark【作者】叶怡新;汪璟玢【作者单位】福州大学数学与计算机科学学院, 福州 350108;福州大学数学与计算机科学学院, 福州 350108【正文语种】中文语义万维网中的RDF和OWL标准已在各个领域有着广泛的应用,如一般知识(DBpedia[1])、医疗生命科学(LODD[2])、生物信息学(UniProt[3])、地理信息系统(Linkedgeodata)和语义搜索引擎(Watson)等.随着语义万维网的应用,产生了海量的语义信息.由于数据的复杂性和大规模性,如何通过语义信息并行推理高效地发现其中隐藏的信息是一个亟待解决的问题.由于语义网数据的急速增长,集中式环境的内存限制,已不适用于大规模数据的推理.研究RDFS/OWL分布式并行推理是目前较新的一个领域.J.Urbani[4-6]等人在RDFS/OWL规则集上采用WebPIE进行推理,能够满足大数据的并行推理;但该算法针对每一条规则启用一个或者多个MapReduce任务进行推理,由于Job的启动相对耗时,因此随着RDFS/OWL推理规则的增加,整体推理的效率受到了限制.顾荣[7]等人提出了基于MapReduce的高效可扩展的语义推理引擎(YARM),使推理在一次MapReduce任务内即可完成RDFS规则的推理;但该算法并不适用于复杂的OWL规则的推理.此外,当某一规则产生的新三元组重复时,YARM会存在过多的冗余计算且产生无用数据.汪璟玢[8]等人提出结合Rete的RDF数据分布式并行推理算法,该算法结合RDF数据本体,构建模式三元组列表和规则标记模型;在RDFS/OWL推理阶段,结合MapReduce实现Rete算法中的alpha阶段和beta 阶段,从而实现Rete算法的分布式推理;但该算法在连接beta网络推理时需要消耗较多的内存且进行多次迭代时效率低下,因而此算法受到集群内存和平台的限制.顾荣[9]等人提出了一种基于Spark的高效并行推理引擎(Cichlid),结合RDD的编程模型,优化了并行推理算法;但该算法未考虑规则能否被激活,均需要进行推理,因而造成了推理性能的浪费和传输的冗余.为了解决上述问题,本文针对OWL Horst规则,提出了 DPRS算法 (Distributed parallel reasoning algorithm based on Spark).该算法结合TREAT[10]算法和RDF数据本体构建模式三元组的alpha寄存器RDD,预先对规则能否被激活做出判断并标记,仅对可激活的规则进行推理的处理,实现在一个MapReduce任务中完成OWL全部规则的一次推理.最后,实时地删除重复的三元组数据和更新冲突集数据到相应的寄存器中,以进一步提高后续迭代推理的效率.实验表明,该算法在数据量动态增加的情况下能够高效地构建alpha网络,并执行正确的推理.定义1.模式三元组(SchemaTriple),指三元组的主语谓语和宾语都在本体文件(OntologyFile)中有定义.即:其中,n表示模式三元组的总数.若v∈{Si,Pj,Ok}, v∈OntologyFile,则:定义2.实例三元组(InstanceTriple),指主语谓语和宾语至少有一个在本体文件(OntologyFile)中未定义,是具体的实例.即:其中,n表示实例三元组的总数.若v∈{Si,Pj,Ok},∃v∉OntologyFile,则:定义3.三元组类型标记(Flag_TripleType),用于标识模式三元组与实例三元组,结合定义1和定义2,三元组类型标记Flag_TripleType定义如下:其中,n表示三元组的总数.则:定义4.模式三元组列表(SchemaRDD).用于获取相同谓语或者宾语的模式三元组集合.结合定义1,模式三元组列表SchemaRDD定义如下:其中,n表示模式三元组的总数.则,其中,Om_RDD表示满足谓语Pj∈{rdf:type}且具有相同宾语的三元组集合,以该宾语命名;Pt_RDD表示满足谓语Pj∉{rdf:type}的所有具有相同谓语的三元组集合,以该谓语命名.具体定义如下:定义 5.连接变量(LinkVar).连接变量为在RDFS/OWL规则中用于连接两个前件的模式三元组项,根据规则描述,连接变量可以不止一个.本文将每一条规则的连接变量信息以<key,value>的形式存储在Rulem_RDD,其中key存储该规则所有用于前件连接的模式三元组项,value存储该规则结论部分的模式三元组项.DPRS算法根据连接变量的类型,对OWL Horst规则进行分类.本文引用OWL Horst规则时采用OWL-规则编号的形式,例如OWL-4表示图1中的第4条规则.同时,给每条规则分配一个规则名称标记,规则名称标记即为该规则所对应的名称(例如,规则OWL-4的规则名称标记为OWL-4).具体的规则分类如下:1)类型 1:只包含一个前件的规则或SchemaTriple与InstanceTriple组合的规则,且只有一个InstanceTriple,可以在Map推理过程中直接输出推理结果(图1中规则OWL-3、OWL-5a、OWL-5b、OWL-6、OWL-8a、OWL-8b、OWL-9、OWL-12a、OWL-12b、OWL-12c、OWL-13a、OWL-13b、OWL-13c、OWL-14a、OWL-14b).2)类型2:SchemaTriple与InstanceTriple组合的规则,且有多个InstanceTriple 的,需要结合Map和ReduceByKey两个阶段推理(图1中规则OWL-1、OWL-2、OWL-4、OWL-7、OWL-15、OWL-16).定义 6.设Cmn为第m条规则的第n个模式三元组前件,定义规则前件模式标记Indexmn, 用于标识是否有符合该前件的模式三元组存在,即以该模式三元组前件Cmn所命名的SchemaRDD是否为空.结合定义4,规则前件模式标记Indexmn定义如下:定义7.规则标记Flag_Rule_m,用于标记该规则是否为不可能激活的规则.结合定义6进行定义规则标记Flag_Rulem如下:Flag_Rulem={0,1,2}其中,规则不能激活时,Flag_Rulem=0;规则激活且为类型1时,Flag_Rulem=1;规则激活且为类型2时, Flag_Rulem=2.由于图1中OWL规则5a、5b不影响推理的并行化,因而,本文所述推理不考虑这两条规则.在Rete算法中,同一规则连接结点上的寄存器保留了大量的冗余结果.实际上,寄存器中大部分信息已经体现在冲突集的规则实例中.因此,如果在部分匹配过程中直接使用冲突集来限制模式之间的变量约束,不仅可以减少寄存器的数量,而且能够加快匹配处理效率.这一思想称为冲突集支撑策略.基于冲突集支撑策略,TREAT[10]算法放弃了Rete算法中利用β寄存器保存模式之间变量约束中间结果的思想. DPRS算法根据Spark RDD的特点,结合TREAT算法的原理,首先根据RDF本体数据构建模式三元组对应的alpha寄存器Om_RDD或Pt_RDD并广播,然后对每条规则的模式前件进行连接并生成对应的连接模式三元组集合Rulem_linkvar_RDD,从而加快推理过程中的匹配速度,能够实现多条规则的分布式并行推理.DPRS算法主要包括以下几个步骤:Step1.加载模式三元组集合Pt_RDD、Om_RDD和Rulem_linkvar_RDD并广播. Step2.构建规则标记模型Flag_Rulem并广播.Step3.并行执行OWL Horst规则推理.Step4.删除重复三元组.Step5.如果产生新的模式三元组数据,则跳至步骤Step2,如果产生新的实例三元组数据,则跳至步骤Step3,否则跳至步骤Step6.Step6.算法结束.DPRS算法的总体框架图如图2所示.2.1 加载模式三元组与构建规则标记模型由于模式三元组的数量远远少于实例三元组, DPRS算法将SchemaTriple加载到SchemaRDD中并广播.并构建每条规则中的模式三元组或模式三元组连接后的数据(Rulem_linkvar_RDD或 Om_RDD或Pt_RDD)为 alpha寄存器并广播,保存对应的SchemaTriple.为了尽早判断出不可能被激活的规则,DPRS算法根据OWL规则构建每一条规则内SchemaTriple间的关系Om_RDD或Pt_RDD,并判断SchemaRDD中是否存在规则前件中的SchemaTriple,生成对应规则的标记Flag_Rulem,构建所有规则的标记模型,将规则标记模型加载到Flag_Rulem并广播.通过SchemaRDD和构建规则标记模型能够过滤大量InstanceTriple,减少Map 阶段键值对的输出,从而减少了无效的网络传输,提高整体推理效率.2.2 Map阶段Map阶段主要完成数据选择过滤与类型1推理,将过滤的结果以键值对的形式输出,本文提出的数据分配与过滤算法具体步骤如下:Step1.获取广播变量中的Om_RDD、Pt_RDD和Rulem_linkvar_RDD以及规则标记Flag_Rulem.Step2.对于输入的∀(Si,Pj,Ok)∈InstanceTriple判断所有Flag_Rulem的值.如果值为0,则跳至Step3,如果值为1,则跳至Step4,否则跳至Step5.Step3.对(Si,Pj,Ok)不做任何处理.Step4.结合Om_RDD或Pt_RDD执行类型1的规则推理,根据规则的结论直接输出对应的三元组Step5获取对应规则中的模式三元组alpha寄存器Om_RDD、Pt_RDD或Rulem_linkvar_RDD,判断当前的实例三元组是否满足前件连接变量的条件;满足,则构建对应的键值对<Rulem_linkvar,(Si,Pj,Ok)>输出;不满足,则不做处理.以图1中规则8a(inverseOf)为例,伪码描述如下:类似于规则8,推理可以在Map阶段就得到规则产生的三元组结果,那么reduce阶段就可以对规则8产生的三元组去重并输出.以图1中规则9(type+sameAs)为例,伪码描述如下:以图1中规则15(someValuesFrom)为例,伪码描述如下:如上所描述的规则9和15,以规则9为例,在Map阶段需要对输入的三元组进行处理,以“Rule9+连接变量”为key,如果谓语为type,那么value中标记为type且资源为连接变量;如果谓语为sameAs,那么value中标记为sameAs且资源为宾语.2.3Reduce阶段Reduce阶段主要完成连接推理.利用RDD的reduceByKey,结合OWL规则,根据SchemaRDD和alpha寄存器以及Map阶段的InstanceTriple输出结果完成连接推理,得到推理结果.本文提出的连接推理算法具体步骤如下:Step1.获取广播变量中的Om_RDD、Pt_RDD和Rulem_linkvar_RDD以及规则标记Flag_Rulem.Step2.获取相同键对应的迭代器;如果key为Rulem,则表示为类型1,直接将value 的三元组输出;如果key为Rulem_linkvar,则表示为类型2,则根据该key对应的OWL规则和连接变量,结合alpha寄存器Rulem_linkvar_RDD与value迭代器完成连接推理,得到推理结果并输出连接后的三元组在执行连接推理过程中,因为符合条件的SchemaTriple已经在构建alpha寄存器时已连接完毕,所以只需要执行SchemaTriple与InstanceTriple或InstanceTriple与InstanceTriple间的连接即可.为了更加明确Reduce阶段的连接推理,以图1中规则9(type+sameAs规则)为例,伪码描述如下:以图1中规则15(someValuesFrom规则)为例,伪码描述如下:由上述的规则9和规则15的伪码,以规则9为例,在Reduce阶段,根据输入的key 和values,我们通过values中的flag值来进行区分并构建输出的三元组.2.4 删除重复三元组和冲突集更新策略在执行算法推理的过程中会产生大量重复的三元组数据到冲突集中,如不删除冲突集中的重复三元组,则更新alpha寄存器时将会产生重复三元组数据,浪费系统资源,降低推理效率.如果每次推理后都能够及时删除冲突集中的重复三元组,那将会减少很大的网络传输开销.本文借助RDD的distinct和subtract完成删除重复三元组算法.通过上述的删除重复三元组后,冲突集中的模式三元组分别更新到对应的alpha寄存器中,实例三元组合则并到实例文件中.2.5 算法的复杂度与完备性复杂性分析是算法分析的核心,DPRS算法的复杂性与集中式算法复杂性的分析不太相同,将DPRS算法的最坏情况下的时间复杂性分为Map阶段的时间复杂性和Reduce阶段的时间复杂性.假设数据集的规模大小为N个三元组,其中模式三元组为n个,在MapReduce中Map阶段的并行数为k,Reduce阶段传入的实例三元组个数为m,Reduce阶段的并行数为t.由于DPRS算法在Map阶段对每个输入的三元组,结合SchemaList、Flag_Rulem扫描一次,即可判断该三元组是该舍弃或是能参与某些规则推理,如能参与后续规则推理,则以该规则名称为key结合此三元组输出.因此,Map阶段的时间复杂性为:O(n*N/k).由于图1 OWL规则中,规则1、2、3、4、15、16都含有两个实例三元组前件,将上述规则称作多实例变量规则,多实例变量规则的Reduce阶段则需要遍历两次输入的实例三元组与模式三元组连接,才能得到推理结果.因此在Reduce阶段的时间复杂性分为单变量和多变量进行分析.Reduce阶段多变量的时间复杂性为:O(n*m/t).由于n的数目非常少,可以认为其量级为常数.DPRS算法首先将数据集中的模式三元组载入内存并广播,根据定义7和OWL规则的描述构建各个规则的Flag_Rulem,从而过滤掉不可能激活的规则.在能被激活的规则并行推理过程中的Map阶段,对于输入的一个三元组,DPRS判断其是否满足某个规则前件,只要满足,就将此规则名称作为键(key),值(value)为该三元组输出;若一个三元组数据满足多个规则前件,我们也将据此方法产生多个不同键(key)的输出,以保障Reduce阶段推理连接的正确性和数据完整性.如果Reduce阶段产生的三元组去重后,有产生新的模式三元组,那么 DPRS算法将重新计算各个规则的Flag_Rulem,再执行规则的并行推理迭代;如果Reduce阶段产生的三元组去重后产生的是实例三元组,那么DPRS算法直接执行规则的并行推理迭代,直到没有新的三元组数据产生为止.因而DPRS算法所得到的推理结果是完备的.实验所使用的软件环境为操作系统Linux Ubuntu,采用scala作为编程语言,开发环境为IntelliJIDEA.在实验环境中,用表1所示配置作为本系统Spark集群的配置,共计8台,其Hadoop集群中1台作为HDFS的名称节点,1台作为JobTracker节点,6台为HDFS的数据节点和TaskTracker节点,Spark集群中1台作为Master 兼Worker节点,7台作为Worker节点.集群工作站的基本配置如表1所列.本文将DPRS算法与DRRM[4]和Cichlid-OWL[9]在相同的实验环境下针对不同的数据集进行对比实验.本实验采用LUBM[11](Lehigh University Benchmark)数据集和DBpedia[1]数据集进行测试.数据集的基本参数说明如表2所列.我们将实验数据集中的模式三元组数进行统计如表2所示,与整个数据集的大小相比,模式三元组的数量非常少,在所测试的数据集范围内,模式三元组数目最高仅仅达到了整个数据集的0.04%.从表3和表4可知,在OWL规则推理结果一致的情况下,DPRS比Cichlid-OWL具有优势.其中,由于LUBM数据集本体比较简单,OWL Horst中的许多规则无法被激活,所以DPRS相比Cichlid-OWL的优势比较微弱;对于比较复杂的DBpedia本体而言,OWL的大部分规则都可被激活,由于本文使用了alpha寄存器广播、连接变量、规则标记和冲突集更新策略,使得DPRS算法的推理时间相对Cichlid-OWL算法最大缩短了21%的时间.另外,DPRS与DRRM相比均有较大的优势.首先, DPRS使用Spark平台比DRRM 使用的Hadoop具有迭代性能优势;再者,DPRS采用冲突集更新策略,避免了beta 网络的开销,大大减少了传输冗余造成的浪费.使得DPRS算法的推理时间相对DRRM算法最大缩短了73.8%的时间.根据2.5节的复杂度分析,其中k和t为常数,所以推理时间的复杂度与N和m成线性关系.结合表4和表5,考察数据集LUBM50和LUBM200,实例三元组个数N 的比例为1:4.01,传入Reduce的实例三元组数m的比例为1:3.89,推理时间的比例为1:4.20;考察数据集DBpedia3.7和DBpedia3.9,实例三元组个数N的比例为1:1.53,传入Reduce的实例三元组数m的比例为1:1.41,推理时间的比例为1:1.35.可以发现,我们的推理时间基本是与N和m成线性关系.从实验结果上符合了理论的分析,证明了算法的正确性.从图3和图4可知,在执行OWL规则推理时,虽然两种算法都需要多次迭代才能使得推理最终停止,但是DPRS在推理前构建并广播了模式三元组的alpha寄存器,并且在每次迭代中采用高效的过滤机制,过滤掉大量的实例三元组数据,减少了并行计算量和网络传输的开销,使得DPRS算法在最终的推理时间较Cichlid-OWL略占优势,尤其是在DBpedia数据集下,从表2中可以看出,Dbpedia的模式三元组占比相对LUBM高,且数据集较为复杂,其优势更加明显.由于在执行推理过程中会产生重复的三元组数据,重复三元组数据会造成系统资源无谓的浪费并增加网络的开销.文中3.4节提出的删除重复三元组算法,能够减少重复的三元组数据.为了评估算法的有效性,将删除重复三元组前后的数据量进行对比如图5所示.删除重复三元组后的三元组数量少于推理三元组数量,在所测试的数据范围内.本文提出的 DPRS算法能够通过执行一次MapReduce任务就完成OWL所有规则的一次推理,弥补了现有方法大多需要启动多个MapReduce任务以及在大规模数据下无法对OWL规则中含有实例三元组的规则进行推理的问题.DPRS算法能够在MapReduce计算框架下高效地实现大规模数据的并行推理,但无法对流式数据进行推理.下一步将会在此方面进行改进,且研究更深一步的OWLDL推理.1 Auer S,Bizer C,Kobilarov G,et al.Dbpedia:A nucleus for a web of open data.The Semantic Web.Springer Berlin Heidelberg.2007.722–735.2 Jentzsch A,Zhao J,Hassanzadeh O,et al.Linking Open Drug Data.I–SEMANTICS.2009.3 Apweiler R,Bairoch A,Wu CH,et al.UniProt:The universal protein knowledgebase.Nucleic AcidsResearch,2004, 32(s1):D115–D119.4 Urbani J,Kotoulas S,Maassen J,et al.WebPIE:A web-scale parallel inference engine using MapReduce.Web Semantics: Science,Services and Agents on the World Wide Web,2012, (10):59–75.5 Urbani J,Kotoulas S,Maassen J,et al.OWL reasoning withWebPIE:Calculating the closure of 100 billion triples. Extended Semantic Web Conference.SpringerBerlin Heidelberg.2010.213–227.6 UrbaniJ.On web-scale reasoning[PhD.dissertation].Amsterdam,Netherlands:Computer Science Department, Vrije Universiteit,2013.7顾荣,王芳芳,袁春风,等.YARM:基于MapReduce的高效可扩展的语义推理引擎.计算机学报,2015,38(1):74–85.8汪璟玢,郑翠春.结合Rete的RDF数据分布式并行推理算法.模式识别与人工智能,2016,(5):5.9 Gu R,Wang S,Wang F,et al.Cichlid:Efficient large scale RDFS/OWL reasoning with spark.Parallel and Distributed ProcessingSymposium(IPDPS),2015 IEEE International. IEEE.2015.700–709.10 Miranker DP.TREAT:A new and efficient match algorithm forAI production system.Morgan Kaufmann,2014.11 Guo Y,Pan Z,Heflin J.LUBM:A benchmark for OWL knowledge base systems.Web Semantics:Science,Services andAgents on the World Wide Web,2005,3(2):158–182.。
基于Spark并行SVM参数寻优算法的研究何经纬;刘黎志;彭贝;付星堡【摘要】针对传统支持向量机(SVM)参数寻优算法在处理大样本数据集时存在的寻优时间过长,内存消耗过大等问题,提出了一种基于Spark通用计算引擎的并行可调SVM参数寻优算法.该算法首先使用Spark集群将训练集以广播变量的形式广播给各个Executor,然后并行化SVM的参数寻优过程,并在在寻优过程中控制Task 并行度,使各个Executor负载均衡,从而加快寻优速度.实验结果表明,本文提出的参数寻优算法,通过设置合理的Task并行度,可以在充分使用集群资源的同时提高最优参数的寻找速度,减少寻优时间.【期刊名称】《武汉工程大学学报》【年(卷),期】2019(041)003【总页数】7页(P283-289)【关键词】支持向量机;参数寻优;Spark;并行度;负载均衡【作者】何经纬;刘黎志;彭贝;付星堡【作者单位】智能机器人湖北省重点实验室(武汉工程大学),湖北武汉 430205;武汉工程大学计算机科学与工程学院,湖北武汉 430205;智能机器人湖北省重点实验室(武汉工程大学),湖北武汉 430205;武汉工程大学计算机科学与工程学院,湖北武汉 430205;智能机器人湖北省重点实验室(武汉工程大学),湖北武汉 430205;武汉工程大学计算机科学与工程学院,湖北武汉 430205;智能机器人湖北省重点实验室(武汉工程大学),湖北武汉 430205;武汉工程大学计算机科学与工程学院,湖北武汉430205【正文语种】中文【中图分类】TP311随着互联网的发展,越来越来的智能设备被接入到网络中来,数以万计的设备每天都在产生大量的数据,如何从海量的数据中获取有价值的信息成为当前研究的热点。
支持向量机[1-5](support vector machine,SVM)算法在参数设置合理的情况下,处理小样本、高维度数据集时表现出很好的性能和准确率,而不合理的参数设置将会导致糟糕的性能和极低的准确率,所以参数的选取是SVM算法中至关重要的一环。
Spark and YARN: Better Together Saisai Shao
sshao@ May 15, 2016
Spark on YARN Recap
Cluster Manager
Overview of Spark Cluster
Driver
Executor Executor
Executor
NM NM NM
Driver
ResourceManager
Client Container Executor Container
Container Executor Executor Cont ainer AM
NM NM NM NM
ResourceManager
Client
Container Container
Container Container Executor Executor Executor
AM
Driver
Difference Compared to Other Cluster Manager Application has to be submitted into a queue
Jars/files/archives are distributed through distributed cache Additional ApplicationMaster
…
Better Run Spark On YARN
What Do We Concern About ?
Better use the resources
Better run on cluster
Easy to debug
Calculate Container Size
What is the size of a Container ?
–Memory
–CPU#
spark.executor.memory
spark.yarn.executor.memoryOverhead (offheap) spark.memory.fraction (0.75)
spark.memory.storageFr
action
(0.5)
execution
memory spark.memory.offHeap.size
container memory = spark executor memory + overhead memory
yarn.scheduler.minimum-allocation-mb <= container memory
<= yarn.nodemanager.resource.memory-mb
container memory will be round to yarn.scheduler.increment-allocation-mb
Calculate Container Size (Cont’d)
Enable CPU Scheduling
–Capacity Scheduler with DefaultResourceCalculator (default) •Only takes memory into account
•CPU requirements are ignored when carrying out
allocations
•The setting of “--executor-cores” is controlled by Spark
itself
–Capacity Scheduler with DominantResourceCalculator
•CPU will also take into account when calculating
•Container vcores = executor cores
container cores <= nodemanger.resource.cpu-vcores
Isolate Container Resource
Containers should only be allowed to use resource they get allocated, they should not be affected by other containers on the node
How do we ensure containers don’t exceed their vcore allocation?
What’s stopping an errant container from spawning bunch of threads and consume all the CPU on the node?
CGroups
With the setting of LinuxContainerExecutor and others, YARN could enable CGroups to constrain the CPU usage (https:///docs/current/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html).
Label Based Scheduling
How to specify applications to run on specific nodes? Label based scheduling is what you want.
To use it:
–Enable node label and label scheduling in YARN side (Hadoop 2.6+)
–Configure node label expression in Spark conf:
•spark.yarn.am.nodeLabelExpression
•spark.yarn.executor.nodeLabelExpression
Dynamic Resource Allocation
How to use the resource more effectively and more resiliently?
Spark supports dynamically requesting or releasing executors according to the current load of jobs.
This is especially useful for long-running applications like Spark shell, Thrift Server, Zeppelin.
To Enable Dynamic Resource Allocation
spark.streaming.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>work.yarn.YarnShuffleService</value>
</property>。