SQL versus MapReduce
- 格式:pdf
- 大小:210.21 KB
- 文档页数:13
sqlserver索引原理-回复SQL Server索引原理索引是数据库中用于加速数据检索速度的重要组成部分。
在实际应用中,索引的设计与使用是优化数据库性能的关键。
本文将以SQL Server索引原理为主题,以简明扼要的方式逐步解答相关问题。
一、什么是索引?索引是数据库中的一种数据结构,用于快速检索表中的数据。
它类似于书籍的目录,可以根据关键字快速找到所需数据的位置。
在SQL Server中,索引的结构主要包括B-Tree索引和哈希索引。
二、为什么需要索引?1. 提高查询速度:索引可以减少数据库查询的数据量,加快查询速度。
2. 优化性能:索引可以减少数据的物理访问次数,减少磁盘I/O操作,提高数据库的性能。
3. 加强数据完整性:通过索引可以定义唯一约束,保证数据的唯一性。
三、如何创建索引?创建索引可以通过以下两种方式实现:1. 创建表时同时创建索引:在创建表的过程中,可以通过指定表的列为索引列来创建索引。
2. 后期创建索引:在已经创建好的表中,可以使用CREATE INDEX语句来创建索引。
四、SQL Server索引的类型SQL Server中常见的索引类型有:1. 聚簇索引(Clustered Index):按照索引列的值将表中的数据物理排序,一个表只能有一个聚簇索引。
2. 非聚簇索引(Non-clustered Index):创建一个单独的索引结构,其中包含索引列的值和指向实际数据的指针。
3. 唯一索引(Unique Index):保证索引列的唯一性,一个表可以有多个唯一索引。
4. 主键索引(Primary Key Index):一种特殊的唯一索引,用于定义主键约束。
五、索引的原理与工作过程1. B-Tree索引原理:B-Tree是一种多叉树结构,在SQL Server中用于实现索引。
B-Tree索引将索引数据存储在树的节点中,通过不断分割节点来实现数据的快速定位。
每个节点包含一个索引值和指向其他节点的指针,这些指针可以指向其他节点或叶子节点。
实验五MapReduce实验:单词计数5.1 实验目的基于MapReduce思想,编写WordCount程序。
5.2 实验要求1.理解MapReduce编程思想;2.会编写MapReduce版本WordCount;3.会执行该程序;4.自行分析执行过程。
5.3 实验原理MapReduce是一种计算模型,简单的说就是将大批量的工作(数据)分解(MAP)执行,然后再将结果合并成最终结果(REDUCE)。
这样做的好处是可以在任务被分解后,可以通过大量机器进行并行计算,减少整个操作的时间。
适用范围:数据量大,但是数据种类小可以放入内存。
基本原理及要点:将数据交给不同的机器去处理,数据划分,结果归约。
理解MapReduce和Yarn:在新版Hadoop中,Yarn作为一个资源管理调度框架,是Hadoop下MapReduce程序运行的生存环境。
其实MapRuduce除了可以运行Yarn框架下,也可以运行在诸如Mesos,Corona之类的调度框架上,使用不同的调度框架,需要针对Hadoop做不同的适配。
一个完成的MapReduce程序在Yarn中执行过程如下:(1)ResourcManager JobClient向ResourcManager提交一个job。
(2)ResourcManager向Scheduler请求一个供MRAppMaster运行的container,然后启动它。
(3)MRAppMaster启动起来后向ResourcManager注册。
(4)ResourcManagerJobClient向ResourcManager获取到MRAppMaster相关的信息,然后直接与MRAppMaster进行通信。
(5)MRAppMaster算splits并为所有的map构造资源请求。
(6)MRAppMaster做一些必要的MR OutputCommitter的准备工作。
(7)MRAppMaster向RM(Scheduler)发起资源请求,得到一组供map/reduce task运行的container,然后与NodeManager一起对每一个container执行一些必要的任务,包括资源本地化等。
SQL Server索引的ESR原则1. 什么是索引索引是数据库中一种特殊的数据结构,用于提高查询性能。
它是对表中的一列或多列进行排序的一种数据结构,可以快速定位到符合某个条件的数据行。
在SQL Server中,索引可以分为聚集索引和非聚集索引。
聚集索引决定了表中数据的物理排序方式,一个表只能有一个聚集索引。
非聚集索引则是在表的物理结构上独立于表的排序方式存在的。
2. SQL Server索引的ESR原则ESR原则是指SQL Server索引设计的三个原则,分别是效率、选择性和覆盖。
2.1 效率索引的效率是指查询数据时所需的时间和资源。
一个高效的索引应该能够快速定位到所需的数据行,减少查询的时间和资源消耗。
为了提高索引的效率,可以采取以下几个措施:•选择合适的数据类型:选择合适的数据类型可以减少索引的大小,从而提高查询的效率。
•限制索引的大小:过大的索引会增加查询的时间和资源消耗,因此需要限制索引的大小。
•避免过多的索引:过多的索引会增加数据库的维护成本,并且在插入、更新和删除数据时会导致性能下降。
2.2 选择性选择性是指索引中不同值的唯一性程度。
一个具有高选择性的索引可以更快地定位到所需的数据行。
为了提高索引的选择性,可以采取以下几个措施:•选择合适的列作为索引列:选择具有高选择性的列作为索引列可以提高索引的选择性。
•组合索引:将多个列组合成一个索引可以提高索引的选择性。
•统计信息:定期更新统计信息可以提高查询优化器的准确性,从而提高索引的选择性。
2.3 覆盖覆盖是指查询可以直接从索引中获取所需的数据,而不需要再次访问表的数据行。
为了提高索引的覆盖性,可以采取以下几个措施:•包含所需的列:将查询中所需的列包含在索引中可以提高索引的覆盖性。
•使用包含索引:包含索引是一种特殊的索引类型,它包含了表中的所有列,可以提高索引的覆盖性。
3. SQL Server索引的创建和优化创建和优化索引是提高查询性能的重要步骤。
HiveSQL优化⽅式及使⽤技巧HIVE简介Hive是基于Hadoop的⼀个数据仓库⼯具,可以将结构化的数据⽂件映射为⼀张数据库表,并提供简单的sql查询,可以将sql语句转换为MapReduce任务进⾏运⾏同时,hive也允许熟悉map-reduce的开发者开发⾃定义的mapper和reducer来处理内建的mapper和reducer⽆法处理的复杂的分析⼯作。
⽤户发出sql命令----> hive处理并转换为MapReduce---->提交任务到hadoop并运⾏在hdfsHIVE DDLHive建表(压缩表和⾮压缩表)⼀个表可以拥有⼀个或者多个分区,每个分区以⽂件夹的形式单独存在于表⽂件夹的⽬录下创建表:指定exterbal关键字的就是外部表,没有就是内部表。
内部表在drop的时候会从hdfs上删除数据,外部表不会删除如果不指定数据库,hive会把表创建在默认数据库下。
创建内部表:create table if not exists `my_inner_table`(`id` bigint comment '逻辑id,记录的唯⼀id',`user_id` string comment 'user_id') comment '内部表'partitioned by (`dt` string comment 'date, yyyy-MM-dd')ROW FORMAT DELIMITEDFIELDS TERMINATED BY'\001'lines terminated by'\n'STORED AS TEXTFILE创建外部表:create external table if not exists `my_external_table`(`id` bigint comment '逻辑id,记录的唯⼀id',`user_id` string comment 'user_id') comment '外部表'partitioned by (`dt` string comment 'date, yyyy-MM-dd')ROW FORMAT DELIMITEDFIELDS TERMINATED BY'\001'lines terminated by'\n'STORED AS TEXTFILElocation 'hdfs://user/user.sql/';HIVE SQL优化优化的根本思想:尽早尽量过滤数据,减少每个阶段的数据量减少job数解决数据倾斜问题尽早尽量过滤数据,减少每个阶段的数据量1.列裁剪:例如某表有a,b,c,d,e五个字段,但是我们只需要a和b,那么请⽤select a,b from table ⽽不是select * from table2.分区裁剪:在查询的过程中减少不必要的分区,即尽量指定分区3.利⽤hive的优化机制减少job数:不论是外关联outer join还是内关联inner join,如果join的key相同,不管有多少表,都会合并为⼀个MapReduce任务:select a.val,b.val,c.val from a JOIN b ON (a.key= b.key1) JOIN c ON (c.key2 = b.key1) ----⼀个jobselect a.val,b.val,c.val from a JOIN b ON (a.key= b.key1) JOIN c ON (c.key2 = b.key2) ----两个job4.善⽤multi-insert:#查询了两次ainsert overwrite table tmp1select ... from a where条件1;insert overwrite table tmp2select ... from a where条件2;#查询了⼀次afrom ainsert overwrite table tmp1select ... where条件1insert overwrite table tmp2select ... where条件25.善⽤union all:不同表的union all相当于multi inputs,同⼀表的union all相当于map⼀次输出多条6.避免笛卡尔积:关联的时候⼀定要写关联条件7.join前过滤掉不需要的数据#hive0.12之前,会先把a全部数据和b的全部数据进⾏了关联,然后再筛选条件,0.12之后做了优化1. select a.val,b.val from a LEFT OUTER JOIN b ON (a.key=b.key)where a.dt='2020-05-07'and b.dt='2020-05-07'#优化后的⽅案2.select x.val,y.val from(select key, val from a where a.dt='2020-05-07') xLEFT OUTER JOIN(select key, val from b where b.dt='2020-05-07') yON x.key=y.key8.⼩表放前⼤表放后在编写带有join的代码语句时,应该将条⽬少的表/⼦查询放在join操作符的前⾯因为在Reduce阶段,位于join操作符左边的表会先被加载到内存,载⼊条⽬较少的表可以有效的防⽌内存溢出(OOM)。
转载:SqlServer数据库性能优化详解本⽂转载⾃:性能调节的⽬的是通过将⽹络流通、磁盘 I/O 和 CPU 时间减到最⼩,使每个查询的响应时间最短并最⼤限度地提⾼整个数据库服务器的吞吐量。
为达到此⽬的,需要了解应⽤程序的需求和数据的逻辑和物理结构,并在相互冲突的数据库使⽤之间(如联机事务处理 (OLTP) 与决策⽀持)权衡。
对性能问题的考虑应贯穿于开发阶段的全过程,不应只在最后实现系统时才考虑性能问题。
许多使性能得到显著提⾼的性能事宜可通过开始时仔细设计得以实现。
为最有效地优化 Microsoft? SQL Server? 2000 的性能,必须在极为多样化的情形中识别出会使性能提升最多的区域,并对这些区域集中分析。
虽然其它系统级性能问题(如内存、硬件等)也是研究对象,但经验表明从这些⽅⾯获得的性能收益通常会增长。
通常情况下,SQL Server ⾃动管理可⽤的硬件资源,从⽽减少对⼤量的系统级⼿动调节任务的需求(以及从中所得的收益)。
设计联合数据库服务器为达到⼤型 Web 站点所需的⾼性能级别,多层系统⼀般在多个服务器之间平衡每⼀层的处理负荷。
Microsoft? SQL Server? 2000通过对SQL Server 数据进⾏⽔平分区,在⼀组服务器之间分摊数据库处理负荷。
这些服务器相互独⽴,但也可以相互协作以处理来⾃应⽤程序的数据库请求;这样的⼀组协作服务器称为联合体。
只有当应⽤程序将每个 SQL 语句发送到拥有该语句所需的⼤部分数据的成员服务器时,联合数据库层才可以达到⾮常⾼的性能级别。
这称为使⽤语句所需的数据配置 SQL 语句。
使⽤所需的数据配置 SQL 语句不是联合服务器所独有的要求;在群集系统中同样有此要求。
虽然服务器联合体与单个数据库服务器呈现给应⽤程序的图像相同,但在实现数据库服务层的⽅式上存在内部差异。
单个服务器层联合服务器层⽣产服务器上有⼀个 SQL Server 实例。
第四章分布式计算框架MapReduce4.1初识MapReduceMapReduce是一种面向大规模数据并行处理的编程模型,也一种并行分布式计算框架。
在Hadoop流行之前,分布式框架虽然也有,但是实现比较复杂,基本都是大公司的专利,小公司没有能力和人力来实现分布式系统的开发。
Hadoop的出现,使用MapReduce框架让分布式编程变得简单。
如名称所示,MapReduce主要由两个处理阶段:Map阶段和Reduce 阶段,每个阶段都以键值对作为输入和输出,键值对类型可由用户定义。
程序员只需要实现Map和Reduce两个函数,便可实现分布式计算,而其余的部分,如分布式实现、资源协调、内部通信等,都是由平台底层实现,无需开发者关心。
基于Hadoop开发项目相对简单,小公司也可以轻松的开发分布式处理软件。
4.1.1 MapReduce基本过程MapReduce是一种编程模型,用户在这个模型框架下编写自己的Map函数和Reduce函数来实现分布式数据处理。
MapReduce程序的执行过程主要就是调用Map函数和Reduce函数,Hadoop把MapReduce程序的执行过程分为Map和Reduce两个大的阶段,如果细分可以为Map、Shuffle(洗牌)、Reduce三个阶段。
Map含义是映射,将要操作的每个元素映射成一对键和值,Reduce含义是归约,将要操作的元素按键做合并计算,Shuffle在第三节详细介绍。
下面以一个比较简单的示例,形象直观介绍一下Map、Reduce阶段是如何执行的。
有一组图形,包含三角形、圆形、正方形三种形状图形,要计算每种形状图形的个数,见下图4-1。
图:4-1 map/reduce计算不同形状的过程在Map阶段,将每个图形映射成形状(键Key)和数量(值Value),每个形状图形的数量值是“1”;Shuffle阶段的Combine(合并),相同的形状做归类;在Reduce阶段,对相同形状的值做求和计算。
工具概要如果你的数据库应用系统中,存在有大量表,视图,索引,触发器,函数,存储过程,sql语句等等,又性能低下,而苦逼的你又要对其优化,那么你该怎么办?哥教你,首先你要知道问题出在哪里?如果想知道问题出在哪里,并且找到他,咱们可以借助本文中要讲述的性能检测工具--sql server profiler(处在sql安装文件--性能工具--sql server profiler)如果知道啦问题出现在哪里,如果你又是绝世高手,当然可以直中要害,写段代码给处理解决掉,但是如果你不行,你做不到,那么也无所谓,可以借助哥的力量给你解决问题。
哥给你的武功的秘诀心法是---数据库引擎优化顾问(处在sql安装文件--性能工具--数据库引擎优化顾问)sql server profiler功能此工具比柯南还柯南,因为他能检测到数据库中的一举一动,即便你不动他,他也在监视你,他很贱的。
他不但监视,还监视的很详细,有多详细一会再说,还把监视的内容记录到数据库或者是文件中,给你媳妇告状说你把数据库哪里的性能搞的多么不好,不过他也会把好的给你记录下来,好与不好这当然需要你来分析,其实他也是个很2的柯南。
数据库引擎优化顾问功能此武功,乃上乘武功。
像张无忌的乾坤大挪移,先是接受sql server profiler检测出来的sql,视图,存储过程,数据结构等等,然后他再自己分析,然后再在怀中转两圈,感觉自己转的差不多啦,就给抛出来个威力更炫,更好的索引,统计,分区等等建议信息。
让你承受不住,happly致死。
下面听哥给你先讲讲咱们的很2柯南。
sql server profiler的使用打开系统主菜单--sqlserver几---性能工具--->>sql server profiler;笨样儿,找到没?哥等你会儿,给你上张打开他后的图,让你看看。
然后文件--新建跟踪--显示跟踪属性窗口首先那个select%是个筛选监测的TextData。
【HIVE】sql语句转换成mapreduce1.hive是什么?2.MapReduce框架实现SQL基本操作的原理是什么?3.Hive怎样实现SQL的词法和语法解析?连接:美团⼤众点评上:hive是什么?Hive是基于Hadoop的⼀个数据仓库系统,在各⼤公司都有⼴泛的应⽤。
美团数据仓库也是基于Hive搭建,每天执⾏近万次的Hive ETL计算流程,负责每天数百GB的数据存储和分析。
Hive的稳定性和性能对我们的数据分析⾮常关键。
在⼏次升级Hive的过程中,我们遇到了⼀些⼤⼤⼩⼩的问题。
通过向的咨询和⾃⼰的努⼒,在解决这些问题的同时我们对Hive将SQL编译为MapReduce的过程有了⽐较深⼊的理解。
对这⼀过程的理解不仅帮助我们解决了⼀些Hive的bug,也有利于我们优化Hive SQL,提升我们对Hive的掌控⼒,同时有能⼒去定制⼀些需要的功能。
MapReduce实现基本SQL操作的原理详细讲解SQL编译为MapReduce之前,我们先来看看MapReduce框架实现SQL基本操作的原理Join的实现原理select , o.orderid from order o join user u on o.uid = u.uid;在map的输出value中为不同表的数据打上tag标记,在reduce阶段根据tag判断数据来源。
MapReduce的过程如下(这⾥只是说明最基本的Join的实现,还有其他的实现⽅式)Group By的实现原理select rank, isonline, count(*) from city group by rank, isonline;将GroupBy的字段组合为map的输出key值,利⽤MapReduce的排序,在reduce阶段保存LastKey区分不同的key。
MapReduce的过程如下(当然这⾥只是说明Reduce端的⾮Hash聚合过程)Distinct的实现原理select dealid, count(distinct uid) num from order group by dealid;当只有⼀个distinct字段时,如果不考虑Map阶段的Hash GroupBy,只需要将GroupBy字段和Distinct字段组合为map输出key,利⽤mapreduce的排序,同时将GroupBy字段作为reduce的key,在reduce阶段保存LastKey即可完成去重如果有多个distinct字段呢,如下⾯的SQLselect dealid, count(distinct uid), count(distinct date) from order group by dealid;实现⽅式有两种:(1)如果仍然按照上⾯⼀个distinct字段的⽅法,即下图这种实现⽅式,⽆法跟据uid和date分别排序,也就⽆法通过LastKey去重,仍然需要在reduce阶段在内存中通过Hash去重(2)第⼆种实现⽅式,可以对所有的distinct字段编号,每⾏数据⽣成n⾏数据,那么相同字段就会分别排序,这时只需要在reduce阶段记录LastKey即可去重。
在SQL SERVER中使用分布式事务全攻略(图解)[原创文章] 作者:cyw操作系统:Win2003 Enterprise Edition。
版本:5.2.3790 Service Pack 2 内部版本号 3790。
数据库:SQL Server 2000 企业版 + SP4 + SP4后的补丁。
版本:8.00.2187 (Intel X86) Mar 9 2006。
网络环境:两台服务器仅安装TCP/IP协议,处于相同网段,工作组可以不同,相互间Ping主机名成功,Ping IP地址也能成功。
如果不能Ping成功,请在两台服务器上安装NETBIOS协议后再重试。
如果还不行,请在“C:\WINDOWS\system32\drivers\etc\hosts”文件中增加一条记录:xxx.xxx.xxx.xxx 服务器主机名作用同样是把服务器名对应到链接服务器的IP地址。
一、建立链接服务器假设服务器A的IP是172.16.10.6,SQLServer登录用户sa,密码8888;服务器B的IP是172.16.10.16,SQLServer登录用户sa,密码9999。
现在先在服务器A上建立与B通信的远程链接服务器,假设链接的别名是BServer,如图:点击“确定”,完成。
同理,在B服务器上也建立对A服务器的远程链接,链接别名为AServer,数据源地址改为172.16.10.6,密码输入8888,其他相同。
当然,也可以使用SQL语句完成以上操作,如下:-- 建立远程链接服务器BServerEXEC sp_addlinkedserve r @server = 'BServer', @srvproduct = '', @provider = 'SQLOLEDB',@datasrc = '172.16.10.16', @catalog = 'HYCommon'-- 建立远程登录到BServerEXEC sp_addlinkedsrvlogin@rmtsrvname = 'BServer', @useself = 'false', @rmtuser = 'sa',@rmtpassword = '9999'-- 设置远程服务器选项:允许RPC和RPC输出(该步可选)Exec sp_serveroption'BServer', 'RPC', 'true'Exec sp_serveroption'BServer', 'RPC Out', 'true'现在测试一下链接是否成功,打开查询分析器,登录到A服务器,输入以下SQL运行:Select * From BServer.pubs.dbo.titles正常的话就可以看到B服务器上pubs数据库的titles表数据,如果不行,请检查网络连接是否满足文章开头所述的网络环境。
ODPS中如何实现MapReduce这里实现了一个对官方提供的原始数据进行去重的MapReduce源码。
这个示例的功能是合并user对brand的在同一天的相同操作,并记录重复次数。
输入有四个字段,user_id,brand_id,type,visit_datetime。
输出有5个字段,user_id,brand_id,type,visit_datetime,count。
本文直接讲如何在线上执行MapReduce,线下的调试不再叙述。
1.确定输入表并用sql建立输出表本文用官方提供的数据作为输入表(t_alibab_bigdata_user_brand_total_1),输出表为wc_out。
注意输出表的字段属性为前4个都为string,最后一个为bigint。
Sql代码如下:create table wc_out(user_id string,brand_id string,type string,visit_datetime string,count bigint);2.在eclipse中建立ODPS工程在eclipse中,File->New->ODPS Project。
如果没有的话,可以在Flie->New->Other 里面有个Aliyun Open Data Processing Service里面找到。
3.建立基本类在src中新建一个Mapper类,取名为MyMapper。
再新建一个Reduce,取名为MyReduce。
再新建一个MapReduce Driver类,取名为MyDriver。
这里我们把三个类都放在default包中。
4.代码(博主对Java不熟悉,风格和实现轻喷,代码为纯手打,没测试,有错欢迎指出) MyMapper代码:public class MyMapper extends MapperBase{Record key;Record value;public setup(TaskContext context) throws IOException{key = context.createMapOutputKeyRecord();value = context.createMapOutputValueRecord();}public void map(long recordNum,Record record,TaskContext context) throws IOException{//根据列名获取数据String uid = record.getString("user_id");String bid = record.getString("brand_id");String type = record.getString("type");String datetime = record.getString("visit_datetime");//key输出key.set(0,uid);key.set(1,bid);key.set(2,type);key.set(3,datetime);//输出count为1value.set(0,1l);context.write(key,value);}}每输入一行记录,会调用map函数一次。
SQL versus MapReduceA Comparison between two Approaches to Large Scale Data AnalysisLivio HobiUniversity of Zurichlivio.hobi@uzh.chMay30,2013AbstractThis paper compares two different approaches to large scale data anal-ysis,namely MapReduce and parallel database management systems.Bothapproaches use a cluster of nodes to compute expensive tasks in parallel.The strengths of MapReduce are fault tolerance,storage-system indepen-dence,flexibility and simplicity.While MapReduce can process the dataon-the-fly by simply loading it into a distributedfile system,the load-ing phase of parallel database management systems can take a long time.Once the data is loaded,parallel database management systems are ro-bust,support a high degree of parallelism and provide high-performance,since they have been developed and improved for over two decades.Theuser has to think about the fundamental trade-offbetween data loadingcosts and task response time.If the dataset is static rather than dynamicand the data has to be queried frequently,a parallel DBMS may be prefer-able.In contrast,if the data is dynamic and the user is interested in a fewqueries,the approach of MapReduce can be very useful.Another interest-ing difference is the failure model.While MapReduce makes extensivelyuse of afine grained failure model,parallel DBMSs prefer performanceat the cost of possible re-executions of the whole task.If the computa-tion takes a long time or the cluster consists of hundreds or thousands ofmachines,the approach of MapReduce may be preferable.1IntroductionAs datasets become larger and larger,there emerges a need to handle such data in parallel over a cluster of machines.Since over two decades parallel database management systems(DBMSs)have existed.They use a cluster of interconnected computing machines to perform queries over large datasets in parallel[PPR09].A DBMS is installed on each node,which is connected to the other members of the cluster.Once the system has to process a query,it automatically computes a query plan,distributes it and partially executes the query in parallel over the cluster.Parallel DBMSs are robust,support a high1degree of parallelism and provide high-performance.They use the declarative, high-level query language SQL(Structured Query Language).Parallel DBMS perform well because they load and index the data before they process queries.Since this loading phase can take a long time,there arises the question if this loading phase is worth it to only process one or two queries over the data.This is where MapReduce comes into play.MapReduce is a programming model for processing and generating large datasets.MapReduce was introduced in2004by Dean and Ghemawat[DG08]and embraced by a ever growing community.Since the launch of MapReduce,users have developed thousands of MapReduce programs.So what makes this programming model so successful?The main benefit of the MapReduce paradigm is simplicity.MapReduce consists basically only of two functions:Map and Reduce.These functions have to be defined by the user.They take as input a key/value pair and output a key/value pair.Hidden from the user,the MapReduce implementation handles parallelism,failures of nodes,data distribution and load balancing.In contrast to the loading and indexing phase of parallel DBMSs,MapReduce only loads the data in a distributedfile system and then processes the data on-the-fly. MapReduce divides the dataset into smaller chunks of data and distributes them among to the cluster.While parallel DBMS make use of the declarative language SQL,MapReduce tasks can be programmed using procedural,object-oriented languages,such as Java.It is possible to write almost every parallel database task as an equivalent MapReduce task and vice-versa(using user defined functions(UDFs)in DBMSs) [PPR09].An interesting question is whether MapReduce should and could replace parallel DBMSs.What are the benefits of each approach?How do they perform?What can they learn from each other?2Problem StatementThis section introduces a problem from the banking industry that will be solved with both techniques,parallel DBMS and MapReduce.The following sections will refer to this problem to show how the two approaches work and where their differences are.Let us consider we are a bank(e.g.UBS).As in most companies,one problem is to know our customers.Which customers should be involved in which marketing campaign?Where are potential customers?Where are their interests?In this example,our goal is tofind customers which have a capital of over one million and are additionally interested in share transactions.The data we have are information about the capital of a customer and a log of clicks of our online banking system.We assume that a customer is interested in shares,when he or she clicks more than ten times in a month on the according”Shares”-button in the online banking system.The difficulty in this scenario is that the log of clicks in our2online banking system grows very fast and changes over time.Therefore we have to deal with a big amount of data.3The two ApproachesIn this section the details of the two approaches are presented.3.1MapReduceMapReduce is a programming model,which allows straightforward implemen-tation of parallel tasks,but hides the details of parallelism,fault tolerance,data distribution and load balancing[DG08].In practice there are a broad range of applications for MapReduce.Examples are graph processing,text processing, data mining,reports of popular queries(Google Trends)and processing of data from web crawls[SAD+10].3.1.1Programming ModelAs mentioned before,MapReduce is a programming model developed for pro-cessing parallel tasks over a large dataset with a cluster of nodes.The two core functions,Map and Reduce,take as input a key/value pairs and produce as output a list of key/value pairs.The Reduce function accepts intermediate key/value pairs,provided by the Map function,performs some computation (such as merging,aggregation)and outputs again key/value pairs.In MapReduce,every intermediate key/value pair with the same key is sent to the same Reduce instance on the same node.An instance means a unique running invocation of a Map or Reduce function,where a node of a MapReduce cluster usually has multiple instances of these functions[PPR09].It is possible to perform complex computations with the MapReduce pro-gramming model by simply chaining multiple Map and Reduce functions.While in SQL,respectively in parallel DBMS,the user has to write a SQL query(what can be hard if the query is complex),MapReduce follows a simple and more un-derstandable step-by-step manner.3.1.2ImplementationIn clusters with hundreds,or even thousands of nodes,failures of some nodes can be very common.Therefore MapReduce uses replication to provide availability and reliability[DG08].Let us look at a simple example:In a cluster of1000machines,where the probability of a failure of a machine is0.001per active hour,and a average task computation time of10hours,the probability of a failure is10%(1000*0.001*10).If this task is running every night and the result is needed the next day, reliability of a programming model can be really useful.Of course,1000nodes and10hour processing are quite big numbers and maybe not often used by3a company nowadays,but with ever-growing data,this scenario may become more realistic.A MapReduce cluster is comprised of one master node and a set of worker nodes.The master node keeps track of the progress of the execution and in case of a failure it re-executes the specific task on another node.A typical execution contains the following steps:1.The inputfiles are split into chunks and distributed among the nodes overthe distributedfile system.Then,copies of the program are started on the cluster.2.The master node picks idle workers and assigns them Map and Reducetasks.3.Workers with Map tasks read the input chunk,parse the data to key/valuepairs and pass them to the Map function.Intermediate key/value pairs are buffered in memory.4.The buffered pairs are written to local disk and the master gets informedabout their location.5.The master informs Reduce workers about the location of the intermediatevalues.The Reduce workers then use remote procedure calls to read the data,where the keys of a Reduce instance are all the same.6.The Reduce worker iterates over the intermediate data and passes thecorresponding data to the Reduce function for each unique key.This generates afinal outputfile for every Reduce instance,which is again stored in the distributedfile system.7.Afterfinishing all tasks,the master wakes up the user program.A MapReduce implementation typically stores one outputfile per Reduce task,since it is normally not necessary to combine the output.The output is often used for a next MapReduce task or the next application can deal with it.For each assigned task,the master node keeps track of the identity of the worker machine and the corresponding state(idle,in-progress,completed).Ad-ditionally it stores the location of the R intermediatefile regions per Map task, where R is the number of unique keys produced by a Map worker.3.1.3Fault ToleranceOne of the key benefits of the MapReduce programming model is its fault toler-ance.It uses replications of data chunks to provide availability and reliability. The master pings workers and in case of no response,the worker is marked as failed.All running Map and Reduce tasks of that machine are then assigned to other pleted Map tasks too,because the intermediate results are stored on local disks,whereas completed Reduce tasks do not have to be re-executed because they are stored in the distributedfile system.Of course there4are some trade-offs between fault tolerance overhead and performance.As data gets larger and hence processing tasks takes more time,failures of nodes in big cluster are more likely and therefore an expensive fault tolerance model can be worth it.We will discuss this aspect later on in section4.3.1.4Solution with MapReduceLet us come back to our problem from section2.Our data is stored in comma-separated values(CSV)files.In thefile of the customers(Customers.csv)every row stores the id,the name and the capital.The log of clicks is stored in one log file per day.As usual in logfiles,the date information is stored in thefilename. Afile of our log could be named”2013-04-02-ClickLog.csv”.Every row stores the customer id and the click event.Examplefiles:2013-04-02-ClickLog.csv Customer.csv CustomerID,ClickEvent ID,Name,Capital3,Shares1,Hans,1’750’0002,Shares2,Peter,25’0002,Home3,Julia,3’500’0003,Payment1,Transfer2,Shares2,Shares1,SharesThe code in MapReduce would look as follows:Map1(key,value){//key:filename of Customer//value:Object,with the entries of a CustomerCustomer c=new Customer()c.parseFrom(value)if(c.capital>1’000’000)Emit(c.id,)}Map2(key,value){//key:filename of ClickLog//value:Object,with the entries of a ClickLogClickLog c=new ClickLog()c.parseFrom(value)if(c.clickEvent=Shares)Emit(c.customerID,c.event)}5Reduce(key,values){//key:CustomerID//value:String shares,String customer;int count=0;if(a value in values!=Shares){customer=that valueFor all other value in valuesCount++}If(count>10AND customer!=null){Emit(key,customer)}}There are two Map functions and one Reduce function.Thefirst Map func-tion takes as input the data from”Customers.csv”and only emits the customer id as key and the customer name as value when the capital of that customer is over one million.This is equivalent to afilter operation in the where clause of a SQL statement.The second Map function processes the data from the log of clicks and emits the customer id as key and the event as shares when the click event is an event of type”Shares”.Intermediate key/value pairs with the same customer id are sent to the same Reduce instance.The Reduce function takes as input the intermediate key/value pairs from both Map functions.It then counts the number of”Shares”events and looks for a customer name.If the variable customer is still null in the end, it means that this customer does not have a capital of more than a million and therefore the Reduce function does not emit a result.After the computation,there will be afile in the distributedfile system for every customer which has more than a million capital and over ten clicks on ”Shares”with its customer id and the customer name.This is exactly what we were looking for.In MapReduce we can make use of the time information stored in thefile-name in order to not scan through all the ClickLogfiles.This reduces the amount of work dramatically(less parsing,reading,processing overhead).Figure1shows an example execution of this MapReduce task.In thefirst column are the input data chunks,stored in the distributedfile system.In this example,every chunk consists of three entries of the Customer data or the ClickLog data.This chunks are assigned to the different Map instances and their output is then stored on local disks on the nodes of the Map instances. Afterwards,the master node informs the Reduce instances where their data is located.They use remote procedure calls to get their intermediate key/value pairs.The output of the Reduce instances is then stored in the distributedfile system.To save space,the example shows a result of the Reduce function if there is at least one customer and one”Shares”event.6Figure1:MapReduce Execution Example3.2Parallel Database Management SystemsA parallel DBMS uses a cluster of nodes with high-speed interconnects to com-pute expensive tasks in parallel.They support standard relational tables and ing horizontal partitioning of relational tables,the system can partially execute SQL queries in parallel.Horizontal partitioning means that the rows of a relational table are distributed across the nodes of the cluster[SAD+10]. Parallel DBMSs use different partitioning ing hash partitioning, a hash function is applied to every row of a relation and according to the value distributed to the nodes.In round-robin partitioning,the rows of a relation are distributed in blocks across the nodes of the cluster.The advantage of round-robin partitioning is better load balancing,since every node stores more or less the same number of records.The system uses an optimizer that translates the query into a query plan, which is then distributed across the nodes of the cluster.In contrast to the approach of MapReduce,there are no additional messages or scheduling al-gorithms necessary,because the optimizer sends the whole query plan at the beginning of the computation and therefore every node knows exactly what and when to compute.A key benefit of parallel DBMSs is the automatic generation of the query plan by the optimizer,which takes into account how the data is partitioned.The user doesn’t have to think about how it is executed.There are many commercial implementations available,including Teradata, Netezza,DataAllegro,ParAccel,Greenplum,Vertica and DB2[SAD+10].73.2.1Solution with parallel DBMSsThis subsection describes how to solve the problem from section2with parallel DBMSs.Our data is stored in two relations,Customer and ClickLog.Customer consists of three attributes,the ID,Name and the Capital.The attributes of ClickLog are CustomerID,Date and ClickEvent.The two example relations could look as follows:ClickLog CustomerCustomerID TimeStamp ClickEvent ID Name Capital 32013-02-15Shares1Hans1’750’000 22013-03-07Shares2Peter25’000 22013-04-02Home3Julia3’500’000 32013-04-08Payment12013-04-17Transfer22013-04-20Shares22013-04-25Shares12013-04-26SharesThis SQL statement solves our problem:SELECT Customer.ID,FROM Customer,ClickLogWHERE Customer.ID=ClickLog.CustomerIDAND Customer.Capital>1000000AND ClickLog.ClickEvent=SharesAND ClickLog.Date BETWEEN’2013-04-01’AND’2013-04-30’GROUP BY Customer.ID,HAVING COUNT(ClickLog.ClickEvent)>10This SQL query basically joins the two relations on the customer id and filters them on the amount of capital,type of event and the date.It then groups by the id and the name and returns the results which have a count of the event type higher than ten.Let us now look at three different execution scenarios:Scenario one:Both relations are hash partitioned on the customer id.If the two relations are hash partitioned on the join attribute,the query can be processed in parallel over the cluster because all the data of a single join attribute are stored on the same node.Therefore the local joins can be processed in parallel and network traffic is only needed at the beginning,when the query plan is distributed and in the end to send all the results to a single node.8Scenario two:Customer and ClickLog are round-robin partitioned. If both relations are round-robin partitioned,the optimizer can chose between different query plans.If one of the two relations is small(e.g.the Customer relation),it could be sent to all the nodes and every node could then compute their partial results in parallel.If the overhead of sending the whole relation across the cluster is too big,a hash function can be applied to the join attribute of the query(ID,CustomerID)to compute where to send the rows.All the records of specific id are then sent to the same node such that the query can be executed in parallel.Scenario three:ClickLog is hash partitioned on CustomerID and Cus-tomer is round-robin partitioned.In this scenario the optimizer would apply the same hash function used to partition ClickLog to Customer and then distribute the records of Customer to the same node,where the ClickLog entries are located.As mentioned before,the optimizer automatically computes the best query plan,depending on how the data is partitioned.4Some Key Aspects in DetailIn this section some differences between the two approaches are discussed in detail.4.1SchemaMapReduce is schema-free.On one hand this makes the programming model veryflexible but on the other hand the user has to think about how the data is parsed into the Map,respectively Reduce functions.Pavlo et al.stated in their paper that the user often has to write a custom parser to add the required semantics to the data used by the program.This can lead to a significant overhead if the parser is not optimized.For this purpose there exists a Protocol Buffer format,which describes the input and output types,but hides the details of encoding and decoding.It uses an optimized binary representation that is more compact and much faster than the textual formats[DG10].In our example,we can use the input like a normal object,because we provided a protocol buffer description,where one,two and three refer to the position in the input value,separated by a comma:message Customer{required int32ID=1;required string Name=2;required int32Capital=3;}9In contrast,parallel DBMSs use the schema known from the relational paradigm. They separate the schema from the application.Furthermore,parallel DBMSs automatically ensure that the data does not violate integrity or other con-straints,while in MapReduce this is the task of the programmer[PPR09].In MapReduce,this can lead to a problem,when data is shared between programmers.They have to agree on a schema and must ensure that added or modified data is handled correctly by the application.If the data changes, it is possible that all programs have to be changed,which can lead to higher maintenance costs.But if no data is shared,the programmer can make use of theflexibility of the MapReduce paradigm.4.2IndexingParallel DBMSs make extensive use of indices to improve their performance. In contrast,MapReduce does not provide built-in indices and one could think that MapReduce always has to scan through the entire data[PPR09].How-ever,MapReduce is storage-system independent.This means that it could for example use the output of a query in a DBMS as an input of a MapReduce task.Another workaround is to make use of information stored infilenames. Time information for example are often stored in the names of log datafiles. MapReduce can then process data from a specific time period[DG10].This is exactly what we have done to solve our example problem.Of course,this workarounds of MapReduce do not fully compensate for the lack of indices.But the storage-system independence makes it reallyflexible. MapReduce was not built to make extensive use of index structures.This is one of the key differences.Parallel DBMSs use these techniques to improve the performance,but this leads to a longer loading phase.The advantage of MapReduce is the ability to process tasks on-the-fly.4.3Programming LanguageThe programming model of the two approaches are very different.While parallel DBMSs use the high-level,declarative language SQL,MapReduce uses proce-dural languages,such as Java or C++.In SQL,the user has to state what he wants,while the implementation is hidden.On the contrary,in procedural languages the programmer has to implement the algorithm by him self[PPR09]. There are approaches that build a layer on top of MapReduce frameworks and provide some high-level functionalities,such as joins or ing this layer,frequently utilized algorithms do not always have to be implemented by the programmer.Apache Pig[ORS08]is an example.4.4Complex FunctionsUsing MapReduce,the programmer is enjoyingflexibility in how a problem is addressed.In contrast,most of the parallel DBMSs come with afixed set of functionalities,where the usage of user defined functions is limited.Even10though several problems can be solved with these standard functionalities,there are cases where this does not hold.For example processing unstructured input data that does no properlyfit into the relational paradigm(e.g.data-mining or data-clustering,where multiple passes over the same data are required)[PPR09]. In contrast,the programmer doesn’t have to care about the execution order because the optimizer automatically computes a query plan.4.5Execution Strategy and Fault ToleranceThe execution strategy of MapReduce is based on a pull model.Every Map function materializes the output on local disk.Then the Reduce instance gets informed by the master on which node and where on the disk its data is located. The advantage of this materializing is fault tolerance.As mentioned in section 3.1,MapReduce is able to re-execute every failed task.On the other side this can lead to a bottleneck,since a lot of Reduce instances could try to get data from the same node at the same time.Dean and Ghemawat state that as datasets grow and computations take more time,fault tolerance will become more important.In contrast,parallel DBMSs use a push model rather than a pull model. Intermediate results are not stored on local disks,but they are forwarded to the next execution step.This ends up with a much simpler failure model(e.g. transactions),where a failure would result in a re-execution of most parts of a computation.Nevertheless,parallel DBMSs gain performance because they do not store intermediate results[PPR09].4.6PerformancePavlo et al.tried to compare the performance of a MapReduce implementation (Hadoop[had])to two parallel DBMSs(DBMS-X and Vertica[ver]).In thefive tasks they tested,on average the two parallel DBMSs outperform Hadoop by a factor of3.2to7.4on a cluster of100nodes.Reasons for the better perfor-mance of the parallel DBMSs are the usage of indices,better compression and storage techniques and mature algorithms for parallel processing.Additionally the high start-up costs of a Hadoop jop can dominate the response time of a task,especially in short-running queries.They were however impressed,how easy a Hadoop environment is installed and how fast a user can start running a job.Dean and Ghemawat criticized different aspects on the performance com-parison.First of all they stated that for many of the tasks,the loading phase of the parallel DBMSs takes aboutfive to50times the time needed to run the job on Hadoop.If the data is processed only once or twice,a Hadoop job would clearly end before the data is even loaded into the parallel DBMSs.Further-more,they claimed that start-up and parsing overhead are not fundamental differences between two programming models.These two aspects are imple-mentation specific and can be addressed in various ways(e.g.Protocol Buffer Format,pools of worker waiting for a new job).Additionally they falsify the11conclusion of[PPR09]that a MapReduce task always has to scan over all the data.MapReduce can use the output of a DBMS as an input or use natural indices such as information stored infilenames in order to not scan through all the data.Even though the performance benchmark of Pavlo et al.may overstate the benefit of parallel DBMSs,it clearly shows the trade-offbetween data loading in parallel DBMSs and processing data on-the-fly in MapReduce.As MapReduce implementations will get more sophisticated,the probable actual performance advantage may shrink and MapReduce may become an interesting alternative to parallel DBMSs.5ConclusionThis paper simplifies and compares the two different approaches,MapReduce and parallel DBMSs.While MapReduce is excellent in the areas of simplicity, fault tolerance andflexibility,there can be still a lack of performance compared to parallel DBMSs,once the data is loaded.MapReduce consists basically of two functions,Map and Reduce.They have to be defined by the program-mer.MapReduce supports procedural programming languages,such as Java and C++.A speciality of MapReduce is to start computations over data on-the-fly.The data has only to be copied in the distributedfile system and then the processing can start.The possibility of starting a task over a huge dataset without a long loading phase is one of the key features.If the data is only processed once and changes frequently over time,this approach can be very useful and preferable.Its fault tolerance model brings an important advantage, as datasets and clusters get bigger.But this fault tolerance is not for free.The materializing of intermediate results on local disks can have a bad impact on the performance.Nevertheless,such a programming model with this fault tolerance will become more preferable as data sets grow.On the other hand,parallel DBMSs have been developed for over two decades and are thus very mature.If the data is static,rather than dynamic,and the same data has to be processed multiple times,their overall performance is quite ing the well-known declarative language SQL,the user can state what he wants,while the parallelism,optimization and execution of the query is hidden.Nevertheless,parallel DBMSs reach their limitations when the input data does not properlyfit into the relational paradigm(unstructured data such as plain text or content of a html page).Both approaches have their advantages and disadvantages.It is up to the user to decide which solution is when useful for a specific problem. References[DG08]Jeffrey Dean and Sanjay Ghemawat.MapReduce:simplified data processing on large munications of the ACM,1251(1):107–113,2008.[DG10]Jeffrey Dean and Sanjay Ghemawat.MapReduce:aflexible data processing munications of the ACM,2010.[had]Hadoop./.[ORS08]Christopher Olston,Benjamin Reed,and U Srivastava.Pig latin:a not-so-foreign language for data processing....on Management ofdata,2008.[PPR09]Andrew Pavlo,Erik Paulson,and Alexander Rasin.A comparison of approaches to large-scale data analysis....on Management of data,pages165–178,2009.[SAD+10]Michael Stonebraker,Daniel Abadi,J.Dewitt DeWitt,Sam Madden, Erik Paulson,Andrew Pavlo,and Alexander Rasin.MapReduce andparallel DBMSs:friends or foes?Communications of the ACM,2010.[ver]Vertica./.13。