当前位置：文档之家› On Mitigating TCP Incast in Data Center Networks_INFOCOM2011

On Mitigating TCP Incast in Data Center Networks_INFOCOM2011

On Mitigating TCP Incast in Data Center Networks Yan Zhang,Student Member,IEEE,and Nirwan Ansari,Fellow,IEEE

Advanced Networking Laboratory,Department of Electrical and Computer Engineering,

New Jersey Institute of Technology,Newark,NJ07012,United States

{yz45,nirwan.ansari}@https://www.doczj.com/doc/8b16268391.html,

Abstract—TCP Incast,also known as TCP throughput collapse, is a term used to describe a link capacity under-utilization phenomenon in certain many-to-one communication patterns, typically in many datacenter applications.The main root cause of TCP Incast analyzed by prior works is attributed to packet drops at the congestion switch that result in TCP timeout. Congestion control algorithms have been developed to reduce or eliminate packet drops at the congestion switch.In this paper, the performance of Quantized Congestion Noti?cation(QCN) with respect to the TCP incast problem during data access from clustered servers in datacenters are investigated.QCN can effectively control link rates very rapidly in a datacenter environment.However,it performs poorly when TCP Incast is observed.To explain this low link utilization,we examine the rate?uctuation of different?ows within one synchronous reading request,and?nd that the poor performance of TCP throughput with QCN is due to the rate unfairness of different?ows. Therefore,an enhanced QCN congestion control algorithm,called fair Quantized Congestion Noti?cation(FQCN),is proposed to improve fairness of multiple?ows sharing one bottleneck link. We evaluate the performance of FQCN as compared to that of QCN in terms of fairness and convergence with four simultaneous and eight staggered source?ows.As compared to QCN,fairness is improved greatly and the queue length at the bottleneck link converges to the equilibrium queue length very fast.The effects of FQCN to TCP throughput collapse are also investigated. Simulation results show that FQCN signi?cantly enhances TCP throughput performance in a TCP Incast setup.

Index Terms—Data Center Networks(DCN),TCP Incast,TCP throughput collapse,Quantized Congestion Noti?cation(QCN), congestion control,fairness

I.I NTRODUCTION

Datacenters are becoming increasingly important to provide a myriad of services and applications to store,access,and process data.Datacenters are typically composed of storage devices,servers and Ethernet switches to interconnect servers within datacenters.Data may be spread across or striped across many servers for performance or reliability reasons.During data access from servers,the data need to pass through dat-acenter Ethernet switches.The Ethernet switches have small buffers in the range of32-256KB typically,which may be over?owed at congestions.

TCP throughput collapse was observed during synchronous data transfers in early parallel storage networks[1].This phenomenon is referred to as TCP Incast,and is attributed to having multiple senders overwhelming a switch buffer that results in TCP timeout due to packet drops at the congestion switch.TCP Incast has also been observed by others in distributed cluster storage[2],MapReduce[3],and web-search workloads.The behavior of TCP incast under a variety of conditions,such as buffer size,different number of servers, and varying Server Request Unit(SRU)size,is characterized in[2].Increasing the amount of buffer size can delay the onset of Incast.However,any particular switch con?guration will have some maximum number of servers involved in simulta-neous transmissions prior to incurring throughput collapse.By using empirical data to reason about the dynamic system of simultaneously communicating TCP entities,Chen et al.[4] tried to understand the dynamics of Incast.They proposed a quantitative model to account for some of the observed Incast behavior,and provided qualitative re?nements to elicit plausible explanations for the other symptoms.Reducing the minimum value of the retransmission timeout(RTO)from the default200ms to200μs signi?cantly alleviates the problem as shown in[4].A similar idea was proposed in[5]to mitigate TCP incast with?ne-grained timers to facilitate sub-millisecond RTO timer values.However,most systems lack the high-resolution timers required for such low RTO values. Congestion control algorithms,such as Backward Con-gestion Noti?cation(BCN)[6],Forward Explicit Congestion Noti?cation(FECN)[7],the enhanced FECN(E-FECN)[8], and Quantized Congestion Noti?cation(QCN)[9],[10],have been developed to reduce or eliminate packet drops at the congestion switch.It has been shown that BCN achieves only proportional fairness but not max-min fairness[11].FECN and E-FECN can achieve perfect fairness,but the control message overhead is high.The QCN algorithm has been developed probably for inclusion in the IEEE802.1Qau standard to provide congestion control at the Ethernet Layer or Layer 2(L2)in Data Center Networks by the IEEE Data Center Bridging Task Group.It is therefore of great interest to investigate the effects of QCN on TCP throughput collapse in the datacenter environment.

In this paper,we?rst investigate the performance of QCN with respect to the TCP incast problem during data access from clustered servers in datacenters.QCN can effectively control link rates very rapidly in a datacenter environment. However,it performs poorly when TCP Incast is observed. An enhanced QCN congestion control algorithm,called fair Quantized Congestion Noti?cation(FQCN),is thus proposed to improve fairness of multiple?ows sharing the link capacity of QCN.We evaluate the performance of FQCN,and compare it with that of QCN in terms of fairness and convergence with four simultaneous and eight staggered source?ows.As compared to QCN,fairness is improved greatly and the queue length at the bottleneck link converges to the equilibrium queue length very fast.The effects of FQCN to TCP through-

This paper was presented as part of the Mini-Conference at IEEE INFOCOM 2011

put collapse are also investigated.Simulation results show that FQCN signi?cantly enhances TCP throughput performance in a TCP Incast setup.

II.QCN

The QCN algorithm is composed of two parts as shown in Fig.1a:switch or congestion point(CP)dynamics and rate limiter(RL)or reaction point(RP)dynamics.At CP, the switch buffer attached to an oversubscribed link samples incoming packets and feeds back the congestion severity level back to the source of the sampled packet.While at RP,RL associated with a source decreases its sending rate based on congestion feedback message received from CP,and increases its rate voluntarily to recover lost bandwidth and probe for extra available bandwidth.

A.The CP Algorithm

The goal of CP is to maintain the buffer occupancy at a desired operating point,Q eq.CP samples the incoming packet with a probability depending on the severity of congestion measured by F b,and computes the severity of congestion measurement F b.Fig.1b shows the sampling probability as a function of|F b|.F b is calculated as follows:

F b=?(Q off+w?Qδ)(1) where Q off=Q?Q eq,Qδ=Q?Q old=Q a?Q d,Q denotes the instantaneous queue-size,Q old denotes the instantaneous queue-size of the last sampling period,Q a and Q d denote the number of arriving and departure packets between two consecutive sampling times,respectively,and w is a non-negative constant,taken to be2for the baseline implemen-tation.F b captures a combination of queue-size excess Q off and rate excess Qδ.Indeed,Qδis the derivative of the queue-size,and equals to the input rate less output rate.Thus, when F b is negative,either the buffer or the link or both are oversubscribed and a congestion message containing the value of F b,quantized to6bits,is sent back to the source of the sampled packet;otherwise,no feedback message is sent. B.The RP Algorithm

The RP algorithm adjusts the sending rate by decreasing the sending rate based on the congestion feedback message received from CP,and increasing the sending rate voluntarily to recover lost bandwidth and to probe for extra available bandwidth.

Rate decreases:when a feedback message is received, current rate(CR)and target rate(TR)are updated as follows:

T R=CR

CR=CR(1?G d|F b|)

(2) where the constant G d is chosen to ensure that the sending rate cannot decrease by more than50%,and thus G d?|F bmax|=

T HRESHOLD bytes are transmitted. After each cycle,RL increases its rate to recover some of the bandwidth lost at the previous rate decrease episode.Thus, the goal of RP in FR is to rapidly recover the rate lost at the last rate decrease episode.After the F R

T HRESHOLD is constant chosen to be cycles in the baseline implementation),BC enters the AI state to probe for extra bandwidth in the path.In the AI phase, RP will transmit BC

T HRESHOLD cycles of T ms duration,enters the AI state where each cycle is set to T/ms long.

BC and RIT jointly determine rate increases at RL.After feedback message is received,they each operate independently and execute their respective cycles of FR and AI.BC and RIT determine the state of RL,and the sending rate is updated as follows:

1)RL is in FR if both BC and RIT are in FR.In this case, when either BC or the RIT completes cycle,TR remains unchanged while CR is updated as:

CR=

as follows:

T R=T R+R AI

CR=1

cycle,T imer

T HRESHOLD)

CR=1

T HRESHOLD packets have been sent and5T ms have passed since the last congestion feedback message was received.This doubly ensures that aggressive rate increases occur only after RL provides the network adequate opportunity

to sending rate decrease signals should there be congestion. This is vital to ensure stability of the QCN algorithm while optimization can be performed to improve its responsiveness, for the sake of stability and simplicity.

III.FQCN

One problem with the QCN congestion control algorithm is rate unfairness of different?ows when sharing one bottleneck link.Fig.2shows the simulation results of the source goodput of different?ows for four simultaneous constant bit rate(CBR) source?ows with User Datagram Protocol(UDP)traf?c.In this simulation,each source node is linked to an edge switch. All edge switches are linked to a single core switch.The capacity of each link is10Gbps.The propagation delay of each link is assumed to be0.5μs.The processing delay of each node is assumed to be1μs.Each switch adopts the drop-tail queue mechanism.Note that the four sources achieve very different rates by the QCN algorithm,showing the unfairness of QCN.In the above analysis on TCP Incast,the total TCP goodput is limited by the slowest source?ow in the TCP Incast problem.Therefore,the unfairness of QCN decreases the TCP throughput.In this section,we propose an enhancement on QCN,referred to as FQCN,to improve TCP throughput to avoid or postpone the instigation of TCP Incast.FQCN feeds QCN messages back to all?ow sources which send packets with the sending rate over their share of the bottleneck link capacity.FQCN adopts the same RP algorithm as that of QCN. The main modi?cation is at the congestion switch.The basic algorithm of fair QCN at CP is summarized as follows:?Almost all FQCN operations are the same as those in QCN.

?The switch monitors the queue length and the packet arrival rate of each?ow.

?The switch calculates the main measure of congestion severity level,F b.If F b<0,the negative QCN feedback message will be sent back to all?ow sources which send packets with the sending rate over their share

(a)

Throughput

(b)Queue length at the congestion switch

Fig.2:QCN performance of four simultaneous CBR sources the link capacity.For each QCN feedback message,the congestion parameter is calculated as follows:

F b(i)=A i

(a)

Throughput (b)Queue length at the congestion switch Fig.3:FQCN performance of four simultaneous CBR sources.A.FQCN Evaluation

We ?rst evaluate the performance of FQCN in terms of fairness and convergence.We conducted two experiments with four simultaneous and eight staggered source ?ows,respectively.Each source node is connected to an edge switch.All edge switches are connected to a single core switch.The capacity of each link is 10Gbps.Propagation delay of each link is 0.5μs.The processing delay of each node is assumed to be 1μs.Each switch adopts the drop-tail queue mechanism.The switch buffer size is 150,000bytes,and so each switch buffer can hold 100packets,1500bytes per packet.The equilibrium queue length is set to 16packets in our simulation.The generated traf?c is a constant bit rate (CBR)carried by UDP over Ethernet.One CBR continuous ?ow is generated at each server node.For four simultaneous source ?ows,the simulation duration is 0.5sec.For eight staggered source ?ows,eight source ?ows are started one after another at an interval of 0.1sec,four out of eight end at 0.8sec and the remaining four end at 1sec.The simulation results for the former are shown in Fig.3and those for eight staggered source ?ows are shown in Fig.4.From the simulation results,it is obvious that the sources can

almost

(a)

Throughput

(b)Queue length at the congestion switch

Fig.4:FQCN performance of eight staggered CBR sources.share the link capacity fairly.As compared to the results of QCN shown in Fig.2,fairness is improved greatly,and the queue length at the bottleneck link needs about 10ms to converge to the equilibrium queue length.B.Effect of FQCN to TCP Incast

Our test application performs synchronized reads over TCP in NS2to model a typical striped ?le system data transfer operation to test the effects of QCN and FQCN on the TCP Incast problem.The TCP throughput collapse for a synchronized reading application performed on the network with the congestion control algorithm is shown in Fig.5.As a comparison,TCP goodput without any congestion control is also shown in the same ?gure.In the simulation,a SRU size of 256KB is chosen to model a production storage system,link capacity is 1Gbps,Round Trip Time (RTT)is 100μs,the default Minimum Timeout Retransmission (RTO)is 200ms,and TCP NewReno implementation with the default packet size of 1000bytes is used in the simulation.The buffer size of the bottleneck link is 64KB,corresponding to a queue length of 64packets.The equilibrium queue size is set to 22packets for both the QCN and FQCN congestion control algorithms.QCN can effectively control link rates very rapidly in a

Fig.5:TCP throughput for a synchronized reading application with congestion control algorithms.

datacenter environment.However,it performs poorly in TCP Incast.The performance of QCN in mitigating TCP Incast is presented as a reference in Fig.5.The TCP goodput decreases with the increase of the number of servers,reducing to around 400Mbps with16servers.On the hand,FQCN is able to mitigate TCP Incast superbly without much degradation of the goodput,maintaining a high goodput of around900Mbps even with a large number of servers,as shown in Fig.5. The poor

performance of TCP throughput with QCN is attributed to the rate unfairness of different?ows.To explain this low utilization,we examine the rate?uctuation of different ?ows within one synchronous reading request.QCN initially reacts with timeouts because at the very beginning,all source ?ows send packets at the link rate resulting in rapid queue built up at the switch.If the switch buffer is over?owed, packets are dropped,thus leading to TCP timeouts.QCN effectively regulates sending rates of TCP?ows and prevents TCP timeouts from occurring.After TCP timeouts begin to occur,the fairness issue plays a big role in a synchronized reading,causing variations in QCN?ow rates to affect the overall performance.Fig.6shows the rate?uctuation of one synchronized reading over eight servers with QCN.The fastest two?ows complete their transmissions at around17.284sec, while the slowest?ow?nishes the data transfer in about12 ms later.After the fastest?ow?nishes the data transfer,the sending rates of other?ows start to increase slowly.During this rate increase period,the total throughput is smaller than the link capacity,thus wasting link bandwidth.The average time to complete one synchronous reading is around27.7ms, corresponding to610Mbps.We also examine the rates of all?ows within one barrier synchronous reading request with the FQCN congestion control algorithm,which facilitates fair sharing of the link capacity among all the source?ows.

V.C ONCLUSION

An enhanced QCN congestion control algorithm,called fair Quantized Congestion Noti?cation(FQCN),has been proposed to improve the fairness of QCN and the throughput performance in barrier synchronous reading applications.We Fig.6:The rate changing of one synchronized reading over eight servers with QCN.

have evaluated the performance of FQCN experimentally,and compared it with QCN in terms of fairness and convergence with simultaneous and staggered source?ows.As compared to QCN,FQCN has greatly improved the fairness,and facilitated quick convergence of the queue length at the bottleneck link to the equilibrium queue length.We have also investigated the effects of FQCN on TCP throughput collapse with simulations. The results showed that FQCN signi?cantly enhances TCP throughput performance with respect to TCP Incast.

R EFERENCES

[1] D.Nagle, D.Serenyi,and A.Matthews,“The Panasas ActiveScale

Storage Cluster:Delivering Scalable High Bandwidth Storage,”in ACM/IEEE conference on Supercomputing,Washington,DC,2004.

[2] A.Phanishayee,E.Krevat,V.Vasudevan,D.G.Andersen,G.R.Ganger,

G. A.Gibson,and S.Seshan,“Measurement and analysis of TCP

throughput collapse in cluster-based storage systems,”in the6th USENIX Conference on File and Storage Tech.,San Jose,CA,2008,pp.1–14.

[3]J.Dean and S.Ghemawat,“MapReduce:simpli?ed data processing on

large clusters,”Commun.ACM,vol.51,no.1,pp.107–113,2008. [4]Y.Chen,R.Grif?th,J.Liu,R.H.Katz,and A.D.Joseph,“Under-

standing TCP Incast Throughput Collapse in Datacenter Networks,”in Proc.of the1st ACM workshop on Research on enterprise networking, Barcelona,Spain,August21,2009,pp.73–82.

[5]V.Vasudevan,A.Phanishayee,H.Shah,E.Krevat,D.G.Andersen,

G.R.Ganger,G.A.Gibson,and B.Mueller,“Safe and effective?ne-

grained TCP retransmissions for datacenter communication,”in ACM SIGCOMM’09,Barcelona,Spain,August21,2009,pp.303–314. [6] D.Bergamasco,“Data Center Ethernet Congestion Management:Back-

ward Congestion Noti?cation,”in IEEE802.1Meeting,Berlin,Germany, May2005.

[7]J.Jiang,R.Jain,and C.So-In,“An Explicit Rate Control Framework

for Lossless Etherent Operation,”in International Conference on Com-munication2008,Beijing,China,May19-23,2008,pp.5914–5918.

[8] C.So-In,R.Jain,and J.Jiang,“Enhanced Forward Explicit Congestion

Noti?cation(E-FECN)Scheme for Datacenter Ethernet Networks,”in International Symposium on Performance Evaluation of Computer and Telecommunication Systems,Edinburgh,UK,Jun.2008,pp.542–546.

[9]M.Alizadeh, B.Atikoglu, A.Kabbani, https://www.doczj.com/doc/8b16268391.html,kshmikantha,R.Pan,

B.Prabhakar,and M.Seaman,“Data Center Transport Mechanisms:

Congestion Control Theory and IEEE Standardization,”in the46th Annual Allerton Conference,Illinois,USA,Sep.2008,pp.1270–1277.

[10]“QCN pseudo-code version2.0,”https://www.doczj.com/doc/8b16268391.html,/1/?les/public

/docs2008/au-rong-qcn-serial-hai-pseudo-code%20rev2.0.pdf.

[11]J.Jiang and R.Jain,“Analysis of Backward Congestion Noti?cation

(BCN)for Ethernet In Datacenter Applications,”in IEEE INFOCOM’07, Alaska,USA,May6-12,2007,pp.2456–2460.

[12]ns-2Network Simulator,https://www.doczj.com/doc/8b16268391.html,/nsnam/ns/.