BB's: DSCP

Friday, June 20, 2014

DC QoS: A Study.

Even years after Fallacies Of Distributed Computing were set, application still assumes that they have unlimited bandwidth provided by the fabric. Applications such as Hadoop, map-reduce, VMotion etc. are few examples that make such assumptions leading to problems like TCP Incast , Congestion due to Elephant and Mice and micro-burst in the network.

Fabric provides various tools like QCN, PFC, DCTCP, WRED, ECMP, switch-buffer etc. Yes, so many and many more are not covered here.

Do any or combination of few solve the problem of congestion ?. Not Really. But combination of few can handle the problem better.

QCN, DCTCP, WRED informs the source about the congestion in the network thro' explicit or implicit control message. QCN and DCTCP generate explicit control message thro' QCN feedback and ECN respectively while WRED informs source implicitly by dropping a random packet from the source for it to figure out congestion in the network. In all cases, source cuts down it's sending rate there by reducing the congestion in the network. These mechanism work only for Elephant flows as they live longer in the network. Latency sensitive Mice lives only for short time to react to these messages. So, Incast due to Mice like Hadoop are not solved by QCN, DCTCP or WRED. On the other hand, Latency sensitive Mice gets more bandwidth when Elephants cuts down it's sending rate.

PFC is another tool where local node informs immediate neighbor's (nbrs) about the congestion on a queue thro' flow control message (FCM). One receiving FCM, nbrs does following for the COS on which FCM is received.
- Stops sending
- Starts buffering
- Starts sending only after specific period of time

During buffering period, if the nbr starts experiencing congestion, would send an FCM to it's nbr's and so on. In some sense, PFC is hop by hop and takes a while for source to get to know about the congestion in the network for it to cut down the rate.

It should be noted that QCN, DCTCP, WRED informs the source where-as PFC is COS based. This COS based model makes PFC prone to HOL Blocking. An example where application A(Elephant) & B(Mice) generates traffic with same COS. With QCN, DCTCP, WRED, only App-A will get throttled. With PFC, both App-A & App-B gets throttled. In short, PFC is *not* suited for Elephant and Mice.

Unlike TCP, FoCE (or RoCE) do not support re-transmission for lost packets for to make the stack light-weight. Such light-weight protocols expects the fabric to provide loss-less behavior that is realized using PFC.

Switch buffer is another place to look at when it comes to incast and micro-burst. BRCM-T2s smart-buffer performance number looks great for multiple Incast happening on same switch at same point of time. Assuming, a given Incast requests 4KB from each of say 25 servers.

Interestingly, cloud vendors build robust L3 fabric having large #ECMPs using leaf-spine architecture. By throwing more bandwidth at leaf-spine and having large #ECMPs, one could get rid of complex QoS at leaf-spine layer. One might still require to handle Incast and bursts at Access layer. Measurements in DC have shown that bursts are observed only at Access layer and not at leaf-spine layer.

To summarize, one could start off with
1) Build L3 fabric with large #ECMP's by throwing more bandwidth at leaf & spine(LS). This takes care of LS part of fabric.
2) At Access,
- Have Smart-buffer capability.
- Enable PFC/DCTCP on ports facing the fabric to handle micro-burst. It should be noted that PFC will not create HOL as FCM will be directly sent to server.

But still, we are left to deal with Mice-Incast at Access as Mice does not live long enough to react to any of the above techniques. Possibly, SDN could be of help here.

Today, SDN helps to detect Elephant. Few examples are
- Classify Elephants & Mice using DSCP
- Handshake between Underlay & Overlay
- A priori learning of VMotion

Can SDN help to detect Mice ?. Perhaps, Yes. It would be good to know how Cisco's ACI would be of help here.

Thursday, June 12, 2014

Short and Long Flows (a.k.a) Mice and Elephant: A Primer

This post is in Q&A format.

What are Elephants and Mice flows?
Elephants and Mice are flows defined based on it's size and length. Elephants are bigger in size and stays for longer duration of time. Mice are smaller in size and stays only for fews seconds (infact milliseconds).

Are there any defined numbers for size and length ?
Today, It is left to the operator to decide these numbers based on their network. But there are few texts that define these.

In one text, a flow is considered short if size is < 10KB and stays only few hundred milliseconds. Any numbers above is considered as large flow.
In other text, a flow is considered large if it occupies >= 10% of link bandwidth for >= 1 sec. Rest Mice flows.

What are the characteristics of Elephant and Mice ?
Typically, Elephants are throughput sensitive where as Mice are delay sensitive. Given that Elephants stays longer, it make sense to detect and do something about it for better throughput. On the other hand, Mice ceases to exist even before they are detected.

Also, Elephants occupies lot of resources in the network in-terms of Buffer and Queues leading to starvation of buffer for Mice. This impacts the delay sensitive nature of Mice.

What is the ratio between Elephant and Mice in a DC ?
According to measurements done in DC
- 80% of flows are Mice and remaining are Elephants
- Most of bytes are from top 10% of large flows.

What are the examples of Elephant and Mice ?
Elephants: File transfers, VMotions, Video Streams, DDoS packets, etc.
Mice: Map-reduce applications, Request-Response protocols like HTTP, etc.

How are Elephants detected and mitigated ?

There are multiple solutions available in the market today. Few are discussed below.

1) vSwitch(OVS) detects elephants at the edge. Edge becomes a great place to detect due to it's proximity to applications and also can detect is more accurately. Once detected at the edge various mitigation can be employed

- Elephants are marked with specific DSCP to be handled by the underlay's to Traffic Engineer Elephants.

- Use OVSDB to inform underlay about new Elephant flow. So this becomes more on-fly thingy.
- Use different VxLAN id's or IP-address to traffic engineer Elephant flows.

2) Inmon and Brocade got together to detect DDoS using tools like sFlow and Openflow. Brocade switch exports sFlow samples to sFlow-RT module of inMon. sFlow-RT detects Elephant's based on the samples received and send's signature of the attack to a mitiagation application (SDN App). SDN app install's OF rules on to switch to stop such attacks.

3) DCTCP is another tool to handle Elephant and Mice better. It leverages the fact that Elephant lives longer to react to ECN. Destination DCTCP uses ECN to detect congestion in network and informs Source DCTCP about the congestion. Source DCTCP reacts by reducing the window by a factor that depends on the fraction of ECN marked packets.

What are the difference among various approaches described?
I don't see much of a difference between #1 & #2 except that the tools used are different. Probably, one plus with sFlow based approach is with physical servers connected to ToRs. Most of the ToRs today support sFlow at HW level but don't support OVS. If there are Elephant occurring from those physical servers (say storage replication) using sflow based approach would be a plus.

#3 is completely different from #1 & #2. It uses a form of WRED and requires support from edge inform of DCTCP. DCTCP is probably more reactive and occurs deep in the network. would choose #1 or #2 instead of #3.

Can Elephants be detected a priori ?
Elephants cannot be detected a priori for all cases but for few cases Elephant could be detected based on control message. One such example is VMotion. A prior detection of Elephant would be help to traffic engineer flows better.