Even years after Fallacies Of Distributed Computing were set, application still assumes that they have unlimited bandwidth provided by the fabric. Applications such as Hadoop, map-reduce, VMotion etc. are few examples that make such assumptions leading to problems like TCP Incast , Congestion due to Elephant and Mice and micro-burst in the network.
Fabric provides various tools like QCN, PFC, DCTCP, WRED, ECMP, switch-buffer etc. Yes, so many and many more are not covered here.
Do any or combination of few solve the problem of congestion ?. Not Really. But combination of few can handle the problem better.
QCN, DCTCP, WRED informs the source about the congestion in the network thro' explicit or implicit control message. QCN and DCTCP generate explicit control message thro' QCN feedback and ECN respectively while WRED informs source implicitly by dropping a random packet from the source for it to figure out congestion in the network. In all cases, source cuts down it's sending rate there by reducing the congestion in the network. These mechanism work only for Elephant flows as they live longer in the network. Latency sensitive Mice lives only for short time to react to these messages. So, Incast due to Mice like Hadoop are not solved by QCN, DCTCP or WRED. On the other hand, Latency sensitive Mice gets more bandwidth when Elephants cuts down it's sending rate.
PFC is another tool where local node informs immediate neighbor's (nbrs) about the congestion on a queue thro' flow control message (FCM). One receiving FCM, nbrs does following for the COS on which FCM is received.
- Stops sending
- Starts buffering
- Starts sending only after specific period of time
During buffering period, if the nbr starts experiencing congestion, would send an FCM to it's nbr's and so on. In some sense, PFC is hop by hop and takes a while for source to get to know about the congestion in the network for it to cut down the rate.
It should be noted that QCN, DCTCP, WRED informs the source where-as PFC is COS based. This COS based model makes PFC prone to HOL Blocking. An example where application A(Elephant) & B(Mice) generates traffic with same COS. With QCN, DCTCP, WRED, only App-A will get throttled. With PFC, both App-A & App-B gets throttled. In short, PFC is *not* suited for Elephant and Mice.
Unlike TCP, FoCE (or RoCE) do not support re-transmission for lost packets for to make the stack light-weight. Such light-weight protocols expects the fabric to provide loss-less behavior that is realized using PFC.
Switch buffer is another place to look at when it comes to incast and micro-burst. BRCM-T2s smart-buffer performance number looks great for multiple Incast happening on same switch at same point of time. Assuming, a given Incast requests 4KB from each of say 25 servers.
Interestingly, cloud vendors build robust L3 fabric having large #ECMPs using leaf-spine architecture. By throwing more bandwidth at leaf-spine and having large #ECMPs, one could get rid of complex QoS at leaf-spine layer. One might still require to handle Incast and bursts at Access layer. Measurements in DC have shown that bursts are observed only at Access layer and not at leaf-spine layer.
To summarize, one could start off with
1) Build L3 fabric with large #ECMP's by throwing more bandwidth at leaf & spine(LS). This takes care of LS part of fabric.
2) At Access,
- Have Smart-buffer capability.
- Enable PFC/DCTCP on ports facing the fabric to handle micro-burst. It should be noted that PFC will not create HOL as FCM will be directly sent to server.
But still, we are left to deal with Mice-Incast at Access as Mice does not live long enough to react to any of the above techniques. Possibly, SDN could be of help here.
Today, SDN helps to detect Elephant. Few examples are
- Classify Elephants & Mice using DSCP
- Handshake between Underlay & Overlay
- A priori learning of VMotion
Can SDN help to detect Mice ?. Perhaps, Yes. It would be good to know how Cisco's ACI would be of help here.
Fabric provides various tools like QCN, PFC, DCTCP, WRED, ECMP, switch-buffer etc. Yes, so many and many more are not covered here.
Do any or combination of few solve the problem of congestion ?. Not Really. But combination of few can handle the problem better.
QCN, DCTCP, WRED informs the source about the congestion in the network thro' explicit or implicit control message. QCN and DCTCP generate explicit control message thro' QCN feedback and ECN respectively while WRED informs source implicitly by dropping a random packet from the source for it to figure out congestion in the network. In all cases, source cuts down it's sending rate there by reducing the congestion in the network. These mechanism work only for Elephant flows as they live longer in the network. Latency sensitive Mice lives only for short time to react to these messages. So, Incast due to Mice like Hadoop are not solved by QCN, DCTCP or WRED. On the other hand, Latency sensitive Mice gets more bandwidth when Elephants cuts down it's sending rate.
PFC is another tool where local node informs immediate neighbor's (nbrs) about the congestion on a queue thro' flow control message (FCM). One receiving FCM, nbrs does following for the COS on which FCM is received.
- Stops sending
- Starts buffering
- Starts sending only after specific period of time
During buffering period, if the nbr starts experiencing congestion, would send an FCM to it's nbr's and so on. In some sense, PFC is hop by hop and takes a while for source to get to know about the congestion in the network for it to cut down the rate.
It should be noted that QCN, DCTCP, WRED informs the source where-as PFC is COS based. This COS based model makes PFC prone to HOL Blocking. An example where application A(Elephant) & B(Mice) generates traffic with same COS. With QCN, DCTCP, WRED, only App-A will get throttled. With PFC, both App-A & App-B gets throttled. In short, PFC is *not* suited for Elephant and Mice.
Unlike TCP, FoCE (or RoCE) do not support re-transmission for lost packets for to make the stack light-weight. Such light-weight protocols expects the fabric to provide loss-less behavior that is realized using PFC.
Switch buffer is another place to look at when it comes to incast and micro-burst. BRCM-T2s smart-buffer performance number looks great for multiple Incast happening on same switch at same point of time. Assuming, a given Incast requests 4KB from each of say 25 servers.
Interestingly, cloud vendors build robust L3 fabric having large #ECMPs using leaf-spine architecture. By throwing more bandwidth at leaf-spine and having large #ECMPs, one could get rid of complex QoS at leaf-spine layer. One might still require to handle Incast and bursts at Access layer. Measurements in DC have shown that bursts are observed only at Access layer and not at leaf-spine layer.
To summarize, one could start off with
1) Build L3 fabric with large #ECMP's by throwing more bandwidth at leaf & spine(LS). This takes care of LS part of fabric.
2) At Access,
- Have Smart-buffer capability.
- Enable PFC/DCTCP on ports facing the fabric to handle micro-burst. It should be noted that PFC will not create HOL as FCM will be directly sent to server.
But still, we are left to deal with Mice-Incast at Access as Mice does not live long enough to react to any of the above techniques. Possibly, SDN could be of help here.
Today, SDN helps to detect Elephant. Few examples are
- Classify Elephants & Mice using DSCP
- Handshake between Underlay & Overlay
- A priori learning of VMotion
Can SDN help to detect Mice ?. Perhaps, Yes. It would be good to know how Cisco's ACI would be of help here.