Sunday, June 22, 2014

DC Traffic Types and It's provisioning

North-South & East-West.
Data Center traffic is characterized in to North-South(NS) and East-West(EW). ES is between Machine-Machine that is localized to a DC. NS is between Machine-User & Machine-Machine (Inter-DC, hybrid clouds are few examples) that traverses WAN edge either to internet or over to VPN (using L2/L3/MPLS VPN or OTV).



Pic Courtesy: Nango's Facebook Talk

According to Facebook's Nanog Talk in 2013
    - EW constitutes about 95% of DC traffic. NS takes remaining 5%.
    - EW traffic is growing at an exponential rate while NS traffic has stagnated pretty much.
    - For every 2 bytes of data generated by NS there is corresponding 98 bytes of data generated by EW.

These number's clearly indicate that EW traffic have to be well engineered for better performance. 

East-West Traffic Types
EW traffic can be classified in to 
    - Tenant Traffic
    - Infrastructure Traffic.

Tenant traffic are between the VM's. Infrastructure traffic consists of Management, Storage & VMotion. Typically, Storage & VMotion are high in bandwidth and storage is latency sensitive. Management is low bandwidth but required for the management of Compute, Network and Storage nodes. 

Tenant traffic are virtualized using protocols like VxLAN, NvGRE, MPLSoGRE(Contrail/Nuage), etc. Infrastructure traffic are not virtualized as they are between the hypervisors and does not make sense to do so

North-South Traffic Types
NS traffic are due to  
    - Inter-DC case.
    - To & from Internet  

Inter-DC is a case where VM's that communicate are located in different DC's or one of them is located in a public cloud. Usually, there is VPN connection (L2VPN, L3VPN, MPLS-VPN or OTV)  between the DC's.

User talk to a webserver thro' internet. These traffic go thro' a firewall and get NAT'ed. Few hybrid cloud also use the model.

VLAN Provisioning
VLANs are provisioned at vSwitch. Each traffic type are placed in different VLANs and are allocated different bandwidth based on requirement. If the fabric is L3, all VLANs gets terminated at ToR with exception of Edge VLAN which gets plumbed from vSwitch to WAN edge including all intermediate networking nodes.

All NS traffic are place in Edge VLAN. Of-course, Edge VLAN is not required when MPLSoGRE based solution is used. Contrail and Nuage are MPLSoGRE based.

Each of EW traffic are placed in different VLAN on vSwitch. As VLAN get's terminated at ToR, EW traffic are routed then on. Yes. VMotion, Storage, Management & Tenant traffic will go over routed network. Even though tenant traffic are part of single VLAN's, the virtualized nature provides tenant isolation. 

Summary
A server running VMWs vSphere as hypervisor with single vSwitch having 4 NIC's attached will have IP with VLAN created using VMKernel option for each of 
    - Management
    - Storage
    - VMotion
    - Tenant (NVO'ed)
    - Edge VLAN. 

If the fabric is L3, default GW would be first hop ToR. All traffic would be routed from then on. Edge VLAN is for NS traffic and would have VLANs plumbed from vSwitch to WAN edge. 

According to NSX design guide, Fabric design should be approved with VMW for VMotion to be supported over L3.

Friday, June 20, 2014

DC QoS: A Study.

Even years after Fallacies Of Distributed Computing were set, application still assumes that they have unlimited bandwidth provided by the fabric. Applications such as Hadoop, map-reduce, VMotion etc. are few examples that make such assumptions leading to problems like TCP Incast , Congestion due to Elephant and Mice and micro-burst in the network. 

Fabric provides various tools like QCN, PFC, DCTCP, WRED, ECMP, switch-buffer etc. Yes, so many and many more are not covered here. 

Do any or combination of few solve the problem of congestion ?. Not Really. But combination of few can handle the problem better.

QCN, DCTCP, WRED informs the source about the congestion in the network thro' explicit or implicit control message. QCN and DCTCP generate explicit control message thro' QCN feedback and ECN respectively while WRED informs source implicitly by dropping a random packet from the source for it to figure out congestion in the network. In all cases, source cuts down it's sending rate there by reducing the congestion in the network. These mechanism work only for Elephant flows as they live longer in the network. Latency sensitive Mice lives only for short time to react to these messages. So, Incast due to Mice like Hadoop are not solved by QCN, DCTCP or WRED. On the other hand, Latency sensitive Mice gets more bandwidth when Elephants cuts down it's sending rate.

PFC is another tool where local node informs immediate neighbor's (nbrs) about the congestion on a queue thro' flow control message (FCM). One receiving FCM, nbrs does following for the COS on which FCM is received.
- Stops sending
- Starts buffering
 - Starts sending only after specific period of time

During buffering period, if the nbr starts experiencing congestion, would send an FCM to it's nbr's and so on. In some sense, PFC is hop by hop and takes a while for source to get to know about the congestion in the network for it to cut down the rate.

It should be noted that QCN, DCTCP, WRED informs the source where-as PFC is COS based. This COS based model makes PFC prone to HOL BlockingAn example where application A(Elephant) & B(Mice) generates traffic with same COS. With QCN, DCTCP, WRED, only App-A will get throttled. With PFC, both App-A & App-B gets throttled. In short, PFC is *not* suited for Elephant and Mice. 

Unlike TCP, FoCE (or RoCE) do not support re-transmission for lost packets for to make the stack light-weight. Such light-weight protocols expects the fabric to provide loss-less behavior that is realized using PFC.

Switch buffer is another place to look at when it comes to incast and micro-burst. BRCM-T2s smart-buffer performance number looks great for multiple Incast happening on same switch at same point of time. Assuming, a given Incast requests 4KB from each of say 25 servers. 

Interestingly, cloud vendors build robust L3 fabric having large #ECMPs using leaf-spine architecture. By throwing more bandwidth at leaf-spine and having large #ECMPs, one could get rid of complex QoS at leaf-spine layer. One might still require to handle Incast and bursts at Access layer. Measurements in DC have shown that bursts are observed only at Access layer and not at leaf-spine layer.  

To summarize, one could start off with
1) Build L3 fabric with large #ECMP's by throwing more bandwidth at leaf & spine(LS). This takes care of LS part of fabric. 
 2) At Access,
     - Have Smart-buffer capability.
     - Enable PFC/DCTCP on ports facing the fabric to handle micro-burst. It should be noted that PFC will not create HOL as FCM will be directly sent to server. 

But still, we are left to deal with Mice-Incast at Access as Mice does not live long enough to react to any of the above techniques. Possibly, SDN could be of help here.  

Today, SDN helps to detect Elephant. Few examples are
- Classify Elephants & Mice using DSCP 
- Handshake between Underlay & Overlay 
   - A priori learning of VMotion

Can SDN help to detect Mice ?. Perhaps, Yes. It would be good to know how Cisco's ACI would be of help here. 

Sunday, June 15, 2014

Geneve: A Network Virtualization Encapsulation With A Difference

Before Geneve, each vendor had their virtualization encaps like VxLAN, STT, NvGRE, MPLSoGRE, DOVE etc. Though, there were few pros & cons associated with each of these encapsulation(encap), it is believed that encap does not matter as it is used to transport VM's packet from A to B. Geneve seems to move away from this premise and also tries to bring all vendors towards common encap.

To summarize, Geneve addresses
    1) A common encap that consumes all virtualization tunneling protocol.
    2) Encap does matter and it does more than just transporting the packet.


From the pictures above, it is quite obvious that Geneve has added few more fields to VxLAN header. Fields those are important to this discussion are
    1) O & C bit
    2) Protocol Type
    3) Variable Length Options. 

Common Encapsulation:
Geneve attempts to bring all vendors under one umbrella WRT encap. This helps in lot of ways
    - DC operators would have single encap in a heterogeneous hypervisor environment.
    - Vendors like NVO GWs, NIC vendors, Firewall system, IDS etc have to provide support for only one encap.
    - Life becomes much easy from Monitoring and Debugging perspective.

Protocol_Type indicates the ether_type that appears after Geneve header. Ether_Type could be any as defined by IEEE. This provides an ability to encapsulate any protocol with an Ethertype. With VxLAN, it is possible only to encap an Ethernet frame.

Encapsulation Matters:
Traditional protocols carry meta-data only on control plane not on data plane where-as Geneve carries meta-data on data plane. This is where Geneve thinks differently. It views DC as a single large distributed router with each hypervisors as a line card. It is typical of such distributed system to carry meta-data on data plane between different line-cards. Geneve rightly does that for network virtualization too.

Meta-data is immensely powerful for things like service chaining, storing context for firewall, etc. Meta-data can do something like what communities and attributes did for BGP.

Interestingly, Geneve does not define these meta-data but leaves it to vendor to define their own meta-data using Variable length options field. 

Backward Compatibility:
Geneve is not backward compatible (from data plane perspective) with any of NVO encap including VxLAN. Perhaps, VxLAN can be made forward compatible with Geneve.

So, what is the future of VxLAN and its support by various HW, SW & NIC vendors. Bruce explains about it here

Remaining Headers:
O is OAM bit, when set is consumed by tunnel endpoint rather than forwarding to the tenant. This is more on the lines of VxLAN Router-Alert. One could possibly simulate BFD using O bit and Options provided.

C is set when the packets are consumed first by tunnel end point and then forwarded to the tenant. This seems more like piggy-backing options to tunnel data plane on the dataplane.

Final Thoughts:
At this point it is not clear as how much flexibility HW GWs can provide for Options field. These would pan-out as HW vendors starting to support Geneve. 

I still believe MPLSoGRE that Nuage and Contrail uses works well for Inter-DC cases given that ASICs already supports MPLSoGRE encap & decap. Would have been good had Geneve considered Inter-DC as well. Possibly it does not matter if operators start using vRouters as WAN edge.


Thursday, June 12, 2014

Short and Long Flows (a.k.a) Mice and Elephant: A Primer

This post is in Q&A format. 

What are Elephants and Mice flows?
Elephants and Mice are flows defined based on it's size and length. Elephants are bigger in size and stays for longer duration of time. Mice are smaller in size and stays only for fews seconds (infact milliseconds). 

Are there any defined numbers for size and length ?
Today, It is left to the operator to decide these numbers based on their network. But there are few texts that define these.

In one text, a flow is considered short if size is < 10KB and stays only few hundred milliseconds. Any numbers above is considered as large flow. 
In other text, a flow is considered large if it occupies >= 10% of link bandwidth for >= 1 sec. Rest Mice flows. 

What are the characteristics of Elephant and Mice ?
Typically, Elephants are throughput sensitive where as Mice are delay sensitive. Given that Elephants stays longer, it make sense to detect and do something about it for better throughput. On the other hand, Mice ceases to exist even before they are detected. 

Also, Elephants occupies lot of resources in the network in-terms of Buffer and Queues leading to starvation of buffer for Mice. This impacts the delay sensitive nature of Mice.

What is the ratio between Elephant and Mice in a DC ?
According to measurements done in DC
             - 80% of flows are Mice and remaining are Elephants
             - Most of bytes are from top 10% of large flows. 

What are the examples of Elephant and Mice ?
Elephants: File transfers, VMotions, Video Streams, DDoS packets, etc.
Mice: Map-reduce applications, Request-Response protocols like HTTP, etc.

How are Elephants detected and mitigated ?
There are multiple solutions available in the market today. Few are discussed below.
1) vSwitch(OVS) detects elephants at the edge. Edge becomes a great place to detect due to it's proximity to applications and also can detect is more accurately.  Once detected at the edge various mitigation can be employed
        - Use OVSDB to inform underlay about new Elephant flow. So this becomes more on-fly thingy.
         - Use different VxLAN id's or IP-address to traffic engineer Elephant flows. 

2) Inmon and Brocade got together to detect DDoS using tools like sFlow and Openflow. Brocade switch exports sFlow samples to sFlow-RT module of inMon. sFlow-RT detects Elephant's based on the samples received and send's signature of the attack to a mitiagation application (SDN App). SDN app install's OF rules on to switch to stop such attacks.

3) DCTCP is another tool to handle Elephant and Mice better. It leverages the fact that Elephant lives longer to react to ECN. Destination DCTCP uses ECN to detect congestion in network and informs Source DCTCP about the congestion. Source DCTCP reacts by reducing the window by a factor that depends on the fraction of ECN marked packets.

What are the difference among various approaches described?
I don't see much of a difference between #1 & #2 except that the tools used are different. Probably, one plus with sFlow based approach is with physical servers connected to ToRs. Most of the ToRs today support sFlow at HW level but don't support OVS. If there are Elephant occurring from those physical servers (say storage replication) using sflow based approach would be a plus. 

#3 is completely different from #1 & #2. It uses a form of WRED and requires support from edge inform of DCTCP. DCTCP is probably more reactive and occurs deep in the network. would choose #1 or #2 instead of #3.

Can Elephants be detected a priori ?
Elephants cannot be detected a priori for all cases but for few cases Elephant could be detected based on control message. One such example is VMotion. A prior detection of Elephant would be help to traffic engineer flows better. 

Thursday, June 5, 2014

Opendaylight Hydrogen Editions.

As of Hydrogen release, ODL has 3 editions. 
- Base
- Virtualization
- Service Provider

Base Edition:

ODL wiki claims that base edition is only for testing and experimentation purpose. On South-bound, Base edition supports OF1.x, OVSDB & Netconf 

*run.sh* is the command to bring up this edition. OF1.x is established between the node and ODL controller. OVSDB does not show up. Ignoring the wiki claims for now, it is fair to assume that base edition is used only for Physical networks (a.k.a underlay). All flows are orchestrated using OF1.x and to work like traditional reactive controller. 

Virtualization Edition:

According to wiki, Virtualization edition is for data centers. On southbound, supports OVSDB, OF1.x, Netconf, VTN, DOVE. It additionally has support for Openstack too.

*./run.sh -virt ovsdb* is the command to bring up with OVSDB. Apart from OVSDB one could use either VTN or Opendove. 

With Soft-switch like OVS, ODL establishes both OVSDB & OF1.x, to install ports/tunnels/flows (Creation of Ports/tunnels/flows to be dealt in different post). As of Hydrogen release, ODLs working with HW switch is unclear. Expect ODL to evolve in this area as it partners with HW vendors.  
Given the lean support of Hydrogen with HW, would probably make sense to use this edition as NVO solution for now. 

Service Provider Edition:

Service Provider edition is for network operators. On southbound, it supports SNMP, BGP-LS, PCEP, and LISP etc. 

Have not played much with this edition yet.