BB's

Friday, July 11, 2014

Automation, Scripting, SDN, CLI, API, etc.

Greg blogged about current state on automation and it's pain points. Marten followed it up with his thoughts. Though, agree with Marten but there are still lot more to be thought out. Hence this post.

Top-Down (vs) Bottom-Up
When a product is conceptualized, it usually takes a top-down approach where architecture is always on the top of the list. The goal of any architecture is to take in to account aspects like frame-work, performance, scale-ability, flexibility, extension-ability, usability, etc. Such an architecture is made of many pieces called modules. Each modules are designed independently to achieve it's purpose and also adheres to over-all architecture goals. Modules could be realized using any software like C, Java, Python, etc.

Automation is like an architecture where top-down format should be followed and all goals of architecture applies to automation as well.
On the other hand scripting is somewhat like modules that fits well in to automation frame work. A bottom-up approach of building scripts and gluing all scripts together to arrive at automation is not the right way go.

Lesson-1: Top-Down approach to be followed.

Macro (vs) Micro
CLI is a traditional way to configure that beautifully abstracts NPU programming. From single NPU perspective, CLI has a macro view but when it comes to fabric, CLI is at micro level. This micro-view leads to lot of touch-points. Larger the touch-points large the Opex costs. So, basing automation using CLI is a bottom-up approach and is also laborious & tedious to develop and to maintain. API model is little better as it provides a consistent format across releases but has similar over all constructs like CLI.

Hence, a frame-work is required that has macro view, provides top-down and reduces #touch-points. Tools like Network Management(NM), SNMP have macro-view but always played the game of fabric (read it as CLI) which has always been bottom-up format. Hence, these tools did not realize it's potential for data center networks. Additionally, these tools have moved it's touch-point complexity from device to a browser.

Lesson-2: Macro view is required.

Can SDN help ?
SDN has many hands, devops is one of them. NM is good starting point for devops but it does not leverage it's macro-view. For NM to be effective, there must be synergy between NM and fabric. Fabric to provide handles for top-down format and reduce #touch-points. NM to leverage handles provided by the fabric. This is where we are getting in to dark and innovation areas of devops.

It might take a while for the industry to arrive at such a synergy but there are solutions getting towards this direction. Few are listed below

PTM: It is open source project from Cumulus networks for Cable management and physical connection verification.
Un-numbered Interfaces: Un-numbered interfaces is an age old technology supported by most vendors. Interestingly, Cumulus networks is pushing them on DC from DevOps perspective.
Converged Infrastructure: Vendors like DELL are championing this cause. From the opensource side, Openstack-Neutron-ODL are focusing on such an approach.
Zero touch provisioning: Vendors support different methods to support provisioning with Zero or No touch. All solutions can be broadly classified in to Network Way (Eg: PXE Boot) or Management Way (Eg: Puppet/Chef/CFEngine). Most of the network PXE has it's own set of use-cases but attempts to solve the problem from device level. On the other hand, Management way has potential to look at solution from Macro level.
Big clouds builders like Facebook, Google, etc have automated such that it reduces their Opex costs.

Lesson-3: Innovative interaction between Fabric and Orchestration tool is name of the game.

Final Thoughts

Devops is taking networking head-on. Though the fundamentals of networking would remain the same but the way we sleep, eat and drink networking are definitely changing.

Sunday, June 22, 2014

DC Traffic Types and It's provisioning

North-South & East-West.
Data Center traffic is characterized in to North-South(NS) and East-West(EW). ES is between Machine-Machine that is localized to a DC. NS is between Machine-User & Machine-Machine (Inter-DC, hybrid clouds are few examples) that traverses WAN edge either to internet or over to VPN (using L2/L3/MPLS VPN or OTV).

Pic Courtesy: Nango's Facebook Talk

According to Facebook's Nanog Talk in 2013
- EW constitutes about 95% of DC traffic. NS takes remaining 5%.
- EW traffic is growing at an exponential rate while NS traffic has stagnated pretty much.
- For every 2 bytes of data generated by NS there is corresponding 98 bytes of data generated by EW.

These number's clearly indicate that EW traffic have to be well engineered for better performance.

East-West Traffic Types
EW traffic can be classified in to
- Tenant Traffic
- Infrastructure Traffic.

Tenant traffic are between the VM's. Infrastructure traffic consists of Management, Storage & VMotion. Typically, Storage & VMotion are high in bandwidth and storage is latency sensitive. Management is low bandwidth but required for the management of Compute, Network and Storage nodes.

Tenant traffic are virtualized using protocols like VxLAN, NvGRE, MPLSoGRE(Contrail/Nuage), etc. Infrastructure traffic are not virtualized as they are between the hypervisors and does not make sense to do so.

North-South Traffic Types
NS traffic are due to
- Inter-DC case.
- To & from Internet

Inter-DC is a case where VM's that communicate are located in different DC's or one of them is located in a public cloud. Usually, there is VPN connection (L2VPN, L3VPN, MPLS-VPN or OTV) between the DC's.

User talk to a webserver thro' internet. These traffic go thro' a firewall and get NAT'ed. Few hybrid cloud also use the model.

VLAN Provisioning
VLANs are provisioned at vSwitch. Each traffic type are placed in different VLANs and are allocated different bandwidth based on requirement. If the fabric is L3, all VLANs gets terminated at ToR with exception of Edge VLAN which gets plumbed from vSwitch to WAN edge including all intermediate networking nodes.

All NS traffic are place in Edge VLAN. Of-course, Edge VLAN is not required when MPLSoGRE based solution is used. Contrail and Nuage are MPLSoGRE based.

Each of EW traffic are placed in different VLAN on vSwitch. As VLAN get's terminated at ToR, EW traffic are routed then on. Yes. VMotion, Storage, Management & Tenant traffic will go over routed network. Even though tenant traffic are part of single VLAN's, the virtualized nature provides tenant isolation.

Summary
A server running VMWs vSphere as hypervisor with single vSwitch having 4 NIC's attached will have IP with VLAN created using VMKernel option for each of
- Management
- Storage
- VMotion
- Tenant (NVO'ed)
- Edge VLAN.

If the fabric is L3, default GW would be first hop ToR. All traffic would be routed from then on. Edge VLAN is for NS traffic and would have VLANs plumbed from vSwitch to WAN edge.

According to NSX design guide, Fabric design should be approved with VMW for VMotion to be supported over L3.

Friday, June 20, 2014

DC QoS: A Study.

Even years after Fallacies Of Distributed Computing were set, application still assumes that they have unlimited bandwidth provided by the fabric. Applications such as Hadoop, map-reduce, VMotion etc. are few examples that make such assumptions leading to problems like TCP Incast , Congestion due to Elephant and Mice and micro-burst in the network.

Fabric provides various tools like QCN, PFC, DCTCP, WRED, ECMP, switch-buffer etc. Yes, so many and many more are not covered here.

Do any or combination of few solve the problem of congestion ?. Not Really. But combination of few can handle the problem better.

QCN, DCTCP, WRED informs the source about the congestion in the network thro' explicit or implicit control message. QCN and DCTCP generate explicit control message thro' QCN feedback and ECN respectively while WRED informs source implicitly by dropping a random packet from the source for it to figure out congestion in the network. In all cases, source cuts down it's sending rate there by reducing the congestion in the network. These mechanism work only for Elephant flows as they live longer in the network. Latency sensitive Mice lives only for short time to react to these messages. So, Incast due to Mice like Hadoop are not solved by QCN, DCTCP or WRED. On the other hand, Latency sensitive Mice gets more bandwidth when Elephants cuts down it's sending rate.

PFC is another tool where local node informs immediate neighbor's (nbrs) about the congestion on a queue thro' flow control message (FCM). One receiving FCM, nbrs does following for the COS on which FCM is received.
- Stops sending
- Starts buffering
- Starts sending only after specific period of time

During buffering period, if the nbr starts experiencing congestion, would send an FCM to it's nbr's and so on. In some sense, PFC is hop by hop and takes a while for source to get to know about the congestion in the network for it to cut down the rate.

It should be noted that QCN, DCTCP, WRED informs the source where-as PFC is COS based. This COS based model makes PFC prone to HOL Blocking. An example where application A(Elephant) & B(Mice) generates traffic with same COS. With QCN, DCTCP, WRED, only App-A will get throttled. With PFC, both App-A & App-B gets throttled. In short, PFC is *not* suited for Elephant and Mice.

Unlike TCP, FoCE (or RoCE) do not support re-transmission for lost packets for to make the stack light-weight. Such light-weight protocols expects the fabric to provide loss-less behavior that is realized using PFC.

Switch buffer is another place to look at when it comes to incast and micro-burst. BRCM-T2s smart-buffer performance number looks great for multiple Incast happening on same switch at same point of time. Assuming, a given Incast requests 4KB from each of say 25 servers.

Interestingly, cloud vendors build robust L3 fabric having large #ECMPs using leaf-spine architecture. By throwing more bandwidth at leaf-spine and having large #ECMPs, one could get rid of complex QoS at leaf-spine layer. One might still require to handle Incast and bursts at Access layer. Measurements in DC have shown that bursts are observed only at Access layer and not at leaf-spine layer.

To summarize, one could start off with
1) Build L3 fabric with large #ECMP's by throwing more bandwidth at leaf & spine(LS). This takes care of LS part of fabric.
2) At Access,
- Have Smart-buffer capability.
- Enable PFC/DCTCP on ports facing the fabric to handle micro-burst. It should be noted that PFC will not create HOL as FCM will be directly sent to server.

But still, we are left to deal with Mice-Incast at Access as Mice does not live long enough to react to any of the above techniques. Possibly, SDN could be of help here.

Today, SDN helps to detect Elephant. Few examples are
- Classify Elephants & Mice using DSCP
- Handshake between Underlay & Overlay
- A priori learning of VMotion

Can SDN help to detect Mice ?. Perhaps, Yes. It would be good to know how Cisco's ACI would be of help here.

Sunday, June 15, 2014

Geneve: A Network Virtualization Encapsulation With A Difference

Before Geneve, each vendor had their virtualization encaps like VxLAN, STT, NvGRE, MPLSoGRE, DOVE etc. Though, there were few pros & cons associated with each of these encapsulation(encap), it is believed that encap does not matter as it is used to transport VM's packet from A to B. Geneve seems to move away from this premise and also tries to bring all vendors towards common encap.

To summarize, Geneve addresses

1) A common encap that consumes all virtualization tunneling protocol.

2) Encap does matter and it does more than just transporting the packet.

From the pictures above, it is quite obvious that Geneve has added few more fields to VxLAN header. Fields those are important to this discussion are

1) O & C bit

2) Protocol Type

3) Variable Length Options.

Common Encapsulation:

Geneve attempts to bring all vendors under one umbrella WRT encap. This helps in lot of ways

- DC operators would have single encap in a heterogeneous hypervisor environment.

- Vendors like NVO GWs, NIC vendors, Firewall system, IDS etc have to provide support for only one encap.

- Life becomes much easy from Monitoring and Debugging perspective.

Protocol_Type indicates the ether_type that appears after Geneve header. Ether_Type could be any as defined by IEEE. This provides an ability to encapsulate any protocol with an Ethertype. With VxLAN, it is possible only to encap an Ethernet frame.

Encapsulation Matters:

Traditional protocols carry meta-data only on control plane not on data plane where-as Geneve carries meta-data on data plane. This is where Geneve thinks differently. It views DC as a single large distributed router with each hypervisors as a line card. It is typical of such distributed system to carry meta-data on data plane between different line-cards. Geneve rightly does that for network virtualization too.

Meta-data is immensely powerful for things like service chaining, storing context for firewall, etc. Meta-data can do something like what communities and attributes did for BGP.

Interestingly, Geneve does not define these meta-data but leaves it to vendor to define their own meta-data using Variable length options field.

Backward Compatibility:

Geneve is not backward compatible (from data plane perspective) with any of NVO encap including VxLAN. Perhaps, VxLAN can be made forward compatible with Geneve.

So, what is the future of VxLAN and its support by various HW, SW & NIC vendors. Bruce explains about it here

Remaining Headers:

O is OAM bit, when set is consumed by tunnel endpoint rather than forwarding to the tenant. This is more on the lines of VxLAN Router-Alert. One could possibly simulate BFD using O bit and Options provided.

C is set when the packets are consumed first by tunnel end point and then forwarded to the tenant. This seems more like piggy-backing options to tunnel data plane on the dataplane.

Final Thoughts:

At this point it is not clear as how much flexibility HW GWs can provide for Options field. These would pan-out as HW vendors starting to support Geneve.

I still believe MPLSoGRE that Nuage and Contrail uses works well for Inter-DC cases given that ASICs already supports MPLSoGRE encap & decap. Would have been good had Geneve considered Inter-DC as well. Possibly it does not matter if operators start using vRouters as WAN edge.

Thursday, June 12, 2014

Short and Long Flows (a.k.a) Mice and Elephant: A Primer

This post is in Q&A format.

What are Elephants and Mice flows?
Elephants and Mice are flows defined based on it's size and length. Elephants are bigger in size and stays for longer duration of time. Mice are smaller in size and stays only for fews seconds (infact milliseconds).

Are there any defined numbers for size and length ?
Today, It is left to the operator to decide these numbers based on their network. But there are few texts that define these.

In one text, a flow is considered short if size is < 10KB and stays only few hundred milliseconds. Any numbers above is considered as large flow.
In other text, a flow is considered large if it occupies >= 10% of link bandwidth for >= 1 sec. Rest Mice flows.

What are the characteristics of Elephant and Mice ?
Typically, Elephants are throughput sensitive where as Mice are delay sensitive. Given that Elephants stays longer, it make sense to detect and do something about it for better throughput. On the other hand, Mice ceases to exist even before they are detected.

Also, Elephants occupies lot of resources in the network in-terms of Buffer and Queues leading to starvation of buffer for Mice. This impacts the delay sensitive nature of Mice.

What is the ratio between Elephant and Mice in a DC ?
According to measurements done in DC
- 80% of flows are Mice and remaining are Elephants
- Most of bytes are from top 10% of large flows.

What are the examples of Elephant and Mice ?
Elephants: File transfers, VMotions, Video Streams, DDoS packets, etc.
Mice: Map-reduce applications, Request-Response protocols like HTTP, etc.

How are Elephants detected and mitigated ?

There are multiple solutions available in the market today. Few are discussed below.

1) vSwitch(OVS) detects elephants at the edge. Edge becomes a great place to detect due to it's proximity to applications and also can detect is more accurately. Once detected at the edge various mitigation can be employed

- Elephants are marked with specific DSCP to be handled by the underlay's to Traffic Engineer Elephants.

- Use OVSDB to inform underlay about new Elephant flow. So this becomes more on-fly thingy.
- Use different VxLAN id's or IP-address to traffic engineer Elephant flows.

2) Inmon and Brocade got together to detect DDoS using tools like sFlow and Openflow. Brocade switch exports sFlow samples to sFlow-RT module of inMon. sFlow-RT detects Elephant's based on the samples received and send's signature of the attack to a mitiagation application (SDN App). SDN app install's OF rules on to switch to stop such attacks.

3) DCTCP is another tool to handle Elephant and Mice better. It leverages the fact that Elephant lives longer to react to ECN. Destination DCTCP uses ECN to detect congestion in network and informs Source DCTCP about the congestion. Source DCTCP reacts by reducing the window by a factor that depends on the fraction of ECN marked packets.

What are the difference among various approaches described?
I don't see much of a difference between #1 & #2 except that the tools used are different. Probably, one plus with sFlow based approach is with physical servers connected to ToRs. Most of the ToRs today support sFlow at HW level but don't support OVS. If there are Elephant occurring from those physical servers (say storage replication) using sflow based approach would be a plus.

#3 is completely different from #1 & #2. It uses a form of WRED and requires support from edge inform of DCTCP. DCTCP is probably more reactive and occurs deep in the network. would choose #1 or #2 instead of #3.

Can Elephants be detected a priori ?
Elephants cannot be detected a priori for all cases but for few cases Elephant could be detected based on control message. One such example is VMotion. A prior detection of Elephant would be help to traffic engineer flows better.

Thursday, June 5, 2014

Opendaylight Hydrogen Editions.

As of Hydrogen release, ODL has 3 editions.
- Base
- Virtualization
- Service Provider

Base Edition:

ODL wiki claims that base edition is only for testing and experimentation purpose. On South-bound, Base edition supports OF1.x, OVSDB & Netconf

*run.sh* is the command to bring up this edition. OF1.x is established between the node and ODL controller. OVSDB does not show up. Ignoring the wiki claims for now, it is fair to assume that base edition is used only for Physical networks (a.k.a underlay). All flows are orchestrated using OF1.x and to work like traditional reactive controller.

Virtualization Edition:

According to wiki, Virtualization edition is for data centers. On southbound, supports OVSDB, OF1.x, Netconf, VTN, DOVE. It additionally has support for Openstack too.

*./run.sh -virt ovsdb* is the command to bring up with OVSDB. Apart from OVSDB one could use either VTN or Opendove.

With Soft-switch like OVS, ODL establishes both OVSDB & OF1.x, to install ports/tunnels/flows (Creation of Ports/tunnels/flows to be dealt in different post). As of Hydrogen release, ODLs working with HW switch is unclear. Expect ODL to evolve in this area as it partners with HW vendors.
Given the lean support of Hydrogen with HW, would probably make sense to use this edition as NVO solution for now.

Service Provider Edition:

Service Provider edition is for network operators. On southbound, it supports SNMP, BGP-LS, PCEP, and LISP etc.

Have not played much with this edition yet.