Network Field Day #NFD11

NFD-Logo-400x398ve been invited to attend the Gestalt IT´s Network Field Day 11 held in the Silcon
Valley on January 20th – 24th 2016.

For those of you that haven´t heard of the Tech Field Day events so far. The idea behind Network Field Day is to bring together a bunch of delegates that will attend one week of presentations by different top IT vendors. All the presentations will be streamed live and uploaded to youtube after the event. That includes Q&A sessions after the presentations. It helped me a lot to listen to the Tech Field Day presentations before scheduling an appointment with a possible vendor.

I am honored to be part of the Tech Field Day 11  and am looking forward to meet up with the other delegates:
Ethan Banks
Greg Ferro
Ivan Pepelnjak
John Herbert
Jon Langemak
Brandon Carroll
Terry Slattery
Jason Edelman
Matt Oswald
Jordan Martin
Michael McNamara

 

You can follow Tech Field Day 11 on Twitter @NFD11

The following disclosure also needs to be made – my travel and lodging costs will be covered by Gestalt IT and the sponsors of Tech Field Day11.

NFD11 Presenting Sponsors

BigSwitch-200x57

 

NetScout

 

SkyportSystemsLogo-250x76

 

2000px-Dell_Logo.svg_-61x60

500px-Cisco_logo.svg_-231x130

 

 

 

Posted in All | 2 Comments

End of Sale Avaya ERS8000

ERS8000EoSAvaya has recently published an end of sales notice for the ERS8000 product line. The ERS8000 was introduced as Passport 8000 in the year 2000. The product is now for nearly 16 years available. I have configured a lot of new technologys the first time on the Passport/ERS8k. For me the 2 most amazing features that have been introduced on this platform the first time was SMLT and SPBm. The SMLT SwitchCluster features wich was introduced in 2001 was the first Multi Chassis Link Aggregation technology. SMLT was in 2001 a real cutting edge technology that was ahaed of most of the compeditors. For example Cisco introduced with VSS their Multi Chassis Link Aggregation technology in 2008 wich is for IT standards ages later. 10 years later the second next generation technology SPBm was introduced. In 2011 the first SPBm pre standard implemenation was showing up on the ERS8800 platform.

So it is time to say goodbye to the Passport/ERS8000. At the end of the day the complete industry is shifting to Linux based switching OS and the old monolithic OS based switches are fading away.

Some of the ERS8k developpers from Avaya have created a Goodbye ODE , wich I have seen recently on a Avaya presenation:

“When we first turned you on SMLT was quiet new
We had some tough times but we made it through
Alone in the rack looking naked and small
Before we knew it ERS modules populated all
Bandwith demands came quick and came swift
When we gave you E modules you just wouldn´t quit
Who would of thought 10Gig to come fast
Your poor little E modules just wouldn´t last
When R modules came so did netflow
You got super mezz cards but had problems below
Slot 10 was tired and couldn´t keep up
So your body was replaced and you were brand new pup
they lauded and loved you and gave you a new name
8800 they said but you were still the same
Ten days before retirement a power supply quit
We knew at the time we had to be quick
After the more than a decade you served us well
Oh the good times we had and stories we tell
Your out of commision but you still stand tall
Your performance and relaibilitywill be remebered by all”

Here is the Link to the EoS Notice:
https://downloads.avaya.com/css/P8/documents/101015430

Posted in All | Leave a comment

MicroBursts A Troubleshooting Nightmare

One of the most difficult problems to troubleshoot in a Network are Microbursts. This is a really though one. So what is actually the problem with Microburts ? You have a Traffic peak in the network that is only present for a subsecond. Sometimes these spikes can fill up a 10Gigabit Interface at full line rate. The result is that you have typically on multiple devices in a VLAN/Subnet a high rate of TCP retransmits and resets wich causes ~25% Packet Loss. In most cases the server / appliaction teams detects first performance problems that occur sporadically. When this is reported to the network team it is nerly invinceble on the network side. The normal sources for statistics and troubleshooting will show up nothing. For example the monitoring server that polls e.g. every minute the Interface statistics will show up nothing. Also the show commands in the CLI shows on most vendors a statistic over a timeperiod of 10 seconds, wich will round down the burst that was only present for a subsecond. So it looks like that there is no problem in the network. To find the problem it helps to have some sniffer traces during a Microburts that show the TCP Retransmits and Resets. At this time you have to think in a different direction to hunt down the micoburst. Depending on the switch vendor you have to look at a different error counter. The root problem here is that an asic reaches the maximum of throuput and starts to drop packets. If you are lucky you have a counter for that drops like “Drops on no Ressources”.

What can you do to resolve the Problem ?

On the server / application side it is possible to change the traffic profile to remove the burtsy behaviour. That is really hard to achieve and can only be done with apllications that you can change and control. If you can do that this will resolve the issue with Microbursts for one type of Server / Application. You have to be aware that you can run in the same problem again in that network when you deploy for example a new application.

The other method to avoid the problem is to split up the uplinks that are connected to severs that show the bursty behaviour to different devices or asics. It also helps to have more bandwith on the uplinks available than the burst could fill up. So when the Microburst spikes up to 10Gig a 25 or 40Gig Uplink also resolves the issue.

Sometimes you have Micobursts sporadically in a network for years undetected. With strange performnce tickets that are unssolved for a long long time. This is really hard to detect , so keep Microbursts in mind for the case you are dealing with this kind of problems.

 

Posted in All | 2 Comments

Switch OS Reverse Engeneering

It is really hard to get informations about the proprietary OS that runs on many switches. The vendors don´t give away many informations how it actually works under the hood. The old model of security by obscurity is still applied here. I saw on the 25C3 conference in Berlin the “Cisco IOS attack and defense” talk from Felix FX Lindner that changed my mindset about code quality inside of switch OS completly. Felix FX Linder reverse engeneered the IOS code and showed very detailed how IOS works and wich attack vectors can be leveraged to get control over an IOS based device. Felix is one of the most talented persons in the community when it comes to reverse engeneering and I am very thankful for all the time and effort that he has spend on this project. The talk is about 1 hour and covers a really deep dive into Ciscos IOS code. I learned more about how IOS works from this talk than on all presantations that I have ever seen from Cisco.

This talk is from 2008 and was the first of a series of switch OS reverse engeneering projects from FX. The next target was the Huawei VRP OS. The results FX presented on DEFCON 2012. Huawei had a joint a venture with HP and I it looks like that most of the results are also apply for the H3C devices from HP. The myth that Huawei has copied the IOS code was disproved by FX. He found out that the Huawei VRP OS is based on VxWorks. At the end of the talk his devastating summary is “90´s style bugs, 90´s style exploration, 0 operating system hardening … no security advisories..”.

Beyond the physical switches FX also reverse engeneered the Cisco Nexus 1000v virtual switch. In the talk “Cisco in the sky with diamonds” FX presented the results of that research at the Signit 2013 conference.The NX-OS based Nexus 1000v is based on a Montavista Linux that runs a 2.6.10 Kernel. FX und Greg found a jailbreak wich they show in the talk and mention that the same jailbreak also works on the physical Nexus devices.

This shows the level of security that is embedded inside of the switches that FX has investigated is very poor. I think very different since I am aware of the resaerch of FX when it comes to protect a switch from getting owned by a hack. It also explains a lot of the bugs that I have expierenced in the past. Hopefully FX and Greg will continue their excellent work in the future.

 

 

 

Posted in All, Blog | Leave a comment

Packet Pushers Podcast Show 250 – How To Document A Network

PPI-Weekly-New-330x330-optI have recently attended to the packet pushers podcast show 250 – How To Document A Network with the packet pushers hosts Ethan @ecbanks and Greg @etherealmind. It was the 3rd time that I have attended to the packet pushers podcast. We had an interesting discussion that gives a pretty good overview about the most important topics regarding to network documentation and documentation tools. We nearly hit the 90 minutes mark. The show can be downloaded here:

http://packetpushers.net/podcast/podcasts/show-250-document-network/

If we missed your favorite documentation tool, feel free to leave a comment.

 

Posted in All, Blog | Leave a comment

Whitebox vs Blackbox

whiteboxvsblackboxAt the moment there are a lot of discussions about Whitebox switches and how it changes the networking industry. Essentially the idea is that you buy your switch hardware and software separately. At the moment most of the network vendors use already merchant silicon like e.g. chips from Broadcom inside their switches. You can also buy the same hardware silicon inside a Whitebox switch.  The main benefit of Whitebox switches is that they are cheaper and you can also use any SFP/QSFP modules from the open market that you like. Additional to the hardware you also need an operating system for your whitebox switches like cumuls linux or an OS toolbox like the Facebook Open Switching System FBOSS. That runs on top of your hardware. Another aspect is that you can change your hard- or software supplier separately and not be dependent on a single vendor. On the other hand with the traditional blackbox appliance model the hard- and software comes as a tested package from a single vendor. For the vendors the challenge is that most of them have the same chips inside their switches. So they need to provide additional goodies on their switches like management, support, features and protocols to convince customers to buy their product.

 

The Networkautobahn View

For me the idea of Wihitebox is not new. In the Firewall space we have had the whitebox model for a long time. And I have been burned many times in situations where something has not worked and the hardware and the software vendor were finger pointing at each other instead of troubleshooting the actual problem. We are facing the same challenges with the whitebox switching. I doubt that whitebox will be revolutionizing the complete network industry. For customers that have the staff and the drive to go the whitebox road it has the obvious benefits of lower costs and independency. But that comes with some restrictions. All the testing and the developing process of new features that is done in a blackbox solution by a vendor is now outsourced to the customer. It can be a benefit to have the ability to add new features to your switches with your own developers. On the other hand not every organisation has developers available to do that job. In the classic blackbox model these task would be provided by the vendor. As always your choice will depend on your needs. I suggest the whitebox model is more attractive for large organisations that have their own developers and enough resources to do extensive testing in their own Lab environment. If you don´t have these resources available the traditional blackbox switches serve your needs better in most cases. I suggest whitebox switching will have a market impact on a specific type of customers like large cloud and datacenter providers. IMHO whitebox switching will be more of a niche product than something that revolutionizes the entire network industry. If the demand is high enough we will definitely see more vendors that add a separately available switching OS product that runs on whitebox switches to their portfolio.

Posted in All, Blog | Leave a comment

Interview with Paul Unbehagen

On the Avaya ATF in Vienna I had the chance to make an interview with Paul Unbehagen, Roger Lapuh and Randy Cross. This trio has been deply involved in the development process of SPB inside Avaya and the IEEE.

Paul_ATF_2015

Network Autobahn:What are typical Topologies that are deployed in SPB networks. What do you have seen in real world deployments. Is leave spine the usual topology or do you also see other solutions.

Paul Unbehagen: So there are not any typical designs. There are a lot networks that are already in place based on SMLT designs or dual hub and spoke that we just simple upgraded to our fabric and it just keeps on running. Often times we found they have been upgraded they added links in different places where they couldn´t do it before because the old SMLT design wouldn´allow it. As our reference architecture we tendenly tworads people more a douple helix looking design recently. Something that would more like a square SMLT stack on each other and going with that comes a more organic growth. SPB don´t care what the topology is. In fact because we don´t have IP Adresses on any of the Interfaces between of the SPB nodes you can redisgn the fabric on the fly. You take one core link and swap it to the another one and the fabric will just adopt for that automatically. We actually do that as a great demo at conferences where the video survalliance guys watching a video and just unplug it. I wouldn´t say there is a typical design because there are so many different out there so many ways.

Roger Lapuh: So maybe I can add there if the 7200 or 8400 as a typically 10Gig /40Gig datacenter switch you probally will see  that you have 7200 as the top of the rack switch . Since we have 6x 40Gig links you can use some of the 40Gig links to interconnect them horizontally for fast east/west and than you have as a aggregation layer a 8400 also with 40Gig connected probably dual homed . So its not a leaf spine , it is a leaf with a horizontal detour.

Paul Unbehagen: We call it distrubuted top of a rack.

Networkautobahn: That is exactly my expierence. One day you see between these 2 Servers I have a lot of traffic. So OK I plug in another direct link between these two switches to add more bandwith.

Paul Unbehagen: So we look at the 7200 as a great example why it has the 6x 40Gig Links at the front. Because the VSP7k that we are using that is using the Fabric Interconnect links , there is actually 8x 40Gig interfaces bundled that wehn we running in the SPB mode they exposed to. In the new 7200 modell we have the ability that we have native 40Gig interfaces , that you can plug into 8k, 9k.. so distrubuted top of the rack becomes more  than a just a VSp7k, it now becomes end-of-row , middle-of-row, top-of-rack all integrated togehterso so it becomes even more flexible detour modell.

Networkautobahn: How do you deploy your L3 instances in a SPB network. You also have planty of choices. In the classic modell you have a centralized design. For example in a classic SMLT design you would have 2 cores with RSMLT. In a SPB network you can have L3 also in the access. What do you have seen here. Is it the same answer than we had in the topology discussion ?

Paul Unbehagen: Yes , the answer is very similar. It all depends on the enviroment and the emotional attachments.

Randy Cross: And the Relegion the people doing it…

Paul Unbehagen: So by relegion the question is do you route or switch ad the edge kind of conversation. The point with the fabric was no longer make it a religious debate. Make it do what you need to do where you need to do and mary the two togehter. Putting the two worlds togheter. The idea that you can switch or route ad one point, so people did some routing and now we get more mobility in the Layer 2. When you are in the fabric you are not in a 2 deminsional world you are in a 3 dimensional world. So you can create a layer 2 anywhere you want underneeth the layer 3. You are not bound by that anymore. You can put for example in the datacenter a VRF right on the top-of-the-rack like with the VSP 7200 or 8200. You can also put the same 8200 put in the distribution or core and have a 8400 in the closet doing layer 2. You can also have layer 2 that starts in the 8400 span multiple closets for example for wireless LAN that is important. We call it unified networking. You probally have heard me talking about that in previous presantations the problem set in wireless LAN is identically than in the datacenters. VMs and moving around is the same problem set as moving around with these devices. We not putting you on the native LAN in the cmapuses just we do in the datacenter. The LAN stretch that we do in the datacenter makes perfect sense when we talking about the campus as well. It gets more flexibility. Where you do the routing is depending on the application type. You might wanna do when the device you are holding in your hand is on the same subnet as your PC in the cube in your office. When you pick up your tablet you still roaming on the same subnet, using the same DNS and DHCP so it becomes a simpler design but more robust. When we are not going back to a wirelss controller in the datacenter , your are not trying to manage tunnels over a real network going down to the speed of the tunnel. You using native switching to do native forwarding. So taht means you get more flexibilty where you put your routing.  You are doing it more 3 dimensional, here is where I want my routing and here is my switching stretching benteh.

Roger Lapuh: From a product perspective we will map this basically so far the VSP7000 is Layer2 only. So you are forced into the spine beeing Layer3 and edge beeing Layer2.With the new VSP7200 coming we will give you the freedom to really push routing to the edge.It will have VRFs all the capabilities that you knew from the VSP4k or VSP8400 right at that switch. So that you can have a top-of-the-rack with the VRFs right there.

Networkautobahn: What can you tell us about the new ONA devices that Avaya has recently introduced and how they are connected to a SPB Fabric.

Paul Unbehagen: So an ONA is kind of interesting, because it has multiple use cases depending on your need. For some people it is about the security, for others the automation or simplicity. For example the ONA uses a technology that we call Fabric Attach, wich we have actually taken to the IEEE and will become 802.1qsg. What it is basically it is an extension to LLDP to bring the capability to an device to say I want to join VSN 53. It might be your medical or secure network or what ever you are using it for. It is not limited to the ONAs. This allows thinks like video surveillance cameras they start to embed the FA technology. For devices they don´t have the ability to integrate the FA technology quickly like for example an MRI, it is an multi million dollar device that don´t change very often. For this the ONA allows you to bring the Fabric Attach concepts to devices that normally wouldn´t get it. In the Fabric Attach mode, because there are two modes Fabric Attach and Fabric Extend. Allows you that an MRI is automaticaly is attach to the medical devices VSN and I wanna to make sure that the only thing that is allowed to communicate through between this MRI is the Server-to-MRI and MRI-to-Server and nothing else. For healthcare this is a huge benefit. This apply to healthcare, education, stock exchanges…

Randy Cross: PCI and anything where you need device isolation manufacturing.

Networkautobahn: All you need for deploying that is a routed connection ?

Paul Unberhagen: In the Fabric Attach mode all you really need is one of our switches to plug into. You don´t even need a fabric. You can just take a ERS4800 and have it automatically configure the VLAN attachments. If you have a full SPB fabric it is getting more powerfull, because you can say connect me to the right Layer2 VSN and now it attaches you where ever you need. The ONA itself is very powerfull, because what it is running on is openVSwitch. And the same openVSwitch we where demoed at VMWare last year. The 2.4 release of openVSwitch has Fabric Attach embedded as well. So you can run this in your datacenter and have Fabric Attach going to your 7200 in the future.

Networkautobahn: Can the ONA do encryption ?

Paul Unbehagen: Stay tuned not yet. The other interesting aspect is asset tracking. Suddenly you know where everything is in your network and if you ever need to move a device to another room, no one has to be involved in that change. Just unplug the ONA roll it to the next room and plug it back in, and it automatically attached.

Roger Lapuh: I just want to touch how it is manged. Basically it doesn´t have a CLI or is manually configured.  It is dumb , you plug it in and it gets all its configurations from a centralized controller. It is really leveraging the SDN approach. So you have a central controllor that have some rules how these particular ONAs connecting, if someone steels it all configuration is lost and it can´t be used anymore. It is really a tie in between the costumer infrastructure controller and the ONA, that is happening while you connect the ONA to the network.

Paul Unbehagen: Also if someone tries to hack it, because there is no CLI or even a console port on it. If they are trying to start manipulating the software on it , it will brick itself because it is a secure boot device.So a lot of thoughts was putting into it. trying to make sure it is a secure environment, cause we don’t want someone stealing it from one place and trying to take it to another place to hack into our virtual network. So in this case we are trying to make sure that is not only provide segmentation and security, it also makes it easier for you to sleep at night.

Networkautobahn: SPB is now getting connected to SDN solutions. Is this the future of SPB? How is it connecetd to OpenFlow ?

Paul Unbehagen: So the communication you are talking about to the controller looks like this the ONA talks to the controller via OpenFlow. When you plug it in the ONA downloads a rule set via OpenFlow that triggers a Fabric Attach massage going up to the first SPB switch and the SPB switch provides the needed VSN. It is a combination of technologys in the right way versus simply there is only one way. It allows the right mixture of tools to make the solution work.

Networkautobahn: At the moment everybody is talking about SDN , for me SPB has solved most of the problems that SDN is trying to address.

Paul Unbehagen: That is our SDN FX story. The SPB Protocol has solved already what everybody else is trying to solve with SDN. We took a lot of time to talk to our customers what is it really that you think it is you need SDN for. It really comes down to we need to automate the edge and the connections. That is really what our SDN is.

Posted in All, Avaya, Blog | 4 Comments

10 Days of Troubleshooting

PaperCamera2015-06-20-00-16-18The last 10 days I spent with troubleshooting. Sometimes you hit a problem and until it is fixed you will not have a lot of sleep and coffee is one of your best friends. So my little war story starts with a planed upgrade in one of our datacenters. We added 30 additional switches to our SPB based fabric. The actual planed job has worked out as planed. We started at Friday and nearly finished the job with adding the new switches to the network on Saturday evening. We left with a good feeling and had only 4 Switches left for the Sunday wich looked like a short day. During the night several ERS4800 stacks that where already deployed 12 month erlier in that network lost their ISIS adjancencies and went offline. All the new added single ERS4800 switches worked without any issues. On the console of the ERS4800 stacks that have been gone offline we have seen immediatly that the CPU utilization was at nearly 100%. With 100% CPU utilization the ERS4800 stacks didn´t send the ISIS hello packets and the adjancency went down. On the start of the problem we had also VLACP configured on the uplinks and here we had the same problem that the cpu was to busy to send VLACP packets and the connected core switch shuts down the ports because it didn´t receive VLACP packets from the connected ERS4k anymore. Disabling VLACP didn´t helped very much because we start running in the same problem with the ISIS hello mechanism. The workaround that helped to get the connected switches back online was to split up the stacks into single units and reducing the number of adjancencies per stack. That was a very time consuming task especially with large stacks. It was strange that the stacks that have worked for 12 month without any problems had shown this connection loss after adding additional devices to a different part of the network.

Bug describtion

The Avaya support was able to find the root cause of the problem. On the ERS4k SW 5.7.x or 5.8.x the process when a ISIS adjancencie is formed up looks like this. The ERS4k makes a lookup on all the interfaces than prgramm the ASICs per I-SID. So the Interface lookup is proceeded for every I-SID. We have seen the problem with devices that run ~140 I-SIDs. Here it also depends how many other devices have that I-SID also configured. In the backround there is a path calcultaion for all the devices that have also that particular I-SID configured running that causes also high CPU load. For example the ERS4k with 140 configured I-SIDs has also to do the Interface lookup 140 times. In my expierence the ERS4k runs stable in the network until you reach a certain breaking point where the path and interface calculation CPU spike is longer than the ISIS timer. When you reach that point you end in a situation where the CPU runs endless in 100%. When you loose the adjancency the deconfiguration also produce the same CPU spike when it pulls back the I-SID assignements. Avaya was able to rpvode a bugfix release. Basicly the bugfix makes that the interface lookup is done only one time regardingless how many I-SIDs are configured. We tested this bugfix release and could see that there is now only a very short CPU spikes and that solves the problem. The same Bugfix is in the VSP7k 10.3.3 release already in place here you could run into the same problem.

Here is Link to the Release notes of the fixed SW 5.8.1.301s: https://downloads.avaya.com/css/P8/documents/101012182

The Aftermath

with every network outage you loose a lot of trust. When the network has run stable for long time the customers expecting that it will run all the time without any interruption. When you have to scale your network on the fly you run sometimes into problems. Often that ends in the “it´s always the network fault” discussion. On the day we successfully updated all switches to the fixed software release and solved the problem, a new problem has come up on our centralized storage system. That was a good reminder that nearly all IT systems have hidden bugs inside. Finger pointing doesn´t help anyone we are all in the same boat. I have to say that all the envolved people in this troubleshooting hunt have done an amazing job. Everybody has worked after hours and did everything to mange the crysis as good as possible. Everybody from the different IT departments, the management and the Avaya Support has put serious efforts into fixing the problem as fast as possible. Thanks to everybody that have been involved.

Posted in All, Avaya, Blog | 4 Comments