10 Days of Troubleshooting

The last 10 days I spent with troubleshooting. Sometimes you hit a problem and until it is fixed you will not have a lot of sleep and coffee is one of your best friends. So my little war story starts with a planed upgrade in one of our datacenters. We added 30 additional switches to our SPB based fabric. The actual planed job has worked out as planed. We started at Friday and nearly finished the job with adding the new switches to the network on Saturday evening. We left with a good feeling and had only 4 Switches left for the Sunday wich looked like a short day. During the night several ERS4800 stacks that where already deployed 12 month erlier in that network lost their ISIS adjancencies and went offline. All the new added single ERS4800 switches worked without any issues. On the console of the ERS4800 stacks that have been gone offline we have seen immediatly that the CPU utilization was at nearly 100%. With 100% CPU utilization the ERS4800 stacks didn´t send the ISIS hello packets and the adjancency went down. On the start of the problem we had also VLACP configured on the uplinks and here we had the same problem that the cpu was to busy to send VLACP packets and the connected core switch shuts down the ports because it didn´t receive VLACP packets from the connected ERS4k anymore. Disabling VLACP didn´t helped very much because we start running in the same problem with the ISIS hello mechanism. The workaround that helped to get the connected switches back online was to split up the stacks into single units and reducing the number of adjancencies per stack. That was a very time consuming task especially with large stacks. It was strange that the stacks that have worked for 12 month without any problems had shown this connection loss after adding additional devices to a different part of the network.

Bug describtion

The Avaya support was able to find the root cause of the problem. On the ERS4k SW 5.7.x or 5.8.x the process when a ISIS adjancencie is formed up looks like this. The ERS4k makes a lookup on all the interfaces than prgramm the ASICs per I-SID. So the Interface lookup is proceeded for every I-SID. We have seen the problem with devices that run ~140 I-SIDs. Here it also depends how many other devices have that I-SID also configured. In the backround there is a path calcultaion for all the devices that have also that particular I-SID configured running that causes also high CPU load. For example the ERS4k with 140 configured I-SIDs has also to do the Interface lookup 140 times. In my expierence the ERS4k runs stable in the network until you reach a certain breaking point where the path and interface calculation CPU spike is longer than the ISIS timer. When you reach that point you end in a situation where the CPU runs endless in 100%. When you loose the adjancency the deconfiguration also produce the same CPU spike when it pulls back the I-SID assignements. Avaya was able to rpvode a bugfix release. Basicly the bugfix makes that the interface lookup is done only one time regardingless how many I-SIDs are configured. We tested this bugfix release and could see that there is now only a very short CPU spikes and that solves the problem. The same Bugfix is in the VSP7k 10.3.3 release already in place here you could run into the same problem.

Here is Link to the Release notes of the fixed SW 5.8.1.301s: https://downloads.avaya.com/css/P8/documents/101012182

The Aftermath

with every network outage you loose a lot of trust. When the network has run stable for long time the customers expecting that it will run all the time without any interruption. When you have to scale your network on the fly you run sometimes into problems. Often that ends in the “it´s always the network fault” discussion. On the day we successfully updated all switches to the fixed software release and solved the problem, a new problem has come up on our centralized storage system. That was a good reminder that nearly all IT systems have hidden bugs inside. Finger pointing doesn´t help anyone we are all in the same boat. I have to say that all the envolved people in this troubleshooting hunt have done an amazing job. Everybody has worked after hours and did everything to mange the crysis as good as possible. Everybody from the different IT departments, the management and the Avaya Support has put serious efforts into fixing the problem as fast as possible. Thanks to everybody that have been involved.

4 Responses to 10 Days of Troubleshooting

Michael McNamara says:

06/25/2015 at 22:04

It’s the age old question of ‘does it scale’. Thanks for sharing Dominik! Happy to hear that Avaya brought the right resources to the table and put forth some real customer support.

Cheers!

- Dominik says:
  
  06/29/2015 at 12:46
  
  Lessons to be learned , neraly all hardware and software products have bugs, it is only a matter of time when you hit one. In the moment you scale up or simply use more features there is always a chance that you hit a bug. We had some stacks in SMLT configuration that also had run under high CPU utilization. Here it was only a managemnt problem for the stacks. In the basic SMLT configuration the switch is still forwarding under 100% CPU utilization without any problems. There is no Hello mechanism like you have with ISIS , where it is required to send out a hello packet in the right time slot.
  Cheers
  
Sja says:

10/09/2016 at 9:10

Well good question does it scale?
How many switchs are running in that SPB domain today?

Sja

- Dominik says:
  
  10/10/2016 at 20:50
  
  The answer to does it scale is: it depends.
  It is not the number of devices that is critical here , more the number of configured ISIDs.
  Transit devices that have no access ports that are participating in an ISID are unproblematic.
  If you have a high number of Access ports with many different ISIDs I would recommand to use
  a switch with more CPU horse powers.