One of the most difficult problems to troubleshoot in a Network are Microbursts. This is a really though one. So what is actually the problem with Microburts ? You have a Traffic peak in the network that is only present for a subsecond. Sometimes these spikes can fill up a 10Gigabit Interface at full line rate. The result is that you have typically on multiple devices in a VLAN/Subnet a high rate of TCP retransmits and resets wich causes ~25% Packet Loss. In most cases the server / appliaction teams detects first performance problems that occur sporadically. When this is reported to the network team it is nerly invinceble on the network side. The normal sources for statistics and troubleshooting will show up nothing. For example the monitoring server that polls e.g. every minute the Interface statistics will show up nothing. Also the show commands in the CLI shows on most vendors a statistic over a timeperiod of 10 seconds, wich will round down the burst that was only present for a subsecond. So it looks like that there is no problem in the network. To find the problem it helps to have some sniffer traces during a Microburts that show the TCP Retransmits and Resets. At this time you have to think in a different direction to hunt down the micoburst. Depending on the switch vendor you have to look at a different error counter. The root problem here is that an asic reaches the maximum of throuput and starts to drop packets. If you are lucky you have a counter for that drops like “Drops on no Ressources”.
What can you do to resolve the Problem ?
On the server / application side it is possible to change the traffic profile to remove the burtsy behaviour. That is really hard to achieve and can only be done with apllications that you can change and control. If you can do that this will resolve the issue with Microbursts for one type of Server / Application. You have to be aware that you can run in the same problem again in that network when you deploy for example a new application.
The other method to avoid the problem is to split up the uplinks that are connected to severs that show the bursty behaviour to different devices or asics. It also helps to have more bandwith on the uplinks available than the burst could fill up. So when the Microburst spikes up to 10Gig a 25 or 40Gig Uplink also resolves the issue.
Sometimes you have Micobursts sporadically in a network for years undetected. With strange performnce tickets that are unssolved for a long long time. This is really hard to detect , so keep Microbursts in mind for the case you are dealing with this kind of problems.
An additional challenge is shared resources, like shared buffers and oversubscribed switchfabric connections. The source of the microbursts may be a different interface from the the one experiencing the problems.
Year you have multiple shared ressources on the hard- and software, that can cause prblems or
performance issues. Most times the vendors have to make compromises and trade offs that can bite you