Weird MLT (aka EtherChannel) issues with vSphere / ESX 3.5
February 2, 2010 at 22:42 3 comments
I am at a site that uses Nortel Networking kit for the L2/L3 switching. It’s a simple config with about 5 VLANs in total. The ESX/i servers are mostly vSphere 4 / u1 servers with one remaining ESX 3.5 Update 4 server.
We have been having problems over the last few weeks across a specific Nortel stack. Suffice to say the issue has been a pain to troubleshoot since we still can’t isolate it. There seems to be corruption in the backplane and we think it has to do with the MLT links that are configured on the stack.
The problem that we are experiencing manifests itself as follows: For example, port 1/1 may have a problem communicating with a physical server on 4/40. However, port 1/4 has no issues. Their physical switching path means that all traffic has to go via the stack uplink/downlink cables. A similar issue could happen with other ports on other switches in the stack. But it is not consistent! Additionally you may have no issues accessing a server, but certain portions of that server are nigh on unresponsive. Example: Connecting to our SharePoint Site – everything on the site works, but a specific portion of the site just takes forever to return and SharePoint runs off 1 server!
Troubleshooting this issue has forced us to reboot our switch stack to see if the problem could be cleared.
However, a new problem reared it’s ugly head and is the reason for this post.
All ESX (HP DL 360 G5 /380 G5) servers in this site have 6 NICS:- 2 onboard and a Quad Port Intel Gigabit NIC card.
They are configured as follows:
——————————————————————————
pNic0 (On Board) – primary service console
pNic1 (On Board) – Guest VM traffic (3 VLANS)
pNic2 (Quad Port) – Guest VM traffic (3 VLANS)
pNic3 (Quad Port) – Guest VM traffic (3 VLANS)
pNic4 (Quad Port) – Fault Tolerant port
pNic5 (Quad Port) – vMotion Port / Secondary Service Console
——————————————————————————
The Guest VM traffic ports are bonded into a MultiLink Trunk (MLT) which is Nortel’s equivalent of Etherchannel.
The vSwitch config for this MLT is as follows:
- Use IP Hash Load Balancing
- Notify Switches = Yes
This configuration had no issues until the recent problem caused us to reboot the switch stack (which is of course a very rare occurrence)
When the stack comes up, all NIC ports on the Quad Port NIC only negotiate at 100Mbps. Every ESX server exhibited the same problem. And this is consistent since we had to restart our switch stack a few times over the course of the last few weeks!!
The Quad Port NICs negotiating at 100Mbps of course causes a few problems:
- vMotion is impossible
- FT doesn’t work
- The MLT seems to be working but it is not! Every member of an MLT link must be at 1000Mbps however we have the on-board NIC negotiating at 1Gbps but the other two members negotiating at 100Mbps. This creates a very unique, unintended and nasty side-effect – all traffic on the MLT is sent out across the entire switch stack causing the switch to act like a hub! Running a WireShark trace on my desktop, I can see all VM guest traffic between each other on my PC (when all I should be seeing is broadcast, multicast and direct IP traffic). I found this out by mistake only as I was trying to troubleshoot some other issue!
The only way to fix this is to take each NIC and force it to auto detect its link speed using the vSphere client which then negotiates at 1Gbps. This then resolves that issue and the VM traffic disappears off my WireShark trace.
On the ESX 3.5 server, all traffic across the MLT and the separate vSwitches (Service Console and vMotion ports) was running through primary port of the MLT. Even if you disabled the vMotion interface on the Switch, we could still ping the vMotion interface since the it was somehow being routed via the MLT. The only way to resolve this was to reboot the ESX server for normal behaviour to resume.
I don’t know what causes the Add-On Nics to only negotiate at 100Mbps but some good lessons have been learnt as a result of our troubles
Lessons Learnt:
1) Configure/Force all switch ports connecting to ESX servers to statically hardcoded to 1000Mbps Full Duplex
2) Check networking and other config on each ESX server reboot or if another major system event occurs
3) Pay attention to the events log on ESX servers. We did not notice that vMotions were failing since the vMotion ports were running at 100Mbps.
I would like to thank Michael McNamara for his help in trying to help me figure out our issues with our Nortel Switch stack. The post is here:
If you guys have any clues as to what would cause some ports not to communicate with another port in a stack, I would appreciate it as we are still experiencing the issue and none of our checks can point to either a faulty switch or a stack cable.
Entry filed under: Troubleshooting. Tags: MLT, Nortel, Quad Port Nics, vSphere.

1.
Max Lukjanenko | May 9, 2011 at 14:02
Your problem is Layer 1, so forget about VLANs and routing.
If you are using nortel 5510/5520 baystack switches, they autonegotiate perfectly with all new equipment.
Your problem will occur if: cables are damaged/not punched-in correctly or are not connected properly in any way. Or if the signal is too weak due to attenuation caused by the distance of the cable.
You must make sure that you are using fully straight-thru cable configuration between the switch and the server. You must also make sure you are using CAT6 cables.
Alternatively, you may use fiber, which is more expensive but does not have the same L1 issues as copper cables.
2.
MkH | May 12, 2011 at 17:59
Hi Max.
Thanks for taking the time to reply. The issue actually was cause by a faulty backplane cable that was used to connect the switches into a stack. The traces and logs never showed an error. But replacing it, did.
3.
MkH | May 12, 2011 at 18:03
Ignore my last post. It was so long since I wrote the post that I forgot what the problem was!
We still don’t know why the on-board nics don’t auto-negotiate correctly if the switch stack is rebooted. We have at least 20 servers with the same hard ware spec/configuration and all the add-on nic cards exhibit the same behaviour.
Thankfully, once we sorted out the stack issue, we have never had to reboot it again since!