Some days ago I mentioned in a post that when I was configuring a backup server I or better we, the customer and me, faced massive hardware problems. To be a bit more precise, we were working on two identical Dell PowerEdge R710 servers which both worked fine for months. We decided to re-configure the local NIC teams (BACS) to use all four onboard Broadcom BCM5709C NetXtreme II interfaces.
About two hours later the first system started to crash randomly. A day later we got the same problem with the second system. Both servers performed complete power cycles and stopped at the POST with a critical error notification.
- No Bluescreen
- No memory dump
- No helpful Windows logs
The only error was error was logged by the iDRAC / OMSA log:
Critical,”Wed Jan 29 2014 05:53:25″,”A bus fatal error was detected on a component at bus 0 device 0 function 0.”
Critical,”Tue Jan 28 2014 06:37:05″,”CPU 1 has an internal error (IERR).”
The error was identical on both servers, but the primary system which usually faces way more load, crashed more often.
Because the error messages didn’t indicate a problem a PCIe device like a RAID controller or the RAM AND the CPU error has NOT been logged every time the system crashed, Dell decided to replace the mainboard. The system was not even back in production, it crashed again. Next try, this time Dell replaced the CPU. Guess what? Right, it took not even one hour and the system was offline again.
The next step was to perform some Dell & 3rd party hardware diagnostic & load test and all passed with NO errors.
Then we reviewed all changes we performed on both system and the only thing both servers had in common (which was kind of hardware related) were the changed we made to the NIC teams. We re-configured the team as mentioned above and attached two additional cables (from 2 to 4). As soon as the customer removed those two cables (from 4 back to 2) the system was way more stable, but not completely. The system was up for one or two days before it crashed again.
So it hat to be related to the onboard network adapter. The final step was to upgraded to the latest firmware version 7.8.0 and driver 18.2.0 and since then the system is running fine! Hope this helps you to save some time …