This week I got a call from one of my workmates of the Microsoft team, who asked me to check a virtual machine which seemed to have some network problems.
The first thing I checked were the basics like CPU, RAM, SAN & network utilization. None of those resources was heavily used so I moved on to the Events tab of the affected VM, which showed the following:
Actual the snapshot creation was not the time I was looking for but the removal, but somehow I didn’t care that much about it and I also checked the vDS “Events” tab and some entries made me curious:
The dvPort 10810 link was up in the vSphere Distributed Switch in DATACENTER Info DATE 02:19:10
The dvPort 10810 was unblocked in the vSphere Distributed Switch in DATACENTER. Info DATE 02:19:10
The dvPort 10810 was not in passthrough mode in the vSphere Distributed Switch in DATACENTER. Info DATE 02:19:10
The dvPort 10810 was not in passthrough mode in the vSphere Distributed Switch in DATACENTER. Info DATE 02:18:45
The dvPort 10810 link was down in the vSphere Distributed Switch in DATACENTER Info DATE 02:18:45
Also the vmkernel.log showed corresponding messages:
*Timestamp +1 hr !
DATET01:19:10.798Z cpu50:11074)NetPort: 2599: resuming traffic on DV port 10810
DATET01:19:10.798Z cpu50:11074)NetPort: 1237: enabled port 0x200000e with mac 00:50:56:XX:XX:XX
My first thought: “25 seconds of network downtime, there must be something wrong!?” But it didn’t took long to find the show stopper:
According to KB2011040 and the VMware support all those messages can be safely ignored.
OT: Why do I post log messages which can be ignored?
However, what was causing the network issues? In the end the miracle has been solved by the virtual machine log:
DATET01:19:10.755Z| vcpu-0| Checkpoint_Unstun: vm stopped for 24733733 us
Sometimes one cannot see the wood for the trees… So problem was really “just” caused by the snapshot removal and the corresponding virtual machine stun. The process is basically the same as when creating a virtual machine snapshot, the VM gets stunned for a certain time to freeze I/O and to switch to the delta.vmdk. Unfortunately I missed that the VM already had two snapshots and this unnecessarily increased the snapshot removal duration. I hope this helps when you are seeing those vDS event messages and you start to wonder if they are maybe the cause of your problem …