We had an issue with a customer were sometimes after a vMotion some of the traffic was dropped from the VM for a short period of time (aprox. 5 minutes). After that short period everything was working as expected. (NSX-T 3.2.0.1). When we looked in loginsight we could see that traffic was dropped based on a rule that had a higher number as the allow rule. So we had some strong indication that this had something to do with firewall rules en adressets that get pushed to the interface(s) of the VMs after the vMotion.

To assure that the issue we were facing was indeed related to the vMotion we set DRS to the least aggressive stand, and we postponed all the upgrades on ESXi hosts that were planned.We didn’t see the issue any more. So we knew we were on the right track.

We logged a case with GSS and did some test migration. During this migration, we would get a complete list of the firewall rules and adressets of the interface before and after to see if there was any difference to prove our theories. Unfortunately, the problem did not occur.

So we knew we had to step up our game. We decided to place all the hosts sequentially in maintenance mode on a lovely Sunday morning. You can imagine that it is not that much fun to collect all the firewall rules/addressets from the interfaces before and after the migration. So I came up with a little script.

In the past, I had some fun with Posh-SSH Powershell Module. This module allows you to run SSH sessions in PowerShell and records the output so you can use it.

This PowerShell script runs through an entire cluster and gets per-hosts-per-interface the firewall addresset and rules. It then logs them to a text file per interface. We did not know if we were able to grep the issue because we had a limited time frame of 5 minutes, but the script ran for a total of 2 minutes so we just needed a little luck.

We ran the script before and after we placed a host in maintenance mode and all the vMotions were done. If we saw the problem we could checked the corresponding log file of that interface to determine what firewall rule/addresset were active before/after migration so we can send them to GSS for analyses.

When we put the last host in maintenance we finally were able to reproduce the issue. So it seems it gets only triggered after a lot of vMotions. We delivered the DATA back to GSS and we got an answer we were expecting; We were hitting a bug :). https://kb.vmware.com/s/article/88228?lang=en_US

We didn’t upgrade the customer yet so we are not 100 % certain that this is our issue but it sure seems like the same problem.

At last, the code which I used. It sure can use some optimizations, but it did the trick :).

sorry for the code
#variabels
$vcenter = 'vcentername'
$cluster = 'clustername' 
$getallnics = "summarize-dvfilter | grep -A 3 'vmm'"
$date = get-date -format HHmmddMM

#Connect to vCenter
connect-viserver $vcenter
$hosts = get-cluster $cluster |get-vmhost

#user and password of the ESXi Hosts
$user = 'root'
$pswd = 'password'

$pswdSec = ConvertTo-SecureString -String $pswd -AsPlainText -Force
$cred = New-Object System.Management.Automation.PSCredential($user,$pswdSec)

#enable SSH on all hosts
Get-VMHost -Name $hosts| Foreach {Start-VMHostService -HostService ($_ | Get-VMHostService | Where { $_.Key -eq "TSM-SSH"} )}

#foreach loop through al the hosts
foreach ($esxhost in $hosts){
#grep all the VMs on the host to be sure we know where the VM started
Get-VMHost $esxhost | ForEach-Object -Process {
    get-vmhost $esxhost |get-vm| select name,id| out-file ".logging$esxhost.VMs.$date.txt"

    #Build SSH Session to host
    if((Get-VMHostService -VMHost $_).where({$_.Key -eq 'TSM-SSH'}).Running){
    $ssh = New-SSHSession -ComputerName $esxhost.name -Credential $cred -AcceptKey -KeepAliveInterval 5
        #Collect al Interfaces/Nic on the specific host
         $getalnics0 = Invoke-SSHCommand -SessionId $ssh.SessionId -Command $getallnics -TimeOut 30
         $getalnics0.output | out-file ".loggingVMs.niclist.$date.txt" -Append
        #Some not so fancy trimming so i have only the interface name left
         $trimnics = $($getalnics0.output |select-string Name) -split(":")|select-string nic
        
        #Foreach loop so we can get the firewall rules and addressets per interface
         Foreach ($trimnic in $trimnics){
        #Get-addresset
         $addresset = "vsipioctl getaddrsets -f $trimnic"
         $addresset0 = Invoke-SSHCommand -SessionId $ssh.SessionId -Command $addresset  -TimeOut 30
        #output naar textfile
         $addresset0.output |out-file .logging$trimnic.$esxhost.$date.txt
        #get-ruleset
         $ruleset = "vsipioctl getrules -f $trimnic"
         $ruleset0 = Invoke-SSHCommand -SessionId $ssh.SessionId -Command $ruleset -TimeOut 3
        #Output to textfile
         $ruleset0.output |out-file .logging$trimnic.$esxhost.$date.txt -Append
         }
        #verwijderen SSH sessie
        Remove-SSHSession -SessionId $ssh.SessionId
    }
    }
}
#As you can imagine this created a bunch of log files. To keep it somehow organized I put everything in an ZIP file
$compress = @{
  Path = ".logging*.txt"
  CompressionLevel = "Fastest"
  DestinationPath = ".logging$date.zip"
}
Compress-Archive @compress
Remove-item ".logging*.txt"

Get-VMHost -Name $hosts| ForEach {Stop-VMHostService -HostService ($_ | Get-VMHostService | Where {$_.Key -eq “TSM-SSH”}) -Confirm:$FALSE}

The original article was posted on: www.ruudharreman.nl

Related articles

  • Cloud Native
  • Application Navigator
  • Kubernetes Platform
  • Digital Workspace
  • Cloud Infrastructure
  • ITTS (IT Transformation Services)
  • Managed Security Operations
  • Multi-Cloud Platform
  • Backup & Disaster Recovery
Visit our knowledge hub
Visit our knowledge hub
Ruud Harreman Virtualization Consultant

Let's talk!

Knowledge is key for our existence. This knowledge we use for disruptive innovation and changing organizations. Are you ready for change?

"*" indicates required fields

First name*
Last name*
Hidden