VMware Tanzu In My Homelab: Excessive Disk I/O

I’ve been working on getting VMware Tanzu Kubernetes Grid deployed in my homelab for a bit now. While I have been able to successfully deploy the workspace management and supervisor cluster VMs, I have noticed very excessive disk I/O from those VMs.

I never would have noticed if I never looked at my NAS. After deploying the supervisor cluster VMs, the disk activity light for onboard SATA was on solid. The onboard SATA is used by the ZIL on my FreeNAS box. Activity lights for the spinning rust was not abnormal though.

So, I decided to start looking at some metrics, as I wanted to validate that it was indeed the supervisor cluster VMs.
NOTE: Another oddity I have found is that only the local [email protected] account has permissions to even view the supervisor cluster VMs. Is this normal or expected? My domain account has the Administrator role as well, but it is unable to view the VMs or even Workspace folder in the VMs & Templates view. It can see the vCLS VMs though.

Reviewing the virtual disk rate, we could see that there was a constant write rate of around 15-17MB/s on one of the VMs.

Now let’s look at the disk requests. We can see that this is pretty consistent at 4,000 IOPS.

Now we will look at one of the other VMs. This one is right around 10MB/s write and only 2,000 write requests per second. The third VM looks like the first, with 15-17MB/s and 4,000 IOPS.

Let’s take a look at the storage statistics. ADA8 is one of the ZIL SSDs. We see a constant write rate of roughly 38MB/s.

Now let’s take a look at the operations per second. I had just redeployed the supervisor cluster VMs, and you can clearly see that. About 4,000 ops/sec.

Just for fun, let’s look at the network interface statistics. Constant receive of roughly 350Mbps.

So I thought I would try and delve in to the individual VMs, and see if I could find the source.

First thing, we need to find the password for the root accounts on those VMs.
To do this, there’s a script you can run from the vCenter shell to output the password.
SSH to your vCenter shell, then run: /usr/lib/vmware-wcp/decryptK8Pwd.py

The output should look something like:

[email protected] [ ~ ]# /usr/lib/vmware-wcp/decryptK8Pwd.py
 Read key from file
 Connected to PSQL
 Cluster: domain-c7:122bff4a-443d-4094-9035-be4e244286cd
 IP: 10.0.13.129
 PWD: Ho+BBOatjFWS2s68flutg7xHYewt/Vy8BAvHNMUQcMhcqG3Havk4Lta4Sdf37wXA/fcmr4rVWBsc1l8LqSSfQrHqmsHfDpolgz8aI269Who6L43RNF283TvEpityLVWgYo+Ai2hOUKUdapzKgG5WPnBhSDyLttMUYAiVO9rrHSg=

Now that we have the password for the root account, we can SSH to the individual VMs.
SSH to one, log in with root and the password found from the script, and we can use the dstat utility:

/usr/bin/python3 /usr/bin/dstat

This will output disk and network activity. We could see, and validate, that sure enough we are writing to disk at an excessive rate, even htitting 20MB:

[email protected] [ ~ ]# /usr/bin/python3 /usr/bin/dstat
 You did not select any stats, using -cdngy by default.
 Color support is disabled as python-curses is not installed.
 /usr/bin/dstat:515: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
   if isinstance(self.val[name], collections.Sequence) and not isinstance(self.val[name], six.string_types):
 --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
 usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw 
  14  10  41  35   0| 216k   16M|   0     0 |   0     0 |  17k   47k
   9   8  48  34   0|   0    14M|  37k   91k|   0     0 |  19k   44k
  12  13  41  34   0|   0    14M|  25k   57k|   0     0 |  15k   43k
   8   9  46  37   0|   0    15M|  25k   80k|   0     0 |  14k   47k
   9   9  46  36   0|   0    15M|  31k   69k|   0     0 |  16k   49k
   9   9  46  35   0|   0    16M|  35k   87k|   0     0 |  15k   46k
  19  12  38  31   0|   0    15M|  34k   72k|   0     0 |  16k   47k
  11  11  42  36   0|   0    15M|  35k   83k|   0     0 |  15k   47k
  29  18  25  28   0|   0    20M|  69k  222k|   0     0 |  19k   45k

We can ran the same utility on the other VMs that should show similar results from what we found in the VM statistics.

Let’s try another utility, to see if we might be able to find where all those writes are going! We can use pidstat for this one. I’m not going to output everything as a number of these processes are at 0 writes. Here are all the processes that are writing. We can absolutely see the culprit with the write rate and IO delay. fluent-bit with 15,408KB/s write and IO delay of 7,177,811.

[email protected] [ ~ ]# pidstat -d
 Linux 4.19.132-5.ph3-esx (420631715fd538d484265a9a44e451cb) 01/26/2021 x86_64(2 CPU)
 06:03:33 AM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
 06:03:33 AM     0         1      1.20      0.14      0.00      13  systemd
 06:03:33 AM     0        67      0.00     10.62      0.00   50003  jbd2/sda3-8
 06:03:33 AM     0        90      0.00     18.12      0.00    4155  systemd-journal
 06:03:33 AM     0       167      0.52      8.63      0.00       0  crond
 06:03:33 AM     0       173      1.88      3.91      0.00     232  vmtoolsd
 06:03:33 AM     0       285      0.34      6.91      0.10       0  sshd
 06:03:33 AM     0      5149      9.59    323.49      2.12      16  containerd
 06:03:33 AM     0      8320      7.62      0.00      0.00      36  registry
 06:03:33 AM     0      8469      0.01    155.83      0.00   18639  etcd
 06:03:33 AM     0     11639      0.00  15408.21      0.00 7177811  fluent-bit
 06:03:33 AM     0     12709      0.00      0.02      0.00       0  registry-agent
 06:03:33 AM     0     16287      0.06      5.28      0.00      25  python3
 06:03:33 AM     0     16363      0.00      0.06      0.00      57  python3
 06:03:33 AM     0     19825      0.31     11.29      0.00       0  kubelet
 06:03:33 AM     0     20703      0.01      0.67      0.00       1  kube-apiserver
 06:03:33 AM     0     23636      0.01      0.08      0.00       1  lwregd
 06:03:33 AM     0     23740      0.00      0.01      0.00       2  lsassd

Again, we can run the same utility one of the other VMs and find nearly identical results.

I spent a lot of time searching, and was not able to find any other reports of this behavior. Surely I can’t be the first one to experience this, I thought. So, let’s go make a thread on the VMware communities.

https://communities.vmware.com/t5/VMware-vSphere-Discussions/TKG-Supervisor-Cluster-Constant-Disk-IO/m-p/2827512

The one community user that responded stated they experienced the same issue, and found it was related to using a .local domain. Well, my homelab does have a .local domain.
Also referenced were a couple VMware KB articles related to DNS issues with .local domains.

https://kb.vmware.com/s/article/78611

https://kb.vmware.com/s/article/50120424

But when reviewing both of those articles, the issue is related to the appliances not having a valid or usable DNS server configured. In both cases, the cause is described as not configuring DNS during deployment or having decommissioned the DNS servers.

Well, I did provide DNS server during deployments as well as search domain, and also validated that the supervisor cluster VMs could indeed resolve the hostname whether using short hostname or FQDN.

[email protected] [ ~ ]# ping vcenter
PING vcenter.incendiary.local (192.168.1.100) 56(84) bytes of data.
64 bytes from vcenter.incendiary.local (192.168.1.100): icmp_seq=1 ttl=62 time=4.51 ms

So DNS didn’t appear to be the issue. Additionally, I did attempt to search the log file /var/log/vmware/fluentbit/consolidated.log and oddly it kept returning zero results for my vCenter.

[email protected] [ ~ ]# cat /var/log/vmware/fluentbit/consolidated.log | grep vcenter.incendiary.local
[email protected] [ ~ ]# cat /var/log/vmware/fluentbit/consolidated.log | grep vcenter
[email protected] [ ~ ]# cat /var/log/vmware/fluentbit/consolidated.log | grep 192.168.1.100

Although I was able to find references to my vCenter when I viewed the log in a text editor.
But, all I could ever find were POST errors.

kube.var.log.containers.vsphere-csi-controller-7c646978db-dv8tn_vmware-system-csi_vsphere-syncer-117430760f80fea02e666c34a71f32c1f88887151b5e78639e9ff047a9e7b1b2.log: [1611621903.859503, {"log":"2021-01-26T00:45:03.859325213Z stderr F {\"level\":\"error\",\"time\":\"2021-01-26T00:45:03.859202328Z\",\"caller\":\"storagepool/intended_state.go:183\",\"msg\":\"Error fetching cluster properties for domain-c7. Err: Post \\\"https://vcenter.incendiary.local:443/sdk\\\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\",\"TraceId\":\"2ed71762-b82f-4cfc-ad35-d82394e9ab6e\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.isRemoteVsan\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/intended_state.go:183\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.newIntendedState\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/intended...

Well, I then got a VMware staffer to respond; depping, runs yellow-bricks.com and CTO Office Cloud Platforms. Neat!

He agreed that this sounds extreme, raised it with some folks, then had engineering looking in to it!

Let me raise this with a few folks, that sounds very extreme indeed.

Agreed, engineering is looking into it.

Sure enough, on February 4 he updated that this is indeed a bug! Engineering was able to reproduce the issue and are now investigating. Though I still don’t know if it actually is related to a .local domain or something else. I have not yet seen a KB article related, and I’ve been trying to keep my eyes open for one.

Issue has been confirmed in our engineering environment as well, it is being investigated and hopefully fixed in a future release. If you run this in production I would recommend filing an SR for it and submitting all details, that way it can be attached to the engineering issue.

So, there is a bug that causes an excessive amount of disk I/O, specifically disk writes due to logging.

It seems you may be able to decrease the logging level, but I have no idea what the full impact would be. The other community user that responded did provide a way to decrease the logging level. The post is no longer there; I don’t know if the user removed it or a mod did. I received the post in an email notification though.

I can take no responsibility for this, and use at your own risk:

I solved the problem with these steps:
Change Log_Level for fluentbit from debug to info using:
kubectl edit configmap fluentbit-config -n vmware-system-logging
delete and recreate daemonset.apps/fluentbit in namespace vmware-system-logging
No more useless thousand of loglines per second.
It's down from 1500 IOPS to 50 IOPS for each VM Also CPU usage went down from ~1500 MHz to ~800 MHz
I would recommend opening a case with vmware before implementing this. I dont want to struggle with vmware support and just 'for my VMUG homelab and will just use this modification in my VMUG lab

If or when I see/find the KB article, I will post the link here.

Update 02/26/2021: Issue is only related to default logging level of fluent-bit. It’s doing debug logging instead of warning.
https://www.virten.net/2021/02/vsphere-with-tanzu-supervisorcontrolplanevm-excessive-disk-write-io/

Still unsure of the KB, been keeping an eye open but have not yet seen or found it.

In a production environment, seems VMware support just suggests disabling fluent-bit logging, per a post by a user on my VMware community thread.

I got an answer back on my SR. It was a simple fix for us:
You can disable fluentbit logging by:

1. ssh into vcsa
2. editing /etc/vmware/wcp/wcpsvc.yaml on the vcsa
change logging_fluentbit_enabled: true => logging_fluentbit_enabled: false
3. service-control --restart wcp

Leave a Reply

Your email address will not be published.