Symptoms
You see errors similar to the following in the Tail input plugin in Fluent-bit:
No space left on device
When you check the Fluent-bit pods by running kubectl logs <fluent-bit-pod> -n pks-system, you see the entries similar to:
[2020/03/04 20:16:17] [error] [in_tail] could not register file into fs_events[2020/03/04 20:16:17] [error] [plugins/in_tail/tail_fs.c:219 errno=28] No space left on device[2020/03/04 20:16:17] [error] [in_tail] could not register file into fs_events[2020/03/04 20:16:17] [error] [plugins/in_tail/tail_fs.c:219 errno=28] No space left on device[2020/03/04 20:16:17] [error] [in_tail] could not register file into fs_events
You see that there is enough free space on /var/log inside the pod and the worker nodes also have enough free space.
Impact / Risks
If this situation occurs, the underlying log files are not lost or deleted. They are still there. However, they will no longer be monitored by fluent-bit after hitting that current limit. This situation and error occurs because (at that time) the system kernel has reached the limit of filesystem "inodes" (not a limit of storage space).
Resolution
This is a known issue affecting VMware Enterprise PKS / VMware Tanzu Kubernetes Grid Integrated Edition. There is currently no resolution.
Workaround
Note: This workaround will not persist across PKS upgrades or node recreation.As a work around, you can increase sysctl to 16384 to start with and see if this resolves the issue. For more information, see https://github.com/fluent/fluent-bit/issues/1018To Increase the sysctl parameter fs.inotify.max_user_watches on all the worker nodes: Check the current value:sysctl -a | grep fs.inotify.max_user_watchesIncrease the value to 16384sysctl -w fs.inotify.max_user_watches=16384Update the new value to the kernel:sysctl -pCheck the updated value:sysctl -a | grep fs.inotify.max_user_watchesYou can also edit the file /etc/sysctl.conf and search for this parameter and overwrite the existing value and then perform kernel update by running sysctl -p.