Issue
[clone of RHELPLAN-152599]
Description of problem:
Customer reported performance regression from RHEL 7 to RHEL 8 in Intel Skylake.
Version-Release number of selected component (if applicable):
How reproducible:
The customer used the following example to demonstrate the problem.
perf bench mem memcpy -f default --nr_loops 500 --size 3MB
That test achieved 8.5 GB/sec on RHEL-7.5, and only 5.3 GB/sec on RHEL-8.4. This is easily reproducible.
Steps to Reproduce:
Run the above test on RHEL-7.5 and again on RHEL-8.4. The customer had a 2-socket Skylake server. I have been able to reproduce this on a 2-socket Cascade Lake server.
Additional info:
Thanks to great triaging help from Carlos O'Donell, the problem is understood.
It turns out glibc is selecting a sub-optimal memcpy routine for that processor.
On RHEL-7.5, it used the "__memcpy_ssse3_back()" routine, which was the optimal choice then.
On RHEL-8.4, the glibc memcpy routine used is "__memmove_avx_unaligned_erms()".
On RHEL-8.4, if the "Prefer_ERMS" attribute is given to glibc, then the faster "__memmove_erms()" is used.
For example, slow and fast cases:
perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB
5.468937 GB/sec
GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS \
> perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB
12.508272 GB/sec
I've also attached a simple memcpy reproducer to demonstrate the problem, as shown below:
gcc -O memcpy.c -o memcpy
./memcpy --help
USAGE: ./memcpy size-in-MB loop-iterations
./memcpy 3 500
Rate for 500 3MB memcpy iterations: 7.30 GB/sec
GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS ./memcpy 3 500
Rate for 500 3MB memcpy iterations: 27.29 GB/sec
The customer's system did boot with mitigations=off, and with transparent_hugepages (THP) disabled. Neither are needed to reproduce this problem, but disabling THP does enable the simple memcpy reproducer to achieve much higher performance.