BugZero | Red Hat BugID RHEL-22865 - glibc: Memcpy throughput lower on RH8.4 compared t...

Red Hat - Defect ID: RHEL-22865

glibc: Memcpy throughput lower on RH8.4 compared to RH7.5 - same Skylake hardware

Red Hat - Defect ID: RHEL-22865

glibc: Memcpy throughput lower on RH8.4 compared to RH7.5 - same Skylake hardware

Last updated on 9/23/2024

Overall: 9.79.7

Severity: 1010.0

Community: 6.96.9

Lifecycle: 9.49.4

What is the BugZero Risk Score?

Vendor details

Priority: Blocker
Status: Closed
Impact Category: glibc

Overall: 9.79.7

Severity: 1010.0

Community: 6.96.9

Lifecycle: 9.49.4

What is the BugZero Risk Score?

Vendor details

Priority: Blocker
Status: Closed
Impact Category: glibc

Issue

[clone of RHELPLAN-152599] Description of problem: Customer reported performance regression from RHEL 7 to RHEL 8 in Intel Skylake. Version-Release number of selected component (if applicable): How reproducible: The customer used the following example to demonstrate the problem. perf bench mem memcpy -f default --nr_loops 500 --size 3MB That test achieved 8.5 GB/sec on RHEL-7.5, and only 5.3 GB/sec on RHEL-8.4. This is easily reproducible. Steps to Reproduce: Run the above test on RHEL-7.5 and again on RHEL-8.4. The customer had a 2-socket Skylake server. I have been able to reproduce this on a 2-socket Cascade Lake server. Additional info: Thanks to great triaging help from Carlos O'Donell, the problem is understood. It turns out glibc is selecting a sub-optimal memcpy routine for that processor. On RHEL-7.5, it used the "__memcpy_ssse3_back()" routine, which was the optimal choice then. On RHEL-8.4, the glibc memcpy routine used is "__memmove_avx_unaligned_erms()". On RHEL-8.4, if the "Prefer_ERMS" attribute is given to glibc, then the faster "__memmove_erms()" is used. For example, slow and fast cases: perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB 5.468937 GB/sec GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS \ > perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB 12.508272 GB/sec I've also attached a simple memcpy reproducer to demonstrate the problem, as shown below: gcc -O memcpy.c -o memcpy ./memcpy --help USAGE: ./memcpy size-in-MB loop-iterations ./memcpy 3 500 Rate for 500 3MB memcpy iterations: 7.30 GB/sec GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS ./memcpy 3 500 Rate for 500 3MB memcpy iterations: 27.29 GB/sec The customer's system did boot with mitigations=off, and with transparent_hugepages (THP) disabled. Neither are needed to reproduce this problem, but disabling THP does enable the simple memcpy reproducer to achieve much higher performance.

Release Notes

No. of Comments

Resolution

Done-Errata

No bugs this month

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

Red Hat - Defect ID: RHEL-22865

glibc: Memcpy throughput lower on RH8.4 compared to RH7.5 - same Skylake hardware

Red Hat - Defect ID: RHEL-22865

glibc: Memcpy throughput lower on RH8.4 compared to RH7.5 - same Skylake hardware

Last updated on 9/23/2024

Vendor details

Vendor details

Description

Issue

Release Notes

No. of Comments

Resolution

Links

Top Red Hat defects by risk score

Ready to prevent the next vendor outage?