8000 Add libxmp_crc32c function and fast module hashing mode. by AliceLR · Pull Request #651 · libxmp/libxmp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add libxmp_crc32c function and fast module hashing mode. #651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AliceLR
Copy link
Contributor
@AliceLR AliceLR commented May 18, 2023

Adds a CRC-32C calculation function simplified from Mark Adler's fast CRC-32C function, with fast hardware ARM and RISC-V Zbc implementations added.

Also adds an optional fast module hashing mode to libxmp, allowing the MD5 to be skipped unless a module matches both the expected length and CRC-32C of a module in the quirks list. This reduces loading time significantly for most modules, at the cost of users opting out of being able to use the MD5 for their own module quirks lists. Fast hashing mode is intended to accelerate fuzzing and loading on embedded platforms, and is disabled by default.

Finally, I've removed an old copy of nebulos by Audiomonster from the quirks list. I've been unable to locate this copy, and all existing copies are correctly detected as VBlank by scan compare.

See libxmp issue #400 for more information.

TODO:

  • Locate missing copy of "No Mercy" by Alf.
  • Add xmp_set_player setting to configure hashing mode and update documentation.
  • Win32 detection of ARMv8.0 extensions?
  • Define LIBXMP_NO_HARDWARE_CRC in Makefile.vc and watcom.mif?
  • CMake detection of getauxval/asm/hwcap.h/sysctlbyname.
  • Vectorized x86 CRC-32C implementation. Newer versions of Mark Adler's CRC-32C use the faster vectorized method so this shouldn't be much trouble to adapt. This increases the architecture requirements of the algorithm slightly (Westmere rather than Nehalem).
  • Vectorized ARM CRC-32C implementation. This needs an extra sysctlbyname call, as PMULL is optional. Done with pure C right now, but it might be helpful to also have a GCC/clang inline assembly implementation too.
  • Fix unfinished RISC-V Zbc CRC-32C implementation. I'm not aware of any useful boards that implement B yet, and Linux hasn't added a HWCAP flag for B or Zbc yet, so this is mostly just for QEMU :^)

When the fast routine is enabled, I've seen this reduce full loads of the Modland Protracker directory from ~50s to ~30s (i7-7700, Windows 10, test-dev/xmpchk, filesystem cached in RAM). Fuzzing improvement won't be quite this good since most inputs already fail to load, and the fuzzing routine now uses the CRC-32C to seed player settings. All fast hashing code is currently disabled in the patch to pass the regression tests.

Other useless things:

  • Power ISA has the instruction vpmsumd (intrinsic vec_pmsum_be), which is almost equivalent to the entire fold function in crc32c_arm.h (the products are XORed into bits 1-127 of the result so the required modulos are different, and the result still needs to be XORed with the next input). This seems to have been added in Power ISA 2.07 (POWER8 and up), which as far as I can tell is used exclusively in servers and expensive workstations, so this is very low priority.
  • SPARC64 has the instructions xmulx, xmulxhi, and xmpmul which can perform the required polynomial math for this. Buy me a workstation that supports these instructions and I will implement it :)

@AliceLR AliceLR marked this pull request as draft May 18, 2023 21:15
@sezero
Copy link
Collaborator
sezero commented May 18, 2023

Some notes:

  • Build failure on i686-linux:
src/crc32c.c: In function ‘libxmp_crc32c’:
src/crc32c.c:331: error: PIC register ‘%ebx’ clobbered in ‘asm’
  • ix86 in general: detect cpuid existence first?

  • OW, x86: has its own inline asm syntax (#pragma aux ....)

  • MSVC, x86: has its own inline asm syntax

  • MSVC, x64: has no inline asm at all. uses intrinsics instead
    as of VS 2008 version. (include intrin.h or possibly nmmintrin.h for
    _mm_crc32_u8|16|32 )

@AliceLR
Copy link
Contributor Author
AliceLR commented May 18, 2023

Some notes:

Thanks, I will check into this. I don't know if I'll get back to this right away. This branch is a tumor that grew off of MOD VBlank detection and my old CRC-32C testing, I had to get it out of my working tree :-)

@sezero
Copy link
Collaborator
sezero commented May 18, 2023

For calling cpuid, and also the PIC issue, see SDL's code:
https://github.com/libsdl-org/SDL/blob/SDL2/src/cpuinfo/SDL_cpuinfo.c

@sezero
Copy link
Collaborator
sezero commented May 18, 2023

For RISC stuff, CC: @ccawley2011

@codecov
Copy link
codecov bot commented May 19, 2023

Codecov Report

Merging #651 (c6ad1ac) into master (e5e79e3) will decrease coverage by 0%.
The diff coverage is 69%.

❗ Current head c6ad1ac differs from pull request most recent head 04fcb51. Consider uploading reports for the commit 04fcb51 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@          Coverage Diff           @@
##           master    #651   +/-   ##
======================================
- Coverage      84%     84%   -0%     
======================================
  Files         154     156    +2     
  Lines       30051   30135   +84     
======================================
+ Hits        25302   25356   +54     
- Misses       4749    4779   +30     
Impacted Files Coverage Δ
src/common.h 100% <ø> (ø)
src/crc32c.c 19% <19%> (ø)
src/control.c 83% <70%> (-<1%) ⬇️
src/crc32c_x86.h 86% <86%> (ø)
src/load_helpers.c 76% <95%> (+3%) ⬆️
src/load.c 92% <100%> (-<1%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0e822c...04fcb51. Read the comment docs.

@ccawley2011
Copy link
Contributor

For RISC stuff, CC: @ccawley2011

I don't know anything about RISC-V. I'm more familiar with RISC OS, which is unrelated.

That said, it would be interesting to get the ARMv8 code working on RISC OS, although GCC 4.7.4 predates the arm_acle.h header so it may be necessary to provide an inline ASM fallback for that particular use case.

@AliceLR
Copy link
Contributor Author
AliceLR commented May 19, 2023
  • OW, x86: has its own inline asm syntax (#pragma aux ....)
  • MSVC, x86: has its own inline asm syntax
  • MSVC, x64: has no inline asm at all. uses intrinsics instead
    GCC 4.7.4 predates the arm_acle.h header so it may be necessary to provide an inline ASM fallback for that particular use case.

I suspect supporting all of these is going to get messy—maybe an src/crc32c/ and a header per-implementation is justified...

That said, it would be interesting to get the ARMv8 code working on RISC OS

Does RISC OS allow detection of the required CPU features at runtime? (System call, user code runs at EL1 or higher, RISC OS guarantees access to ID_AA64ISAR0_EL1 at EL0, RISC OS allows EL0 to set a custom fault handler, etc?). Due to the current situation with RISC-V I've been considering a compile option to remove the runtime feature check. I don't know why ARM and RISC-V both decided to make this difficult.

@ccawley2011
Copy link
Contributor

Does RISC OS allow detection of the required CPU features at runtime? (System call, user code runs at EL1 or higher, RISC OS guarantees access to ID_AA64ISAR0_EL1 at EL0, RISC OS allows EL0 to set a custom fault handler, etc?).

It should be possible using the OS_PlatformFeatures SWI.

@AliceLR AliceLR force-pushed the add-libxmp-crc32c branch from c6276a5 to ca34b1e Compare May 24, 2023 22:27
@AliceLR
Copy link
Contributor Author
AliceLR commented May 24, 2023

I fixed the RISC-V CRC-32C implementation and fixed minor issues that caused some of the Linux regression tests to fail. I have a much better idea of how to handle the vectorized x86 implementation now, and I'll fix the MSVC/Watcom tests when I get to that.

I found this regarding an ARM vectorized implementation though:

The PMULL instruction extracts each source vector from the lower half of each source register, while the PMULL2 instruction extracts each source vector from the upper half of each source register.

Depending on the settings in the CPACR_EL1, CPTR_EL2, and CPTR_EL3 registers, and the current Security state and Exception level, an attempt to execute the instruction might be trapped.

PMULL and PMULL2 can be detected by the same getauxval call as the CRC-32 instructions, but I don't know if these access control registers are going to be a problem.

@AliceLR AliceLR force-pushed the add-libxmp-crc32c branch 7 times, most recently from 4683847 to 799ab5f Compare June 12, 2023 19:31
Adds a CRC-32C calculation function simplified from Mark Adler's
fast CRC-32C function, with fast hardware ARM and RISC-V Zbc
implementations added.

Also adds an optional fast module hashing mode to libxmp, allowing
the MD5 to be skipped unless a module matches both the expected
length and CRC-32C of a module in the quirks list. This reduces
loading time significantly for most modules, at the cost of users
opting out of being able to use the MD5 for their own module quirks
lists. Fast hashing mode is intended to accelerate fuzzing and
loading on embedded platforms, and is disabled by default.

Finally, I've removed an old copy of nebulos by Audiomonster from the
quirks list. I've been unable to locate this copy, and all existing
copies are correctly detected as VBlank by scan compare.

See libxmp issue libxmp#400 for more information.

TODO:
* Locate missing copy of "No Mercy" by Alf.
* Documentation for xmp_set_player setting to configure hashing mode.
* Vectorized x86, ASM vectorized ARM.
@AliceLR AliceLR force-pushed the add-libxmp-crc32c branch from 799ab5f to 04fcb51 Compare June 14, 2023 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0