Skip to content

Centralize SIMD with Highway and vendor all dependencies as submodules#39

Closed
KimBioInfoStudio wants to merge 14 commits intoOpenGene:mainfrom
KimBioInfoStudio:main
Closed

Centralize SIMD with Highway and vendor all dependencies as submodules#39
KimBioInfoStudio wants to merge 14 commits intoOpenGene:mainfrom
KimBioInfoStudio:main

Conversation

@KimBioInfoStudio
Copy link
Member

@KimBioInfoStudio KimBioInfoStudio commented Feb 26, 2026

Summary

  • Centralized SIMD architecture: Refactored scattered inline Highway SIMD code into a single simd.h/simd.cpp module using foreach_target.h + HWY_DYNAMIC_DISPATCH for runtime CPU target selection, ported from fastp's proven pattern
  • Vendored all dependencies as git submodules: Highway v1.3, isa-l v2.31.0, libdeflate v1.25, googletest v1.17.0 — zero system library requirements, static-only build eliminates header/library version mismatch risks
  • Simplified build: Single make -j builds everything from submodules; git clone --recursive is all that's needed
  • Fixed isa-l NEON assembly on ARM macOS: macOS reports arm64 but isa-l expects aarch64; without the fix only C fallback code is compiled (~60% slower gzip decompression)

Details

SIMD Refactoring

  • Created src/simd.h with 4 public functions: countQualityMetrics, reverseComplement, countAdjacentDiffs, countMismatches, plus testSimd for validation
  • Created src/simd.cpp with Highway multi-target dispatch (compiles for multiple SIMD targets, selects best at runtime)
  • Refactored filter.cpp, sequence.cpp, adaptertrimmer.cpp to use centralized SIMD functions
  • Removed obsolete simdutil.h
  • Added test/simd_test.cpp gtest wrapper

Build System

  • Vendored highway, isa-l, libdeflate, googletest as git submodules under third_party/
  • Rewrote Makefile for static-only build from submodules (no system -lisal -ldeflate -lhwy needed)
  • Upgraded C++ standard from C++14 to C++17 (required by googletest 1.17)
  • Removed stale src/libdeflate.h (v1.6) in favor of submodule's v1.25
  • Fixed igzip include path to match fastp's flat include style

Other

  • Updated README build instructions
  • Added ONT dummy data generator (testdata/gen_ont_reads.py) for benchmarking
  • All 13 unit tests pass

E2E Benchmark

Dataset: 10,000 ONT reads (64.5M bases, median 5kb, N50 8.3kb)
Platform: Apple M4 Pro (14 cores), 48 GB RAM, macOS

FASTQ (uncompressed, 124.5 MB)

Version Run 1 Run 2 Run 3 Avg wall
Baseline (v0.4.1) 0.75s 0.74s 0.74s 0.74s
This PR 0.74s 0.74s 0.74s 0.74s

FASTQ.GZ (compressed, 60.8 MB)

Version Run 1 Run 2 Run 3 Avg wall
Baseline (v0.4.1) 1.12s 1.11s 1.11s 1.11s
This PR 1.12s 1.12s 1.15s 1.13s

Result: No performance regression. The SIMD refactoring is architectural (centralization, portability, maintainability) — the hot path (adapter search) was already SIMD-accelerated, so raw throughput is equivalent.

Note: During development we discovered that isa-l's Makefile.unx only builds C baseline code on ARM macOS (because uname -m returns arm64 not aarch64). This caused a ~60% gzip decompression regression. Fixed by passing host_cpu=aarch64 on ARM macOS. This fix benefits all ARM macOS users building from source.

Test plan

  • make -j builds successfully (static linking from submodules)
  • make test — all 13 tests pass
  • Smoke test: ./fastplong -i testdata/ont_10k.fastq.gz -o /dev/null works correctly
  • E2E benchmark: no performance regression vs baseline (see above)
  • Verified on macOS ARM64 (Apple M4 Pro)

🤖 Generated with Claude Code

KimBioInfoStudio and others added 13 commits February 26, 2026 21:50
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…module

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Refactor filter.cpp: use countQualityMetrics/countAdjacentDiffs
- Refactor sequence.cpp: use centralized reverseComplement
- Refactor adaptertrimmer.cpp: use countMismatches
- Remove simdutil.h (merged into simd.cpp)
- Add simd_test.cpp
- Upgrade C++ standard to C++17 for gtest compatibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add isa-l (v2.31.0) and libdeflate (v1.25) as git submodules matching
fastp's vendoring pattern. Default `make` links system libs; `make static`
builds from submodules. Remove stale src/libdeflate.h (v1.6) in favor of
third_party/libdeflate/libdeflate.h (v1.25). Fix igzip include to match
fastp's flat include style.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace outdated conda/system library instructions with dual-mode
build documentation matching fastp: Option A (static build from
submodules, recommended) and Option B (link against system libraries).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove dual-mode build (system libs vs submodules) to eliminate
potential header/library version mismatch. Default `make` now builds
isa-l and libdeflate from submodules and links statically. Simplify
README build instructions accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Vendor googletest v1.17.0 as a git submodule under third_party/.
Build it from source via cmake during `make test`, eliminating the
need to install googletest on the system.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
macOS reports arm64 via uname -m but isa-l's Makefile.unx expects
aarch64 for the NEON/SVE assembly paths. Without this fix, only the
C baseline code is compiled, resulting in ~60% slower gzip decompression.
Pass host_cpu=aarch64 arch=aarch64 on ARM macOS to build the full
optimized library (85 objects vs 23 C-only).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use actions/checkout@v4 with submodules: recursive
- Remove system library installs (isa-l, libdeflate, highway, googletest)
- Only need build-essential, cmake, nasm on Ubuntu
- macOS runners have cmake/Xcode toolchain preinstalled

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
isa-l dev branch uses __arm_streaming (SVE) and ARM assembly syntax
that requires Xcode 17+. macos-14 (Xcode 15.4) and macos-13 are
too old. macos-15 provides Xcode 16.2+ which supports these features.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sfchen
Copy link
Member

sfchen commented Feb 27, 2026

No performance improvement?

@KimBioInfoStudio
Copy link
Member Author

KimBioInfoStudio commented Feb 27, 2026

No performance improvement?

Seems in v0.4.1 we already use highway for SIMD, so we do not have perf improve here

@sfchen
Copy link
Member

sfchen commented Feb 27, 2026

please address the compilation errors::

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C ./third_party/isa-l -f Makefile.unx lib host_cpu=aarch64 arch=aarch64
cd ./third_party/libdeflate && cmake -B build
-DCMAKE_BUILD_TYPE=Release
-DLIBDEFLATE_BUILD_SHARED_LIB=OFF
-DLIBDEFLATE_BUILD_GZIP=OFF &&
cmake --build build
make[1]: Makefile.unx: No such file or directory
make[1]: *** No rule to make target `Makefile.unx'. Stop.
make: *** [third_party/isa-l/bin/isa-l.a] Error 2
make: *** Waiting for unfinished jobs....
CMake Error: The source directory "/Users/shifu/github/fastplong/third_party/libdeflate" does not appear to contain CMakeLists.txt.

@KimBioInfoStudio
Copy link
Member Author

please address the compilation errors::

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C ./third_party/isa-l -f Makefile.unx lib host_cpu=aarch64 arch=aarch64 cd ./third_party/libdeflate && cmake -B build -DCMAKE_BUILD_TYPE=Release -DLIBDEFLATE_BUILD_SHARED_LIB=OFF -DLIBDEFLATE_BUILD_GZIP=OFF && cmake --build build make[1]: Makefile.unx: No such file or directory make[1]: *** No rule to make target `Makefile.unx'. Stop. make: *** [third_party/isa-l/bin/isa-l.a] Error 2 make: *** Waiting for unfinished jobs.... CMake Error: The source directory "/Users/shifu/github/fastplong/third_party/libdeflate" does not appear to contain CMakeLists.txt.

Seems u clone without recursive to fetch submodule, pls clone it with --recursive or

cd fastplong && git submodule update --init --recursive

BTW, better to have a clear tmpdir for this branch, if u use upstream repo maybe some conflicts

@sfchen
Copy link
Member

sfchen commented Feb 27, 2026

No performance improvement?

oh yes, I almost forgot about that.

Replace 8x Eq + 8x IfThenElse comparison chain with a single
And + TableLookupBytes using the low nibble of DNA base ASCII codes.
DNA bases A/a(1), C/c(3), T/t(4), G/g(7) have unique low nibbles,
enabling a 16-byte lookup table for complement mapping.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants