buffer: improve performance of multiple Buffer operations#61871
buffer: improve performance of multiple Buffer operations#61871thisalihassan wants to merge 10 commits intonodejs:mainfrom
Conversation
|
Review requested:
|
d2ba38f to
495feb5
Compare
| void FastSwap16(Local<Value> receiver, | ||
| Local<Value> buffer_obj, | ||
| // NOLINTNEXTLINE(runtime/references) | ||
| FastApiCallbackOptions& options) { | ||
| HandleScope scope(options.isolate); | ||
| ArrayBufferViewContents<char> buffer(buffer_obj); | ||
| CHECK(nbytes::SwapBytes16(const_cast<char*>(buffer.data()), buffer.length())); | ||
| } |
There was a problem hiding this comment.
These fast callbacks are non-identical to the conventional callbacks they shadow.
- The existing callbacks validate their argument and throws to JS if invalid, whereas your fast callbacks hard-crash the process. It might be better to validate in the JS layer, then use the same unwrapping logic on both sides.
- Your fast callback cannot have a different return convention to the conventional callback. You will need to remove the return value from the conventional callback.
There was a problem hiding this comment.
Was removing the validation instead of moving it to js intentional?
Also fast paths can keep the validation and fall back to slow i think, so we can still validate on the src side?
There was a problem hiding this comment.
Also fast paths can keep the validation and fall back to slow i think, so we can still validate on the src side?
This is outdated, the fast API doesn't use fallback any more (since Node.js v23.x).
However, any validation in a JS wrapper should be shadowed by a CHECK or DCHECK in the C++ binding.
There was a problem hiding this comment.
SPREAD_BUFFER_ARG already contains CHECK((val)->IsArrayBufferView()) on all Slow & Fast swap methods
| void FastSwap16(Local<Value> receiver, | ||
| Local<Value> buffer_obj, | ||
| // NOLINTNEXTLINE(runtime/references) | ||
| FastApiCallbackOptions& options) { | ||
| HandleScope scope(options.isolate); | ||
| ArrayBufferViewContents<char> buffer(buffer_obj); | ||
| CHECK(nbytes::SwapBytes16(const_cast<char*>(buffer.data()), buffer.length())); | ||
| } |
There was a problem hiding this comment.
Fast callbacks should include debug tracking and call tests.
There was a problem hiding this comment.
Thanks for sharing these I will update the code
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #61871 +/- ##
==========================================
- Coverage 91.60% 89.68% -1.92%
==========================================
Files 337 676 +339
Lines 140745 206716 +65971
Branches 21802 39585 +17783
==========================================
+ Hits 128925 185400 +56475
- Misses 11595 13449 +1854
- Partials 225 7867 +7642
🚀 New features to boost your workflow:
|
1395d2f to
01ba74f
Compare
src/node_buffer.cc
Outdated
| FastApiCallbackOptions& options) { | ||
| TRACK_V8_FAST_API_CALL("buffer.swap16"); | ||
| HandleScope scope(options.isolate); | ||
| ArrayBufferViewContents<char> buffer(buffer_obj); |
There was a problem hiding this comment.
ArrayBufferViewContents is wrong here, as buffer.data() may be a stack-allocated copy of the byte data rather than the data itself. SPREAD_BUFFER_ARG is the correct macro to use here, as per the conventional callback.
There was a problem hiding this comment.
@Renegade334 replaced ArrayBufferViewContents with SPREAD_BUFFER_ARG in all three fast swap callbacks
There was a problem hiding this comment.
For toHex, wait until #61609, which improves native perf significantly (more than Uint8Array.prototype.toHex)
See also #60249 (comment)
| } else if (value.length === 1) { | ||
| // Fast path: If `value` fits into a single byte, use that numeric value. | ||
| if (normalizedEncoding === 'utf8') { | ||
| if (normalizedEncoding === 'utf8' || normalizedEncoding === 'ascii') { |
There was a problem hiding this comment.
Currently, ascii behaves exactly like latin1
Unsure if by design or accidentally
There was a problem hiding this comment.
yes, this is safe by design I am just extending the existing single byte numeric optimization to cover ASCII, since the guard already constrains it to the valid ASCII range.
|
Note on toBase64 / toBase64url: I also tried replacing the C++ base64Slice/base64urlSlice bindings with V8's Uint8Array.prototype.toBase64() (similar to the toHex change) but it caused a 35-54% regression across all buffer sizes so I reverted base64/base64url and kept only the toHex optimization which showed a clear +26-37% win. |
|
@thisalihassan toHex doesn't show a win anymore with Instead, it's ~3x slower. See nodejs/nbytes#12 |
|
Hi @ChALkeR thanks for flagging I was not aware. I benchmarked the nibble approach locally and it's indeed a much bigger win (~3x vs my ~30% with toHex). Reverted the toHex path entirely the other changes in this PR are unaffected. Should I include the nbytes nibble HexEncode optimization in this PR or keep them as separate PRs?
|
Hi @ChALkeR I see that your changes landed, congratualtions on that huge win. Is there benchmark-ci that we can run on this PR? I can rebase it |
|
I re-ran the benchmark against the latest main (which contains the nibble improvement) and found out there are few regressions caused by my code:
I tried fixing this regression but unavoidable Attaching Details below: buffer-benchmark-all-rebased.csv -- raw benchmark CSV |
|
compare-R-output.txt-- full output from Rscript benchmark/compare.R (shared as a file to avoid cluttering the PR comment). |
59f5d09 to
48abfc5
Compare
|
PS: I resolved some conflicts |
|
Hi @ChALkeR @anonrig @Renegade334 is this PR ready to land? can you also please check the latest benchmarks i have posted, I have compiled PDF for this benchmark in one my comments above for more detail
|
Qard
left a comment
There was a problem hiding this comment.
Generally LGTM, but one small nit.
Failed to start CI⚠ Commits were pushed since the last approving review: ⚠ - buffer: improve performance of multiple Buffer operations ⚠ - buffer: address PR feedback ⚠ - buffer: revert toHex in favour of nbytes HexEncode update ⚠ - resolve feedback ⚠ - resolve feedback on fast callbacks ⚠ - added comments and updated tests ⚠ - resolve feedback ✘ Refusing to run CI on potentially unsafe PRhttps://github.com/nodejs/node/actions/runs/22977319444 |
|
CI: https://ci.nodejs.org/job/node-test-pull-request/71839/ Results |
lib/buffer.js
Outdated
| if (length === undefined && typeof offset === 'string') { | ||
| encoding = offset; | ||
| length = this.length; | ||
| length = len; |
There was a problem hiding this comment.
This is extremely confusing. Please use more readable variable names.
|
|
||
| Buffer.prototype.swap16 = function swap16() { | ||
| // For Buffer.length < 128, it's generally faster to | ||
| // For Buffer.length <= 32, it's generally faster to |
There was a problem hiding this comment.
Can you add a reference link to this pull-request above this line
| void Swap16(const FunctionCallbackInfo<Value>& args) { | ||
| Environment* env = Environment::GetCurrent(args); | ||
| THROW_AND_RETURN_UNLESS_BUFFER(env, args[0]); | ||
| DCHECK(args[0]->IsArrayBufferView()); |
There was a problem hiding this comment.
Although I agree that DCHECK is sufficient here, we always use non-debug CHECK's. If we want to replace this, we should do it in all of the codebase.
| DCHECK(args[0]->IsArrayBufferView()); | |
| CHECK(args[0]->IsArrayBufferView()); |
There was a problem hiding this comment.
we already haveSPREAD_BUFFER_ARG for Check thenDCheck looks redudand want me to remove this?
| FastApiCallbackOptions& options) { | ||
| TRACK_V8_FAST_API_CALL("buffer.swap16"); | ||
| HandleScope scope(options.isolate); | ||
| SPREAD_BUFFER_ARG(buffer_obj, ts_obj); |
There was a problem hiding this comment.
This fast api call is missing DCHECK(buffer_obj->IsArrayBufferView());
There was a problem hiding this comment.
SPREAD_BUFFER_ARG already do CHECK((val)->IsArrayBufferView()); which is why I didn't add DCheck here
| void Swap32(const FunctionCallbackInfo<Value>& args) { | ||
| Environment* env = Environment::GetCurrent(args); | ||
| THROW_AND_RETURN_UNLESS_BUFFER(env, args[0]); | ||
| DCHECK(args[0]->IsArrayBufferView()); |
There was a problem hiding this comment.
| DCHECK(args[0]->IsArrayBufferView()); | |
| CHECK(args[0]->IsArrayBufferView()); |
| FastApiCallbackOptions& options) { | ||
| TRACK_V8_FAST_API_CALL("buffer.swap32"); | ||
| HandleScope scope(options.isolate); | ||
| SPREAD_BUFFER_ARG(buffer_obj, ts_obj); |
There was a problem hiding this comment.
check is arraybufferview is missing.
There was a problem hiding this comment.
SPREAD_BUFFER_ARG already do CHECK((val)->IsArrayBufferView()); which is why I didn't add DCheck here
| FastApiCallbackOptions& options) { | ||
| TRACK_V8_FAST_API_CALL("buffer.swap64"); | ||
| HandleScope scope(options.isolate); | ||
| SPREAD_BUFFER_ARG(buffer_obj, ts_obj); |
There was a problem hiding this comment.
check is arraybufferview is missing
There was a problem hiding this comment.
same as above
| void Swap64(const FunctionCallbackInfo<Value>& args) { | ||
| Environment* env = Environment::GetCurrent(args); | ||
| THROW_AND_RETURN_UNLESS_BUFFER(env, args[0]); | ||
| DCHECK(args[0]->IsArrayBufferView()); |
There was a problem hiding this comment.
| DCHECK(args[0]->IsArrayBufferView()); | |
| CHECK(args[0]->IsArrayBufferView()); |
- copyBytesFrom: calculate byte offsets directly instead of
slicing into an intermediate typed array
- toString('hex'): use V8 Uint8Array.prototype.toHex() builtin
- fill: add single-char ASCII fast path
- indexOf: use indexOfString directly for ASCII encoding
- swap16/32/64: add V8 Fast API functions
- Guard ensureUint8ArrayToHex against --no-js-base-64 flag by falling back to C++ hexSlice when toHex is unavailable - Remove THROW_AND_RETURN_UNLESS_BUFFER and return value from slow Swap16/32/64 to match fast path conventions (JS validates) - Add TRACK_V8_FAST_API_CALL to FastSwap16/32/64 - Add test/parallel/test-buffer-swap-fast.js for fast API verification
Remove V8 Uint8Array.prototype.toHex() path for Buffer.toString('hex')
in favour of the upcoming nbytes HexEncode improvement (nodejs/nbytes#12)
which is ~3x faster through the existing C++ hexSlice path.
Refs: nodejs/nbytes#12
Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com>
1d16ad4 to
f1d938b
Compare
| let end = viewLength; | ||
|
|
||
| view = TypedArrayPrototypeSlice(view, offset, end); | ||
| if (offset !== undefined) { |
There was a problem hiding this comment.
Would switching from TypedArrayPrototypeSlice to TypedArrayPrototypeSubarray be easier here? It avoids the extraneous copy possibly eliminates the more complicated manual processing here.
There was a problem hiding this comment.
Ok @jasnell I tried as you suggested, here's my implementation did bench with the current implementation and there was consistant 14-20% of regresssion with TypedArrayPrototypeSubarray although I like the code, it's much cleaner
Buffer.copyBytesFrom = function copyBytesFrom(view, offset, length) {
if (!isTypedArray(view)) {
throw new ERR_INVALID_ARG_TYPE('view', [ 'TypedArray' ], view);
}
const viewLength = TypedArrayPrototypeGetLength(view);
if (viewLength === 0) {
return new FastBuffer();
}
let start = 0;
let end = viewLength;
if (offset !== undefined) {
validateInteger(offset, 'offset', 0);
if (offset >= viewLength) return new FastBuffer();
start = offset;
}
if (length !== undefined) {
validateInteger(length, 'length', 0);
end = start + length;
}
view = TypedArrayPrototypeSubarray(view, start, end);
return fromArrayLike(new Uint8Array(
TypedArrayPrototypeGetBuffer(view),
TypedArrayPrototypeGetByteOffset(view),
TypedArrayPrototypeGetByteLength(view)));
};There was a problem hiding this comment.
Here's the bench between current imlementation vs your suggestion:
thisalihassan@Alis-MacBook-Pro node % cat local/benchmarks/buffers/jsnell-copybytes-subarray.csv | Rscript benchmark/compare.R
confidence improvement accuracy (*) (**) (***)
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=2048 type='Float64Array' -0.38 % ±3.41% ±4.53% ±5.90%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=2048 type='Uint16Array' -2.45 % ±2.93% ±3.90% ±5.07%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=2048 type='Uint32Array' * 4.21 % ±3.69% ±4.92% ±6.41%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=2048 type='Uint8Array' *** -8.81 % ±2.34% ±3.11% ±4.06%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=256 type='Float64Array' *** -9.65 % ±2.32% ±3.08% ±4.02%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=256 type='Uint16Array' *** -14.88 % ±1.09% ±1.45% ±1.89%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=256 type='Uint32Array' *** -12.90 % ±2.06% ±2.75% ±3.62%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=256 type='Uint8Array' *** -18.38 % ±2.80% ±3.76% ±4.96%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=64 type='Float64Array' *** -14.73 % ±1.08% ±1.44% ±1.88%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=64 type='Uint16Array' *** -21.16 % ±0.91% ±1.21% ±1.58%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=64 type='Uint32Array' *** -19.46 % ±2.12% ±2.84% ±3.74%
buffers/buffer-copy-bytes-from.js n=600000 partial='none' len=64 type='Uint8Array' *** -21.50 % ±0.94% ±1.25% ±1.64%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=2048 type='Float64Array' 1.55 % ±2.88% ±3.83% ±4.99%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=2048 type='Uint16Array' *** -9.22 % ±1.52% ±2.03% ±2.64%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=2048 type='Uint32Array' -0.56 % ±2.80% ±3.72% ±4.84%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=2048 type='Uint8Array' *** -13.41 % ±1.26% ±1.68% ±2.19%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=256 type='Float64Array' *** -14.31 % ±1.74% ±2.32% ±3.04%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=256 type='Uint16Array' *** -18.77 % ±0.86% ±1.15% ±1.49%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=256 type='Uint32Array' *** -14.93 % ±2.59% ±3.48% ±4.57%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=256 type='Uint8Array' *** -19.45 % ±0.81% ±1.08% ±1.41%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=64 type='Float64Array' *** -16.75 % ±2.95% ±3.94% ±5.14%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=64 type='Uint16Array' *** -20.57 % ±0.92% ±1.23% ±1.61%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=64 type='Uint32Array' *** -19.43 % ±0.74% ±0.98% ±1.28%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset-length' len=64 type='Uint8Array' *** -20.58 % ±1.11% ±1.48% ±1.92%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=2048 type='Float64Array' *** 13.69 % ±4.00% ±5.32% ±6.93%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=2048 type='Uint16Array' *** -8.51 % ±2.33% ±3.10% ±4.03%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=2048 type='Uint32Array' -2.65 % ±2.74% ±3.65% ±4.77%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=2048 type='Uint8Array' *** -10.05 % ±1.26% ±1.68% ±2.19%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=256 type='Float64Array' *** -10.48 % ±1.17% ±1.55% ±2.02%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=256 type='Uint16Array' *** -15.20 % ±1.14% ±1.52% ±1.98%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=256 type='Uint32Array' *** -13.07 % ±1.19% ±1.59% ±2.07%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=256 type='Uint8Array' *** -19.86 % ±1.38% ±1.84% ±2.42%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=64 type='Float64Array' *** -14.71 % ±0.92% ±1.23% ±1.60%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=64 type='Uint16Array' *** -21.59 % ±1.08% ±1.44% ±1.88%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=64 type='Uint32Array' *** -19.28 % ±1.95% ±2.59% ±3.38%
buffers/buffer-copy-bytes-from.js n=600000 partial='offset' len=64 type='Uint8Array' *** -22.33 % ±1.03% ±1.37% ±1.79%
Be aware that when doing many comparisons the risk of a false-positive
result increases. In this case, there are 36 comparisons, you can thus
expect the following amount of false-positive results:
1.80 false positives, when considering a 5% risk acceptance (*, **, ***),
0.36 false positives, when considering a 1% risk acceptance (**, ***),
0.04 false positives, when considering a 0.1% risk acceptance (***)
There was a problem hiding this comment.
hmm... that would still make it a net improvement over the current implementation. I'll leave it up to you, I'm just not a big fan of the manual handling.
There was a problem hiding this comment.
Since I was aiming to maxing perf, I will keep it if the regression was less I would have kept your solution
| // do the swap in javascript. For larger buffers, | ||
| // dropping down to the native code is faster. | ||
| const len = this.length; | ||
| const len = TypedArrayPrototypeGetLength(this); |
There was a problem hiding this comment.
While this should provide a viable defense against crashing if this isn't a typed array, the error is going to be somewhat non-obvious...
Welcome to Node.js v26.0.0-pre.
Type ".help" for more information.
> const { swap16 } = Buffer.prototype
undefined
> swap16(1)
Uncaught:
TypeError: Method get TypedArray.prototype.length called on incompatible receiver undefined
at get length (<anonymous>)
at swap16 (node:buffer:1265:15)
> Having this instead be a ERR_INVALID_THIS error with a more specific error message would be nicer.
There was a problem hiding this comment.
This would have a performance impact wouldn't it?
There was a problem hiding this comment.
These are used so infrequently that I'm not concerned about the performance impact of an additional isArrayBufferView() type check
There was a problem hiding this comment.
Added isArrayBufferView guard with ERR_INVALID_THIS to all three swap methods. Here's the behavior comparison:
swap16
| Receiver | main branch | This PR |
|---|---|---|
| Buffer(4) | OK | OK |
| Uint8Array(4) | OK | OK |
| Float32Array(4) | OK | OK |
| Int16Array(4) | OK | OK |
| Array(4) | OK | ERR_INVALID_THIS |
| Array(200) | ERR_INVALID_ARG_TYPE | ERR_INVALID_THIS |
| number(1) | ERR_INVALID_BUFFER_SIZE | ERR_INVALID_THIS |
| string("ab") | TypeError (read-only property) | ERR_INVALID_THIS |
| plain object | OK (silent garbage) | ERR_INVALID_THIS |
| null | TypeError | ERR_INVALID_THIS |
| undefined | TypeError | ERR_INVALID_THIS |
I don't think any of this is now considered breaking changes, some used to work but that shouldn't have worked realistically @jasnell
cc: @aduh95
| if (len % 8 !== 0) | ||
| throw new ERR_INVALID_BUFFER_SIZE('64-bits'); | ||
| if (len < 192) { | ||
| if (len < 48) { |
There was a problem hiding this comment.
In the swap16 and swap32, the threshold check is len <=, while this is len < ... is there a particular reason for the difference?
There was a problem hiding this comment.
Yes intentional the crossover point for swap64 is between 40 and 48, check the bench here
There was a problem hiding this comment.
A comment here would be helpful then.




Multiple performance improvements to Buffer operations, verified with benchmarks (15-30 runs, comparing old vs new binaries built from same tree).
Buffer.copyBytesFrom()(+100-210%)Avoid intermediate
TypedArrayPrototypeSliceallocation by calculating byte offsets directly into the source TypedArray's underlying ArrayBuffer.Buffer.prototype.fill("t", "ascii")(+26-37%)ASCII
indexOf(+14-46%)Call
indexOfStringdirectly for ASCII encoding instead of first converting the search value to a Buffer viafromStringFastand then callingindexOfBuffer. ASCII and Latin-1 share the same byte values for characters 0-127.swap16/32/64(+3-38%)Add V8 Fast API C++ functions (
FastSwap16/32/64) alongside the existing slow path. Largest gains at len=256 (+35%).Benchmark results
Key results (15-30 runs,
***= p < 0.001):Attaching Details below:
detail.pdf -- visual breakdown
buffer-benchmark-all-rebased.csv -- raw benchmark CSV
compare-R-output.txt-- full output