Andrii's Blog

Optimization catalog. How 4 bytes of padding make array clearing 49% faster

A surprising alignment quirk I learned the hard way: adding 4 bytes of struct padding makes Go's array clearing 49% faster on Intel, all thanks to REP STOSQ.

Part of the Optimization catalog series:

  1. When float division beats integer division
  2. How 4 bytes of padding make array clearing 49% faster (this post)

"Psst, do you want your array clearing function to be 49% faster? Just shift the array by 4 bytes."

Let's take a look at this Go snippet:

const words = 1 << 19

type block struct {
	n    uint32
	data [words]uint32
}

func (b *block) reset() {
	b.data = [words]uint32{}
}

What if I were to tell you that if you just add dummy 4 bytes between n and data you'll speed up the reset function by 49% (on intel machines at least)? Sounds crazy?

Let's benchmark it. Let's actually try different paddings - with a 4-byte step:

type (
	blk00 struct {
		_    [0]byte
		data [words]uint32
	}
	blk04 struct {
		_    [4]byte
		data [words]uint32
	}
	// ...
	blk28 struct {
		_    [28]byte
		data [words]uint32
	}
)

func clear00(b *blk00) { b.data = [words]uint32{} }
// ...
func clear28(b *blk28) { b.data = [words]uint32{} }

So what do we have here? (on benchmark methodology. I denoised my machine before running the benchmarks: pinned core, turbo off, fixed uncore frequency, SMT sibling offline, THP disabled. Here's exactly how.)

goos: linux
goarch: amd64
pkg: clearalign
cpu: 12th Gen Intel(R) Core(TM) i5-12500
                    │ bench.intel.txt │
                    │       B/s       │
Clear/off=00/mod8=0      28.22Gi ± 0%
Clear/off=04/mod8=4      18.94Gi ± 0%
Clear/off=08/mod8=0      28.22Gi ± 0%
Clear/off=12/mod8=4      18.94Gi ± 0%
Clear/off=16/mod8=0      28.22Gi ± 0%
Clear/off=20/mod8=4      18.94Gi ± 0%
Clear/off=24/mod8=0      28.22Gi ± 0%
Clear/off=28/mod8=4      18.93Gi ± 0%

Do you see the zebra pattern? Every padding divisible by 8 gives us much higher throughput.

Do we see the same behavior on the AMD chips? Yes! But the effect is milder, only ~9% here (I didn't investigate why, maybe AMD is just better 🤪).

goos: linux
goarch: amd64
pkg: clearalign
cpu: AMD Ryzen 5 7535HS with Radeon Graphics
                    │ bench.amd.txt │
                    │      B/s      │
Clear/off=00/mod8=0    62.78Gi ± 0%
Clear/off=04/mod8=4    57.67Gi ± 0%
Clear/off=08/mod8=0    62.74Gi ± 0%
Clear/off=12/mod8=4    57.61Gi ± 0%
Clear/off=16/mod8=0    62.74Gi ± 0%
Clear/off=20/mod8=4    57.65Gi ± 0%
Clear/off=24/mod8=0    62.73Gi ± 0%
Clear/off=28/mod8=4    57.57Gi ± 0%

I even rented aws c6g.large instance to test if this reproduces on ARM chips. It's not.

Expand to see the ARM results
goos: linux
goarch: arm64
pkg: clearalign
                    │ bench.arm.txt │
                    │      B/s      │
Clear/off=00/mod8=0    34.25Gi ± 0%
Clear/off=04/mod8=4    34.18Gi ± 0%
Clear/off=08/mod8=0    34.15Gi ± 0%
Clear/off=12/mod8=4    34.19Gi ± 0%
Clear/off=16/mod8=0    34.25Gi ± 0%
Clear/off=20/mod8=4    34.17Gi ± 0%
Clear/off=24/mod8=0    34.19Gi ± 0%
Clear/off=28/mod8=4    34.16Gi ± 0%
geomean                34.19Gi

What on earth is going on, Intel? Let's check the assembly. The most important line there is this:

REP; STOSQ AX, ES:0(DI)

The whole clearing is governed by a single instruction: REP (repeat) STOSQ on the provided address range. What's so special about the REP STOSQ and why it sucks when it's not 8-byte aligned? The Intel's Optimization Reference says:

3.7.6.4 Memset Considerations

When the destination buffer is misaligned, memset() performance using Enhanced REP MOVSB and STOSB can degrade about 20% relative to aligned case, for processors based on Ivy Bridge microarchitecture.

Nice! We've got confirmation that the misalignment causes performance degradation. But what is "Enhanced REP MOVSB and STOSB" (aka ERMSB)? As far as I understand it ERMSB is the special handling Intel added to make REP STOSB fast in the first place. (Intel's doc says STOSB, but as we'll see it applies to STOSQ too)

So how exactly does Intel enhance REP STOSB? Well, Intel's Optimization Reference doesn't spell it out (God, that thing is so hard to read). But I think I have a good guess on what happens behind the scenes. This whole topic originated when I was investigating a regression in my brotli library. Back then I dug into the perf counters and saw that the 8-misaligned clear showed +139% L2 read-for-ownership and +65% cycles vs the aligned one. So the guess is that enhancement mostly consists of writing whole cache lines without reading them first. This, I think, explains why breaking 8-byte alignment hurts: the no-RFO path works on whole cache lines, but with a 4-byte offset one of the 8-byte stores crosses every 64-byte boundary, so no line is ever cleanly covered and the fast path can't kick in.

One more thing about the alignment. The padding controlls offset within the struct but what about the address of the struct itself? What if it lands on an address that is not 8-byte-aligned? Turns out that can't happen, Go allocator makes sure it's always page-aligned. I verified it by this targeted test.

STOSB vs STOSQ

STOSB (B=Byte, clears one byte at a time) is mentioned in the reference as the Enhanced one, but the Go compiler chooses STOSQ (Q=Quadword, clears 8 bytes at a time). Apparently when the reference mentions STOSB it applies to the whole family, including STOSQ. Anyways, worth benchmarking whether STOSB produces better throughput on an aligned target:

stosq    28.21 GiB/s ± 0%
stosb    28.29 GiB/s ± 0%   +0.28%

There is no difference in speed, so I assume Intel "enhances" both variants. The stosb version is only slightly faster, which is likely just noise. But it doesn't matter, because you wouldn't want to use stosb in your Go code anyway: the Go compiler chooses stosq for a good reason (https://go.dev/src/runtime/memclr_amd64.s):

// STOSQ is used to guarantee that the whole zeroed pointer-sized word is visible for a memory subsystem as the GC requires this.

I'm sure the next question will be asked a lot of times so why don't I clear it beforehand.

But what about SIMD?

Either I'm too dumb to understand it, or the Intel reference says that ERMSB is so enhanced that it beats SIMD, but it depends on the destination size.

Go1.26 comes with experimental support of SIMD intrinsics so why don't we try it for this microbenchmark on our 2 MiB array. I also hand-rolled the assembly version of the SIMD variants so that I know the throughput with, e.g. bound checks eliminated, that are present in the experimental intrinsics version.

stosq           28.21 GiB/s

// intrinsics
simd128_sse     26.24 GiB/s    -6.97%
simd256_avx2    28.17 GiB/s    -0.14%

// handrolled (cached)
avx2_hand       28.40 GiB/s    +0.67%
avx2_hand_pft0  20.30 GiB/s   -28.02%
avx2_hand_pfw   27.69 GiB/s    -1.82%

// handrolled (streaming).
// ⚠ NOT reproducible, see below
nt128_sse       31.77 GiB/s   +12.64%
nt256_avx2      32.07 GiB/s   +13.68%

What do I mean by "cached" and "streaming" here? The "cached" variant uses the VMOVDQU SIMD instruction, which first writes to the CPU cache. The "streaming" variant uses the NT (non-temporal) instructions MOVNTDQ/VMOVNTDQ, which write straight to memory.

The intrinsics versions simd256_avx2 (it uses the "cache" VMOVDQU variant btw) showed the same speed as the stock stosq version. Interestingly stock stosq beats the 128 simd variant. The handrolled version showed the same results. I had a wild guess that the whole thing is memory-access bound so the prefetch could be beneficial. It wasn't, see the _pft0 and _pfw variants.

The best results were shown by the "streaming" simd variants. BUT. Only on that run.

// run 2
nt256_avx2   19.60 GiB/s

// run 3
nt256_avx2   35.76 GiB/s

the next runs showed wild jumps. This was happening only to the streaming nt variants.

I will not bore you with all the intermediary steps I took to determine wtf is actually going on. The root cause turned out to be whether the kernel hands you a contiguous block of memory or a scattered one. I created a targeted test to verify this. The test runs the benchmarks in 2 modes: a) scattered 4KiB pages and b) one contiguous 2MiB huge page. As the NT variant writes direct to DRAM, the contiguous 2MiB huge page was consistently fast across runs, whereas the scattered mode kept producing random throughput.

So SIMD didn't beat the stock STOSQ. The NT SIMD variant does beat the stock STOSQ but inconsistently - the result is basically random and depends on your kernel. But SIMD might be a wrong solution to your problem anyways because

You might not need to clear the array

Since you have that reset method it's by definition that the data in your array is temporary. There is a widely used technique to avoid expensive array clearing when dealing with temporary data: mark the data entries with generation tags. In this toy example the array is being cleared once per 255 resets. Yes, 255 and not 256, as zero entry can't be used as you can't distinguish it from just unset data.

const maxGen = 255

type dataEntry struct {
	gen uint8 // generation tag
	// ...
}

type block struct {
	// entries tagged with anything else are stale
	currentGen uint8
	data       [words]dataEntry
	// ...
}

func (b *block) reset() {
	if b.currentGen >= maxGen {
		// out of generations: must use real clear
		b.data = [words]dataEntry{}
		b.currentGen = 0
	}
	b.currentGen++
}

func (b *block) writeDataEntry(d *dataEntry) {
	d.gen = b.currentGen
	// ... some useful work
}

func (b *block) processDataEntry(d *dataEntry) {
	// stale? treat as empty
	if d.gen != b.currentGen {
		return
	}
	// ... some useful work
}

So why do we even need to care about STOSQ zebra pattern? The effect is negligible if we clear the array only once per 255 times.

Unfortunately you can't always afford a tag as wide as a full uint8 for your generations. There is a fundamental tradeoff here. Introducing or enlarging the gen tag inevitably enlarges the data array. If your data array was neatly sitting in L2 cache, enlarging it may cause catastrophic performance degradation once the data doesn't fit the cache anymore.

In my last project I could afford only 4 bits for the gen tag, which gave me a full clear once per 15 resets and there the zebra effect was not negligible.


This alignment-dependent behavior of array clearing, once I discovered it the hard way, made me angry at first and then frustrated. I've only been doing optimization work for a couple of months, and cases like this make me think: do I need to know all these pitfalls in advance to be good at it? What if I'd never discovered this behavior? I'd have left a couple of throughput percent on the table. You don't add dummy 4-byte fields to your structs at random. In fact I discovered it only because I was making the fieldalignment linter happy while adding a new match finder to my brotli go library: I reordered the struct fields and saw the performance degradation. Even then I ignored it for a while and waved it off, because I thought it was one of those weird codegen alignment issues, like e.g. this one. I'm glad that in the end I decided to chase down the root cause. Now I know more than I did before.

P.S. all the code related to this blog post can be found here.