Andrii's Blog

Optimization catalog. When float division beats integer division

A counterintuitive hot-path optimization: Swapping IDIVQ for DIVSD to divide integers faster

I did quite a lot of hot path optimizations while working on my last project and some of them were genuinely interesting ones. So why don't I write about some of them. I'm starting the series with an interesting case where speedup was achieved by unintuitive replacement of int operation with float operation.

Avoiding IDIVQ with float division

Let's take a look at this optimization (ignore the 3x claim, I was wrong in that doc comment):

diff --git a/huffman.go b/huffman.go
index 33f8263..180dffc 100644
--- a/huffman.go
+++ b/huffman.go
@@ -378,7 +378,11 @@ func optimizeHuffmanCountsForRLE(counts []uint32, goodForRLEBuf *[]bool) {
 		if i != length {
 			sum += int(counts[i])
 			if stride >= 4 {
-				limit = (256*sum + stride/2) / stride
+				// float64 division is ~3x faster than IDIVQ for variable
+				// divisors, and exact for our input range (numerator < 2^25,
+				// denominator < 2^9, quotient < 2^17 — all exactly
+				// representable in float64's 52-bit mantissa).
+				limit = int(float64(256*sum+stride/2) / float64(stride))
 			}
 			if stride == 4 {
 				limit += 120

I'm replacing integer division (256*sum + stride/2) / stride with float division and converting the result back to int: int(float64(256*sum+stride/2) / float64(stride)). But aren't the integer operations inherently faster than the float operations? Well yes, but actually no. Not in this case.

Before we do anything let's run benchmarks to confirm that the optimization is real (benchmark code):

goos: linux
goarch: amd64
pkg: idivq
cpu: 12th Gen Intel(R) Core(TM) i5-12500
          │    idivq    │               float                │
          │   sec/op    │   sec/op     vs base               │
DivInline   3.358n ± 0%   2.414n ± 0%  -28.10% (p=0.002 n=6)

Ok it's real, 28% win on the microbenchmark. Let's also run the benchmark on AMD CPU:

goos: linux
goarch: amd64
pkg: idivq
cpu: AMD Ryzen 5 7535HS with Radeon Graphics
          │    idivq    │               float                │
          │   sec/op    │   sec/op     vs base               │
DivInline   2.178n ± 0%   1.959n ± 0%  -10.08% (p=0.002 n=6)

~10% win.

So why is it faster? Let's compare the assembly:

Integer (IDIVQ) Float (DIVSD)
Compute 256*sum + stride/2
SHLQ $8, AX
MOVQ BX, CX
SHRQ $63, BX
LEAQ (BX)(CX*1), DX
SARQ $1, DX
ADDQ DX, AX
Compute 256*sum + stride/2
SHLQ $8, AX
MOVQ BX, CX
SHRQ $63, BX
LEAQ (CX)(BX*1), DX
SARQ $1, DX
ADDQ AX, DX
TESTQ CX, CX
JEQ panicdivide
CMPQ CX, $-1
JNE do_div
NEGQ AX
XORL DX, DX
JMP ret
The division by zero check is missing in the float version. As well as handling of overflow special case stride == -1.
Uses very expensive IDIVQ instruction to do the actual division
CQO
IDIVQ CX
Uses relatively cheap DIVSD instruction but adds a bunch of code to do int<-->float conversions
XORPS X0, X0
CVTSQ2SD DX, X0
XORPS X1, X1
CVTSQ2SD CX, X1
DIVSD X1, X0
CVTTSD2SQ X0, AX

The key point is that IDIVQ is replaced with DIVSD. According to uops.info the IDIVQ gives latency of 14-18 and throughput of 10 cycles (see link). And DIVSD has latency 13 and throughput of 4 cycles (link). So I thought that the number I should look for is throughput.

Why throughput and not latency? Because in this microbenchmark the divisions are independent. One result doesn't depend on the previous one. So the divider unit stays "saturated" as it receives the work at the fastest possible rate. And the rate is bottlenecked by the throughput and not latency.

So the throughput was supposed to drop from 10.0 to 4.0 and that's why I wrote 3x in the doc comment above (should have written 2.5x to be more precise as 10/4=2.5). So why then didn't I get the speedup of 2.5x but rather got only ~1.4x (the 28% Intel win)?

To answer this question I ran perf stat and collected the arith.divider_active metric. The arith.divider_active counts cycles during which the cpu divide unit was busy.

idivqfloat
arith.divider_active / op10.00 cyc6.35 cyc
total cycles / op10.03 cyc7.29 cyc

The perf says that 10 cycles is spent on IDIVQ exactly as uops.info has predicted (using throughput rather than latency was a right choice after all). While we are here note that total cycles in this case is only 0.03 cycles extra which means that all that machinery compiler added to check the divide-by-zero and overflow is negligible (that was another question I wanted to ask - if the speedup is actually caused by that machinery elimination and perf showed it was not the case. The speedup is truly because of using more performant DIVSD instruction). In the float (DIVSD) case the cpu spends 6.35 cycles on division, not 4.0 cycles as I was expecting. Also it spends ~1 cycle more on some extra work making it 7.29 cycles per op in total. Is uops.info wrong then?

I think it's not. I'm fairly confident that DIVSD timing is operand-dependent

  • the 6.35 vs 4 gap is itself a sign of it, since uops.info measures throughput with one fixed operand set. I don't have a decisive explanation for the exact rule, though.

Caveat

Integer division and float-division-then-cast-to-int are not always equivalent. It's due to how the floating point numbers are represented in the hardware and the rounding limitation. Every integer up to 2^53 is exactly representable as an fp number (valid integers can be found after 2^53 too, but above that gaps appear - only even numbers, then multiples of 4, and so on). As we'll see below, the operands of DIVSD must stay <= 2^53 - and the representable values above 2^53 don't save you.

Now the rounding limitation: the result of division (let's call it q for quotient) is q = floor(a/b). But the true value of a/b can be somewhere in the range [q, q+1). The problem arises when a/b is so close to q+1 that hardware can't determine that it is actually smaller than q+1. Then instead of correct q you get off-by-one q+1. The a/b becomes indistinguishable from q+1 when ulp(q) >= 2/b where the ulp is the unit in the last place and that happens when bits(q) + bits(b) > 54, which forces a ≈ q*b > 2^53.

So basically when you divide a by b you have these two limitations:

  • a <= 2^53
  • b <= 2^53

If these conditions hold then you are golden and you can replace IDIVQ with DIVSD.

Wrapping up

So this is one of the few cases where the "integer ops are obviously faster" intuition is wrong: IDIVQ is one of the most expensive integer instructions, and DIVSD plus the int<-->float conversions turns out to be faster. Of course benchmark before applying.

Oh yeah, forgot to mention. This optimization is only applicable when you divide by a runtime variable. If it's a compile-time constant, then the compiler already replaces it with a better optimization.