Branchless UTF-8 Encoding

(cceckman.com)

88 points | by vortex_ape 6 hours ago

15 comments

  • comex 5 hours ago
    Incidentally, this automatic branch-if-zero from LLVM is being improved.

    First of all, a recent LLVM patch apparently changes codegen to use CMOV instead of a branch:

    https://github.com/llvm/llvm-project/pull/102885

    Beyond that, Intel recently updated their manual to retroactively define the behavior of BSR/BSF on zero inputs: it leaves the destination register unmodified. This matches the AMD manual, and I suspect it matches the behavior of all existing x86-64 processors (but that will need to be tested, I guess).

    If so, you don't need either a branch or CMOV. Just set a register to 32, then run BSR with the same register as destination. If the BSR input is nonzero, the 32 is overwritten with the trailing-zero count. If the BSR input is zero, then BSR leaves the register unmodified and you get 32.

    Since this behavior is now guaranteed for future x86-64 processors, and assuming it's indeed compatible with all existing x86-64 processors (maybe even all x86 processors period?), LLVM will no longer need the old path regardless of what it's targeting.

    Note that if you're targeting a newer x86-64 version, LLVM will just emit TZCNT, which just does what you'd expect and returns 32 if the input is zero (or 64 for a 64-bit TZCNT). But as the blog post demonstrates, many people still build for baseline x86_64.

    (Intel does document one discrepancy between processors: "On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.")

    • hinkley 2 hours ago
      I was watching a video ranting about bad benchmarks yesterday and in an aside they pointed out the (gcc) generated code used Conditional Move (cmov) in several places to handle and if/else if in the code with no branches.

      I think the days of trying to branches by trying to remove conditional assignments are either gone or close to it. You may still have a subsequent data race, but the conditional assignment isn't your biggest problem with throughput.

      • achierius 47 minutes ago
        What makes you say that? I've seen several cases where an over-usage of branchless programming actually slowed things down. Especially once you get past 2 nested conditionals (so 4+ pathways) you do just end up executing a lot of ultimately-unused code. In fact this has been going the other direction, in some ways, for a little while now: people overestimate how much branches cost, particularly small, local, and easy-to-predict ones.
  • deathanatos 1 hour ago

      /// Encode a UTF-8 codepoint.
      /// […]
      /// Returns a length of zero for invalid codepoints (surrogates and out-of-bounds values);
      /// it's up to the caller to turn that into U+FFFD, or return an error.
    
    It's not a "UTF-8 codepoint", that's horridly mangling the terminology. Code points are just code points.

    The input to a UTF-8 encode is a scalar value, not a code point, and encoding a scalar value is infallible. What doubly kills me is that Rust has a dedicated type for scalar values. (`char`.)

    (In languages with non-[USV]-strings…, Python raises an exception, JS emits garbage.)

  • orlp 4 hours ago
    If you have access to the BMI2 instruction set I can do branchless UTF-8 encoding like in the article using only 9 instructions and 73 bytes of lookup tables:

        branchless_utf8:
            mov     rax, rdi
            lzcnt   ecx, esi
            lea     rdx, [rip + .L__unnamed_1]
            movzx   ecx, byte ptr [rcx + rdx]
            lea     rdx, [rip + example::DEP_AND_OR::h78cbe1dc7fe823a9]
            pdep    esi, esi, dword ptr [rdx + 8*rcx]
            or      esi, dword ptr [rdx + 8*rcx + 4]
            movbe   dword ptr [rdi], esi
            mov     qword ptr [rdi + 8], rcx
            ret
    
    
    The code:

        static DEP_AND_OR: [(u32, u32); 5] = [
            (0, 0),
            (0b01111111_00000000_00000000_00000000, 0b00000000_00000000_00000000_00000000),
            (0b00011111_00111111_00000000_00000000, 0b11000000_10000000_00000000_00000000),
            (0b00001111_00111111_00111111_00000000, 0b11100000_10000000_10000000_00000000),
            (0b00000111_00111111_00111111_00111111, 0b11110000_10000000_10000000_10000000),
        ];
    
        const LEN: [u8; 33] = [
            // 0-10 leading zeros: not valid.
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            // 11-15 leading zeros: 4 bytes.
            4, 4, 4, 4, 4,
            // 16-20 leading zeros: 3 bytes.
            3, 3, 3, 3, 3,
            // 21-24 leading zeros: 2 bytes.
            2, 2, 2, 2,
            // 25-32 leading zeros: 1 byte.
            1, 1, 1, 1, 1, 1, 1, 1,
        ];
    
        pub unsafe fn branchless_utf8(codepoint: u32) -> ([u8; 4], usize) {
            let leading_zeros = codepoint.leading_zeros() as usize;
            let bytes = LEN[leading_zeros] as usize;
            let (mask, or) = *DEP_AND_OR.get_unchecked(bytes);
            let ret = core::arch::x86_64::_pdep_u32(codepoint, mask) | or;
            (ret.swap_bytes().to_le_bytes(), bytes)
        }
  • koala_man 4 hours ago
    I'm surprised there are no UTF-8 specific decode instructions yet, the way ARM has "FJCVTZS - Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero"
  • xeeeeeeeeeeenu 5 hours ago
    > So on x86_64 processors, we have to branch to say “a 32-bit zero value has 32 leading zeros”.

    Not if you're targeting x86-64-v3 or higher. Haswell (Intel) and Piledriver (AMD) introduced the LZCNT instruction that doesn't have this problem.

    • pklausler 5 hours ago
      Easy to count leading zeroes in a branch-free manner without a hardware instruction using a conditional move and a de Bruijn sequence; see https://github.com/llvm/llvm-project/blob/main/flang/include... .
      • hinkley 2 hours ago

            x |= x >> 1;
            x |= x >> 2;
            x |= x >> 4;
            x |= x >> 8;
            x |= x >> 16;
            x |= x >> 32;
        
        Isn't there another way to do this without so many data races?

        I feel like this should be

           x |= x >> 1 | x >> ??? ...
        • gpderetta 2 hours ago
          By data races I assume you actually mean data dependencies?
    • sltkr 5 hours ago
      You can also very trivially do (codepoint | 1).leading_zeros(), then you can also shave one byte off the LEN table. (This doesn't affect the result because LEN[32] == LEN[33] == 1).
  • Arnavion 6 hours ago
    >So on x86_64 processors, we have to branch to say “a 32-bit zero value has 32 leading zeros”. Put differently, the “count leading zeros” intrinsic isn’t necessarily a branchless instruction. This might look nicer on another architecture!

    Yes, RISC-V for example defines the instructions for counting leading / trailing zeros (clz, clzw, ctz, ctzw) such that an N-bit zero value has N of them.

    I don't know if I can show it on Rust Godbolt because none of the default RISC-V targets that Rust has support the Zbb extension, but I checked with a custom target that I use locally for my emulator, and `leading_zeros()` indeed compiles to just one `clz` without any further branches. Here's a C demonstration though: https://gcc.godbolt.org/z/cKx3ajsjh

  • purplesyringa 1 hour ago
    Instead of

        let surrogate_mask = surrogate_bit << 2 | surrogate_bit << 1 | surrogate_bit;
        len &= !surrogate_mask;
    
    consider

        len &= surrogate_bit.wrapping_sub(1);
    
    This should still work better. Alternatively, invert the conditions and do

        len &= non_surrogate_bit.wrapping_neg();
  • Validark 4 hours ago
  • RenThraysk 4 hours ago
    Or-ing 1 onto codepoint before calling leading_zeroes() should get a decent compiler to remove the branch.
  • lxgr 5 hours ago
    > Compiler explorer confirms that, with optimizations enabled, this function is branchless.

    Only if you don't consider conditional move instructions branching/cheating :)

  • decafbad 1 hour ago
    Checkout Erlang bit parsing.
  • Dwedit 4 hours ago
    Wouldn't branchless UTF-8 encoding always write 3 bytes to RAM for every character (possibly to the same address)?
    • ngoldbaum 4 hours ago
      You could do two passes over the string, first get the total length in bytes, then fill it in codepoint by codepoint.

      You could also pessimistically over-allocate assuming four bytes per character and then resize afterwards.

      With the API in the linked blog post it's up to the user to decide how they want to use the output [u8;4] array.

  • emilfihlman 38 minutes ago
    I mean, isn't the trivial answer to just collapse the if else tree into just math that's evaluated always?

      u32 a = (code <= 0x7F);
      u32 b = (code <= 0x07FF);
      u32 c = ((code < 0xD800) || 
      (0xDFFF < code));
      u32 d = (code <= 0xFFFF) * c;
      u32 e = (code <= 0x10FFFF);
      u32 v = (c && e);
      return(-1 * !v + v * (4 - a - b - d));
    
    Highly likely easy to optimise.
  • ThatGuyRaion 6 hours ago
    So is this potentially performance improving?.
    • not2b 5 hours ago
      Usually people are interested in branchless implementations for cryptography applications, to avoid timing side channels (though you then have to make sure that the generated instructions don't have different timing for different input values), and will pay some time penalty if they have to.
    • PhilipRoman 6 hours ago
      Last time I tested branchless UTF-8 algorithms, I came to the conclusion that they only perform [slightly] better for text consisting of foreign multibyte characters. Unless you expect lots of such inputs on the hot path, just go with traditional algorithms instead. Even in the worst case the difference isn't that big.

      Sometimes people fail to appreciate how insanely fast a predictable branch really is.

      • dbcurtis 2 hours ago
        Pretty much. A strongly predicted branch is as fast as straight-line code, for most practical purposes (in modern processors). It is the mis-predicted branch that causes a pipeline flush and a re-fetch and so forth. The whole point of instructions like CMOV is to replace "flakey" branches with a CMOV so that you can execute both code paths and the test condition path all in parallel and grab the right answer at the end. This avoids paying the mis-predict penalty, and gives more time to compute the test condition, which for a branch is almost always only available awkwardly late in the pipeline. So as long as the compiler can do a decent job of identifying "flakey" branches up front for replacement with CMOV, it is a win. And many branches are easy for the compiler to classify. For instance -- if(SomeRareExceptionCondition) handle_exception(); -- for bonus points, move the exception handling code way the heck out to a different text page so that it isn't hanging around taking up I-cache space for no good reason.
  • jan_haker 5 hours ago
    [flagged]