26 comments

  • MathMonkeyMan 1 day ago
    Chandler Carruth told a similar story in one of his talks.

    He met Ken Thompson and saw beautiful C code for the first time because he had encountered a performance problem in a service. The service had to choose a policy to enforce (or something) based on the incoming request. It was taking too long to match the criteria of each potential policy against the request.

    Ken wrote a finite automata based pattern matcher that would simultaneously advance through all of the policies' predicates. It was perfect, and it was much faster than the existing code.

    Then somebody noticed that 99.9% of requests matched a particular policy, so they changed the existing code to just check that predicate first, and the code sped up a zillion times, much more than with Ken's solution.

    • tasn 1 day ago
      This is such a great anecdote, thanks for sharing!

      Somehow relatedly, I still remember the first time I heard about profile-guided optimization which is essentially the same but for all of your code at once (well, same idea, not sure it's aggressive enough to reach the same result as the anecdote you shared).

      • scottlamb 15 hours ago
        > Somehow relatedly, I still remember the first time I heard about profile-guided optimization which is essentially the same but for all of your code at once (well, same idea, not sure it's aggressive enough to reach the same result as the anecdote you shared).

        Exactly: profile-guided optimization is pretty awesome if you have the right infrastructure. You can get maybe 15% speedup on your program without going deep into all the code. But it has limits. IIUC, it gathers information about which code is hot, including which branches are taken, and enables certain compiler optimizations based on that. [1] But it's not going to make huge algorithm changes, it's not going to change data structure definitions [2], and it's not going to prepare good microbenchmark input for you to compare your new algorithm against the current one.

        Also, this article is about Java, and profile-guided optimization is something Java just has baked into its JIT already.

        [1] I don't know exactly which, and they likely vary by compiler version. But I think a variety of stuff to better use cache: breaking code apart into likely-"hot" vs likely-"cold" pages, deciding which loops are worth aligning, and where inlining is worthwhile. Also where branches should be optimized for likely vs unlikely vs unpredictable (conditional instructions). When it's worth autovectorizing. But it's not as good at SIMD as good as a human who puts two weeks into it, and SIMD usage is constrained by the inability to change data structures.

        [2] in theory, in Rust's default #[repr(Rust)] it could reorder a struct's members, but it's not going to say change from array-of-struct to struct-of-array format or hashmap to btree.

    • RedNifre 20 hours ago
      Okay, but how about checking the 99.9% policy first and if it doesn't match, run Ken's solution?
      • fifilura 19 hours ago
        You'd be better off with some stupid code that junior devs (or Grug brained senior devs) could easily understand for the 0.1% cases.
        • smcin 19 hours ago
          (In that specific case, with that specific very skewed data distribution, yes. But not necessarily in the general case. So "they" would be better off with that solution, but not a generalized "you").
        • username135 19 hours ago
          Then how do they learn
          • mannykannot 17 hours ago
            My guess is that they would be more likely to “fix” and “improve” Ken’s algorithm than understand it (Jon Bentley mentioned, in passing, a similar case where this was the outcome.)
          • chii 17 hours ago
            They can learn finite automata from computer science text books.
      • TrapLord_Rhodo 15 hours ago
        That's what he said they did. that's the joke.

        >so they changed the existing code to just check that predicate first, and the code sped up a zillion times, much more than with Ken's solution.

        but since you've now spent all this time developing some beautiful solution for .01% of the actual data flow. Not the best use of dev time, essentially.

      • lispitillo 14 hours ago
        If a case doesn't match, then speeding up the remaining 0.1% is not going to change much the overall speed. Hence, a non optimized algorithm maybe enough. But when speed and speed variance is critical, then optimization is a must.
      • hinkley 15 hours ago
        The devil is in the details. A pair of my coworkers spent almost two years working on our page generation to split it onto a static part and a dynamic part. Similar to Amazon’s landing pages. Static skeleton, but lots of lazy loaded blocks. When they finally finished it, the work was mediocre for two reasons. One, they ended up splitting one request into several (but usually two), and they showed average response times dropping 20% but but neglected to show the overall request rate also went up by more than 10%. They changed the denominator. Rookie mistake.

        Meanwhile, one of my favorite coworkers smelled a caching opportunity in one of our libraries that we use absolutely everywhere, and I discovered a whole can of worms around it. The call was too expensive which was most of the reason a cache seemed desirable. My team had been practicing Libraries Over Frameworks, and we had been chipping away for years at a NIH framework a few of them and a lot of people who left had created in-house. At some point we lost sight of how we had sped up the “meat“ of workloads at the expense of higher setup time initializing these library functions over, and over, and over again. Example: the logging code depended on the feature toggle code, which depended on a smaller module that depended on a smaller one still. Most modules depended on the logging and the feature toggle modules. Those leaf node modules were being initialized more than fifty times per request.

        Three months later I had legitimately cut page load time by 20% by running down bad code patterns. 1/3 of that gain was two dumb mistakes. In one we were reinitializing a third party library once per instantiation instead of once per process.

        So on the day they shipped, they claimed 20% response reduction (which was less than the proposal expected to achieve), but a lot of their upside came from avoiding work I’d already made cheaper. They missed that the last of my changes went out in the same release, so their real contribution was 11%, but because of the extra requests it only resulted in a 3% reduction in cluster size, after three man years of work. Versus 20% and 7% respectively for my three months of work spread over five.

        Always pay attention to the raw numbers and not just the percentages in benchmarks, kids.

    • snvzz 1 day ago
      My take is that code is disposable, and often needs to be written to figure out what is really needed, then disposed off.

      Learning this is worth writing the otherwise useless code.

      • BigJ1211 22 hours ago
        This is why I strongly adhere to WET¹, you discover so much about the domain the first time you write something. By the time you have it 'working' you realize you should've written vastly different code to solve or cut-off edge cases.

        That's why these days I just start writing code, get it to 'it works!' level. To then throw it all away and start again. Takes a little more effort, but the payoff is far less problematic code and tech debt.

        ¹: Write Everything Twice

        • wizzwizz4 21 hours ago
          Don't do WET with your entire system, though, or you're likely to end up with the Second System Effect.
          • MathMonkeyMan 17 hours ago
            I think the idea is you never release the first version.
      • Cthulhu_ 23 hours ago
        Or one should spend a little more time at first to fully understand the problem, in both the mentioned case and the OP, use realistic data.
        • TillE 23 hours ago
          Writing code is very very often a big part of understanding any new problem. It's just that you should be quite comfortable discarding that code. Refactor mercilessly and all that.
        • scott_w 21 hours ago
          Writing code that solves the problem is a huge way of proving you understand the problem.
        • yetihehe 23 hours ago
          Sometimes you can't have realistic data without writing some code first, that will allow you to gather that data.
      • 5pl1n73r 18 hours ago
        My reaction to the article title was recalling how old assembly teachers would drill into our heads to make sure we have proof of 200% gains before implementing an optimization. However the article shows how sometimes such planning doesn't work.
      • ddalex 23 hours ago
        I came to the same conclusion months ago with the advent of coding LLMs.... I used to treat code as precious, carefully saving, cataloguing and referencing git commits and SQL scripts and command line histories, because they were painful to write, and thus they are valuable. I mean, they were, because when I had the same need again, I could find previous instances and reuse stuff.

        Now ? I just dump the problem into an LLM and it spits up a script that solves it, and then I throw away the script because it horribly bugged for any other case then my particular problem, but hey, I just solved my problem

    • ralegh 1 day ago
      This is fine assuming the popular request types don’t change, but arguably if both new versions of matching are sufficiently fast then I would prefer Ken’s long term as the other could become slow again if the distribution of request types changes.
      • sfilmeyer 1 day ago
        As a counterpoint, what fraction of the future engineers who will touch the project are likely to be able to competently edit the finite automata based version without introducing bugs and what fraction will be able to competently edit the if statement that checks the particular policy?
        • mikepurvis 1 day ago
          A further question mark is whether any of this has sufficient instrumentation to be able to notice and act on a change of and when it occurs.
      • andrepd 1 day ago
        Nonsense. The pre-check can literally be one line (if common_case {fast_path()} else {slow_path()}), and thus enabling or disabling it is dead simple and obvious if the problem changes in the future.

        Lines of thinking like that are part of the reason most modern software is so sloooow :)

        • Rendello 1 day ago
          This situation where two paths produce the same output but one is optimized is the easiest case in property-based testing, as the property is just:

            normal(x) == optimized(x)
          • Stratoscope 1 day ago
            I have sometimes done just this. First I write the simplest possible brute force code, something that any competent programmer can look at and say, "Yes, this may be slow, but it is straightforward and obvious that it will handle all cases."

            Then I write the optimized code and use a test like yours to compare the results between the simple code and the optimized code.

            One time I needed to write a function to search for a specific pattern in a binary file and change it. So I wrote the brute force code as a first step, the same code that anyone would probably write as a simple solution. It worked the first time, and a couple of people reviewed the code and said "yep, even if it's slow, it is correct."

            But this code took more than a second to run!

            Of course I thought about optimizing it with Boyer-Moore or the like. Then I went, "Hold on to your horses. This isn't something like a web page load where one second matters. It's part of a build process that only runs a few times a day and already takes several minutes to run. One extra second is nothing!"

            In the wise words of Kenny Rogers in The Gambler:

              You got to know when to hold 'em,
              know when to fold 'em
              Know when to walk away
              and know when to run
            • mark-r 19 hours ago
              Knuth's famous quote about premature optimization captures this perfectly.

              "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

            • scott_w 21 hours ago
              Also true for deciding whether to write code at all! About 15 years ago I was working with a junior who'd spent 3 hours trying to automate some data editing until I said "mate, you can just edit it all by hand in about an hour!"
              • taeric 15 hours ago
                I had a similar one where a billing mistake had happened in a system and the other devs were trying to find out how to determine all of the bad values. I asked if they had simply looked at the top N that could fit on their screen sorted by value. The incorrect values were the obvious outliers. Only about 4 of them. They had been blocked on identifying them for about an hour.
              • VBprogrammer 19 hours ago
                Funny enough, with LLMs this trade off may well have flipped. For simple tasks like given a string like X, format it like Y, they work amazingly well.
                • scott_w 16 hours ago
                  Absolutely. Even back then, trying to script it would sometimes be fine. I've had plenty of situations where I'd try to write a script, give it 30 minutes then say "fuck it, I'm hand-fixing it." The same probably applies to LLMs, you just need to set a cut-off point before it stops being worthwhile.
              • lettuceconstant 16 hours ago
                Sure, but I have to obtain my dopamine somehow.
            • throwaway-kg8d5 23 hours ago
              Very true. I would have gone with

                Every coder knows
                the secret to optimizin’
                is knowing what to throw away
                and knowing what to keep
          • andrepd 22 hours ago
            Absolutely! Proptesting is a great but underutilised feature. I use it a lot wherever I'm in a language which makes it easy (and with a mature library for that), like Haskell, Ocaml, or Rust.
            • Rendello 13 hours ago
              The simple cases are great demos, and then the complex cases melt your brain! But still worth it a lot of the time.
        • vlovich123 1 day ago
          Hyperoptimizing for the fast path today and ignoring that hardware and usage patterns change is the reason modern software is so slooow :)

          A more robust strategy would be at least be to check if the rule was the same as the previous one (or a small hash table) so that the system is self-healing.

          Ken’s solution is at least robust and by that property I would prefer it since it’s just as fast but doesn’t have any weird tail latencies where the requests out of your cache distribution are as fast as the ones in.

          • Jean-Papoulos 1 day ago
            You were shown an example of exactly why this thinking is incorrect but you still insist...

            Also, it's trivial to keep Ken's implementation as the slow path. If request patterns change, dig up the new fast path and put the old one in Ken's slow path code. Most of the performance will still come from the initial `if`.

            • vlovich123 6 hours ago
              It’s ungenerous to assume I would be against the if statement + Ken’s. But Ken’s approach is critically important and the “if” statement should just be a 1 entry cache instead of being hard coded. Getting this stuff right in a future proofed durable way is actually quite hard even when you notice the opportunity.
          • necovek 1 day ago
            Nobody is hyperoptimizing the fast path today.

            Ken's solution was stated to have been slower than the alternative optimization.

            • akie 1 day ago
              Ken's solution optimized the general case, basically everything that doesn't match the if-statement.
        • ants_everywhere 1 day ago
          You can even track the request statistics live and disable the fast path if the distribution of requests changes significantly.
      • scott_w 21 hours ago
        I think you missed the point. Ken's version wasn't removed, it was simply prepended with something like:

          if value in most_common_range:
            take_fast_path()
          else:
            use_kens_solution()
    • underdeserver 20 hours ago
      This is a good place to mention profile-guided optimization.

      In PGO you profile your application running in the real world (via automatic instrumentation) and the compiler uses the output to make optimization decisions.

      I'm not sure what exactly they use it for, but I imagine it could be used for loop unrolling, switch case ordering, branch prediction optimization, memory ordering and more.

      • menaerus 20 hours ago
        What if your "application" is a server-side code hosting multitude of different types of workloads, e.g. databases? I was never really convinced by the PGO benefits for such open-ended and complex workloads.
        • ltbarcly3 15 hours ago
          PGO operates at the cpu branch level. The main benefit, if i understand correctly, is to reduce branch mispredictions which lead to pipeline stalls.

          Theres absolutely no reason to think that the specific workload of a system generally makes random/arbitrary/heuristic be the best way to pick a most likely branch for a given branching instruction.

          For example, when scanning data you might check if the current value is equal to a test value, and branch on that to either code that handles the match, or else code that moves to the next entry to test that. This may be extremely likely to match (95%) as in a hash table, or very likely to not match (string search). A compiler can't tell which is true, and your workload literally has no impact on this unless it is pathological.

    • n3t 22 hours ago
      Anyone has a link to the talk?
    • sylware 23 hours ago
      I am writing assembly and often you have many code paths/data structures which "fit". Each combinaison of code paths/data structures will favor some specific usage/data in spite of the others. And this is not real "optimization" since usually those code paths have roughly "the same" cost. The bottom of this: it is what you think of semantics and real-life usage of this code which will drive those "choices".

      That said, you can still have generic optimizations which will benefit nearly all (for instance quick sort).

  • jasonthorsness 1 day ago
    Identifying a representative usage scenario to optimize towards and then implementing that scenario in a microbenchmark test driver are both massively difficult to get right, and a "failure" in this regard, as the author found, can be hard to detect before you sink a lot of time into it.

    Even for seemingly trivial scenarios like searching an array, the contents and length of the array make a massive difference in results and how to optimize (as shown in the last section of this write-up where I tried to benchmark search algorithms correctly https://www.jasonthorsness.com/23).

    I've not seen a perfect solution to this that isn't just "thinking carefully about the test setup" (profile-guided optimization/production profiles replayed for benchmarks seem like maybe it could be an alternative, but I haven't seen that used much).

    • bgirard 1 day ago
      > Identifying a representative usage scenario to optimize towards and then implementing that scenario in a microbenchmark test driver are both massively difficult to get right, and a "failure" in this regard, as the author found, can be hard to detect before you sink a lot of time into it.

      I can vouch for this. I've been doing web performance for nearly 15 years and finding clear representative problems is one the hardest parts of my job. Once I understood this and worked on getting better at finding representative issues, it became the single thing that I did to boost my productivity the most.

      • jasonthorsness 1 day ago
        What have you found are the most useful approaches to collecting the representative issues in a way you can reuse and test? I haven’t worked as much in the web space.
        • bgirard 1 day ago
          A combination of: 1) RUM, 2) Field tracing, 3) Local tracing with throttling (CPU, Network), 4) Focusing either on known problems, or if I'm on less certain territory I will do more vetting to make sure I'm chasing something real. I'll minimize the time spent on issues that I can only catch in one trace, but sometimes it's also all you get so I'll time box that work carefully. It's more of an art than a science.
      • viraptor 1 day ago
        Tracing and continuous profiling made this task significantly easier than it was in the past, fortunately.
    • sfink 1 day ago
      It's a subset of profile-guided optimization, but I fairly frequently will gather summary histograms from more or less representative cases and then use those to validate my assumptions. (And once in a while, write a benchmark that produces a similar distribution, but honestly it usually comes down to "I think this case is really rare" and checking whether it is indeed rare.)
    • anonymoushn 1 day ago
      I have asked for like the ability to apply a filter that maps real strings from real users to a very small number of character classes, so that I may run some codec on data that is equivalent to user data for the purposes of the codec. I have heard, this is not a good use of my time, and if I really care so much I should instead make various guesses about the production data distribution of user-created strings (that need to be json-escaped) and then we can deploy each of them and keep the best, if we can even tell the difference.
      • jasonthorsness 1 day ago
        If the real data is sensitive, it's hard to distill test data from it that succeeds in fully removing the sensitivity but that is still useful. Depending on the domain even median string length could be sensitive.
    • kazinator 1 day ago
      If they were forking that JVM in order to offer it as a development tool to a broad developer base, then such an optimization might be worth keeping.

      Someone out there might have an application in which they are serializing large numbers of large integers.

      The case for abandoning it is not as strong, since you don't know who is doing what.

      It's a guessing game in programming language run-times and compilers: "will anyone need this improvement?" You have room to invent "yes" vibes. :)

      • grogers 1 day ago
        If the distribution of numbers you are serializing is uniformly random, you generally wouldn't choose a variable length encoding. Sometimes the choice is made for you of course, but not usually.
      • necovek 1 day ago
        The problem is that this is a complex assembly implementation that needs to be maintained over the course of decades in the future. Unless you have a good reason to keep it, you should drop it.
    • spott 1 day ago
      Why is just capturing real values not the answer here?

      Something like grab .01% of the inputs to that function for a day or something like that (maybe a lower fraction over a week or month).

      Is the cost of grabbing this that high?

      • adrianN 1 day ago
        That works for some cases, but if you have a service that processes customer data you probably don’t want to make copies without consulting with legal.
  • heipei 1 day ago
    Counterpoint: I once wrote a paper on accelerating blockciphers (AES et al) using CUDA and while doing so I realised that most (if not all) previous academic work which had claimed incredible speedups had done so by benchmarking exclusively on zero-bytes. Since these blockciphers are implemenented using lookup tables this meant perfect cache hits on every block to be encrypted. Benchmarking on random data painted a very different, and in my opinion more realistic picture.
    • atiedebee 1 day ago
      Why not use real world data instead? Grab a large pdf or archive and use that as the benchmark.
      • bluGill 1 day ago
        A common case is to compress data before encrypting it so random is realistic. Pdf might allow some optimization (how I don't know) that is not representative
      • jbaber 1 day ago
        Or at least the canonical test vectors. All zeros is a choice.
    • rdc12 1 day ago
      Do you have a link to that paper?
    • almostgotcaught 1 day ago
      There are an enormous number of papers like this. I wrote a paper accelerating a small CNN classifier on FPGA and when I compared against the previous SOTA GPU implementation and the numbers were way off from the paper. I did a git blame on their repo and found that after the paper was published they deleted the lines that short-circuited eval if the sample was all 0 (which much of their synthetic data was ¯\_(ツ)_/¯).
    • Marazan 1 day ago
      Back when Genetic Algorithms were a hot topic I discovered that a large number of papers discussion optimal parameterisation of the the approach (mutation rate, cross-over, populations etc) were written using '1-Max' as the "problem" to be solved by the GA. (1-Max being attempting to make every bit of the candidate string a 1 rather than 0)

      This literally made the GA encoding exactly the same as the solution and also very, very obviously favoured techniques that would MAKE ALL THE BITS 1!

      • imtringued 22 hours ago
        This reminds me of this paper:

        https://link.springer.com/article/10.1007/s10710-021-09425-5

        They do the convergence analysis on the linear system Ax = 0, which means any iteration matrix (including a zero matrix) that produces shrinking numbers will converge to the obvious solution x = 0 and the genetic program just happens to produce iteration matrices with lots of zeros.

  • scottlamb 1 day ago
    > For example, there was minimal use of java.lang.String because ain’t nobody got time for a second allocation of a char[] with associated indirection and GC churn.

    An example of why I'd prefer to avoid Java for something like that. Java has a reputation for being slow because of JIT compilation and GC, but I also think a lot of it comes down to all non-primitive types being boxed Objects (boxed = in a separate allocation). This means it works that garbage collector so much harder, there's poorer locality, it fills RAM/cache with extra Object stuff like mutexes that no one ever uses. (Apparently now 8 bytes after a lot of effort to reduce it. https://openjdk.org/jeps/450)

    • deepsun 1 day ago
      > it fills RAM/cache with extra Object stuff

      But it has much lesser impact than in, say, Rust, where it's really an allocation (asking kernel for more RAM). Java's Object "allocation" happens in its own heap, which is a big chunk of RAM already pre-allocated from kernel. So in JVM boxing is really cheap, often just a pointer increment. Also oftentimes the wrapper object and its value are located near each other in the same memory page, so we're not adding a RAM access, more like L2 access.

      PS: In some cases it can be even faster than Rust or C++, because it pre-allocates larger pages and drops them in chunks within generational GC (e.g. all values "allocated" to process one HTTP request can be GCed immediately after), while C++ is eager to destruct each object immediately. Also, a GC swipe can happen in a separate thread, not bothering the main thread user is waiting for. One can do the same in Rust and C++ using Arenas, of course.

      • scottlamb 1 day ago
        > But it has much lesser impact than in, say, Rust, where it's really an allocation (asking kernel for more RAM). Java's Object "allocation" happens in its own heap, which is a big chunk of RAM already pre-allocated from kernel.

        What? No. Rust doesn't ask the kernel for each allocation individually. That'd be insane; besides the crushing system call overhead, the kernel only does allocations in whole pages (4 KiB on x86-64) so it'd be incredibly wasteful of RAM.

        Rust does the same thing as virtually every non-GCed language. It uses a memory allocator [1] that does bookkeeping in userspace and asks the kernel for big chunks via sbrk and/or mmap/munmap. Probably not the whole heap as a single chunk of virtual memory as in Java, but much closer to that than to the other extreme of a separate kernel call for each allocation.

        [1] by default, just libc's malloc and free, although you can override this, and many people choose jemalloc or mimalloc instead. My high-level description applies equally well to any of the three.

        • gf000 17 hours ago
          While Java just does a thread local pointer bump, which will still be more efficient, and closer to stack allocation.

          Of course you can optimize better with Rust/CPP/etc, but it is not trivial and you definitely not get it out of the box for free. My point is, this is a bit overblown how much overhead java has.

        • deepsun 13 hours ago
          Yes, my mistake, thanks for pointing out, upvoting. I meant asking memory allocator, not kernel.

          I meant that Java usually already has that memory allocated, JVM is a memory allocator of its own. It operates within its own heap. One can do that within Rust of course (or easier in Zig, as it welcomes passing an allocator around), it's just built-in in JVM already. Drawback is that it's more complex to release that memory back from the JVM process, so Java apps (not AOT-compiled) usually consume more RAM.

        • labadal 17 hours ago
          I'm glad that I'm over that phase I had in university where I wanted to write a custom memory allocator for everything because "I understand my usage better". I will admit that it was a good bit of fun though.
      • IshKebab 1 day ago
        Aside from Rust not working like that (as scottlamb said), Rust is faster than Java even if Java has a faster allocator because Rust code usually does much less allocation in the first place.
        • jeroenhd 16 hours ago
          I don't know if Rust code allocates more or less in general. It really depends on what kind of code you write. Once Rust code reaches the complexity of the Java stacks it's replacing, you get a lot of wrapper objects, locks, and intermediates to cross thread boundaries and to prove soundness to the borrow checker.

          I recently encountered an example of someone writing a Rust version of a popular Java library by just taking the Java code, commenting it out, and writing the Rust equivalent almost line for line. The approach works great (no need to reinvent the wheel and you can point to the existing documentation and code samples) but in terms of allocations, you're not going to find many improvements.

          There's a type of Java code that looks more like C code than anything else that runs blazing fast with minimal overhead. It's not the type of Java code you'll probably encounter when writing Java applications, but if you use Java as a kind of cross-platform C target, you can get pretty close to Rust (and for some use cases even beat it). Java has a LOT of tricks up its sleave (pointer compression, dynamic realignment) that Rust can't automatically take advantage of.

          Your AbstractFunctorClassFactoryProducer isn't going to be very allocation efficient, but once you start seeing volatile ints all over the place, things quickly become a lot faster.

    • mcosta 22 hours ago
      An example of why I'd prefer to avoid Java for somethi [Core dumped]
    • pjc50 1 day ago
      I like how dotnet has acknowledged this and provided a whole lot of machinery recently for things like value types on the stack and Span<T> to reduce array copying.
      • pjmlp 19 hours ago
        I would have liked that both had acknowledged this to the point of languages like Oberon (all variants predate them), Eiffel, Modula-3, CLU, Cedar,...

        While they were used as inspiration on their design, the AOT approach, with automatic memory management, and programming language features for low level coding really took a while to get back from them, even the NGEN/unsafe from C# v 1.0, while welcomed wasn't that great.

        I always think that in an alternate reality, had them been had those features already on version 1.0, C and C++ would have lost even more mindshare at the turn of the century, and something like C++11 would not have been as impactful.

    • snickerdoodle12 1 day ago
      > but I also think a lot of it comes down to all non-primitive types being boxed Objects

      They're working on it: https://openjdk.org/projects/valhalla/

      • eulgro 20 hours ago
        Project Valhalla started over 10 years ago. It keeps being mentioned, but with seemingly no end in sight. Is there an ETA?
    • bsder 1 day ago
      It's not like C++ users don't say the exact same thing about String, though.

      The problem is that String really isn't a language primitive.

      Different people and programs always have a different notion of what a String should do (Should it be mutable? Should it always be valid UTF-8? Which operations should be O(1), O(n) or O(n log n)? etc.)

      • scottlamb 1 day ago
        > It's not like C++ users don't say the exact same thing about String, though.

        If they do, they're wrong, as the two languages are quite different here. In Java, String requires two allocations: your variable is implicitly a pointer to String allocation, which in turn has a pointer to a char[] allocation. In C++, the std::string itself is a value type. The actual bytes might be inline (short string optimization) or behind a single allocation accessible from the std::string.

        Rust's std::string::String is somewhere between: it's a value type but does not have a short string optimization (unless you count the empty string returned by String::new).

        > Different people and programs always have a different notion of what a String should do (Should it be mutable? Should it always be valid UTF-8? Which operations should be O(1), O(n) or O(n log n)? etc.)

        Sure, there can be call for writing your own String type. But what's unique about Java as compared to say C, C++, Go, Rust, even to some extent C# is that you can't have a class or struct that bundles up the parts of your data structure (in the case of a mutable string, two fields: data pointer/capacity + the used length) without boxing. There's a heavy cost to any non-primitive data type.

        • josephg 1 day ago
          > Sure, there can be call for writing your own String type. But what's unique about Java as compared to say C, C++, Go, Rust, even to some extent C# is that …

          You also can’t make a first class string type in most of those language because “hi” is defined to be of a system specified class. You can make a different type to store strings but it’ll never be as ergonomic to use.

          It’s even worse in JavaScript, where the standard library is written in a different language (usually C++). It’s impossible to write JavaScript code that matches the performance of the built in string primitive. At least outside of specific niche use cases.

          Rust has lots of alternate string like libraries in cargo that are optimized better for different use cases - like smallstring or ropey. They’re fast, convenient and ergonomic. I imagine C++ is the same.

        • layer8 1 day ago
          > In Java, String requires two allocations

          That’s true, but thanks to GC an allocation also is just a pointer bump in Java, and the two allocations are likely to be close to each other. For short-lived strings, the GC cost is furthermore zero, because only the longer-lived objects need to be tracked and copied with generational GC. So, “heavy cost” is relative.

        • grogers 1 day ago
          Also, you can't construct a java String without copying the data into it, because String is immutable. In other languages like c++ or rust the string type is mutable so you don't need an extra copy. Java doesn't even special case blessed APIs like StringBuilder to avoid the extra copy, since StringBuilder itself doesn't have a method to consume the inner buffer, you can only create a string from it without touching the buffer even though it's not the normal usage to create multiple strings from a given StringBuilder.
          • throeaway9 22 hours ago
            For cases where a string repeats itself, you can use String.intern() to reuse data.
            • layer8 17 hours ago
              String.intern() doesn't reuse the data per se. It merely gives you a canonical instance with the same value as the String instance you invoke it on (unless it's the first time it is invoked for that value, in which case it returns the same instance you already have in hand). At that point the latter duplicate instance has already been constructed. The only benefit in terms of memory is that the duplicate instance might be garbage-collected earlier if the program stops referencing it and uses the interned instance instead.

              Also, String.intern() can be quite slow compared to something like ConcurrentHashMap.putIfAbsent().

        • pjmlp 18 hours ago
          Except many aren't aware that as long as std::string fullfills the ISO C++ complexity requirements for its implementation, anything goes.

          This is less problematic nowadays, because for many folks there are only three compilers that still matter, or even a single one.

  • charleslmunger 1 day ago
    I tried implementing an optimized varint encoder that did something similar, by populating an 8 byte value and then storing it to ram, but unaligned overlapping stores caused big regressions. The approach that worked required a different technique. This one is for encoding backwards:

    1. One branch for the one-byte case, always inlined, just store the byte

    2. Calculate the size size of the unencoded zero prefix with a branch-free construction: (((uint32_t)clz + 7) * 9) >> 6

    3. Hand roll a jump table taking advantage of arm's fixed instruction width to calculate the jump target with a shift.

    https://github.com/protocolbuffers/protobuf/commit/b039dfe26...

    This results in one branch for 1 byte varints, plus one additional branch for any larger size, and the branch predictor does not have to track a varying trip count through a loop. This approach resulted in a 2.64% speedup for overall encoding (which includes a lot more than just varints) on mid sized arm cores.

    I think it's very difficult to beat a single comparison and branch for the 1 byte case for actual encoding forwards or backwards, unless you know there's going to be long runs of one-byte values.

  • chaosmachine 1 day ago
    This reminds me of Benford's law[1]

    "Benford's law, also known as the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small."

    [1] https://en.m.wikipedia.org/wiki/Benford%27s_law

  • OhMeadhbh 1 day ago
    <snark>Time spent optimizing assembly language is never wasted.</snark>

    That's only half-snark. It was "wasted" in the sense that it wasn't as useful as the OP thought it would be. But diving deep into a technical problem? That left an indelible pattern on the OP's psyche. If they ever have to do something like that again, it will certainly be easier. And since AI is going to be doing all the simple coding tasks, humans will have to dive into the deep, hard tasks.

    • bluGill 18 hours ago
      Also there is no way for author to know that his optimizations wouldn't work without trying them. We are not even sure that his analysis of why they work is correct (though it sounds reasonable). There is also the possibility that author could use the new found information to fix his optimizations so they are better.
    • xandrius 1 day ago
      It could also burn someone out and take out the real reason technology exists: to help someone.
      • labadal 17 hours ago
        Kind of like math, huh? Some people burn out on "meaningless" impossible problems. Others get obsessed with them and make significant contributions along the way to failure.
  • corysama 1 day ago
    Reminds me of https://ridiculousfish.com/blog/posts/old-age-and-treachery....

    Data condition can definitely affect runtime performance. All the way down to the micro architecture level. Random, sorted, reverse-sorted, uniform data sets often lead to dramatic differences in perf. Then you get into “Ooops, my dataset had a surprisingly large number of NaNs or denormalized floats!”

  • zem 1 day ago
    > We were in the early phases of forking the JVM

    definitely made me go :-O - how big a company do you need to be for that to be a realistic thing to maintain in house?

    • cbzoiav 1 day ago
      A fork can be many things.

      I've maintained in house forks of major code bases with a handful of minor code changes while we've tried to convince upstream to accept them.

    • dcrazy 1 day ago
      You don’t need to be big, you just need to handle lots of money.
      • zem 1 day ago
        fair point, I guess this was a high frequency trading firm or similar, who could justify paying a few engineers to maintain a jvm fork
    • scheme271 1 day ago
      Off-hand I think that there were at least 7 forks/implementations out there. Oracle/Sun, Oracle's Graal, IBM, OpenJDK, Amazon Coretto, Twitter's fork, and Azul. There's also a GNU effort but I don't think that really went very far. I think there's probably a few others that aren't publicly well known for trading firms, etc.
  • titzer 1 day ago
    I'm surprised they didn't just add a fast path of 1 and 2 byte LEBs to the hard-won assembly they generated. These fast paths are absolutely worth it. E.g. Wizard's Wasm interpreter uses a 3 instruction fastpath to decode 1-byte LEBs and then jumps out-of-line to handle the general case. It is worth 5-10% for the interpreter performance.
    • menaerus 1 day ago
      The issue was that they were hand-optimizing the SIMD for a workload that doesn't even exist in their case so they ended up seeing one big nothing once they profiled the code with a new optimization - a classic trap with 95% "optimizations". It turned out that the distribution of their data is in those 1 and 2 byte LEBs, and in that case their SIMD approach doesn't give the gains as it does with 9 and 10 byte LEBs.

      I am wondering more about the bit-by-bit compatibility part from "This was verified in a benchmark that ran both versions on billions of random numbers, confirming both the huge speedup and the bit-for-bit compatibility of the result.".

      How does one test for 18.4 quintillion numbers? This isn't possible.

    • phire 1 day ago
      The existing implementation was already "highly optimised", so it would have had a fast path for short ints (they probably profiled on realistic data sets).

      Adding the fast path to the hard-won assembly is just optimising the slow path, and there is little point to optimising it. Complex optimisations are also a form of technical debt, so they are probably better off not integrating it.

  • Const-me 16 hours ago
    LEB128 is slow and there’s no way around that. I don’t know why so many data formats are using that codec.

    The good variable integer encoding is the one used in MKV format: https://www.rfc-editor.org/rfc/rfc8794.html#name-variable-si... The compression win i.e. encoded size is equal to LEB128, however both encoding and decoding are way more efficient.

    The encoder is a few fast instructions for all numbers no matter large or small: bit scan reverse (or leading zero count) to find count of bits required to store the input number, division by 7 rounding up to compute count of bytes (modern compiler reliably optimize into integer multiplication and bit shifts), bitwise shift + OR to set the VINT_MARKER bit, then byte swap to store in big endian format.

    The decoder is equally efficient: compare first byte with 0 to detect numbers longer than 8 bytes (very rarely happens), bit scan reverse of either first or second byte to find encoded length, byte swap to load the big-endian format, then reset the most significant bit in the integer to get rid of the VINT_MARKER.

  • duxup 1 day ago
    I feel like this is the problem of, a lot of what I do, although not as cool as the author, and more due to human issues.

    I'm always looking for / wondering about valid data representing the problem so that I can measure success.

    A little of last and much of this week I've spent addressing a performance problem. At first I was told it was always happening in some specific situation, it very much is not. Then I was told it frequently happens ... nope, maybe 2 to 4 times a day at most.

    I think I've managed to track the issue down. I think so, pretty sure? On the other hand this client has a couple performance related issues ongoing and listening to them talk on a conference call... I'm not really sure they differentiate between them. It's very possible my neighbor a few seats away from me is actually addressing "the problem" and I'm chasing ghosts. But the client is too busy to provide good data. Meanwhile this issue "too important" of an issue for us to do nothing and wait, granted ... we might be doing nothing anyway.

  • grapesodaaaaa 1 day ago
    I can’t help but wonder what the use-case is for optimizing Java to these levels.

    I would expect C or asm to be a better choice at ultra high levels of optimization. I know Java can get somewhat similar performance if you put the work in, but I’d assume you want to do manual memory management at minimum.

    • pjmlp 18 hours ago
      Using Java gets many companies almost to the goal line, while paying less, and having a bigger pool to hire from.

      "Why we chose Java for our High-Frequency Trading application"

      https://medium.com/@jadsarmo/why-we-chose-java-for-our-high-...

      • grapesodaaaaa 18 hours ago
        I’m not saying it can’t be done, and I understand the economics can work.

        I still just feel like a non-GC language (maybe even Rust) is a better choose when nanoseconds matter.

        • pjmlp 18 hours ago
          The only non-GC language that offered Java like capabilties in productivity, without being either C or C++, was Delphi, which has stuck being mostly for Windows shops, and is taited with the way Borland went through re-inventing itself.

          Ada is not something that these companies would bother using, even though there are about 7 vendors still in business.

          Likewise with either FreePascal or Modula-2.

          Most likely a few of these exchanges are writting new code in Rust for such deployment scenarios.

    • mcosta 21 hours ago
      C or asm IS a better choice at ultra high levels of optimization and crashing the process.
      • grapesodaaaaa 18 hours ago
        Yes, C or asm get a bad reputation. They are definitely harder.

        I’m just saying that this company seems to be throwing a lot of money at the problem, and at those levels you can squeeze more performance out of them and get them reasonably right.

        After all, many, many pieces of reliable software are written in C. If you wanted a “safer” modern language, you could pick Rust.

        I would just assume those would be easier picks rather than optimizing around GC pauses, which can be notoriously difficult when nanosecond matter.

        • Joel_Mckay 13 hours ago
          High level languages tend to hammer the allocator like a Piñata. C++ can improve code smell of C, but different languages have their own specific use-cases. The reason llvm and GC based languages is a problem in low-level real-time code is bewildering for many.

          Java is unique in that it often ran a dual stack for byte code, and platform specific binary objects.

          My advice often is to use high-level languages if you can get away with the convenience, and a low-level language with proper design patterns if you must.

          Many programmers I've met shouldn't write C/C++, but mostly because they can't be trusted to use memory or thread safe code.

          Cheers, =3

          • grapesodaaaaa 9 hours ago
            Agreed on a lot of that.

            C doesn’t inherently have to be bad (and can often be better than C++) if you are strict about adhering to standards.

            I like to point to OpenBSD as a successful and well-done C project. As a bonus, if you develop on OpenBSD, your program is more likely to have correct memory handling operations (OpenBSD has a lot of very aggressive memory protections. Many issues in ported applications like Firefox were found and fixed by OpenBSD users, but would happily keep running on Linux).

            • Joel_Mckay 8 hours ago
              Indeed, many wished the OpenBSD Debian project had succeeded. =3
  • gorset 1 day ago
    In a previous project we used fixedint32/64 instead of varint values in the schema for messages where performance was critical.

    That left only varint used for tags. For encoding we already know the size at compile time. We also don’t have to decode them since we can match on the raw bytes (again, known at compile time).

    A final optimization is to make sure the message has a fixed size and then just mutate the bytes of a template directly before shipping it off. Hard to beat.

    In a newer project we just use SBE, but it lacks some of the ergonomics of protobuf.

  • anonymoushn 1 day ago
    Yes, it is very important to test performance work on data distributions that are representative of the production data distribution. I hope the author has another go at using SIMD that actually makes it into use :)
  • kccqzy 1 day ago
    Good story! I totally didn't know about the trick of changing what the JIT code produces bypassing JNI overhead.

    > Never have I ever seen such a highly optimized Java codebase. Not before, not since.

    This makes me wonder where it was. I know in the fintech world there's at least one hedge fund (not HFT, not market maker) where all their latency-sensitive trading code is written in Java, and when I asked them about GC pauses they said that their hot path produces no garbage so GC doesn't kick in.

    • pengaru 1 day ago
      I too knew someone in HFT ages ago that spoke of using java even in the hot path. He told me they disabled GC altogether in combination with not generating garbage in the hot path, with large enough machines to not OOM before markets closed for the day.

      From what I saw poking at the development trading boxes they over-provisioned both cpu ghz and ram to a ridiculous level. The machines looked nearly idle at their busiest.

      • sshine 1 day ago
        I worked with low-latency trading with C# in the hot path; they had started with C++ and eventually migrated to C# and just kept tight control of allocations.

        Later they’d go back to using C++ for training deep neural nets.

        • pjmlp 18 hours ago
          To avoid going back to C++, is exactly the reason why modern .NET has native support for tensors and various SIMD flavours.
      • perihelions 1 day ago
        > "large enough machines to not OOM before markets closed for the day"

        Reminded of that missile defense system that was designed to be software-reset once a day, except the design assumptions changed (a war started) and it was ordered to be left running nonstop, as an emergency measure; after being left on for a week, failed, causing a large number of deaths. That one had some kind of integrator that accumulated floating-point roundoff over time.

        • peterfirefly 1 day ago
          • scottlamb 1 day ago
            I interpret that write-up to mean the daily reboot was a temporary user-suggested workaround for a bug discovered in the field, rather than something in the product specs from the beginning. And it makes more sense to me that no one realized the errors would accumulate or thought to test long operation in general than it would for them to have explicitly said it wasn't important.
        • 0xffany 1 day ago
          Also on the topic of missiles, it reminded me of Raymond Chen's classic Null Garbage Collector

          https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...

        • mohaba 1 day ago
          Patriot missile range gate. I remember covering that in a numerical analysis class.
        • scottlamb 1 day ago
          > Reminded of that missile defense system that was designed to be software-reset once a day, except the design assumptions changed (a war started) and it was ordered to be left running nonstop, as an emergency measure

          I'm sure you've simplified the story, but it seems like a bit of process failure for a missile defense system to assume peacetime. There's writing software that implements the requirements, and then there's making sure the requirements are right both up front and when you really rely on them in a critical way.

      • anonymousDan 1 day ago
        Yeah it always felt like they were jumping through hoops to basically write C code through Java. I expect some of this might shift to Rust..
        • kccqzy 1 day ago
          Well you do get guaranteed memory safety.
          • saati 1 day ago
            If they were mucking around with asm and custom jit those memory safety guarantees are gone.
      • potato3732842 1 day ago
        It's a vicious cycle between provisioning more server and devs figuring out how to get more speed by using the resources for pre-computation.
      • renewiltord 1 day ago
        Disabling GC is wildly inefficient without making sure you don't have garbage because allocation will slow down the application to human scale. The magic is more in the no-allocation code than in the disable GC side. It is very annoying in Java because any non-primitive is boxed type. Object pools etc. Problem is it's easy to start in Java, but then you hit a wall. JNI has huge overhead. Exasperating. But hard to argue, sometimes if you don't start in Java you never get the problem because you never manage to bootstrap.
    • scheme271 1 day ago
      Azul has a JVM that doesn't have GC pauses and has fairly good memory allocation / collection throughput.
    • renewiltord 1 day ago
      There's quite a few. Teza definitely does, but I've spoken to many people who have done this.
  • necovek 1 day ago
    I am guessing this came out of misunderstanding "profile before optimizing": you don't profile a benchmark, but a live, production system instead.

    Obviously, it becomes hard in such an aggressively optimized system, but if that's the case, trying and failing is an acceptable approach (I've seen 2 weeks wasted in a many a worse ways, to be honest).

    • brabel 1 day ago
      Optimizing for the JVM is a nightmare (or trivial depending on your point-of-view).

      Another anecdote: I profiled our Java-based server and concluded that most of the time was being spent on creating temporary DI containers for request handling. I spent several days trying to avoid copying the "app" scope components to the "request" scope container (we forked and heavily modified our DI container implementation) so that creating request containers was cheaper. I wrote a JMH benchmark that tried to emulate the real thing based on the real data (with something like 40% of components in the "app" scope, 60% in the "request" scope). My optimisation made the code noticably faster, I can't remember by how much exactly, but probably something like 25% faster.

      When I ran the "real" performance test that simulates real-world usage of the server (not just the containers) the performance was slightly worse!! Ended up throwing the code away, as the OP. I had already spent too much time on it so I couldn't dig much further into why that was. But from other experiments, I can say that the JVM, thanks to JIT compilation of bytecode, will heavily optimize the hot path anyway based on the actual usage patterns, so much that the Java code you wrote may resemble very little of what machine code is actually executing in the end. That's why I said it may also be "trivial" to optimize JVM code: just let the JIT do its job.

      Of course you may still obtain big gains when you optimize code that was doing stupid things, but because of JIT "magic", even your stupid code might run really fast given enough JIT passes.

      • necovek 14 hours ago
        Definitely, sometimes, testing in production is the only way to know something works or doesn't (or the only non-cost-prohibitive way).
  • kazinator 1 day ago
    > I ended up rolling back the change, and wrote off the time spent as a proof-of-concept for developing and productionizing a custom JIT optimization.

    In Canada, your org can get government money for this under the SR&ED program:

    https://www.canada.ca/en/revenue-agency/services/scientific-...

    • m00x 1 day ago
      You can get SR&ED for anything in Canada. Even CRUD apps get SR&ED money.
      • Joel_Mckay 1 day ago
        Except the people that review your project don't necessarily keep it secret. The tax credit also won't cover Patents and Trademarks etc. which means as a startup you are vulnerable to getting scooped. YMMV =3
  • lqr 1 day ago
    In this example, random numbers provoked the worst case too often. However, in other situations random numbers are "too nice". For example, a random matrix with independent identically-distributed zero-mean entries is overwhelmingly likely to be well-conditioned, have no repeated eigenvalues, etc. Testing numerical algorithms on random data alone is a bad idea.
  • Havoc 21 hours ago
    Learned that lesson with ZFS.

    Better to do janky testing with a cobbled together script on your own data & usage scenarios than standard tools that measure standard metrics

  • necovek 1 day ago
    How is this encoding different from UTF-8 other than the upper boundary? There are certainly good enough encoding implementations for UTF-8, but perhaps the boundary matters.
    • advisedwang 1 day ago
      UTF-8 the first byte isn't just 1xxxxxxx for continuation, it's either 110xxxxx, 1110xxxx, or 11110xxx depending on how many bytes that character will take up.
      • necovek 14 hours ago
        Good point, I didn't pay close attention.

        In a sense, a shame this encoding wasn't structured like UTF-8, or even the other way around, a shame UTF-8 wasn't structured in this, more generic way.

        • advisedwang 3 hours ago
          ULEB128 has several deficiencies for strings:

          * it's not "self-synchronizing". If you jump into the middle of a ULEB128 stream and see a byte starting with 0 you can't tell whether it is a single-byte code point or the last byte of a multi-byte code-point. You have to back up to determine the meaning of the byte, which is often not viable. In fact you can't even be sure of the first 3 bytes!

          * Relatedly, some code-points are subsets of other code points. 0AAAAAAA appears inside of 1BBBBBBB 0AAAAAAA), so you can't search for a code point in a string without checking backwards to make sure you are comparing against a whole code point.

          * You can't tell the difference between a string cut at the end of a 2 byte code point and one cut part way through a 3 byte code point.

  • mark-r 19 hours ago
    The problem is that random numbers produced by a computer don't come close to producing naturally occurring random distributions. Your data should at the very least conform to Benford's Law (https://en.wikipedia.org/wiki/Benford%27s_law).
  • eulgro 20 hours ago
    > Never have I ever seen such a highly optimized Java codebase. Not before, not since.

    That must be awful to work with. Why couldn't they just pick a more adequate language if they wanted performance? Or write bindings for performance critical parts?

    • gf000 17 hours ago
      Binding can be very expensive.
  • rob_c 23 hours ago
    Pebkac?
  • CactusQuipster 1 day ago
    [dead]
  • fantoccio_29 23 hours ago
    [dead]