Making Deep Learning Go Brrrr from First Principles

(horace.io)

41 points | by tosh 2 hours ago

3 comments

  • tosh 1 hour ago
    > in the time that Python can perform a single FLOP, an A100 could have chewed through 9.75 million FLOPS

    wild

    • tosh 7 minutes ago
      re comments:

      yes of course this is apples to oranges but that's kind of the point

      it shows the vast span between specialized hardware throughput IFF you can use an A100 at its limit vs overhead of one of the most popular programming languages in use today that eventually does the "same thing" on a CPU

      the interesting thing is why that is so

      CPU vs GPU (latency vs throughput), boxing vs dense representation, interpreter overhead, scalar execution, layers upon layers, …

    • patmorgan23 1 hour ago
      Why are we comparing a programing language and a GPU. This is a category error. Programing languages do not do any operations. They perform no FLOPs, they are the thing the FLOPs are performing.

      "The I7-4770K and preform 20k more Flops than C++" is an equally sensible statement (i.e. not)

    • p1esk 1 hour ago
      This statement makes zero sense
    • xyzsparetimexyz 1 hour ago
      Single core vs multi core accounts for much of this
      • cdavid 1 hour ago
        Not really. GPU many cores, at least for fp32, gives you 2 to 4 order of magnitudes compared to high speed CPU.

        The rest will be from "python float" (e.g. not from numpy) to C, which gives you already 2 to 3 order of magnitude difference, and then another 2 to 3 from plan C to optimized SIMD.

        See e.g. https://github.com/Avafly/optimize-gemm for how you can get 2 to 3 order of magnitude just from C.

  • jdw64 1 hour ago
    Right now, all I know how to do is pull models from Hugging Face, but someday I want to build my own small LLM from scratch
    • max-amb 48 minutes ago
      If you want a written resource I have a blog post about the mathematics behind building a feed forward from scratch, https://max-amb.github.io/blog/the_maths_behind_the_mlp/. Kinda focuses on translation from individual components to matrix operations.
    • kflansburg 59 minutes ago
      If you aren't already aware, Karpathy has several videos that could get you there in a few hours https://www.youtube.com/@AndrejKarpathy
      • jdw64 57 minutes ago
        very thanks!
    • glouwbug 1 hour ago
      It’s just linear algebra. Work your way from feed forward to CNN to RNN to LSTM to attention then maybe a small inference engine. Kaparthy’s llama2.c is only ~300 lines on the latter and it pragma simds so you don’t need fancy GPUs
  • noosphr 1 hour ago
    >For example, getting good performance on a dataset with deep learning also involves a lot of guesswork. But, if your training loss is way lower than your test loss, you're in the "overfitting" regime, and you're wasting your time if you try to increase the capacity of your model.

    https://arxiv.org/abs/1912.02292

    • appplication 1 hour ago
      Generally, posting a link-only reply without further elaboration comes across as a bit rude. Are you providing support for the above point? Refuting it? You felt compelled to comment, a few words to indicate what you’re actually trying to say would go a long way.
      • noosphr 1 hour ago
        >We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better.
        • ForceBru 36 minutes ago
          Right, isn't double descent one of the reasons why modern Extremely Large Language Models work at all? I think I heard somewhere that basically all today's "smart" (reasoning, solving math problems, etc) LLMs are trained in the "double descent" territory (whatever this means, I'm not entirely sure).
          • SiempreViernes 8 minutes ago
            No, double descent is a symptom of whatever it is that makes the deep models work at all. It's just the name for something you see happen when it works. The reason it works has something to do with how all those extra dimensions work as a regularisation term in the fit.