Our New Sam Audio Model Transforms Audio Editing

(about.fb.com)

77 points | by ushakov 6 days ago

7 comments

  • ks2048 3 hours ago
    I recently discovered Audacity includes plug-ins for audio separation that work great (e.g. split into vocals track and instruments track). The model it uses also originated at Facebook (demucs).
    • tantalor 3 hours ago
      Is "demucs" a pun on demux (demultiplexer)?
  • yunwal 3 hours ago
    This is hilariously bad with music. Like I can type in the most basic thing like "string instruments" which should theoretically be super easy to isolate. You can generally one-shot this using spectral analysis libraries. And it just totally fails.
    • photon_garden 2 hours ago
      I had the same experience. It did okay at isolating vocals but everything else it failed or half-succeeded at.
    • duped 3 hours ago
      what in theory makes those "super easy" to isolate? Humans are terrible at this to begin with, it takes years to train one of them to do it mildly well. Computers are even worse - blind source separation and the cocktail party problem have been the white whale of audio DSP for decades (and only very recently did tools become passable).
      • yunwal 2 hours ago
        The fact that you can do it with spectral analysis libraries, no LLM required.

        This is much easier than source separation. It would be different if I were asking to isolate a violin from a viola or another violin, you’d have to get much more specific about the timbre of each instrument and potentially understand what each instruments part was.

        But a vibration made from a string makes a very unique wave that is easy to pick out in a file.

        • duped 2 hours ago
          Are you making this up? What spectral analysis libraries or tools?

          String instruments create similar harmonic series to horns, winds, and voice (because everything is a string in some dimension) and the major differences are in the spectral envelope, something that STFT tools are just ok at approximating because of the time/frequency tradeoff (aka: the uncertainty principle).

          This is a very hard problem "in theory" to me, and I'm just above casually versed in it.

          • jb1991 17 minutes ago
            If you look at the actual harmonics of a string and of horn, you will see how wrong you are. There is a reason why they sound different to the ear.

            It’s because of this that you can have a relatively inexpensive synthesizer (not sample or PCM based) that does a crude job of mimicking these different instruments by just changing the harmonics.

          • 613style 2 hours ago
            He's not making it up and there's no reason for that tone. Strings are more straightforward to isolate compared to vocals/horns/etc because they produce a near-perfect harmonic series in parallel lines in a spectrogram. The time/frequency tradeoff exists, but it's less of a problem for strings because of their slow attack.

            You can look up HPSS and python libraries like Essentia and Librosa.

            • IndySun 1 hour ago
              Hmmm... was 'tone' a pun?

              Why mention a strings 'slow attack' as less of a problem? No isolation software considers this an easy route.

              Vocals are more effectively isolated by virtue of the fact they are unique sounding. Strings (and other sounds) are the similar in some ways but far more generic. All software out there indicates this, including the examples mentioned.

          • dleeftink 1 hour ago
            I might misremember, but iZotope RX and Melodyne were pretty useful in this regard.
  • yjftsjthsd-h 5 hours ago
    > Visual prompting: Click on the person or object in the video that’s making a sound to isolate their audio.

    How does that work? Correlating sound with movement?

    • janalsncm 2 hours ago
      If it’s anything like the original SAM, thousands of hours of annotator time.

      If I had to do it synthetically, take single subjects with a single sound and combine them together. Then train a model to separate them again.

    • yodon 4 hours ago
      Think about it conceptually:

      Could you watch a music video and say "that's the snare drum, that's the lead singer, keyboard, bass, that's the truck that's making the engine noise, that's the crowd that's cheering, oh and that's a jackhammer in the background"? So can AI.

      Could you point out who is lead guitar and who is rhythm guitar? So can AI.

      • scarecrowbob 2 hours ago
        I mean, sometimes I -mixing- a show and I couldn't tell you where a specific sound is coming from....
        • yodon 2 hours ago
          > sometimes I -mixing- a show and I couldn't tell you where a specific sound is coming from

          And in those situations it won't work. Is any of this really a surprise?

  • ajcp 5 hours ago
    Given TikToks insane creator adoption rate is Meta developing these models to build out a content creation platform to compete?
    • mgraczyk 3 hours ago
      I doubt it, although it's possible these models will be used for creator tools, I believe the main idea is to use them for data labeling.

      At the time the first SAM was created, Meta was already spending over 2B/year on human labelers. Surely that number is higher now and research like this can dramatically increase data labeling volume

  • ac2u 4 hours ago
    I wonder if the segmentation would work with a video of a ventriloquist and a dummy?
  • teeray 4 hours ago
    I wonder if this would be nice for hearing aid users for reducing the background restaurant babble that overwhelms the people you want to hear.
  • m3kw9 4 hours ago
    Can I create a continuous “who farted” detector? Would be great at parties
    • rmnclmnt 27 minutes ago
      Bighead is back! « Fart Alert »!
    • IncreasePosts 1 hour ago
      Each person's unique fartprint is yet another way big tech will be tracking us