50 comments

  • cyp0633 3 days ago
    The same happens with whisper-large-v3 on Chinese transcription: silence is transcribed to something like "please upvote, share and favourite this video". I suspect they trained the model on some random YouTube video without carefully picking really useful data.
    • ttflee 3 days ago
      In Chinese, it always added something like "For study/research purpose only. Please delete after 48 hours." This is what those volunteers added in subtitles of (pirated) movies/shows.
      • codedokode 3 days ago
        Fair, if AI companies are allowed to download pirated content for "learning", why ordinary people cannot.
        • snickerdoodle12 3 days ago
          There is so much damning evidence that AI companies have committed absolutely shocking amounts of piracy, yet nothing is being done.

          It only highlights how the world really works. If you have money you get to do whatever the fuck you want. If you're just a normal person you get to spend years in jail or worse.

          Reminds me of https://www.youtube.com/watch?v=8GptobqPsvg

          • shadowgovt 3 days ago
            There's actually a lot of court activity on this topic, but the law moves slowly and is reluctant to issue injunctions where harm is not obvious.

            It's more that the law about "one guy decides to pirate twelve movies to watch them at home and share with his buddies" is already well-settled, but the law about "a company pirates 10,000,000 pieces to use as training data for an AI model (a practice that the law already says is legal in an academic setting, i.e. universities do this all the time and nobody bats an eye)" is more complicated and requires additional trials to resolve. And no, even though the right answer may be self-evident to you or me, it's not settled law, and if the force of law is applied poorly suddenly what the universities are doing runs afoul of it and basically nobody wants that outcome.

            • BobbyTables2 1 day ago
              What’s ironic to me is that had these companies pirated only a single work, wouldn’t that be a chargeable crime?

              Clearly Bonnie and Clyde shouldn’t have been prosecuted. Imagine they were just robbing banks for literary research purposes. They could have then used the learnings to write a book and sell it commercially…

              Or imagine one cracks 10000 copyrighted DVDs and then sells 30 second clips… (a derived work).

              To me, for profit companies and universities have a huge difference — the latter is not seeking to directly commercially profit from copyrighted data.

            • Workaccount2 3 days ago
              There is a distinction that must be made that very few people do, but thankfully the courts seems to grasp:

              Training on copyright is a separate claim than skirting payment for copyright.

              Which pretty much boils down to: "If they put it out there for everyone to see, it's probably OK to train on it, if they put it behind a paywall and you don't pay, the training part doesn't matter, it's a violation."

              • Analemma_ 3 days ago
                Whether it’s legal slash fair use to train on copyrighted material is only one of the questions currently being asked though. There’s a separate issue at play where these companies are pirating the material for the training process.

                By comparison, someone here brought up that it might be transformative fair use to write a play heavily based on Blood Meridian, but you still need to buy a copy of the book. It would still be infringement to pirate the e-book for your writing process, even if the end result was legal.

                • codedokode 3 days ago
                  If they would buy material at a large scale, the seller might require them to sign a contract that requires royalty if the material is used for training an AI. So buying legally is a way to put yourself into a trap.
                  • brookst 3 days ago
                    They can buy individual works like anyone else.

                    Or they can negotiate a deal at scale with whatever price / restrictions make sense to both parties.

                    I don’t see a way they could be “trapped”. Worst case they pay retail price.

                  • shadowgovt 3 days ago
                    What is the precedent on that kind of agreement?

                    The only thing I've been able to find is the note that since copyright is federal law, state contract law actually can't supersede it, to wit: if you try to put a clause in the contract that says the contract is void if I use your work to make transformative fair-use works (or I owe you a fee), that clause is functionally unenforceable (for the same reason that I don't owe you a fee if I make transformative fair-use works of your creations in general).

              • snickerdoodle12 3 days ago
                So if I download copyrighted material like the new disney movie with fansubs and watch it for training purposes instead of enjoyment purposes it's fine? In that case I've just been training myself, your honor. No, no, I'm not enjoying these TV shows.

                Because it's important to grasp the scale of these copyright violations:

                * They downloaded, and admitted to using, Anna's Archive: Millions of books and papers, most of which are paywalled but they pirated it instead

                * They acquired Movies and TV shows and used unofficial subtitles distributed by websites such as OpenSubtitles, which are typically used for pirated media. Official releases such as DVDs tend to have official subtitles that don't sign off with "For study/research purpose only. Please delete after 48 hours" or "Subtitles by %some_username%"

                • moralestapia 2 days ago
                  OpenSubtitles has nothing to do with pirated media. Transcripts/translations are fair use. Their own use case is fair use as well.
                  • snickerdoodle12 2 days ago
                    OpenSubtitles is almost exclusively used with pirated media. Official copies come with official subtitles. OpenSubtitles itself is legal, but that's not the point at all.
                • Workaccount2 3 days ago
                  I don't know what is confusing here, perhaps my comment isn't clear.

                  If you skirt payment, its a violation. If it's free, but still copyright, it's likely not a violation.

                  • snickerdoodle12 3 days ago
                    They've done both, so my confusion is about why you are bringing this up?
              • kmeisthax 3 days ago
                [dead]
          • 4gotunameagain 3 days ago
            If you owe the bank $1,000 you have a problem.

            If you owe the bank $100,000,000 the bank has a problem.

            We live in an era where the president of the United States uses his position to pump crypto scams purely for personal profit.

            • kyleee 2 days ago
              10% for the big don
          • NoMoreNicksLeft 3 days ago
            The dead corpses of filmmakers and authors and actors are buried in unmarked graves out behind those companies' corporate headquarters. Unimaginable horror, that piracy. Why has no one intervened?

            >If you're just a normal person you get to spend years in jail or worse.

            Not that I'm a big fan of the criminalization of copyright infringement in the United States, but who has ever spent years in jail for this?

            Besides, if it really bothered you, then we might not see this weird tone-switch from one sentence to the next, where you seem to think that piracy is shocking and "something should be done" and then "it's not good tht someone should spend time in jail for it". What gives?

            • degamad 2 days ago
              > who has ever spent years in jail for this?

              Aaron Swartz?

              EDIT: apparently he wasn't in jail, he was on bail while the case was ongoing - but the shortest plea deal would still have had him in jail for 6 months, and the penalty was 35 to 50 years.

            • snickerdoodle12 3 days ago
              > Besides, if it really bothered you, then we might not see this weird tone-switch from one sentence to the next, where you seem to think that piracy is shocking and "something should be done" and then "it's not good tht someone should spend time in jail for it". What gives?

              What a weirdly condescending way to interpret my post. My point boils down to: Either prosecute copyright infringement or don't. The current status quo of individuals getting their lives ruined while companies get to make billions is disgusting.

              • brookst 3 days ago
                > Either prosecute copyright infringement or don't

                This is the absolute core of the issue. Technical people see law as code, where context can be disregarded and all that matters is specifying the outputs for a given set of inputs.

                But law doesn’t work that way, and it should not work that way. Context matters, and it needs to.

                If you go down the road of “the law is the law and billion dollar companies working on product should be treated the same as individual consumers”, it follows that individuals should do SEC filings (“either require 10q’s or don’t!”), and surgeons should be jailed (“either prosecute cutting people with knives or don’t!”).

                There is a lot to dislike about AI companies, and while I believe that training models is transformative, I don’t believe that maintaining libraries of pirated content is OK just because it’s an ingredient to training.

                But insisting that individual piracy to enjoy entertainment without paying must be treated exactly the same as datasets for model training is the absolute weakest possible argument here. The law is not that reductive.

                • wormius 2 days ago
                  > But law doesn’t work that way, and it should not work that way. Context matters, and it needs to.

                  As Anatole France famously quipped:

                  "The law, in its majestic equality, forbids the rich and poor alike to sleep under bridges, to beg in the streets, and to steal bread."

                • snickerdoodle12 2 days ago
                  Pretty funny that your argument boils down to: It's okay to break the law if you do it as a company.

                  Copyright laws target everyone. SEC laws don't.

                • HWR_14 2 days ago
                  It doesn't matter whether it's transformative. Copyright covers derivative works.
          • alphan0n 3 days ago
            No one (in the US) has been jailed for downloading copyrighted material.
            • snickerdoodle12 3 days ago
              https://en.wikipedia.org/wiki/Aaron_Swartz

              And the US is not the only jurisdiction

              • gruez 3 days ago
                That's not the same as piracy though. He wasn't downloading millions of scientific papers from libgen or sci-hub, he was downloading them directly from jstor. Indeed, none of his charge was for copyright infringement. It was for stuff like "breaking and entering" and "unauthorized access to a computer network".
                • snickerdoodle12 3 days ago
                  The exact same charges could apply to the AI scrapers illegitimately accessing random websites.
                  • dragonwriter 3 days ago
                    No, they couldn't, since the then-novel and untested strained interpretation of the CFAA that the prosecutor was relying on has since been tested in the courts and soundly rejected.
                  • kube-system 3 days ago
                    I haven’t seen any accusations that they’ve done that, though. Usually people get pirated material from sources that intentionally share pirated material.
                    • snickerdoodle12 3 days ago
                      They're not just training on pirated content, they've also scraped literally the entire internet and used that too.
                      • kube-system 2 days ago
                        Scraping the public internet is also not a CFAA violation
                        • snickerdoodle12 2 days ago
                          CFAA bans accessing a protected computer without authorization. Hitting URLs denied by robots.txt has been argued to be just that.
                          • dragonwriter 2 days ago
                            > Hitting URLs denied by robots.txt has been argued to be just that.

                            "Has been argued" -- sure, but never successfully; in fact, in HiQ v. LinkedIn, the 9th Circuit ruled (twice, both before and on remand again after and applying the Supreme Court ruling in Van Buren v. US) against a cease and desist on top of robots.txt to stop accessing data on a public website constituting "without authorization" under the CFAA.

                            • snickerdoodle12 2 days ago
                              Now do every other jurisdiction
                              • gruez 2 days ago
                                CFAA was mentioned specifically, which means only US jurisdiction is relevant here.
                  • gruez 3 days ago
                    Part of the accusation comes from the fact that Swartz accessed the downloads through a MIT network closet, which AI companies wasn't doing. The equivalent to that would be if openai broke into a wiring closet at Disneyland to download Disney movies.
                    • snickerdoodle12 3 days ago
                      The CFAA is vague enough to punish unauthorized access to a computer system. I don't have an example case in mind, but people have gotten in trouble for scraping websites before while ignoring e.g. robots.txt
                      • gruez 3 days ago
                        The CFAA might be vague, but the case law on scraping pretty much has been resolved to "it's pretty much legal except in very limited circumstances". It's regrettable that less resourced defendants were harassed before large corporations were able to secure such rulings, but the rulings that allowed scraping occurred before AI companies' scraping was done, so it's unclear why AI companies in particular should be getting flak here.
              • alphan0n 3 days ago
                Aaron Swartz was not jailed or even charged for copyright infringement. The discussion and the comment I replied to is centered around US companies and jurisdiction.
                • snickerdoodle12 3 days ago
                  The thread is centered around US companies, but not US jurisdiction.
            • codedokode 3 days ago
              There could be a moral question. For example a researcher might not want to download a pirated paper and cause loss to a fellow researcher. But it becomes pretty stupid to pay when everyone, including large reputable companies endorsed by the government, is just downloading the content for free. Maybe his research will help developing faster chips to win against China, why should he pay?

              Would it be a "fair use" to download pirated papers for research instead of buying?

              Also I was gradually migrating from obtaining software from questionable sources to open source software, thinking that this is going out of trend and nobody torrents apps anymore, but it seems I was wrong?

              Or another example: if someone wants to make contributions to Wine but needs a Windows for developing the patch, what would be the right choice, buy it or download a free copy from questionable source?

              • immibis 2 days ago
                Researchers don't get paid when their papers are downloaded, though. They pay to have their papers downloaded, and the middleman makes money on both sides. Piracy is the only moral option for them. There is a reason every single competent professor in the western world will email you a free copy of their papers if you ask nicely.
            • codedokode 3 days ago
              What about people filming movies in the cinema (for learning of course)? [1]

              [1] https://www.thefederalcriminalattorneys.com/unauthorized-rec...

          • CamperBob2 3 days ago
            No, if you revolutionize both the practice and philosophy of computing and advance mankind to the next stage of its own intellectual evolution, you get to do whatever the fuck you want.

            Seems fair.

            • recursive 3 days ago
              Hm. Not a given that it's an advance.
              • Nevermark 3 days ago
                I get the common cynical response to new tech, and the reasons for it.

                We wish we lived in a world where change was reliably positive for our lives. Often changes are sold that way, but they rarely are.

                But when new things introduce dramatic capabilities that former things couldn't match (every chatbot before LLMs), it is as clear of an objective technological advance as has ever happened.

                --

                Not every technical advance reliably or immediately makes society better.

                But whether or when technology improves the human condition is far more likely to be a function of human choices than the bare technology. Outcomes are strongly dependent on the trajectories of who has a technology, when they do, and how they use it. And what would be the realistic (not wished for) outcome of not having or using it.

                For instance, even something as corrosive as social media, as it is today, could have existed in strongly constructive forms instead. If society viewed private surveillance, unpermissioned collation across third parties, and weaponizing of dossiers via personalized manipulation of media, increased ad impact and addictive-type responses, as ALL being violations of human rights to privacy and freedom from coercion or manipulation. And worth legally banning.

                Ergo, if we want tech to more reliably improve lives, we need to ban obviously perverse human/corporate behaviors and conflicts of interest.

                (Not just shade tech. Which despite being a pervasive response, doesn't seem to improve anything.)

              • CamperBob2 3 days ago
                At the risk of stepping on a well-known land mine around here, how'd you do on the IMO problem set this year?
                • recursive 2 days ago
                  I didn't participate. I probably wouldn't have done well. I disagree with your framing.
                  • CamperBob2 2 days ago
                    Well, wait, if somebody writes a computer program that answers 5 of 6 IMO questions/proofs correctly, and you don't consider it an "advance," what would qualify?

                    Either both AI teams cheated, in which case there's nothing to worry about, or they didn't, in which case you've set a pretty high bar. Where is that bar, exactly? What exactly does it take to justify blowing off copyright law in the larger interest of progress? (I have my own answers to that question, including equitable access to the resulting models regardless of how impressive their performance might be, but am curious to hear yours.)

                    • recursive 2 days ago
                      The technology is capable in a way that never existed before. We haven't yet begun to see the impacts of that. I don't think it will be a good for humanity.

                      Social networks as they exist today represent technology that didn't exist decades ago. I wouldn't call it an "advancement" though. I think social media is terrible for humans in aggregate.

                    • immibis 2 days ago
                      I notice you've motte-and-baileyed from "revolutionize both the practice and philosophy of computing and advance mankind to the next stage of its own intellectual evolution" to simply "is considered an 'advance'".
                      • CamperBob2 2 days ago
                        You may have meant to reply to someone else. recursive is the one who questioned whether an advance had really been made, and I just asked for clarification (which they provided).

                        I'm pretty bullish on ML progress in general, but I'm finding it harder every day to disagree with recursive's take on social media.

            • verandaguy 3 days ago
              Except that the jury’s (at best) still out on whether the influence of LLMs and similarly tech on knowledge workers is actually a net good, since it might stunt our ability to critically think and problem solve while confidently spewing hallucinations at random while model alignment is unregulated, haphazard, and (again at best) more of an art than a science.
              • CamperBob2 3 days ago
                Well, if it's no big deal, you and the other copyright maximalists who have popped out of the woodwork lately have nothing to worry about, at least in the long run. Right?
                • verandaguy 2 days ago
                  It's not about copyright _maximalism,_ it's about having _literally any regard for copyright_ and enforcing the law in a proportionate way regardless of who's breaking the laws.

                  Everyone I know has stories about their ISP sending nastygrams threatening legal action over torrenting, but now that corporations (whose US legal personhood appears to matter only when it benefits them) are doing it as part of the development of a commercial product that they expect to charge people for, that's fine?

                  And in any case, my argument had nothing to do with copyright (though I do hate the hypocrisy of the situation), and whether or not it's "nothing to worry about" in the long run, it seems like it'll cause a lot of harm before the benefits are felt in society at large. Whatever purported benefits actually come of this, we'll have to deal with:

                  - Even more mass layoffs that use LLMs as justification (not just in software, either). These are people's livelihoods; we're coming off of several nearly-consecutive "once-in-a-generation" financial crises, a growing affordability crisis in much of the developed world, and stagnating wages. Many people will be hit very hard by layoffs.

                  - A seniority crisis as companies increasingly try to replace entry-level jobs with LLMs, meaning that people in a crucial learning stage of their jobs will have to either replace much of the learning curve for their domain with the learning curve of using LLMs (which is dubiously a good thing), or face unemployment, and leaving industries to deal with the aging-out of their talent pools

                  - We've already been heading towards something of an information apocalypse, but now it seems more real than ever, and the industry's response seems to broadly be "let's make the lying machines lie even more convincingly"

                  - The financial viability of these products seems... questionable right now, at best, and given that the people running the show are opening up data centres in some of the most expensive energy markets around (and in the US's case, one that uniquely disincentivizes the development of affordable clean energy), I'm not sure that anyone's really interested in a path to financial sustainability for this tech

                  - The environmental impact of these projects is getting to be significant. It's not as bad as Bitcoin mining yet, AFAIK, but if we keep on, it'll get there.

                  - Recent reports show that the LLM industry is starting to take up a significant slice of the US economy, and that's never a good sign for an industry that seems to be backed by so much speculation rather than real-world profitability. This is how market crashes happen.

        • gruez 3 days ago
          >why ordinary people cannot

          They can. I don't think anyone got prosecuted for using an illegal streaming site or downloading from sci-hub, for instance. What people do get sued for is seeding, which counts as distribution. If anything AI companies are getting prosecuted more aggressively than "ordinary people", presumably because of their scale. In a recent lawsuit Anthropic won on the part about AI training on books, but lost on the part where they used pirated books.

          • codedokode 3 days ago
            People got in trouble for filming in the cinema as I understand, there is a separate law for that.
            • gruez 3 days ago
              But in that case even though filming isn't technically distribution, it's clearly a step to distributing copies? To take this to the extreme, suppose you ripped a blu-ray, made a thousand copies, but haven't packaged or sold them yet. If the FBI busted in, you'd probably be prosecuted for "conspiracy to commit copyright infringement" at the very least.
              • snickerdoodle12 3 days ago
                It's just "training"
                • gruez 3 days ago
                  You seem to equate "training" (with scare quotes) with someone actually pirating a blu-ray, but they really aren't equivalent. Courts so far have ruled that training is fair use and it's not hard to see why. Unlike copying a movie almost verbatim (as with ripping a blu-ray), AI companies are actually producing something transformative in the form of AI models. You don't have to like AI models, or the AI companies' business models, but it strains credulity to pretend ripping a blu-ray is somehow equivalent to training an AI model.
                  • snickerdoodle12 3 days ago
                    Who's to say why I downloaded and am now watching a movie? Is it for my enjoyment? Is it because I'm training my brain? How is me training my brain any different from companies training their LLMs?

                    Same goes for recording: I'm just training my skills of recording. Or maybe I'm just recording it so I can rewatch it later, for training purposes, of course.

                    • gruez 2 days ago
                      >Who's to say why I downloaded and am now watching a movie? Is it for my enjoyment? Is it because I'm training my brain? How is me training my brain any different from companies training their LLMs?

                      None of this is relevant because Anthropic was only left off the hook for training, and not for pirating the books itself. So far as the court cases are playing out, there doesn't appear to be a special piracy exemption for AI companies.

                      >Same goes for recording: I'm just training my skills of recording. Or maybe I'm just recording it so I can rewatch it later, for training purposes, of course.

                      You can certainly use that as a defense. That's why we have judges, otherwise there's going to be some smartass caught with 1KG of coke and claiming it's for "personal consumption" rather than distribution.

                      None of this matters in reality, though. If you're caught with AV gear in a movie theater once, you'd likely be ejected and banned from the establishment/chain, not have the FBI/MPAA go after you for piracy. If you come again, you'd likely be prosecuted for trespassing. In the cases where they're going after someone in particular for making these rips, they usually have a dossier of evidence, like surveillance/transaction history showing that the same individual has been repeatedly recording movies, and watermarks correlating the screenings that the person has been in to files showing up on torrent sites.

        • shadowgovt 3 days ago
          IANAL, but reading a bit on this topic: the relevant part of the copyright law for AI isn't academia, it's transformative work. The AI created by training on copyrighted material transforms the material so much that it is no longer the original protected work (collage and sampling are the analogous transformations in the visual-arts and music industries).

          As for actually gathering the copyrighted material: I believe the jury hasn't even been empaneled for that yet (in the OpenAI case), but the latest ruling from the court is that copyright may have been violated in the creation of their training corpus.

        • robswc 2 days ago
          AFAIK, downloading or watching pirated stuff isn't something you'll get in trouble for. Hosting and distributing it is what will get you.
        • 0x457 3 days ago
          Well, it just shows that they've downloaded subtitles.
      • kgeist 3 days ago
        Interesting, in Russian, it often ends with "Subtitles by %some_username%"
      • cyp0633 3 days ago
        That is not the case here - I never encountered this with whisper-large-v3 or similar ASR models. Part of the reason, I guess, is that those subs are burnt into the movie, which makes them hard to extract. Standalone subs need the corresponding video resource to match the audio and text. So nothing is better than YouTube videos which are already aligned.
        • simsla 3 days ago
          At least for English, those "fansubs" aren't typically burnt into the movie*, but ride along in the video container (MP4/MKV) as subtitle streams. They can typically be extracted as SRT files (plain text with sentence level timestamps).

          *Although it used to be more common for AVI files in the olden days.

          • Maken 3 days ago
            SRT is ancient. Nowadays everyone uses ASS subtitles which can be randomly styled.
            • simsla 3 days ago
              In general? In the past I've known ASS to be used a lot for things like anime, but less for live action shows.
              • Maken 3 days ago
                I have also found them inside mkvs as the subtitle track. I think SRT was the default because most content was ripped from DVD/BD, but now most of the content is from streaming sources and you need to convert the subtitles anyway.
            • conradev 3 days ago
              WebVTT (a SubRip successor) is probably more widely used than ASS
              • Maken 3 days ago
                By legit providers, probably.
          • ethbr1 3 days ago
            flashbacks of trying to track down subs sync’d to a specific release
    • isoprophlex 3 days ago
      Indeed, with another model I would get persistent transcriptions of silent parts into 'Thanks for watching!' or '[MUSIC]'. Pretty dumb that this failure mode wasn't caught in some QA process, and there are now multiple transcription models suffering from the same issue. Having silent parts in your input audio seems like it should be a very common occurrence...
      • rollcat 3 days ago
        When I was taught mathematics, the zero value was always considered the most important edge case. You prove something for N=0 (or N=1), then for N=M+1.

        It's even more important in audio DSP: processing near-zeroes can end up being extremely CPU intensive, look up denormal/subnormal floats.

        • inglor_cz 3 days ago
          Yeah, I studied mathematics (algebra and number theory) and zero is the point, often sporting discontinuities, or weird asymptotic behavior.

          Quite a lot of algorithms use some form of division and zero is the only number in our typical structures (Z, Q, R, C), that cannot be used to divide with.

          • edwcross 3 days ago
            In machine integer arithmetics, one must also beware division by -1, which can convert MIN_INT into MIN_INT with a signed overflow and violate some arithmetics invariants, such as sign (negative divided by negative is _usually_ positive).
          • isoprophlex 3 days ago
            Well, now in this brave new age of AI we can enjoy computer programs crashing with an

                Error: division by please upvote, share and like!
            • xyproto 3 days ago
              This also works; I upvoted your comment.
              • o1bf2k25n8g5 3 days ago
                I have discovered a truly marvelous proof of how to smash that like and subscribe button, which this comment box is too small to contain.
        • KeplerBoy 3 days ago
          Denormals are flushed to zero by default on most GPUs by the way.
          • rollcat 2 days ago
            Makes total sense, execution time is bounded. The point is it's still a case you must consider (what if near-zero is distinct from zero and significant?)
      • wahnfrieden 3 days ago
        whisper MUST be combined with silence detection / VAD
        • pferde 3 days ago
          Ah, the good old "you're holding it wrong".

          What good is a speech recognition tool that literally hears imaginary voices?

          • zettabomb 3 days ago
            Considering that if you DO use VAD (voice activity detection), it's the best open weights voice recognition model by a very wide margin, it's quite good. I'd be willing to be that commercial products that "don't have this problem" are using VAD as well, and that this is well known to them. But Whisper is just the weights, and I suppose a simple reference implementation, not a full product.
          • bmacho 3 days ago
            > What good is a speech recognition tool that literally hears imaginary voices?

            Well, if it is supposed to work after silence detection, then it is good for speech recognition I guess. It's like blaming a wheel why is it circular, you can't sit on it. It's a part of a larger machine.

            • dumbfounder 3 days ago
              Just lay the wheel on its side and it makes a fine seat.
          • nhecker 3 days ago
            >imaginary voices

            On the other hand, I can imagine that when things get quiet and the signal-to-noise ratio gets close to zero, random background audio (or randomness introduced in the transcription model) will be enough to tickle a critical number of neurons and elicit hallucinations.

            The related thought exercise is this: Try scanning across the band with an AM or sideband radio, and after a while your brain will start to wonder "was that a voice I just heard, or music perhaps?" when in reality it was just environmental static.

          • wahnfrieden 3 days ago
            Yes, you are holding it wrong. The good of it is that it does not output imaginary voices when used with VAD.

            Show us a technology with better results that does not use VAD. If you can’t, then I’m not sure what you’re arguing against except superficialities so inconsequential that I can’t comprehend the condescension. The results speak for itself

          • Xmd5a 3 days ago
            faster-whisper has a min_silence_duration_ms option
          • xandrius 3 days ago
            So if a tool has a process to have it perform at its best then it's a problem?

            Do you also moan that before applying glue to a surface or it won't stick? Or if you need to drill a guiding hole before making a larger one in wood? Or that you need to use truly prime numbers for a security key to actually be safe?

        • DANmode 2 days ago
          What's a good starter VAD lib, and if you know, the best implementation of something like this to use in a browser-based app?

          Say if I wanted to use it for Voice Nav, or Voice Input, but not piss off random people speaking the wrong language.

        • cmiles74 3 days ago
          If that's truly the case then they should make it part of the product, IMHO.
          • wahnfrieden 3 days ago
            How is it not the case? It is unusable without VAD or editing. I don't understand what you're questioning

            I agree their products could be better "end to end" integrated. Meanwhile there is a continuously-improving field of work for detecting speech (which Whisper is incapable of). They offer official "cookbooks" with guidance on an approach they recommend: https://cookbook.openai.com/examples/whisper_processing_guid...

            > At times, files with long silences at the beginning can cause Whisper to transcribe the audio incorrectly. We'll use Pydub to detect and trim the silence.

            (Official OpenAI quote)

        • DANmode 3 days ago
          What's VAD?
          • maxbond 3 days ago
            Voice Activity Detection (it predicts whether a short clip contains speech, eg to mute your microphone when you aren't speaking).
    • xigoi 3 days ago
      Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?
      • madcaptenor 3 days ago
        I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
        • immibis 3 days ago
          I can. He was asking if Babbage was cheating.

          You put in 2+2 - the right figures. The machine says 4 - the right answer. If you put in the wrong figures, like 3+3, will the machine still say 4? It's easy to make a machine that always says 4.

          The people who asked him that question, however, probably got a different scam demonstrated to them every every. Remember the Mechanical Turk? Babbage's reply paints him very honestly. It shows that he couldn't even conceive that someone might try to trick the royal court (or whoever it was) into accepting a fake device.

        • Workaccount2 3 days ago
          Having zero exposure to any form of computation for your entire life, as the vast majority of people in the early 19th century were.
    • indrora 3 days ago
      When YouTube began building automatic transcriptions for captions, it regularly flagged any noise or music -- typically industrial noise -- with "[foreign]"

      If it couldn't understand it, it was "foreign" for the longest time.

      • the_af 3 days ago
        Hey, Netflix occasionally still puts in its English subtitles "[foreign music]", it always cracks me up.
        • 0x457 3 days ago
          [speaks japanese]

          To be fair, there is a difference between when subtitles match the source language and when they don't. Former are often verbatim.

          • the_af 3 days ago
            Haha, yes, it's fair when English subtitles write something like [speaks Japanese], especially when at least one of the characters is not supposed to understand what's being said (when they do, it's more appropriate to write "[in Japanese]: let's go shopping!").

            Netflix sometimes takes the cake with what I consider the most outrageous option: writing "[in English]" when they mean "in whatever language the protagonist considers native", which is mind-bogglingly wrong and hilarious at the same time.

            They do this with the English subtitles of the German production "Die Kaiserin" ("The Empress"): whenever Sisi is speaking in another language, say French, the subtitles will say "[in French] I love you...", and when she switches back to German they will say "[in English] I love you...". WTF, Netflix? Note this is unrelated to understanding German; it's mostly Netflix looking down on its customers and assuming they cannot comprehend there are people in the world for whom their native tongue is different to the viewer's native tongue.

            This has happened in more shows, enough to know it's not a fluke, though Netflix is inconsistent about it.

      • stndef 3 days ago
        Yeah, I can confirm seeing that a fair bit specifically during non-verbal parts of videos when someone is using a tool.
        • TurkTurkleton 3 days ago
          Can confirm as well, although to my recollection it just shows up as if it's a word the transcription model heard, not "[foreign]" in brackets like with "[Music]" or "[Applause]". It's especially weird to me because I recall the auto-transcriptions being reasonably serviceable when they first rolled them out, only to degrade over time to the point where it was hallucinating the word "foreign" and dropping letters from words or using weird abbreviations (like "koby" for "kilobyte", "TBTE" for "terabyte", or, most memorably weirdly, transcribing the phrase "nanosecond-by-nanosecond" as "nond by nanc") if it didn't decide it heard another one entirely.

          I also noticed a couple of months ago that YouTube seems to have quietly rolled out a new auto-transcription model that can make reasonable guesses at where capitalization, punctuation, and sentence boundaries should go. It seems to have degraded even more rapidly than the old one, falling victim to the same kinds of transcription errors. Although the new one has a different hallucination in silence and noise that it wasn't able to classify (which, incidentally, its ability to recognize things like music and applause seems worse than the old one's): where the old model would have hallucinated the word "foreign", the new one thinks it's hearing the word "heat", often repeated ("Heat. Heat.").

    • st_goliath 3 days ago
      That's interesting, the few times I tried playing with whisper, I had the impression that YouTube style videos or random cellphone videos was something it did particularly bad with (compared to movies). My guess at the time was that most of the training material might be sub titles and raw screen plays.

      The videos I tried to transcribe were also Mandarin Chinese, using whisper-large-v3. Besides the usual complaints that it would phonetically "mishear" things and generate nonsense, it was still surprisingly good, compared to other software I played around with.

      That said, it would often invent names for the speakers and prefix their lines, or randomly switch between simplified and traditional Chinese. For the videos I tested, intermittent silence would often result in repeating the last line several times, or occasionally, it would insert direction cues (in English for some reason). I've never seen credits or anything like that.

      In one video I transcribed, somebody had a cold and was sniffling. Whisper decided the person was crying (transcribed as "* crying *", a cough was turned into "* door closing *"). It then transcribed the next line as something quite unfriendly. It didn't do that anymore after I cut the sniffling out (but then the output switched back to traditional Chinese again).

    • mmcwilliams 3 days ago
      Similar in the English model. Pretty clear they trained on YouTube videos where creators will put that in otherwise silent sections to ensure it shows up for people with CC on.
      • probably_wrong 3 days ago
        The number one hallucination in my transcriptions was "Subtitles by the Amara.org community".
    • philipwhiuk 3 days ago
      > I suspect they trained the model on some random YouTube video without carefully picking really useful data.

      They trained the model on every YouTube video they could, and hoped the aggregate was useful data.

    • PhasmaFelis 3 days ago
      This reminds me, some years ago as Google was expanding its translation service, someone tried translating text into and out of an obscure African language (don't recall which) and it always came out as weird Biblical-sounding semi-gibberish.

      My revelation was that machine translation needs a corpus of bilingual documents to learn from, and if the language is sufficiently obscure, there may not be any bilingual documents except for the Bible, which missionaries have translated into just about every language on Earth.

    • danirod 3 days ago
      This is totally happening with other models too, at least with Spanish. Many transcriptions will end with something that roughly translates to "Thanks for watching!" even if it's never present in the original audio.
    • horseradish7k 3 days ago
      oh yeah this happens a lot on reddit on videos in foreign languages
    • tonyhart7 3 days ago
      lmao
  • dlcarrier 3 days ago
    Classic overfitting

    It's the LLM equivalent of thinking that an out-of-office reply is the translation: https://www.theguardian.com/theguardian/2008/nov/01/5

    • stingraycharles 3 days ago
      How is this overfitting, rather than a data quality / classification issue?
      • bGl2YW5j 3 days ago
        If the model was able to generalise, you’d expect it to output something like “[silence]” or “…”, in response to silence.

        Instead, it reverted to what it has seen before (in the training data), hence the overfit.

        • stingraycharles 3 days ago
          Right, maybe my definition of overfitting was wrong, I always understood it more as trying to optimize for a specific benchmark / use case, and then it starts failing in other areas.

          But the way you phrase it, it’s just “the model is not properly able to generalize”, ie it doesn’t understand the concept of silence also makes sense.

          But couldn’t you then argue that any type of mistake / unknown could be explained as “overfitting” ? Where do you draw the line ?

          • RamblingCTO 2 days ago
            I don't think so. Overfitting = the model was too closely aligned to the training data and can't generalize towards *unseen* data. I think it saw "silence" before, so it's not overfitting but just garbage in, garbage out.
          • heavyset_go 3 days ago
            Your definition is one, but the one the OP is using is overfitting to training data.
            • stingraycharles 3 days ago
              That’s exactly my point: by that definition any incorrect answer can be explained by “overfitting to training data”.

              Where do you draw the line between “overfitting to training data” and “incorrect data” ?

              • tempaccount420 3 days ago
                > That’s exactly my point: by that definition any incorrect answer can be explained by “overfitting to training data”.

                Not really, getting 94381294*123=... wrong, but close within the actual answer, cannot be overfitting since it wasn't in the training data.

              • maxbond 3 days ago
                > [By] that definition any incorrect answer can be explained by “overfitting to training data”.

                No it doesn't, for instance some errors would be caused by under fitting. The data could also be correct but your hyperparameters (such as the learning rate or dropout rate) could cause your model to overfit.

                > Where do you draw the line between “overfitting to training data” and “incorrect data” ?

                There's no need to draw a line between two explanations that aren't mutually exclusive. They can (as in this case) both be true. Overfitting is the symptom; dirty data is the cause.

        • mywittyname 3 days ago
          I think it's a classification issue.

          Silence is never put in the subtitles of a film, since it isn't necessary. The viewers can tell that nothing is being said if there are actors on the screen. And in situations where there are no actors, then there will be a subtitle to indicate what is going on, like "[rock music plays]".

          Subtitle authors use this silence to fit in meta information and have done so since the closed captions era.

          Proper data cleaning procedures would be to strip this meta data from any subtitle sources. Since this wasn't done, this is fundamentally a classification issue. It may also be an over-fitting issue, but that is secondary to the classification problem.

        • xg15 2 days ago
          I think it's a data quality problem first, which might lead to a sort of overfitting as a consequence.

          How would the AI know that a series of zero-amplitude audio samples should generate the string "[silence]"?

          It can only know that if the vast majority of silent audio segments in the trainser are consistently labelled with that string. But that doesn't seem to be the case: Silence is either not labeled at all, or labeled with all kinds of different markers or labeled with unrelated things, like copyright credits.

          So even if the model successfully learns a generalized representation of the concept of "silence", it's not clear at all which of all the different labels it should use for that concept.

          So what might happen is that the model then starts to overfit on the tiny variations of the individual silence segments, in a desperate attempt to devise some kind of system behind the all the different "silence" labels - which will of course go wrong spectacularly as such a system doesn't exist. (Or if it does, is entirely accidental and not something that should be learned)

        • alienbaby 3 days ago
          It's actually because it is incapable of recognising when it does not know the answer. It will give you the nearest match, even if that is completely incorrect.
      • hsn915 3 days ago
        ُThe Arabic text is the translator's self credit

        "Translated by Nancy Qanfar"

        • efitz 3 days ago
          I know it’s off topic, but it reminded me that translators like to put in Easter eggs, or at least they used to: https://learn.microsoft.com/en-us/archive/blogs/ericfitz/i-a...
        • wongarsu 3 days ago
          And the German is “subtitles of [public broadcaster] for [content network], 2017

          I'm not sure this is really overfitting, the network does exactly what the training data demands. According to the training data silence art the end transcribes to a copyright notice or subtitle credits

          • baobabKoodaa 3 days ago
            > I'm not sure this is really overfitting, the network does exactly what the training data demands.

            What do you think overfitting is, if not that?

            • wongarsu 3 days ago
              Overfitting would be replicating overly specific details. Like if a specific pattern of silence (or quiet noise) matched to specific copyright notices.

              But in this case the behavior seems to generalize over multiple languages, with the model choosing representative "outro silence" captions depending on the language. Which is consistent with the training data showing that outro silence is captioned.

              If the model was generalizing perfectly it would show something like "[subtitle credits here]" but that'd be demanding a bit much.

              Transcribing outro silence as silence despite the training data consistently transcribing outro silence differently from regular silence would be underfitting

              • maxbond 3 days ago
                The optimizer is functioning correctly, and the pattern really exists in the training data. But consider:

                - This behavior damages the model's performance on out of sample data; every word you predict during silence increases the transcript's Word Error Rate.

                - These translation credits are an artifact of our training data, and not a reflection of the process we are modeling (spoken language).

                So, while you are correct about the mechanism at work here, it is still correct to call learning a spurious pattern which damages our performance "overfitting".

            • bmacho 3 days ago
              Overfitting is achieving better and better scores on the training material and worse and worse scores on unseen tasks. More at: https://en.wikipedia.org/wiki/Overfitting#Machine_learning

              This is just wrong training data.

          • samrus 3 days ago
            fitting on noise in the training data is exactly what overfitting is. underfitting is smoothing out signal
            • xigoi 3 days ago
              Overfitting implies a failure to properly generalize the training data. Here it generalized them correctly. Garbage in, garbage out.
              • samrus 3 days ago
                No. Because there would have been indtances in the data where silence was labelled correctly. But the model couldnt handle the null case, so it over fit on the outros. But generally it fit on the random error in the label of the null feature. Which is what overfitting is
            • wongarsu 3 days ago
              Exactly. Underfitting would be if the model doesn't pick up on the fact that outro silence is labeled differently from regular silence and transcribes them the same
          • andrepd 3 days ago
            That's literally what overfitting means.

            Side-note: it's also yet more evidence that AI companies hoover all data with no regard for legality or copyright status, the very same offences that got other people in jail or with heavy fines.

      • mort96 3 days ago
        Isn't overfitting just when the model picks up on an unintended pattern in the training data? Isn't that precisely what this is?
        • RamblingCTO 2 days ago
          not necessarily, no. if you have 60% of examples for silence being the hallucination, it just learns the (what you detect as) wrong connection.
          • mort96 1 day ago
            Which ... would be overfitting. It picks up on a pattern in the training data that we don't want it to pick up on and which causes it to generalize poorly.
            • RamblingCTO 4 hours ago
              How is it overfitting if the data is garbage in the first place? Saying it's overfitting in this context has no meaning as there is no alternative that maximizes the utility function we're training for?
      • maxbond 3 days ago
        It is a data quality issue which caused the model to overfit.
    • RamblingCTO 2 days ago
      As I didn't see one correct definition of overfitting:

      overfitting means that the model is too closely aligned to the test data, picked up noise and does not generalize well to *new, unseen* data. think students that learn to reproduce questions and their answers for a test instead of learning concepts and to transfer knowledge to new questions that include the same concepts.

      while this sounds like overfitting, I'd just say it's garbage in, garbage out; wrong classification. the training data is shit and didn't have (enough) correct examples to learn from.

  • sivers 3 days ago
    to save you a lookup:

    The Arabic text "رجمة نانسي قنقر" translates to English as: "Nancy Qanqar's translation" or "Translation by Nancy Qanqar"

    "رجمة" means "translation" and "نانسي قنقر" is the name "Nancy Qanqar"

    • mormegil 3 days ago
      In Czech, Whisper usually transcribes music as "Titulky vytvořil JohnyX" ("subtitles made by JohnyX") for the same reason.
      • actionfromafar 3 days ago
        Haha, trained on torrented movies! :-D

        The MPA must be so proud.

        • Incipient 3 days ago
          It's absolutely insane that these companies can't be held liable for what is obvious piracy.
          • jdiff 3 days ago
            That's the magic of money. Download your favorite artist's discography for personal use? If the MPAA had its way (and it occasionally has), torrenting that could bankrupt you.

            The AI industry - soaking up every bit of media available online for commercial purposes, often reproducing it nearly identically - has enough money and capital to influence things its way. And only its way, in case anyone was hoping this might change anything at all for the little guy.

            • d1sxeyes 3 days ago
              > Download your favorite artist's discography for personal use? If the MPAA had its way (and it occasionally has), torrenting that could bankrupt you.

              I don't think that there are any clear examples of cases where ONLY downloading has resulted in huge fines. All the big bankrupting level fines have been for both downloading and sharing.

              You mention that 'torrenting' could bankrupt you, and that is true, but the main reason for the huge fines are that you are taking part in distribution rather than just 'downloading for personal use'.

              • 0points 3 days ago
                > I don't think that there are any clear examples of cases where ONLY downloading has resulted in huge fines.

                They [1, and others] been hunting and fining downloaders for over a decade now, with the only "evidence" being IP addresses connected with the torrent [2].

                1: https://www.njordlaw.com/filesharing-and-downloading-films/q...

                2: https://admin.ovpn.com/en/blog/online-integrity-new-threats-...

                • gruez 3 days ago
                  >with the only "evidence" being IP addresses connected with the torrent [2].

                  Is that an unreasonable assumption? As much as people like to come up with excuses like "I had open wifi!" or "I was running a TOR node", judges don't seem inclined to believe them, probably for the same reason they don't seem inclined to believe excuses like "somebody took my car on a joyride and then returned it!" for parking tickets. Remember, both non-commercial copyright infringement lawsuits and parking tickets are tried in civil court, which means the standard is "preponderance of evidence", not "beyond reasonable doubt".

                  • actionfromafar 3 days ago
                    DHCP addresses often shuffle on reboots. I don't trust ISPs to keep completely accurate records or give them out in a correct manner if they do.
                    • gruez 3 days ago
                      >I don't trust ISPs to keep completely accurate records or give them out in a correct manner if they do.

                      How hard could it be to keep DHCP logs? Assuming they exist at all, what would cause it to be incorrect?

                      • d1sxeyes 3 days ago
                        I'm sure they exist. I think the point is more that you shouldn't need to trust your ISP's record-keeping to avoid life-alteringly big fines.
                  • 0points 3 days ago
                    You are missing the point I was replying to, specifically that parent suggested people were only hunted for creating/uploading pirated content, not merely participating in the torrent.
                    • gruez 3 days ago
                      >specifically that parent suggested people were only hunted for creating/uploading pirated content, not merely participating in the torrent.

                      For all intents and purposes, participating in the torrent almost guarantees that you seeded, because all torrent clients upload as you download.

                      • 0points 3 days ago
                        These are two separate things:

                        * Making content available for unauthorized distribution

                        * Distributing unauthorized content that someone else already made available

                        Seeding isn't making content available, it's keeping content available.

                        • d1sxeyes 3 days ago
                          That’s a really interesting distinction. Clearly there’s an “original crime”, the first person to rip the CD and put it online (or whatever kids do to pirate music nowadays).

                          But then if I download a file, create a copy, and share it with you, have I done anything wrong?

                          To all intents and purposes, seeding is an act of reproduction. You, while keeping your copy, create copies of (parts) of the file and share it to someone else to allow them to assemble a new, second copy.

                          Whether this is, or should be, a crime is a different question altogether. The main point I was making is that it’s the copying/sharing to other people which seems to be a crucial element in these prosecutions.

                          That’s likely intentional: the last thing the *AA folks want is a decision that creating a copy of a copyrighted work for your own personal use is not a crime. But it does seem the courts have decided: making a copy for someone else is indeed illegal.

                        • gruez 3 days ago
                          But both are illegal? I suspect if it came out that some torrent seeder was actually part of some sort of piracy ring responsible for ripping the movies, they'd get far stiffer penalties than the few thousand $ fine that typical torrenters get. Moreover isn't AI companies also "keeping content available"?
                          • 0points 3 days ago
                            Both are illegal, yes.

                            That still doesn't make them the same thing. There are different shades of grey, etc.

                            > Moreover isn't AI companies also "keeping content available"?

                            I don't know what you mean by that.

                            • gruez 3 days ago
                              >I don't know what you mean by that.

                              The whole point of the thread is that AI companies are getting away with piracy but individuals aren't. But the reality is that AI companies aren't getting away with it (a judge ruled that Anthropic must face trial over their use of pirated books).

                              More specific to this thread is that claim that "ONLY downloading" hasn't resulted in fines for anyone. So far as I can tell, this is true. People are just quibbling over how someone who's torrenting somehow counts as "only downloading", even though their client is uploading.

                • d1sxeyes 3 days ago
                  Yes, but torrenting is not ONLY downloading, it's both. The articles you link are very clearly talking about 'Sharing' (from link 2: "File sharing consists of both download and upload of a file.").
                  • 0points 3 days ago
                    Yes, thats lawyer speak to make clients/victims believe there is no distinction.

                    Hint: there is a distinction.

                    • d1sxeyes 3 days ago
                      There is indeed, but not when you’re torrenting (i.e. you can’t download without also uploading).
                      • 0points 3 days ago
                        Even when you are torrenting, there is a clear distinction of the different roles.

                        Copying from another comment I wrote here:

                        > These are two separate things:

                        > * Making content available for unauthorized distribution

                        > * Distributing unauthorized content that someone else already made available

                        > Seeding isn't making content available, it's keeping content available.

                        • d1sxeyes 3 days ago
                          Replied to your other comment (sorry, didn’t clock that we had two threads ongoing)
              • jdiff 3 days ago
                Given the lack of sense in treating each peer as a lost sale for damages, I think we can safely say they're only interested in making examples out of people and would absolutely go after people for only downloading if the law permitted. Thankfully it's not, but maybe they lobby to make changes in that direction to try and curb future AI industry shenanigans.
              • jajko 3 days ago
                You contradict yourself. There were numerous public cases where they chased people downloading few mp3s just for themselves, and made into example case with massive fines.

                If you don't understand how torrents work on technical level I suggest at least some shallow reading. Property rights holders don't care about details, as long as you tick the box of sending a single packet to somebody, off to court with ya.

                • d1sxeyes 3 days ago
                  > There were numerous public cases where they chased people downloading few mp3s just for themselves

                  If this is true, I have been unable to find any. Can you please share? In all of the cases I was able to find, the huge fines were based on also uploading.

                  > If you don't understand how torrents work on technical level I suggest at least some shallow reading

                  This is a bit patronising, and I'm not sure what point you're trying to make. My point is that the only prosecutions I've been able to find are where they were able to prove uploading as well as downloading (and yes, the fact that someone used BitTorrent makes it a slam-dunk, because the protocol makes it impossible to download without also uploading). Are you trying to argue that someone who torrents a copyrighted work doesn't also share it?

            • shadowgovt 3 days ago
              It's more the magic of precedent.

              The fight about digitized media for personal (entertainment / informational) use were the early aughts. The precedents crafted then don't immediately translate to these cases (novel transformative work from protected materials), and the new precedents have to account for the fact that universities have been training via "piracy" for ages.

              (The magic of money factors in to the extent that they can afford the lawyers to remind the court that this isn't settled law yet).

            • 1718627440 3 days ago
              The movie industry also has some money and lobbying power. Surely this is a way larger threat than any single torrenter could ever be?
              • jdiff 3 days ago
                The fact that this is propping up the entire AI industry adds additional weight. When legislating or deciding court cases, some won't be willing to pop the cash cow, some will be worried about falling behind countries that don't enforce copyright evenly. IP owners are trying to go after the AI industry, with only mixed to poor success.
                • verzali 3 days ago
                  Hard to justify that they can't afford to pay when they have multi-billion dollar valuations and are apparently paying hundreds of millions to get a single engineer.
                  • jdiff 3 days ago
                    Maybe. But we are talking about the whole of copyrighted creative works created and sold by humanity. That'll get expensive no matter who you are.
            • anon191928 3 days ago
              court judges agree to this
          • pavon 3 days ago
            Anthropic is going to trial over pirating books for training. The judge was pretty clear that even if training is fair use, the training material must be obtained legally.

            These regurgitations combined with proof that a model is familiar with a work could be sufficient evidence to force discovery to determine if the work was pirated.

          • scotty79 3 days ago
            What's insane is copyright. How come you can own intellectual property but not pay a property tax? The ecosystem would be much healthier if to get copyright protections you should declare value of your IP (that you are obligated to sell for if the buyer pops up) and pay tax on this for every year you hold the IP.
            • retsibsi 3 days ago
              > if to get copyright protections you should declare value of your IP (that you are obligated to sell for if the buyer pops up) and pay tax on this for every year you hold the IP

              I think this would have some unpalatable consequences. Let's say an author is writing a modestly successful book series: it's not going to make them rich, but it's commercially viable and they care a lot about it for its own sake. Under this system, if the author declares a value commensurate with the (quite small) pure economic value of the IP, they have to live in fear of their right to continue working on their creation being abruptly taken away from them at any point. If they instead declare a value commensurate with the economic value + the extra value that it has to them personally, the resulting tax liability could easily tip the balance and destroy their ability to pursue their writing as a career.

              • scotty79 3 days ago
                You are always free to update the value before paying tax. If somebody is willing to pay more than it's worth to you they probably have an idea how to turn it into more economic value for the society. So the society should allow them to do that. For a price, of the tax. What I'm proposing is about the financial rights. Individual right, like the right to call yourself author of any given creation should be inalienable.

                There are always some cases on the edge. The question is if saving them is worth the cost of the major players running rampant.

              • immibis 2 days ago
                Indeed, this is a general problem with a lot of these schemes.

                We shouldn't abandon the line of investigation, however. We should continue thinking of ways to do this until we find one that works well.

                There's a chance it ends up being something that requires a judge to interpret each individual case...

              • Workaccount2 3 days ago
                Perhaps the tax would start a decade after the first sale.
            • gruez 3 days ago
              >What's insane is copyright. How come you can own intellectual property but not pay a property tax? The

              Most jurisdictions that have "property tax" only apply it on certain types of property, most commonly real estate. So it's not that weird that IP isn't taxed.

            • thunderfork 3 days ago
              Can you imagine if we evaluated property taxes this way? Yeah, nice single family home, better hope nobody offers you the same amount you paid for it or it's back to apartment living for you and your kids.
              • scotty79 1 day ago
                If you live there there should be some protections. But when it comes to rentals or vacation homes I think those rules could be great as well.
          • boredhedgehog 3 days ago
            It's an indication how few people consider license infringements as a matter of actual moral import. Those tend to evoke strong feelings.
          • ACCount36 3 days ago
            It's the way it should be.
          • codedokode 3 days ago
            This is corsairy, not piracy, do not be mistaken.
    • aprilthird2021 3 days ago
      And it seems to be because the training data is largely unofficial subtitles from movies. Which often have a string like "Translated by X" at the end of the movie which is often silent while credits roll.
      • rob74 3 days ago
        Looks like they used more official sources for German - there, silence is apparently hallucinated as "Untertitelung des ZDF für funk, 2017" according to one of the comments on the issue. Which makes sense, as the public broadcasters' "Mediathek" is probably the largest freely available resource of subtitled videos in Germany. I wonder if the ZDF gave its approval for it being used for LLM training though?
        • MrGilbert 3 days ago
          > I wonder if the ZDF gave its approval for it being used for LLM training though?

          I am pretty sure they didn't get asked.

          • h784gljf 3 days ago
            Just like the people forced to pay for ZDF under threat of imprisonment.
            • wongarsu 3 days ago
              I'm being made to pay for Autobahnen I barely use, finance kindergartens despite not having a child, and made to pay into public pensions with little hope of getting close to the same value out. All under threat of imprisonment, many without a way to even refuse (not that I'd want to) The only thing that sets the pubic broadcasting fee apart is that it's collected separately from taxes in an attempt to reduce the influence politicians have on broadcasters
            • eclecticfrank 3 days ago
              This person refers to the German television and radio fee (Rundfunkgebühren).[1] It is a state-mandated system that ensures free (as in free speech) and (relatively) neutral public broadcasting institutions. There is a constant and engaged discussion, because every household in Germany has to pay this fee. Exceptions are made only for low-income households.

              [1] https://en.wikipedia.org/wiki/ARD_ZDF_Deutschlandradio_Beitr...

              • rob74 3 days ago
                A constant discussion, lately fueled by extremist parties (AfD) who feel treated unfairly by (amongst others) the public broadcasters (which has parallels to Trump's recent campaign against public broadcasters in the US).
                • blueflow 3 days ago
                  Can't argue them - Tageschau always has been trashtalking people with the wrong opinion.

                  Back in 2011, Tageeschau openly rallied against Muslims and wanting public broadcasting gone was a leftist position. The whole thing is completely asinine to anyone who remembers.

            • throwaway290 3 days ago
              So, people are made pay for it, and it makes it fair if billion USD corporations don't?
            • TheBicPen 3 days ago
              Just like any other public service paid for with public funds?
              • h43z 3 days ago
                Oh it's like any other. Then just add another one!
        • unusual-name 3 days ago
          Most content from Funk (youtubers funded by public german broadcasters) is available on youtube without any geoblocking or other limitations.
          • rob74 3 days ago
            Ah, ok, thanks for the info, TIL! "We are funk – the first public service content network that started on October 1, 2016. We create online-only content on social networks and third-party platforms, including YouTube, Instagram, Snapchat, TikTok, Spotify, Apple Music or Twitch for 14-29 year-olds." (https://presse.funk.net/das-ist-funk/, scroll down for the English version). I live in Germany, and I even watch public broadcasters regularly, but this is the first time I have heard about funk (I even initially thought it was misspelled, usually it's written with a capital F). But I'm not part of the targeted audience (not now, nor even back in 2016 when it was launched), so all good...
          • layer8 3 days ago
            I’m pretty sure that content doesn’t come with a license granting unlimited usage rights.
          • darkwater 3 days ago
            from the link[1] another user posted:

            > We have a public service mandate, which means that we have very clear responsibilities according to the state media treaty. For us, this means that our top priority is actually reaching our target audience, namely approximately 15 million people living in Germany between the age of 14 and 29 who have internet access

            It's not a binding contract for sure but I don't think that OpenAI or other AI scraper is their target.

            [1] https://presse.funk.net/das-ist-funk/

        • Zacharias030 3 days ago
          definitely not! The media platform of the German public television networks is even geoblocking anyone outside of Germany.

          https://www.ardmediathek.de/

        • bigiain 3 days ago
          A more appropriate output might be ``4'33" -- John Cage, 1952``
        • aprilthird2021 3 days ago
          > I wonder if the ZDF gave its approval for it being used for LLM training though?

          Obviously a rhetorical question. The AI grifters of this decade take what they want and laugh at your pitiful future

      • 4gotunameagain 3 days ago
        I'm sure they totally did not pirate the audio of said movies.
      • iqfareez 3 days ago
        make sense..
    • beshrkayali 3 days ago
      You've got a little typo, it's not "رجمة", it's "ترجمة" that means translation, the ت at the beginning is missing.
  • nottorp 3 days ago
    Title should be changed to "OpenAI publishes evidence they trained on pirated movies".
    • pjc50 3 days ago
      Of course. Piracy is legal when you have a bigger pile of money than the studios.
      • codedokode 3 days ago
        Let's not forget that some of real pirates (for example, corsairs) also were legal and performed legitimate pirate activities to ships of foreign countries.
        • gruez 3 days ago
          >Let's not forget that some of real pirates (for example, corsairs) also were legal and performed legitimate pirate activities to ships of foreign countries.

          In other words there are activities that are legal or not depending on whether you have authorization from the state. That describes many things. For instance you synthesize meth without a license from the DEA/FDA, you're a "drug cartel" or whatever. But if you do it with a license you're a "pharmaceutical company", and you're not making "meth", you're making "desoxyn".

          • codedokode 3 days ago
            Synthesizing chemical substances doesn't involve murdering people though.
            • gruez 3 days ago
              Families of fentanyl overdose victims would disagree. Moreover it's not hard to find examples of "legal if the government authorizes it" for killings. Cops and soldiers, for instance.
              • immibis 2 days ago
                It takes quite a leap of logic to blame someone ingesting a toxic quantity of a substance on the person who manufactured the substance. When someone drinks bleach do we blame the company that makes the bleach?
              • codedokode 3 days ago
                It is a bad comparison because substance vendor doesn't kill anyone just as a gun store doesn't. Soldiers are better analogy though.
        • jowea 3 days ago
          I suspect privateers would have been offended at being called pirates, but is this what is going on? If it specifically a Chinese AI company pirating Hollywood for example sure, but it seems it's more of a everyone firing at everyone situation.
      • onlyrealcuzzo 3 days ago
        Isn't Piracy legal in many parts of the world?

        Legally, why wouldn't they be able to do the piracy parts in one of those jurisdictions and then ship the outputs back to the mothership?

      • foogazi 2 days ago
        Too big to nail
    • berkes 3 days ago
      How is this evidence of that fact? Honest question.

      I can see how this might show that subtitles from online sub communities are used, or that maybe even original subtitles from e.g. DVDs are used. But isn't it already known and admitted (and allowed?) that AI uses all sorts of copyrighted material to train models?

      • 0points 3 days ago
        > I can see how this might show that subtitles from online sub communities are used, or that maybe even original subtitles from e.g. DVDs are used.

        Indeed, the captioning is copyrighted work and you are not legally allowed to copy and redistribute it.

        > But isn't it already known and admitted (and allowed?)

        No, and I don't see where you got that from. Meta [1], OpenAI [2] and everybody else is being sued as we speak.

        1: https://petapixel.com/2025/01/10/lawsuit-alleges-mark-zucker...

        2: https://www.reuters.com/legal/litigation/openai-hit-with-new...

        • skeezyboy 3 days ago
          > Indeed, the captioning is copyrighted work and you are not legally allowed to copy and redistribute it. Unless you qualify for one of the many exceptions, such as fair use
          • kranke155 3 days ago
            It’s not clear that training is fair use. That’s being contested in court I think.
            • whamlastxmas 3 days ago
              Training isn’t recreating or distributing so copyright won’t apply if the ruling is actually consistent with the intention of the law, which it may not.

              Using copyrighted materials and then meaningfully transforming it isn’t infringement. LLMs only recreate original work in the same way I am when I wrote the first sentence of this paragraph because it probably exists word for word somewhere else too

              • kranke155 6 hours ago
                Thats your interpretation, not the law.
        • lcnPylGDnU4H9OF 3 days ago
          > I don't see where you got that from

          It’s been determined by the judge in the Meta case that training on the material is fair use. The suit in that case is ongoing to determine the extent of the copyright damages from downloading the material. I would not be surprised if there is an appeal to the fair use ruling but that hasn’t happened yet, as far as I know. Just saying that there is good reason for them to think it’s been allowed because it kind of has; that can be reversed but it happened.

          • 0points 3 days ago
            That was specifically involving 13 authors.

            There hasn't been any trials yet about the millions of copyrighted books, movies and other content they evidently used.

            • lcnPylGDnU4H9OF 3 days ago
              There's no reason to think those cases will go any differently. As far as I know, the ruling would have to be appealed at this point. I am only commenting to say that there is reason to think this is true:

              > But isn't it already known and admitted (and allowed?)

              You seemed to be confused about why this person believed that:

              > No, and I don't see where you got that from.

              And I wrote a comment intended to dispel your confusion. The above commenter thought that it was allowed because a judge said it was allowed; that can be appealed but that's the reason someone thinks it's allowed.

              • dragonwriter 3 days ago
                > There's no reason to think those cases will go any differently. As far as I know, the ruling would have to be appealed at this point.

                Trial court rulings aren't binding precedent even on the same court in different cases, so its quite possible that different cases at the trial level can reach different conclusions on fair use on fairly similar facts, given the lack of appellate precedent directly on point with AI training.

              • 0points 3 days ago
                Yea, no. I don't think I am confused.

                A single verdict about a specific case (13 authors vs META) does not mean it's legal for companies to steal IP from other companies which has evidently been going on for some years now.

                Those other companies have lawyers powerful enough to change jurisdiction in many countries in order to "protect their IP".

      • nemomarx 3 days ago
        The Chinese subtitles for silence use a common mark for pirated media in that language, according to other commentors here. In general it's pretty likely that if you're finding non professional subtitles they were distributed with pirated media in some form, that's where you get the most fan subs after all
        • berkes 10 hours ago
          > were distributed with pirated media in some form,

          I disagree with this conclusion. I've used e.g. the opensubtitles dataset for some data-analysis in the past. It's a huge dataset, freely available and precisely intended for such use. Now, if all the data in the opensubtitles dataset is legal, is another point.

          So one might argue that using this opensubtitles dataset, makes one complicit to the illegal activities of opensubtitles themselves, IDK: IANAL.

      • jcranmer 3 days ago
        > How is this evidence of that fact?

        The contention is that the specific translated text appears largely from illegal translations (i.e., fansubs) and not from authorized translations. And from a legal perspective, that would basically mean there's no way they could legally have appropriated that material.

        > But isn't it already known and admitted (and allowed?) that AI uses all sorts of copyrighted material to train models?

        Technically, everything is copyrighted. But your question is really about permission. Some of the known corpuses for AI training include known pirate materials (e.g., libgen), but it's not known whether or not the AI companies are filtering out those materials from training. There's a large clutch of cases ongoing right now about whether or not AI training is fair use or not, and the ones that have resolved at this point have done so on technical grounds rather than answering the question at stake.

    • Hnrobert42 3 days ago
      HN is pretty strict about not editorializing titles. Even if you statement was unequivocably correct, the post would get flagged.
    • aaron695 3 days ago
      [dead]
  • dandiep 3 days ago
    Whisper is unusable IMO because of the hallucinations. Widely documented. Removing silence from audio clips helps, but even then it will auto correct grammar, translating bilingual speech, etc. Improved in the latest audio models but not solved [1]

    1. https://news.ycombinator.com/item?id=43427376

    • ilyakaminsky 3 days ago
      I wouldn't describe it as "unusable" so much as needing to understand its constraints and how to work around them. I built a business on top of Whisper [1] and one of the early key insights was to implement a good voice activity detection (VAD) model in order to reduce Whisper's hallucinations on silence.

      [1] https://speechischeap.com

      • poly2it 3 days ago
        How does this make a profit? Whisper should be $0.006 to $0.010 per minute, but you rate less than $0.001? Do you 10x the audio?
        • ilyakaminsky 3 days ago
          Thanks for noticing. It took a lot of effort to optimize the pipeline every step of the way. VAD, inference server, hardware optimization, etc. But nothing that would compromise on quality. The audio is currently transcribed in its original speed. I'll be sure to publish something if I manage to speed it up without incurring any losses to the WER.
    • eric-burel 3 days ago
      That's the problem with raws large models, it should always be coupled with satellite small models and logic. It's (probably) easier to detect hallucinations using a traditional ML/DL model that can catch mismatches (it's easy to build a synthetic dataset for this) than transcribing. And the simplest piece of code can detect a silence and that it should match no text.
    • horseradish7k 3 days ago
      well, auto correcting grammar happens in normal subtitles too... "Why don't subtitles match dubbing?" by Tom Scott: https://youtu.be/pU9sHwNKc2c
  • Hobadee 3 days ago
    Little did you all know, this is just being mechanical turked by Nancy Qunqar.

    Way to go Nancy! Keep up the good work, ya crazy bastard!

    • whamlastxmas 3 days ago
      Is this spam? That name only shows as an instagram account and this thread. If you pay for insta followers is this how they get them now? Haha
  • haiku2077 3 days ago
    I've noticed this also happens in english Whisper models with the phrases:

    "[ sub by sk cn2 ]"

    or

    "Anyways, thanks for watching! Please subscribe and like! Thanks for watching! Bye!"

    or

    "This is the end of the video. Thank you for watching. If you enjoyed this video, please subscribe to the channel. Thank you."

    • OSDeveloper 3 days ago
      Because they train on pirated media and or youtube videos, good method, until you get slop, or get caught
  • flexagoon 3 days ago
    In Russian it often hallucinates "Субтитры сделал DimaTorzok" ("Subtitles by DimaTorzok") at the end of things. Interestingly, I wasn't able to find any YouTube videos with that name in the subtitles, so it's not like it's in a lot of training data.
    • codedokode 3 days ago
      I tried googling this and found questions from Telegram users why voice messages recognition sometimes produces this phrase and who is this person. Also I found this thread [1] claiming that the subtitles by DimaTorzok are coming from some Russian youtube videos on gaming like [2].

      [1] https://github.com/openai/whisper/discussions/2372

      [2] https://www.youtube.com/watch?v=FAqyUuahMlc&t=401s

      • flexagoon 3 days ago
        Yeah, I know about this from Telegram, because they use Whisper for voice message recognition. There are a bunch of other artifacts it often produces.
    • berkes 3 days ago
      Could it be someone distributing subs online, e.g. showing up in the opensubtitles.org dataset?
      • voidUpdate 3 days ago
        Or possibly someone subtitling pirated movies? That seems to be a common thing according to other comments
  • arnejenssen 3 days ago
    • io84 3 days ago
      "In the future, everyone will be world-famous for 15 minutes" _in a microniche techno-linguistic community, at a time and choosing of the swirling AI clouds_
    • herdandcurdle 1 day ago
      TL;DR: Whisper occasionally hallucinates and credits “Nicolai Winther” at the ends of Norwegian transcriptions during silent audio segments, likely because the real Nicolai Winther - a former YouTuber who created subtitles—appears frequently in its (likely YouTube‑based) training data. This highlights how limited Norwegian training (only ~266 hours) can cause the model to overfit on specific names and phrases when uncertain.
  • Lucasoato 3 days ago
    In Italian as well there are random hallucination when parsing silence, something like: “Thank you for watching”, “Subtitles by…”
    • userbinator 3 days ago
      I wouldn't be surprised if "like share and subscribe" also shows up at some point.
      • madcaptenor 3 days ago
        In the comments to that Github issue, by alentodorov:

        in romanian, i’ve noticed multiple instances where the transcripts ends with “nu uitati sa da-ti like si subscribe” which, as you might easily infer , translates to “don’t forget to like and subscribe”.

      • Muromec 3 days ago
        "наша зброя в цей момент -- вподобайка і комент"
  • vanschelven 3 days ago
    I wonder if hallucinated copyright claims (esp. like the ZDF one at the bottom of the OP) will be introduced as evidence in one of the court cases against "big AI"
    • berkes 3 days ago
      Evidence against what?

      "Big AI" is transparent and open about the fact they use all sorts of copyrighted material to train the data. How would "we see an exact chunk of text from our copyrighted material" add to that?

      • pbmonster 3 days ago
        It appears they have not been training on the official studio subtitle files, but on community transcriptions/translations commonly distributed with torrents.

        So not only are they training on copyrighted material, but they didn't even pay for it once, and then they didn't even do minimal data cleaning before training. Which, by the way, is the type of cleaning their LLMs could have done.

        • berkes 10 hours ago
          > commonly distributed with torrents

          This is the key part. And it's not certain this happened. Not defending AI data gobbling, but if we truly and honestly want to fight big-AI use of content, we cannot just presume bad faith. OpenSubtitles.org has a large dataset that is "public". It is be a dataset perfectly suitable, intended for, and therefore used for, training and data analysis.

          I've used it for data analysis.

      • sofixa 3 days ago
        Their main defence is that it's fair use because it's transformative (like a human reading a book, getting inspired, and writing something of their own) and not a copypaste illegal distribution (like a human scanning that book and selling it themselves).

        Having models hallucinate copyright notices shows that some content is being copypasted as is, which kind of goes against the transformative argument.

        (Note: I think that trying to litigate AI with current copyright laws is weird. They were created before LLMs were even imagined, so of course they can't handle them clearly. New laws are needed around this, not trying to bend over backwards to think about what a lawmarker a century ago would have thought about how transformative a thing they couldn't have imagined is.)

        • berkes 10 hours ago
          > which kind of goes against the transformative argument.

          Indeed a good example. We've seen several examples of code snippets where this happens too, mentioned on HN.

          But it does not prove that they infringed copyright by ingesting "illegal" stuff, as GP tried to argue. Seeing a verbatim string only "proves" that it came from a specific source. But not if this source was illegally acquired, which was my point.

    • staplers 3 days ago
      It already has been and meta won the lawsuit because corporations are sacrosanct.
      • vanschelven 3 days ago
        Do you mean that specifically a hulucibated text "copyright by not-meta" made it into evidence? Or are you talking about copyright generally?
  • 1718627440 3 days ago
    Well, I fail to see how the LLM is in the wrong here. Surely if a sufficiently large part of the training data comes from a single source, it is correct to credit them for the output.
  • 0points 3 days ago
    Interesting! I used whipser last year to attempt to build an audio transcription tool but gave up due to excessive amount of hallucinated output no matter what model I used.

    It would produce seemingly ok output until you started paying attention.

    One example, it insisted that Biggie Smalls sings "Puttin five carrots in my baby girl ear". (its "carats").

    It's apparently not useful in transcription as it don't reason [sic].

    • IshKebab 3 days ago
      That example is not hallucination, it's just a homonym with insufficiently clear context for the model to disambiguate it.
      • 0points 3 days ago
        I'm well aware mishearing "carots" as "carrots" is not a hallucination.

        That's an example I gave after having used Whisper, the topic of discussion.

        • dpoloncsak 3 days ago
          An example of what you claimed was a hallucination
  • kristjank 3 days ago
    roses are red

    violets are blue

    unregistered hypercam 2

    • maxbond 3 days ago
      Roses are red,

      Silence is golden,

      Translated by Nancy,

      To copyright, we aren't beholden

  • chris_wot 3 days ago
    My wife, who speaks and reads Arabic, got a real kick out of this.

    But honestly, this is the AI equivalent of “please send for translating” in Welsh on a Welsh street sign.

    https://www.theguardian.com/theguardian/2008/nov/01/5

  • sandspar 3 days ago
    Neat, we finally know the answer! What is the sound of one hand clapping? Translation by Nancy Qunqar.
    • layer8 3 days ago
      I can clap with one hand (fingers on palm) and it produces a clapping sound.
      • boomlinde 3 days ago
        Your brain merely hallucinates a clapping sound as "Translation by Nancy Qunqar" enters your ears.
      • sandspar 1 day ago
        Ah, so the sound of one hand clapping is: clapping! A little underwhelming, to be honest. You mean I climbed the Zhen Zi mountains and performed the Seven Labors to learn... this?
  • GaggiX 3 days ago
    Whisper frequently generates random credits. I guess they didn't curate the dataset much at the time.
  • Dwedit 3 days ago
    I've seen Japanese translation models that translate empty string "" into "I'm Sorry I'm Sorry"
  • flkiwi 3 days ago
    > [In English] it also happens a lot with hallucinations saying stuff like "This is the end of the video, remember to like and subscribe

    Well now I know how I’m going to start filling awkward silences in meetings.

  • tarikozket 3 days ago
    this happens in Turkish too. I believe the reason is that the movie subtitles were used for training without cleaning up the comments / intros subtitle authors leave in them.

    leaving personal comments, jokes, reactions, intros in subtitles is very common in eastern cultures.

    Turkish readers will probably remember “esekadam iyi seyirler diler” :)

    • jdiff 3 days ago
      Kind of mindblowing considering who it is we're talking about. Of all companies, OpenAI couldn't be bothered to throw an LLM at this problem? Finding amorphously phrased but clearly recognizable needles in large numbers of haystacks seems like a patently perfect task for them.
      • sofixa 3 days ago
        Don't even need an LLM, a regex would have sufficed (I've used my fair share of community sourced subtitles, and comments are almost always in a different font, colour, between brackets, etc etc).
    • verzali 3 days ago
      That name translates as "Donkey Man" btw :D
  • PatchworkCasino 2 days ago
    I've found if the first 30 seconds of a recorded phone call is ringing and/or DTMF (almost always happens if you call a business) the system with either select Nynorsk or Welsh as the language. Never bothered to check what the text translated to but it's probably something similar. Not a practical issue for me but I can see it being a pain for any bilingual business or call center.
  • jovial_cavalier 3 days ago
    Looks like it's some random user who has generated some lyricslyrics translations between Arabic and English. It's strange, they don't seem to have many contributions. I would have imagined them to be more prolific.

    https://lyricstranslate.com/en/translator/nancy-qunqar

  • shadowgovt 3 days ago
    Interesting. This is similar to the Google Translate bug where it would translate lorem ipsum as bits of political text (because it found most of its lorem ipsum examples flipping between languages on sites where one language was a news story but the not-yet-translated languages would output a lorem-ipsum file instead of a 404 when you toggled over to them).
  • arblor 3 days ago
    Just to add some trivia: ChatGpt interprets(/ed) silence as "Sottotitoli e Revisione a cura di QTSS". Now many videos (mainly dailymotion) with autogenerated subtitles have their Transcripts full of the same message

    i.e. https://www.dailymotion.com/video/x9g9d6u

  • boredumb 3 days ago
    Since it says "Translated by Nancy Qanqar" i'd be willing to bet they're training on some audiobooks with a transcript and somewhere in there it consistently has "Translated by Nancy Qanqar" in the transcript where there is dead air in the audiobook.
  • junon 2 days ago
    It's a common problem with many languages. If you speak gibberish fake Chinese at chatgpt and ask it to translate, it'll happily say you're saying coherent things.
  • majke 3 days ago
    I've spent some time with whisper, and indeed this happens all the time. To my untrained eye it seems like:

    - they indeed seem to have trained on movies/subtitles

    - you absolutely positively must use Voice Activity Detection (VAD) in front of whisper

  • xg15 3 days ago
    Yeah, the subtitle "credits" occur very frequently. I found with whisper-2, they're also triggered by music.

    I suppose the cause is the same, generally subtitle creators adding all kinds of stuff during the credits that is NOT a transcript.

    Seems to me it could have been filtered out relatively easily during training, by clipping the first and last few minutes of all audios. But I guess that's just in hindsight.

    Whisper also likes to transcribe cut off speech or unintelligible noise as "Thank you". I have no idea where that is coming from, but I guess it's a very polite model...

  • shinycode 3 days ago
    In French it’s « Sous-titrage Société Radio-Canada »
  • tornikeo 3 days ago
    Garbage in, garbage out. If the training dataset (accidentally) paired silence (`X_train`) with `رجمة نانسي قنقر` tokens (`y_pred`), then any silence will always be translated to that. Fortunately, this particular problem is easy to fix--just detect and remove silent parts before API call. This also has a side benefit of saving you money on transcription.
  • zahrevsky 2 days ago
    When using ChatGPT audio transcription, sometimes it adds to the end “Subtitles created by ...”, and then some username. Obviously, an artefact of training on subtitiles dataset.
  • GodelNumbering 3 days ago
    Interesting that this happens even on large v3. I had once done a deep dive into STT and Whisper Large was the only model that could correctly transcribe Yann LeCun (it was a Lex Friedman podcast), ever since I held the belief that it was the best STT model, this was over 2 years ago
  • terribleperson 3 days ago
    Using Whisper to sub Japanese vtuber concerts for my enjoyment, I've noticed a similar trend. Not one specific phrase, but several. Some are strange ("I'm going to make a hole in the back of the head"), some are clearly from lyrics websites.
  • tacone 3 days ago
    Same happened to me with English: I've got "Thanks for watching" many times.
    • withinboredom 3 days ago
      Super annoying when it happens with voice chat -- it'll just be explaining something and suddenly stop to say "you're welcome! Feel free to come back any time you want to chat" and that conversation is done.
  • kranner 3 days ago
    I've run lots of guided meditations through whisper-large-v3 and anything with long periods of silence gets a "© Mooji Media" line added at the end of the transcript. None of these have actually been from Mooji.
  • abdussamadbello 3 days ago
    You can either fine-tune the model or filter the response from whisper

    ``` text = "helo helo hello ." target_phrase = "ترجمة نانسي قنقر" replacement = ""

    updated_text = text. Replace(target_phrase, replacement)

    print(updated_text) ```

  • johtso 3 days ago
    I get the same with Welsh, when having some network issues in voice chat it hallucinated me saying "Diolch yn fawr am wylio'r fideo." which translates as "Thank you very much for watching the video."
  • blindstitch 3 days ago
    The fork that I've been using, WhisperX, seems to do better. I've used it on clean splits of mic tracks (ie total silence when the other is talking) with far fewer hallucinations.
    • ethan_smith 3 days ago
      WhisperX works better because it implements a robust VAD (Voice Activity Detection) preprocessing step that effectively filters out silence segments before they reach the model, preventing the hallucination triggers entirely.
  • Oras 3 days ago
    Searching Google for older posts, found many DailyMotion links for translated movies in Arabic with "ترجمة نانسي قنقر".

    I suspected as others mentioned, these were extracted from torrents movies.

    • undersuit 3 days ago
      All my Google searches for Oracle support pages have been labelled with 'الموارد البشرية والتنمية الاجتماعية ' which translates to 'Human Resources and Social Development' for a few months now. Wonder how much this is related.
  • bilekas 3 days ago
    This is a nice reminder that there is no real reasoning in the "AI" it is just still guessing the next word. After being trained on subtitle files which I guess is actually a clever idea as they convey real conversations without pirating, subtitles are freely distributed after all by dedicated translators. Good to see they're the ones getting credit though!
  • ninetyninenine 3 days ago
  • h1fra 3 days ago
    Time to weapon this: publish thousands of videos, add a referal link or AI instructions in subtitles when there is a silent section, ???, profit
  • VMG 3 days ago
    related: googles song detection alg detects my phone vibrating as the song "Montagem Dilatação Hipnótica"
  • michalpleban 3 days ago
    The same things happen on Dutch as well, it brings up some kind of radio channel name if I recall correctly.
  • theanirudh 3 days ago
    In English, silence is transcribed to "Please like and subscribe"
    • cheschire 3 days ago
      I get thanks for watching a lot when using speech to text on ChatGPT
  • jacobgorm 3 days ago
    In Danish I get credits to a known subtitler.
  • DonHopkins 3 days ago
    More like reminiscing than hallucinating.
  • dipierro 3 days ago
    Субтитры сделал Dima Torzhok
  • doodlecricket 3 days ago
    [dead]
  • dangus 3 days ago
    Hey guys, AI by 2027 is going to be superhuman AGI Agentic mega-intelligence, you better fire all your employees and get ready for AI to take your job and embrace your spouse at a Coldplay concert.

    Big data. Machine learning. Blockchain. Artificial intelligence. Digital manufacturing. Big data analysis. Quantum communication and…Internet of things.

    This time the hype cycle won’t be a massive exaggerated disappointment, for real this time.