Pocket TTS: A high quality TTS that gives your CPU a voice

(kyutai.org)

156 points | by pain_perdu 19 hours ago

9 comments

singpolyma3 1 hour ago
Love this.
It says MIT license but then readme has a separate section on prohibited use that maybe adds restrictions to make it nonfree? Not sure the legal implications here.
[-]
- CGamesPlay 32 minutes ago
  For reference, the MIT license contains this text: "Permission is hereby granted... to deal in the Software without restriction, including without limitation the rights to use". So the README containing a "Prohibited Use" section definitely creates a conflicting statement.
- jandrese 32 minutes ago
  The "prohibited uses" section seems to be basically "not to be used for crime", which probably doesn't have much legal weight one way or another.
- iamrobertismo 13 minutes ago
  Yeah, I don't understand the point of the prohibited use section at all, seems like unnecessary fluff.
- Buttons840 37 minutes ago
  Good question.
  If a license says "you may use this, you are prohibited from using this", and I use it, did I break the license?
lukebechtel 1 hour ago
Nice!
Just made it an MCP server so claude can tell me when it's done with something :)
https://github.com/Marviel/speak_when_done
armcat 1 hour ago
Oh this is sweet, thanks for sharing! I've been a huge fan of Kokoro and event setup my own fully-local voice assistant [1]. Will definitely give Pocket TTS a go!
[1] https://github.com/acatovic/ova
[-]
- gropo 1 hour ago
  Kokoro is better for tts by far
  For voice cloning, pocket tts is walled so I can't tell
  [-]
  - echelon 36 minutes ago
    What are the advantages of PocketTTS over Kokoro?
    It seems like Kokoro is the smaller model, also runs on CPU in real time, and is more open and fine tunable. More scripts and extensions, etc., whereas this is new and doesn't have any fine tuning code yet.
    I couldn't tell an audio quality difference.
- amrrs 1 hour ago
  Thanks for sharing your repo..looks super cool.. I'm planning to try out. Is it based on mlx or just hf transformers?
  [-]
  - armcat 1 hour ago
    Thank you, just transformers.
dust42 1 hour ago
Good quality but unfortunately it is single language English only.
[-]
- phoronixrly 1 hour ago
  I echo this. For a TTS system to be in any way useful outside the tiny population of the world that speaks exclusively English, it must be multilingual and dynamically switch between languages pretty much per word.
  Cool tech demo though!
  [-]
  - kamranjon 49 minutes ago
    That's a pretty crazy requirement for something to be "useful" especially something that runs so efficiently on cpu. Many content creators from non-english speaking countries can benefit from this type of release by translating transcripts of their content to english and then running it through a model like this to dub their videos in a language that can reach many more people.
    [-]
    - phoronixrly 10 minutes ago
      You mean youtubers? And have to (manually) synchronise the text to their video, and especially when youtube apparently offers voice-voice translation out of the box to mine and many others' annoyance?
  - Levitz 39 minutes ago
    But it wouldn't be for those who "speak exclusively English", rather, for those who speak English. Not only that but it's also common to have system language set to English, even if one's language is different.
    There's about 1.5B English speakers in the planet.
    [-]
    - phoronixrly 19 minutes ago
      Let's indeed limit the use case to the system language, let's say of a mobile phone.
      You pull up a map. All the street names are in the local language, and no, transliterating the local names to the English alphabet does not make them understandable when spoken by TTS.
      You pull up a browser, open up an article to read during your commute in your local language. You now have to reach for a translation model first before passing the data to the English-only TTS software.
      You're driving, one of your friends Signals you. Your phone UI is in English, you get a notification (interrupting your Spotify) saying 'Signal message', followed by 5 minutes of gibberish.
      But let's say you have a TTS model that supports your local language natively. Well due to the fact that '1.5B English speakers' apparently exist in the planet, many texts in other languages include English or Latin names and words. Now you have the opposite issue -- your TTS software needs to switch to English to pronounce these correctly...
      And mind you, these are just very simple use cases for TTS. If you delve into use cases for people with limited sight that experience the entire Internet, and all mobile and desktop applications (often having poor localisation) via TTS you see how mono-lingual TTS is mostly useless and would be switched for a robotic old-school TTS in a flash...
      > only that but it's also common to have system language set to English
      Ask a German whether their system language is English. Ask a French person. I can go on.
  - echelon 32 minutes ago
    English has more users than all but a few products.
tschellenbach 1 hour ago
It's cool how lightweight it is. Recently added support to Vision Agents for Pocket. https://github.com/GetStream/Vision-Agents/tree/main/plugins...
syntaxing 1 hour ago
Is there something similar for STT? I’m using whisper distill models and they work ok. Sometimes it gets what I say completely wrong.
[-]
- daemonologist 53 minutes ago
  Parakeet is not really more accurate than Whisper, but it's much faster - faster than realtime even on CPU: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 . You have to use Nemo though, or mess around with third-party conversions. (Also has a big brother Canary: https://huggingface.co/nvidia/canary-1b-v2. There's also the confusingly named/positioned Nemotron speech: https://huggingface.co/nvidia/nemotron-speech-streaming-en-0...)
  [-]
  - satvikpendem 45 minutes ago
    Keep in mind Parakeet is pretty limited in the number of languages it supports compared to Whisper.
- phoronixrly 1 hour ago
  from the other day https://github.com/cjpais/Handy
GaggiX 2 hours ago
I love that everyone is making their own TTS model as they are not as expensive as many other models to train. Also there are plenty of different architecture.
Another recent example: https://github.com/supertone-inc/supertonic
[-]
- andai 1 hour ago
  In-browser demo of Supertonic with WASM:
  https://huggingface.co/spaces/Supertone/supertonic-2
- coder543 1 hour ago
  Another one is Soprano-1.1.
  It seems like it is being trained by one person, and it is surprisingly natural for such a small model.
  I remember when TTS always meant the most robotic, barely comprehensible voices.
  https://www.reddit.com/r/LocalLLaMA/comments/1qcusnt/soprano...
  https://huggingface.co/ekwek/Soprano-1.1-80M
- nunobrito 2 hours ago
  Thank you. Very good suggestion with code available and bindings for so many languages.
oybng 58 minutes ago
>If you want access to the model with voice cloning, go to https://huggingface.co/kyutai/pocket-tts and accept the terms, then make sure you're logged in locally with `uvx hf auth login` lol
snvzz 1 hour ago
Relative to AmigaOS translator.device + narrator.device, this sure seems bloated.