I've been using MacWhisper for this, with a huge variety of transcription options and things like speaker detection. It works great for all the 1 hour and shorter videos I've fed it, but does this have more to offer?
I haven't tried a 4+ hour video with MacWhisper but I presume that would work the same.
MacWhisper handles multiple-hour-long recordings just fine for me. I regularly process 4hrs on MacWhisper. Even whisper-cpp works fine these days for long recordings too.
Cool product, but it would be better if you stopped spreading misinformation to support it.
As a side project, I just launched a privacy-first web-based meeting transcriber (https://basilai.app/app). Everything runs entirely in your browser — both the transcription and AI summarization — so no audio or text ever leaves your device.
I'm using the browser built in transcription service plus downloading a model and running it via webgpu. No login. At the end of your meeting, you get a zip file with the audio, transcript and summary.
So, can it handle multiple languages in one video, or do you need to segment the different languages using LID first? This has been a thorny issue for people working in multilingual audio (there are at least two or three of us).
I haven't test that specific edge case, I'm sorry. I tested 2 langue's having a normal conversation and that worked fine- "Auto or English" handle multiple lan the best
You use the word "transcribe" but the page doesn't appear to support that claim? This looks like straightforward STT? Or does it actually support transcription (diarization, etc.)?
(Also, the text is completely illegible on your site.)
One thing that Rev and other online services have as well as MacWhisper is a good interface for editing the text to correct inevitable errors. Being able to click on the text and have it sync to the correct place in the audio is a must for my use case of transcribing interviews. Also speaker diarization.
Scribers’ iCloud system automatically backs up each transcription and organizes them in a three-pane folder view—somewhat inspired by Bars’ layout. This structure allows a surprising degree of customization for all your data needs, especially when transcribing interviews. It would probably make for a very comfortable workflow here
Yes, but by a negligible margin. My program is designed for multi-track audio, which means I run this in parallel on multiple 3 hour recordings, and get results in 12 minutes.
You haven’t shared any architectural details. What model? What size? How can anyone be sure that what you’re building is truly offline?
What does that even mean? Why would OSS make it slower? Why would it be an overkill?
This is not Producthunt, you have to give at least some kind of explanation for your claims.
MacWhisper crashes at about an hour of context.
This uses, smart, invisible regex in the text generation pipe. Makes this fast. + bonus, there is no context limit
I haven't worked in a while with transcription, but whisper.cpp itself (which I assume is the underlying tech behind MacWhisper) does realtime transcription on my MBP with an M1 Pro chip. When I first started writing my last completed novel, I fired it up and just started telling the story to test it out. Realtime.
That was back in 2023. I assume things work better now.
"Smart, invisible regex" sounds like a lot of bs... could you give a more technical explanation?
Also the Whisper model doesn't really have a context window, it already segments the audio with a certain amount of overlap between the chunks, I really have a hard time understanding what you are trying to say here.
This is just plain wrong. I have my own Whisper App in the AppStore (on iOS, with very limited memory capacity) and there are no problems at all with longer Audio / Video files.
FYI it works now because I just brought it up. The website mentions there were HN promo/discount codes, so I honestly expected the app to be like $20+, so color me shocked when it's $3.99.
What language do you have the model architecture and implementation in? Feel like it would be the biggest proportion of the codebase, curious if you did it in Swift?
At $3.99 this was an instant buy for me until the App Store told me I couldn't. I think the venn diagram between HN users and those holding off on Tahoe is probably a pretty big overlap. ;-)
My experience share only supporting the latest OS:
I have launched apps focused on a new feature in the latest OS and regretted it. The # of people who have the latest OS is much smaller than the full install base for much longer than I thought. As a result, my marketing conversion was unnaturally low - people who liked the app idea but couldn't install because they had the wrong OS. This causes two problems: potential users I activated but couldn't convert and this signal gets internalized by the App Store, pushing down future impressions.
Now I always have a fallback implementation of the feature so I can target the prior OS. Both Mac and iOS.
Thanks. An update that will add functionality that allows a user to give it a link that contains web video, will do dynamic link discovery (with Safari extension, and pull in the video automatically (M3U8 discover and retrieval) -- Lots of online lecture videos that need transcription.
I will include better version support (probably to os 13).
Ah, I think you're asking if a user can, when wanting timestamps, if they can further edit the output to be by word?
Currently set around each sentence (2-5s) --- But that is absolutely doable and that’s a great idea - On the next update (~3-4wks) I’ll definitely include the ability to control that.
Also, disabling scrolling and using swipe for sections instead _at a font size that causes text to overflow_, depending on phone screen size, meaning a bunch of the site is _literally_ unreadable, since it's off the screen with no way to get there.
Reds people choose are usually between dark and light, which doesn't contrast particularly well on anything because for good contrast you need a dark color vs a light color.
Green, yellow, red or whatever hue is fine, as long as it's dark or light enough. Colorblind and non-colorblind people can see how dark or light a color is (luminance), but they might not agree on the hue. That's why WCAG contrast checks require luminance contrast and not hue contrast.
It's best to use a contrast checker because it's not always intuitive how dark or light a color is e.g. yellow and lime are almost as light as white.
I can’t read this site. Red background, dark grey/black text, is a terrible terrible choice for colors for readability. There were some words there but all I could make out were the header on my mobile phone.
I haven't tried a 4+ hour video with MacWhisper but I presume that would work the same.
Cool product, but it would be better if you stopped spreading misinformation to support it.
I don’t see this sort of thing, has the page changed? Edit: the comments here…
The drop shadow on the pages does make it deeply unpleasant to read.
I'm using the browser built in transcription service plus downloading a model and running it via webgpu. No login. At the end of your meeting, you get a zip file with the audio, transcript and summary.
For example, could it support a video that included spoken Latin, ancient Greek, German, and Italian?
(Also, the text is completely illegible on your site.)
It even supports speaker differentiation/recognition and is open source on mac/windows/linux;
https://github.com/thewh1teagle/vibe
What's the stack, if I may ask? (I believe Whisper-X does the diarization thing)
https://github.com/naveedn/audio-transcriber
You haven’t shared any architectural details. What model? What size? How can anyone be sure that what you’re building is truly offline?
The elapsed-time timestamps didn't correlate well with other data sources. I figured it was a mistake on my end, and just brushed it off.
That was back in 2023. I assume things work better now.
This is not true. (I've been a MacWhisper user since 2023. I have two bugs during that time, which the author addressed quickly.)
Also the Whisper model doesn't really have a context window, it already segments the audio with a certain amount of overlap between the chunks, I really have a hard time understanding what you are trying to say here.
Would also love to hear what you mean by “smart invisible regex,” sounds like AI slop to me.
I've never heard a regex person speak this way of a regex.
Please tell me you didn't vibecode the regex... one of the areas it's still not good at
Neither whisper nor MacWhisper have any context limit
Actually I would be happy if it could just identify occurrences (timestamps) of a specific word or a small set of words.
Thanks.
Thanks for sharing.
Looking forward to the "Speaker Detection" feature release. ;)
Cool project, I am using ChatGPT for recording/summarising meetings but the limit there is 2 hours
Also that color is color(display-p3 .768627 .031373 .031373 / 1)- It is actually technically redder than red actually is
But thats a fair point- I drank the Liquid Glass Kool-Aid----- I'll aim more compatibility the next upgrade
I have launched apps focused on a new feature in the latest OS and regretted it. The # of people who have the latest OS is much smaller than the full install base for much longer than I thought. As a result, my marketing conversion was unnaturally low - people who liked the app idea but couldn't install because they had the wrong OS. This causes two problems: potential users I activated but couldn't convert and this signal gets internalized by the App Store, pushing down future impressions.
Now I always have a fallback implementation of the feature so I can target the prior OS. Both Mac and iOS.
I will include better version support (probably to os 13).
https://iili.io/KkoKBCx.png
I have change the bg color
https://scriberpro.cc/about/ Are you trolling people with this page's design? Unreadable colors AND a wobble effect? :D
Typically white text works better on red.
Green, yellow, red or whatever hue is fine, as long as it's dark or light enough. Colorblind and non-colorblind people can see how dark or light a color is (luminance), but they might not agree on the hue. That's why WCAG contrast checks require luminance contrast and not hue contrast.
It's best to use a contrast checker because it's not always intuitive how dark or light a color is e.g. yellow and lime are almost as light as white.
The first thought I had when it loaded was, "Did we forget how to make webpages?"
Sorry. I'm sure the software is great, but yeah.
Oh I see title ate your h2 text. Thats no good. Thanks for showing me