How to do Free Speech-to-Text Transcription Better Than Google Premium API with OpenAI Whisper Model #322
FurkanGozukara
announced in
Tutorials
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
How to do Free Speech-to-Text Transcription Better Than Google Premium API with OpenAI Whisper Model
Full tutorial: https://www.youtube.com/watch?v=msj3wuYf3d8
If you want to transcribe your videos and audio into text for free but with high quality, you have come to the correct video.
In this tutorial video, I will guide you on how to use #OpenAI #Whisper model. I will show you how to install and run Open AI's Whisper from scratch. I will demonstrate to you how to convert audio/speech into text.
Whisper is a general-purpose speech recognition model released for free by Open AI. I claim that Whisper is the best available Speech-to-Text model (Natural Language Processing - #NLP) released to public usage including premium paid ones such as Amazon Web Services, Microsoft Azure Cloud Platform, or Google Cloud API. And Whisper is free to use.
I will show you how to install the necessary Python code and the dependent libraries. I will show you how to download a video from YouTube with YT-DLP, how to cut certain parts of the video with LosslessCut, and how to extract the audio of a video with FFMPEG. I will show you how to do a transcription of a video or a sound. I will show you how to generate subtitles for any video. Finally, I will show you how to generate translated transcription and subtitles of any language video.
With the translation feature of the Whisper model, you can watch any language (Whisper supports 99 languages) with English subtitles. Let's say you can find English subtitles for your favorite video in German or Japanese or Arabic. It is not a problem. Just follow my tutorial and generated English translated subtitles.
Actually, to be precise, Whisper is able to transcribe speech to text in all the following languages, and therefore, translation of these following languages into English:
{af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,hi,hr,ht,hu,hy,id,is,it,iw,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}
The links and the commands I have shown in the video below:
Open AI Whisper : https://openai.com/blog/whisper/
Whisper Code : https://github.com/openai/whisper
Python : https://www.python.org/downloads/release/python-399/
Whisper install : pip install git+https://github.com/openai/whisper.git
How to install CUDA support for using GPU when doing transcription of audio :
First, delete existing Pytorch : pip3 uninstall torch
Then install Pytorch with CUDA support : pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
FFMPEG : https://github.com/BtbN/FFmpeg-Builds/releases
LosslessCut : https://github.com/mifi/lossless-cut/releases
How to extract sound of any video with FFMPEG : ffmpeg -i "test_video.webm" -q:a 0 -map a test_video.mp3
How to transcribe an English video : whisper "C:\speech to text\test_video.mp3" --language en --model base.en --device cpu --task transcribe
How to transcribe an English video with CUDA support : whisper "C:\speech to text\test_video.mp3" --language en --model base.en --device cuda --task transcribe
How to transcribe a Turkish video : whisper "C:\speech to text\test_video.mp3" --language tr --model base.en --device cpu --task transcribe
How to transcribe a Turkish video with translation : whisper "C:\speech to text\test.mp3" --language tr --model small --device cuda -o "C:\speech to text" --task translate
Our Discord for SECourses : https://discord.gg/rfttctFewW
If you are interested in programming but you lack experience and skills I suggest you watch our playlists: https://www.youtube.com/c/SECourses/playlists
[1] Introduction to Programming Full Course with C# playlist
[2] Advanced Programming with C# Full Course Playlist
[3] Object Oriented Programming Full Course with C# playlist
[4] Asp.NET Core V5 - MVC Pattern - Bootstrap V5 - Responsive Web Programming with C# Full Course Playlist
[5] Artificial Intelligence (AI) and Machine Learning (ML) Full Course with C# Examples playlist
[6] Software Engineering Full Course playlist
[7] Security of Information Systems Full Course playlist
Video Transcription
00:00:00 Hello everyone. Welcome to the Software Engineering Courses channel. I am Dr. Furkan Gözükara.
00:00:07 Today I will be presenting to you the ultimate guide for speech-to-text transcribing for
00:00:12 free on any Windows operating system. I will be using Windows 10 for demonstration. We
00:00:18 will use Whisper, which is a general-purpose speech recognition model released for free
00:00:23 by OpenAI. Whisper has been released to the public two days ago. OpenAI is an artificial
00:00:30 intelligence company. To download Whisper, just type Google Whisper OpenAI as can be
00:00:35 seen here. Let me show you. And there is blog and GitHub repository. I will explain to you
00:00:46 from scratch how to transcribe a video. Moreover, I will show you how to generate subtitles
00:00:52 for any video in any of the major languages spoken in the world. If we be precise, Whisper
00:00:59 supports 99 languages. Furthermore, I will demonstrate to you how to also generate speech-to-text
00:01:06 translation to English from any of the supported languages. So let's say you have a Japanese
00:01:12 TV series and you can't find English subtitles for these TV series. With Whisper, you can
00:01:18 easily generate English subtitles to your favorite TV series with correct timing without
00:01:23 any hassle and without paying a cent. Also, let's say you are a video producer and you
00:01:29 are a non-English speaker. Then to your videos, you can easily generate subtitles and also
00:01:36 English translated subtitles as well. I have already tested and compared the model of Whisper
00:01:43 to existing both free and paid speech transcription and recognition services. The Whisper model
00:01:50 is so good that it is better than the automatic subtitle generation system of YouTube or even
00:01:55 better than the premium paid speech-to-text cloud API service of Google. Yes, I have used
00:02:01 both services in my previous videos. Therefore, I know what they are capable of and what Whisper
00:02:08 is capable of. For example, I used Whisper to generate subtitles for my previous video.
00:02:15 Let me show you that. How to debug your Python code properly by using Visual Studio Community
00:02:31 Edition 2022. You can select the Whisper generated subtitle from here as I am showing you right
00:02:40 now. English, United States, English untouched by OpenAI. I didn't change even a letter in
00:02:48 this subtitle and it is so good. Whisper made only 5 minor wording errors because I also
00:02:57 uploaded the manually fixed version of the subtitle generated by Whisper. I will also
00:03:05 upload Whisper generated subtitles for this video right now I am recording as well for
00:03:12 you to check out and evaluate yourself. Okay, so let's start with installing all the necessary
00:03:19 files. First, we need to download Python because Whisper runs on Python. They mentioned that
00:03:28 they have used Python 3.9.9. Let me find it. Okay, here. By the way, to access OpenAI GitHub
00:03:42 repository, you can click code on their blog and you can also read their paper if you are
00:03:48 interested in how they have developed. They have used 680,000 hours of multilingual and
00:03:59 multitask supervised data. Supervised means that they are labeled, they have manually
00:04:07 generated subtitles or they are manually transcribed. This is a huge task and this requires huge
00:04:17 hardware computation power. I thank them for releasing this to the public for free. Okay,
00:04:27 so as can be seen in the setup of the GitHub folder, they have used Python 3.9.9 and PyTorch
00:04:38 and other things. I will download and install everything in these videos so that it will
00:04:44 be ultimately guide for you. Download Python 3.9.9. Okay. All right, I will install Windows
00:04:57 installer 64-bit to download. Okay, download has been completed. I will install it as an
00:05:06 administrator. Okay, customize installation. You see I am picking everything. Next, I am
00:05:15 picking these options as well and I will change the working directory to new folder in a C
00:05:25 folder. I will name it as Python 3.9.9. Okay. And I will pick that folder as you can see
00:05:37 here and install. Okay, so the installation has been successful. I will pause. Meanwhile,
00:05:45 I am downloading or installing something. Okay, so what else we need? Then we need to
00:05:53 install pip. Actually, it is already installed in my system. But if you want to install pip,
00:06:00 just download pip like this. And here it will give us the command. I will open CMD, we will
00:06:13 be working a lot of with command prompt. When I click pip install pip, it says it's already
00:06:19 installed. Okay. Okay, it has been installed with the Python package. Then we need to run
00:06:28 this command, as you can see, which is posted on GitHub folder. By the way, I will put into
00:06:35 description of the video every command and every link that I am using here. So don't
00:06:41 worry about that. Copy it. And let's just run in our command prompt. Okay. Okay, so
00:06:52 the installation of the Whisper has been completed successfully, as can be seen here. Now we
00:06:58 can start transcribing our speech to text transcribing speech to text videos. Okay,
00:07:10 so I will download a video of mine, and I will extract its audio for you. I will show
00:07:16 you and then I will show you how to use open AI. Alright, so let's download my latest video
00:07:26 to download my video from YouTube. We will use YouTube DLP. Just type YouTube DLP like
00:07:36 this. It's an open-source project. And let's download its release like this. Okay. Okay,
00:07:49 I think this one. Okay, it has been downloaded, I will put this into my folders in C, which
00:08:04 I have named it as speech-to-text like this. So I am opening another command prompt, I
00:08:11 am moving into C and then I'm moving into speech to text I have typed S P E and then
00:08:21 I hit tab button on my keyboard, it auto completes. Okay, now then Yt-DLP exe, which I also
00:08:31 use the tab and auto-complete. I copy and paste video link. And I right click and it
00:08:40 pastes it into command prompt. Okay, downloading. Meanwhile, let's also download FFmpeg, which
00:08:51 we will be using for extracting audio to download FFmpeg. Click download, pick windows. And
00:09:04 yes, the second link windows built by BTBN, it is better one, I will download the biggest
00:09:16 size having one which is the not the Linux or this one. Okay, it is downloading. And
00:09:27 let's see, okay, our video had been downloaded into our folder as you can see here. Okay.
00:09:38 And YouTube DLP also had been downloaded. Let's also copy and paste it into our folder and
00:09:47 extract it. Okay, I need these three exes. Okay. And the video is, let's see 16 minutes.
00:10:02 Yeah, it's a decent time I think we can work on that but we can also cut it and work on
00:10:09 a small part of it to cut a video I use another open source project which is a great one.
00:10:16 So you will also learn this in this video LosslessCut. Okay, LosslessCut is also another
00:10:23 open source project. This is the link of it. And let's just download its release here.
00:10:35 And let's pick the correct file which is this one I think. Okay, it is getting downloaded.
00:10:47 These are small files. Yeah, oh, I need to download this one not this one actually. Okay,
00:10:59 it's almost ready. Okay, it's ready. Download it. Let's also cut and paste it into
00:11:07 our folder extracted. Okay, I'm opening LosslessCut. Then I will open the downloaded video,
00:11:19 or just drag and drop it like here. Then I will cut its first two minutes. Okay, so it
00:11:28 will be fast. Okay, currently it says this part will be saved and this part will be ignored.
00:11:37 I did set it like this export. And here I will name it as our test video like this.
00:11:48 It is saved as testvideo.webm because we have downloaded it. Now time to extract
00:11:55 its audio to feed it into Whisper. Okay. So the command is here. I have already written
00:12:07 them. It is FFmpeg-i the video file name and the output file name will be like
00:12:18 this. Okay, let's run it before running it. Let's open another command prompt and move
00:12:23 into our folder. This time CD and drag and drop the folder like this and hit enter and
00:12:31 we are there. When you type the DIR you can see its content and copy and pasting
00:12:39 the copy text. Okay, oh, we have a naming error. Okay, let's fix it. Okay, it is done.
00:12:55 And now our test video and page file is ready. Time to transcribe it. Okay. So let's by the
00:13:06 way by when you default install Whisper, it only supports CPU running. But if you have
00:13:16 a GPU that supports CUDA, then you can also use your GPU for speech-to-text transcribing.
00:13:28 So first I will show you with CPU then I will show you with GPU as well. Okay, so our command
00:13:38 will be as like this I will provide language as well. The language of this video is English
00:13:49 therefore it will be like this. Okay, so there is model small. What does that mean is they
00:13:56 released, several models, here tiny base, small, medium, and large. For English they
00:14:04 say that English only model works better. And if you have time, I suggest you to use
00:14:12 biggest one which is medium English only model. And for multilingual if you are going to transcribe
00:14:21 a video that is other than English language I suggest you to use large model. And these
00:14:26 are the video RAM requirements. If you use your GPU by the way GPU GPU is much faster
00:14:34 than CPU. Okay, it is many many times faster than CPU. Currently, I am running over 24 hour
00:14:43 CPU transcribing in another computer and it only transcribed like four hours of speech
00:14:53 on CPU. Therefore I already I just purchased another graphic card today which has 12 GB VRAM
00:15:00 memory and I will use Whisper to generate and improve subtitles of my existing other
00:15:09 lecture videos as well. Okay, so let's start with base model which is a decent one and
00:15:16 it should work fast. So therefore I am going to provide model base.en I will select
00:15:26 the device as CPU currently only CPU is available and the task will be transcribed by the way
00:15:37 if you don't provide any task it will be by default transcribe if you don't provide any
00:15:41 device it will be by default CPU if you don't provide any model it will be by default let's
00:15:48 check it out with Whisper --help the default model will be small okay and if
00:16:03 you don't provide any language it will try to detect the language of the provided audio
00:16:09 so these are the defaults so moreover also you need to move into the folder that you
00:16:15 are going to transcribe otherwise it won't save it last time I have tried so let's just
00:16:25 run our code which is this one okay it's going to start okay I will pause video okay it has
00:16:41 started converting speech into text as you can see hello everyone welcome to my channel
00:16:53 again this is Dr. Furkan Gözükara it failed to understand my name it is Turkish and our
00:17:02 model is not the best one when you transcribe a video with a better model a bigger model
00:17:12 it does a better job for sure I have tested it if you have a GPU it is much faster many
00:17:20 times faster my computer is also strong I have you see I have core i7 10700 F CPU which
00:17:38 has 16 cores 8 real core and 8 logical core it runs at 4.59 gigahertz you see it is using
00:17:48 100% CPU right now okay the transcribing has finished because we have used a small model
00:17:56 and a small video this is the transcription it is pretty good for this model and let's
00:18:07 open our folder and as you can see the transcribe is showing with EM editor okay the transcribe
00:18:18 is here and the generated subtitle file is also here this is directly working in YouTube
00:18:31 I have tested it because it has the correct timestamps for the sentences it is awesome
00:18:38 believe me and it is working okay so we have successfully transcribed an English video
00:18:51 and we have generated subtitles for that video let's test our subtitle I am opening with
00:19:05 media player classic it automatically loaded the subtitle because it is in the same folder
00:19:12 with the same name you see it is awesome if we use the best
00:19:34 model I am sure those minor mistakes will also get fixed okay now time to show you how
00:19:44 to use GPU on Windows installation to be able to use GPU first we need to delete installed
00:19:54 torch okay okay let's run the command okay it says these are depending on the torch I'm
00:20:07 just saying yes and it is uninstalled then we install the latest torch to get this command
00:20:17 you can just go to torch just type torch download PyTorch actually PyTorch yes just type
00:20:28 PyTorch then you can pick the versions here stable LST preview your operating system your
00:20:37 CUDA version python c++ and it is giving it is going to give you a download link okay
00:20:52 and I'm going to select pip and this is it okay this is the same link that I have just
00:20:59 copied and just click install okay it is going to download all the necessary files and install
00:21:07 it I think this was over two gigabytes if I remember correctly it says collecting torch
00:21:15 I'll just pause okay so it has been installed successfully now we can also set the device
00:21:27 with our transcribe method so I am going to change this into give you just copy and paste
00:21:34 it and change this and let's see its speed okay so I will delete the older files I'm
00:21:44 not sure if it if it will override or not therefore I am deleting them and let's just
00:21:50 click enter okay and it says that oh this was not happening I think since I uninstalled
00:22:06 and installed again there is a problem okay I have found the error no matter how
00:22:20 senior we get we still make such minor mistakes but they are taking our time to figure out
00:22:27 you see I have typed GPU as a device but it should be CUDA so when I write it as
00:22:36 a CUDA now it will work let's delete the existing file and let's run the command as
00:22:47 CUDA and now you will see how fast it works it has an initialization period like this
00:22:56 and then it is super fast okay as you can see it's working okay and one final thing
00:23:17 is that I will show you how to do translation okay for translation I will use one of my
00:23:27 Turkish videos okay I also have some Turkish lectures for example here and let's download
00:23:40 some of the short one like the latest one here okay and let's get the code we will just
00:23:49 run okay one second let's move into our folder yt and dlp by the way for translation to
00:24:02 work you should be in the same folder probably I'm not sure but always be in the same folder
00:24:10 is better it is getting downloaded okay so the download was taking too long so I decided
00:24:20 to download only the audio file and you can do that with yt dlp so let's just copy the
00:24:32 link again and we are going to add minus bigger uppercase F and it will give us all the options
00:24:43 as you can see so I will just download the audio file which is let's see audio only let's
00:24:56 download the best one so the best audio is this one I think yes or the best one is this
00:25:13 one so let's just give the command with like this yes now it will download only the audio
00:25:28 of the video but I am not sure if the Whisper is supporting this video format it could be
00:25:44 a problem it is also taking time okay so the download has been completed let's test whether
00:26:12 the whisper is able to utilize M4A sound so I will name this as lecture 14 TR okay and
00:26:36 let's try it okay I won't give output folder I will give also output folder yes I will
00:26:44 use device CUDA and first let's try with transcribing oh by the way we should cut it probably I
00:27:01 wonder if LosslessCut can cut it yeah probably let's cut the first three minutes oh it can't
00:27:15 cut it okay we need to cut this okay so this is the command to cut an audio file with ffmpeg
00:27:37 it is cut I think immediately because we didn't re-encode yes here now we can use this short
00:27:45 file which is 180 seconds okay so first we will transcribe it then we will translate
00:27:56 it this is in Turkish by the way it is not downloading the model because I have already
00:28:07 downloaded it and it uses the cache but it automatically downloads it if you don't have
00:28:18 it in the cache if you don't have the model in the cache so it is not a problem okay it
00:28:24 is still initializing I think yeah we can see the RAM GPU VRAM memory usage okay you
00:28:34 see currently, it is printing the lecture it is in Turkish so it is printing in Turkish
00:28:42 it can be any language one of the supported language let's let me show you the full of
00:28:47 the supported languages okay let's open command prompt type wish first type negative negative
00:28:56 and help yeah and these are all the languages languages that it supports it both supports
00:29:11 the language code or the full language name like africans albanian amharic arabic armenian
00:29:20 azerbaijani basque dutch english danish estonian finnish I think there are 99 languages okay
00:29:33 so it is still processing I think non-english processing is a little bit slower than english
00:29:42 itself by the way it is supporting m4a sound as well okay it should get done in a minute
00:29:56 yeah okay we have cut it as three minutes okay so it has been completed now time to
00:30:11 translate we can see the generated file here like this okay I will just delete them and
00:30:22 now let's run the translation command which will be like this okay okay now it will translate
00:30:37 this text into English unfortunately it only supports translation to English from other
00:30:45 languages it doesn't support translation from non-English language. Translation from English
00:30:53 to non-English languages that would be super awesome if they were supporting however they
00:31:00 do not support that okay so the first sentence uh this one is translated as this one I think
00:31:10 it is decent but not the best because we are using only the small model uh with the big
00:31:17 model I am pretty sure large model I am pretty sure we would have much better translation
00:31:25 and transcription and speech-to-text generation okay and one final thing is that they are
00:31:41 updating the source code time to time you see latest commit was 12 hours ago so you
00:31:50 should update your code time to time you can do that with code download a zip it is downloaded
00:32:01 and then go to Whisper folder then this Whisper folder is located in python then lib and then
00:32:19 I think it is inside let me find it okay inside site-packages folder and then there is a Whisper
00:32:37 folder just drag and drop replace and okay it's updated they are fixing errors actually
00:32:49 they have fixed the translation writing into file error recently so you you really should
00:32:58 pay attention to latest commit uh after your first initial download okay this is all uh
00:33:08 I appreciate if you join and subscribe my channel okay sorry about that sorry about
00:33:16 that and hopefully see you later end of the video uh I am waiting your comments opinions
00:33:24 and questions I also answer your questions you can ask through our discord and if you
00:33:30 wonder where to find our discord you can join our discord with the link here or from here
00:33:37 or many of my videos have um discord link actually I didn't put this on but in many
00:33:47 of them there are discord link okay see you
Beta Was this translation helpful? Give feedback.
All reactions