Mel Spectrograms Explained Easily

Valerio Velardo - The Sound of AI

zhlédnutí 97 249

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 5. 09. 2024

Komentáře • 132

@romainpattyn4528 Před 3 lety ⁺⁴⁷
Really nice video thank you, i like the way you explain things. Just wanted to mention that there is an error at 13:47, in the formula to go from Hz to Mel, the frequency should be divided by 700, not by 500. 😉
@ValerioVelardoTheSoundofAI Před 3 lety ⁺⁴
Thanks a lot! Yep, that is a mistake - thank you for pointing that out :)
@zhouzhou3785 Před 3 lety ⁺¹³
thank god your videos just make my learning curve of speech processing much flatter just like mel scale does.
@ValerioVelardoTheSoundofAI Před 3 lety ⁺³
It's nice to be your Mel scale :D
@jennifer6278 Před 3 lety ⁺²⁴
I was struggling so much trying to understand this for my speech recognition class, I can't believe I understood everything within only 30 minutes! Thank you so much! :) This is incredibly well explained. Now on to MFCCs ...
@aoliveira_ Před rokem
Don't forget that when you came here you already had previous knowledge. I also consider that these videos are very good in explaining things that I've struggled to understand in other places. But I didn't begin here. Most likely you are complementing these explanations with your previous knowledge.
@nedzadhadziosmanovic3785 Před 3 lety ⁺⁴
In this video, and the next video called "Extracting Mel Spectrograms with Python" you are explaining to us what does a mel band mean, mel scale, mel filter bank etc, but in my opinion there is a single step missing for understanding what is really done when using mel filter banks to construct a mel spectrogram.
The process you are referring to:
1. Find the smallest and biggest frequency expressed in Hz, which we got from the output of STFT
2. Convert these two values from Hz to mel scale
3. Choose the number of mel bands we want to use
4. According to the chosen number of mel band, we construct a mel filter bank
And now comes the part which is not clear to me: The use of mel filter banks on outputs of STFT to get output of some other kind, which will be used to construct a mel spectrogram.
At this point let's just go back and look at a single output of the STFT (which is equivalent to performing DFT on one frame of an audio wave). As a result we get a set of complex numbers, and by finding their magnitudes we are able to construct a amplitudeVSfrequency graph (also called "frequency domain graph"), by simply plotting the magnitudes as the amplitude for a certain frequency. In other words, each of the magnitudes of the complex numbers (to be clear, one magnitude per one complex number) is responsible for the high of one bin inside the amplitudeVSfrequency graph.
Now we have this single amplitudeVSfrequency graph, and we want to use it in combination with mel filter banks to construct output of some kind. First question is how to apply a mel filter bank to a single single output of STFT (i.e. to one amplitudeVSfrequency graph)? In other words, how to combine these two to get an output of some kind? (I know that is a multiplication of two vectors basically, but how would you represent this visually, using a mel filter bank and a single amplitudeVSfrequency graph). Secondly, what is the this output representing, the amplitude for a single mel band? Lastly, I think it would be much more clear if we used mel bands on the y-axis and mel measuring unit (but I don't know would this be correct), but in my opinion, putting frequency in Hz on y-axis of a mel spcetrogram is completely misleading (and is making me think I did understand anything).
I wanted to ask you would you be so kind to make a single graph which is the output of a single amplitudeVSfrequency graph (which we got from STFT) and mel filter bank, also expressed visually as graph (I suppose, but I am not sure that it would then be a amplitudeVSfrequency graph, but this time with mel frquencies on the x-axis), as I think that it could help both me, and a lot of your viewers?
@MathStatsMe Před měsícem
Citing you in my master's thesis. Thank you for these videos!
@aussieronnied Před 4 lety ⁺³
Thanks Valerio! The triangular filter bank visualisation helped me connect the dots in understanding what is happening behind the scenes. Keep up the great work :)
@ValerioVelardoTheSoundofAI Před 4 lety
Nice to hear that Ronald!
@kirdiekirdie Před 4 měsíci
Fantastic explanation! Needed this as a prerequisite to understand the OpenAI Whisper paper.
@superhorstful Před 3 lety ⁺⁴
Isn't there an error in the formula for the mel frequency? I mean it should be f/700 and not f/500?
@tiagobeltraolacerda5034 Před 3 lety
I noticed that too. Using 700 we got that 1000 Mel = 1kHz, but using 500, 1238.1 Mel = 1kHz. I didn't understand why.
@user-up3gx6nf5b Před rokem ⁺¹
At 2:20, you mention that the higher frequencies sound similar but I hear the opposite. The lower frequencies I can't distingush, the higher ones, I can.
Edit: had to wear headphones to hear the difference x_x
@StefaanHimpe Před 4 lety ⁺²¹
8:15 is it 500 or 700 ?
@ValerioVelardoTheSoundofAI Před 4 lety ⁺¹³
Great catch Stefan! It's supposed to be '700' not '500'. Thank you for pointing this out!
@kazmzengin5176 Před 3 lety
@@ValerioVelardoTheSoundofAI Hi Valerio, I would ask same question, if i couldn't read your emendation. Maybe you sould emendate in video too. Thank you very much for your video series.
@magnuspierrau2466 Před 3 lety ⁺³
This was just awesome! Thank you so much for explaining this concept so clearly, intuitively and passionately! Great stuff! :)
@oscarwjy5084 Před 3 lety
Man you really helped me a lot for my thesis related to auditory filter bank
@yuu_808 Před 4 měsíci
How good explanation about that. It helped me to understand mel spec. Thank you so much.
@kenand330 Před 2 lety ⁺³
Sir, there is something I don't understand here. We do not perceive the pitch difference between the first two notes you play. We can perceive the pitch difference between the second pair of notes. But shouldn't it be the other way around? Am I the only one hearing this?
@Beatitat Před rokem
Could we go about using these features with identifying Keys and the chords? Watching your videos so I can learn a way to make a simple program that does chord progression detections of songs. Thanks for the videos!
@andreeamadalina8509 Před 3 lety ⁺³
Is it just me but for the first pair I hear only one sound, while for the second one, I hear two sounds? Shouldn't have been the other way around? Lol
@Mattews1119 Před 3 lety ⁺²
Thank you Valerio for the amazing content! I'm really grateful for the time and work you're spending in this videos. The way you teach is very clear and simple, I like that a lot :D
Also, if you don't mind, I have a question. I was wondering if extracting frequency features (Spectral Centroid, Rolloff, ...) from a mel spectrogram, instead of a regular spectrogram, would be more beneficial for a MIR application?
@ValerioVelardoTheSoundofAI Před 3 lety
Thank you Mateus :)
@Saitomar Před 3 lety ⁺²
Hi Valerio. How is mel spectrogram is better compared to vanilla spectrogram in terms of deep learning? I understand that it is better in terms of how we perceive audio as humans. But for deep learning, the models pick up features that are more relevant to the model like how for images we just provide the image as a 3d array and the model identifies the underlying pattern. Is there any paper where there is a comparison for mel spectrogram and vanilla spectrogram in terms of deep learning?
Thank you for the video
@ValerioVelardoTheSoundofAI Před 3 lety ⁺¹
In general, people tend to use MelSpecs over vanilla specs. I'm not aware of any paper that compares the two across the board. Performance of the 2 representations depends on each task. The empirical approach is the best way to check which is best for you. Try both representations for your use case on the same architecture.
@Saitomar Před 3 lety
@@ValerioVelardoTheSoundofAI how does it depend on the given task? I am assuming the time-domain representations performance in DL to be task agnostic
@ValerioVelardoTheSoundofAI Před 3 lety
@@Saitomar Unfortunately, it's not agnostic and it depends on the task.
@Saitomar Před 3 lety
Thanks for the reply, I am working on a model which was used in image classification and trying to use it for audio classification, which is why I was curious. Hopefully the results will be good.
@sailfromsurigao Před 11 měsíci
I greatly appreciate the content you've been sharing on audio processing for machine learning; it's incredibly insightful. I am particularly interested in the intersection of audio and image data. Would it be possible to discuss methods for transforming an image into a Mel spectrogram or a standard spectrogram?
@erkangjing2124 Před 3 lety
Thank you for your sharing. And it's really useful for my learning on audio signal processing. Others things such as mel bands, mel filter bands, frequency resolution, and the frequecy range that that can be perceived by human beings, are sometimes so hard to distinguish and determine them. I hope that I could find the answer in the discussion board or other sharings of yours. Finally, really thanks for you sharings.
@DavidKalinex Před 3 lety
Very useful video! No doubt I will be revisiting for the rest of the year to finish my thesis
@ValerioVelardoTheSoundofAI Před 3 lety
Thank you David :)
@melverys Před rokem
This is how I found your video: I recently got into learning the Japanese language and I thought it would be cool to see the spelling of my name in Japanese. Seems like Mel translates to Meru and the definition of my name in Japanese is a logarithmic transformation of a signal’s frequency. Kind of an interesting rabid hole to go down since I’m a math geek and a musician too lol
@ValerioVelardoTheSoundofAI Před rokem
Fantastic story - thank you for sharing :)
@ebrukeklek3237 Před 3 lety
Incredibly good work Mr.
Sometimes it was hard to understand you because of you talking really fast and with a dialect 🤣🙈 but your devotion is fantastic ❤️
@ValerioVelardoTheSoundofAI Před 3 lety
Thanks!
@markusbuchholz3518 Před 4 lety ⁺³
Perfect! Thanks Valerio for this interesting video. Iam not going to be myself if I do not ask ... . There is a "long pipeline" in signal processing for deep learning. We "loose" info while sampling, quantisation, performing STFT, and now using triangular filers. Afterword we perform convolutions and again some important info is lost. Do you think that this process is "smart" enough and energy efficient ? I assume that, given question is related directly how we want to apply deep learning - I mean what we want to do with the signals - classification, generation, filtering, prediction and so for.
Great channel and community!
@ValerioVelardoTheSoundofAI Před 4 lety ⁺¹
You're spot on! The pre-processing audio pipeline can be quite convoluted. That's why some researchers are experimenting with raw audio signals. The problem with this is that audio is highly dimensional. The preprocessing steps we usually take with spectrograms trade "perfect" information with lowered data dimensionality.
@markusbuchholz3518 Před 4 lety ⁺¹
@@ValerioVelardoTheSoundofAI Thanks for feedback and clarification. Anyway, in order to improve something it is great to be familiar with principles. Thanks!
@andres-ab Před 3 lety ⁺³
I have one question. Given the desired for the NN to learn or catch a pattern that the human ear may not recognize (e. g. classification in cough of different diseases, or positive/negative cases of one disease), what's the need to input the NN a spectogram with "humanly perceived coherence"? Could it be possible to avoid the frequency and amplitude correction? Does it make sense to do so?
Thanks a lot. I really love this series.
@avidreader100 Před 3 lety
I guess there can be any number of features suitably defined based on our objective and current insight. Mel would be one such based on human perception. It could have a great fit for applications where the human perception is relevant. There is no compulsion to use it for classifying cough. I would imagine a differently defined scale can very well be used.
@ash3844 Před 2 lety
Amazing!!! Loving all the series of your videos. Thanks a ton!!!
@Sam-jk5dw Před 3 lety
I wish there was a frequency conversion example for the Mel Filter bank. LIke just one example where you take a freqency(which doesn't have a weight of 0 or 1) and convert it to Mels. I felt like I didn't quite know what you were trying to say.
@nezardasan5015 Před 3 lety
DANKE Valerio, always shining
@ValerioVelardoTheSoundofAI Před 3 lety
Thank you Nezar!
@lenam317 Před 4 měsíci
Thank for great video. I am also trying to implement a kind of ASR for my project but I am unable to find any C/C+ libraries that support MFCC features from a live audio source ? It'd be great if you can give me some pointer here.
@Underscore_1234 Před 3 měsíci
Hi, nice stuff (didn't know any about mels), but I wonder, I guess you apply triangular filters in the mel-domain, if so, the filter is not triangular in the (linear) frequency domain right? I believe the shape shouldn't be a triangle anymore in the linear frequency domain (in other words you apply the mel transformation before applying a filter right?)
@rprantoine Před 2 lety ⁺²
Hi Valerio,
Thank you for your content, first of all!
One thing I struggle to understand though is the need to have bands for Mel, and then the use of filters.
Intuitively, to convert frequencies to Mels, I would have just applied the given Mel=f(frequency) formula to my discrete frequency vector and used the resultant discrete Mel vector as my y-axis.
How is that not correct?
Why do we need bands?
Thanks in advance
Antoine
@antonselitskii8351 Před 2 lety
Don't forget, we work with discretized data. You could notice that the number of Mels (0, m_1, ..., m_63 in total 64) is smaller than the number of frequencies (0, f_1, ..., f_512 in total 513 = 1024/2+1) . The intervals [0, f_1), [f_1, f_2), ..., [f_511, f_512), ..., [f_1023, f_1024) are called linear frequency bins, each interval is associated with its left boundary. Because of the symmetry of SFT, we use only half of them: 0, f_1, ..., f_512. We want to have Mel frequency bins [0, m_1), [m_1, m_2), ..., [m_63, m_64). Obviously, some linear frequency bins will collapse in one Mel bin, that is why we need a convolution with filters.
In TorchAudio, this is done by a matrix 64x513. Use ms = torchaudio.transforms.MelScale(n_mels=64, sample_rate= sr, n_stft=1024//2+1), the matrix is saved in ms.fb variable.
@Waffano Před rokem
@@antonselitskii8351 Great answer. Made me wonder: why do we not have # mel frequency bins = # frequency bins? Then we could just apply the mel function on all the frequency bins like @Antoine suggest right?
@antonselitskii8351 Před rokem ⁺¹
@@Waffano You can think about this as a dimension reduction: you have vector f (say 1024) and m (say 80 mels) and transformation matrix T of size 80x1024. Then m = Tf. Yes, it will transform all linear frequencies. It's clear that we can do the inverse transformation, but it will not be precise, because we'll go from vector of size 80 to a vector of size 1024.
@Deathlydave Před 3 lety ⁺¹
Great video and great series. I really learned a lot from watching these videos. One thing that I am a little unclear about is why is the shape of the mel filter band (# bands , frame size / 2 + 1)? Are the values of the mel filter band simply the weights for the triangle filters? If so, since the triangle filters cover an increasing range of frequencies in Hz, how do we maintain the fixed frame size / 2 + 1 size?
@ehtashamulhaque5002 Před rokem
Edit: Okay I also had this confusion but remember we are doing STFT? And the number of our frame_size is actually dictating how many bins we are producing in the spectogram. It is easy to get confused when there are so much stuff to look out for.
@muntazirmehdi503 Před 3 lety
you mentioned about the piano that we can use 40 mel banks as the notes are similar, but if we are working on audio (speech data) and have voices of different people with different voices, for that case how we can determine mel banks.
TIA
@mukundsrinivas8426 Před 2 lety
Amazing series of videos. Did u cover how to deal with audio of varying lengths in any video?
@mahathibodela Před 10 měsíci
As, usual its a really informative, easy to understand video..Bt, i have a doubt. The spectogram u have showed in the last video was having log ranges for frequency and this mel spectorgram also has the same.. why cant we just do in the way as u said in the last video??
@canernm Před 3 lety ⁺¹
Hi Valerio, thanks for the videos. I have one question: in the previous video of the playlist, we took a vanilla spectrogram and transformed it to be both a log-amplitude and log -frequency spectrogram. The difference between Mel Spectrogram and the transformed one, is simply that in the latter we use a simple log2 scale?
@andreiplatonov7689 Před 2 lety ⁺²
Thank you for your videos!
However, if you place f=1000 in the formula of 'frequency to mel' conversion, you do not get 1000 mel..
@pjmmccann Před 2 lety
* It should be 700, not 500 in the formula (see the inverse function, for example)
@zzhou4621 Před rokem
Formulation: mel = 1/log(2) * (log(1 + (Hz/1000))) * 1000 [Reference: Traunmueller, H. (1990) \"Analytical expressions for the tonotopic sensory scale\" J. Acoust. Soc. Am. 88: 97-100]
@alfredoalarconyanez4896 Před 2 lety
Thank you Valerio for this super nice video
@tetlleyplus Před 8 měsíci
Is filtering using the mel banks just (algebraically) multiplied because convolution in the time domain is equivalent to multiplication in the frequency domain?
@antonnaumov4889 Před 3 lety ⁺¹
Hi, Valerio!
Thanks a lot for your videos! Can you please explain, why on the mel spectrogram we are still using Hz units (at 26.47) ?
@ValerioVelardoTheSoundofAI Před 3 lety
That's just a convention to indicate how the different Mel bands are mapped to in terms of frequency.
@user-sx4ew3sm5u Před 3 lety ⁺¹
Thank you for the excellent explanation. One quick question, is mel-spectrogram always good for deep learning? What I mean is that regardless of the sound classes(speech, ambient sound ...), is mel-spectrogram always better than using spectrogram?
@ValerioVelardoTheSoundofAI Před 3 lety ⁺¹
That will depend on the particular problem. For that reason, it's always advisable to try out different audio representations.
@kirdiekirdie Před 4 měsíci
Tried to listen to the C2 note several times until I figured out that my Lenovo laptop speakers apparently don't go that low, but my cheap headphones do :-)
@minired4611 Před 3 lety
thank for your clear explanation. It help me a lot.
@burak4799 Před rokem
You are a life saver! Thank you very much for the detailed lecture :)
@SonGoku-rl9qf Před 7 měsíci
At 27:40 the Mel spectogram has Hz at it´s coordinate axis. I thought it should be Mel?
@luandesouzasilva565 Před 3 lety
Thank you so much for these videos!
@qin7280 Před 4 lety
Hi Valerio Thanks so much for your effort making these videos! I am keeping learning it by watching all your videos.
May I ask a simple question about the Mel-spectrograms? Is it also useful if I want to detect the sound of heartbeat?
Actually that's what I am doing recently but I am a totally beginner.
I am so appreciate if you can share your ideas or any other good materials of this heartbeat detection stuff!!
@ValerioVelardoTheSoundofAI Před 4 lety
Yes, Mel spectrograms (usually!) work well with most audio classification problems.
@maddai1764 Před 3 lety ⁺²
me again, why not just use the equation of frequency to mel to convert the hertz to mel just as you did in the previous videos to convert the herz to log (log frequence) ? why go through all these hastles ? I know there should be a reason, but dont grasp it.
@zzhou4621 Před rokem
me toooo!
@Moonwalkerrabhi Před 3 lety
at 18:55 , i think the x axis Freq is in KHZ not HZ, coz 1000 Khz = 1000 mel, m not sure though, but i think it is
@armanz.9182 Před rokem
How well would rhythm be represented in mel spectograms? I can imagine 'pure' rhythm information to be stored in the low frequencies, but these are compromised in these spectograms right? I had the idea that maybe rhythm information can be found between 0.55Hz (33bpm, lowest perceivable tempo) and 20Hz (lowest perceivable tone). I have no idea though as to how valid this is.
I would love to hear if anyone knows a valid way to analyze just rhythm, thanks!
@IamAayam-rz8md Před 2 měsíci
In the formula for mel, there should be f/700 right?
@arvindramanathan329 Před 3 lety
clear and intuitive explanation, thanks!
@ValerioVelardoTheSoundofAI Před 3 lety
Thanks!
@shahnaz1981fat Před 2 lety
Hai Valerie . Nice explanation on Mel spectrograms. But I could not understand the triangular filter banks.
It gives visualilization of the transformation from hz to mels. But as the triangles are overlapping, is it one to many transformation? I am preparing for PhD interview, unless it is not clear for me I cannot be confident. Please clarify…
@damdidum2601 Před 3 lety
excellent video, u r realy good at explainig these stuff!
@zzhou4621 Před rokem
oh, why need use the triangular filters , it seems also can get Mel spectrogram if use the formulation straightly. is there anybody know?
@aayushchheda8689 Před rokem
Don't really understand the psychoacoustic experiment ? Can you explain it here ?, I do not perceive the pitch difference between the first two notes you play. I can perceive the pitch difference between the second pair of notes. So shouldn't it be the other way around or am i getting something wrong..
@jaydeepchauhan2737 Před 3 lety
What is difference between filter bank feature and Mel-spectrogram feature? Are both same?
@sarathanurahiyarehewage4642 Před rokem
I have a question. When m=2595.log(1+f/500), the f shud be equal to 500(10^(m/2595) -1). Where is this 700 come from in f=700(10^(m/2595) -1)? is it a mistake?. In your video, it shows 700 in two places? Or am I missing something ?
@Waffano Před rokem
Valerio wrote in a comment above that the first formula had a typo. It should be 700 instead of 500.
@manjulakumari953 Před 2 lety
great video. Must watch
@uthsingi Před 7 měsíci
I'd like to politely confirm: at 2.20s, it seems like the note played as C2 might actually be C1. I'm not very familiar with musical notes, but the C2 played in your video sounds lower.
@ashinkajay Před 2 lety
Thank you so much !
@matthewsmalatji5994 Před 4 lety ⁺¹
Hey man. I love the series. I need some help. I want to perform obtain AUDIO FRAMES and generate SPECTROGRAMS for each frame... SO I CAN FEED CNN the spectrograms to do Music Transcription. Please Help. I am able to generate spectrograms using VQT the issues comes with generating frames and spectrograms for each frame
@disturbedeyebrow5977 Před 4 lety ⁺¹
Thanks dude, you didn't mention the optimal number of MFCCs to use for image processing. In one of your previous videos you said that 13 MFCCs is the best choice for audio processing, why 13 ? and how to determine the optimal number ?
@ValerioVelardoTheSoundofAI Před 4 lety ⁺²
I'll post a couple of videos (theory + implementation) about MFCCs in the coming weeks. (Stay tuned for those!)
The short answer to your question is that 13 is a number traditionally used in earlier AI music research. Sometimes this number goes up to 48 or even 90.
As I mentioned for the number of Mel bands in this video, these numbers are somewhat arbitrary and must be treated as hyperparameters, which should be optimised.
@disturbedeyebrow5977 Před 4 lety ⁺¹
Thank you for answering so fast ! I'll be patient for incoming vids !
@iioiggtrt9085 Před 4 lety
how save it as csv file for ml
@bashhad2633 Před 2 lety
This is a great video
@preethamgali3023 Před 3 lety
Great explaination. 🔥🔥
@henoknigatu7121 Před rokem
can you show us how to convert melspectrogram to audio using python like vocoder
@pranavsingh1081 Před 3 lety
could u please tell us the difference between log spectrogram and mel spectrogram ?
@chrischang1980 Před 3 lety ⁺²
I think the difference is mel spectrogram is applying the mel filter, the result for a specific mel frequency is a weight sum of original frequency. Log spectrogram only change the scale from linear to log.
@razvandumitrugrecea9388 Před 3 lety
nice one :)
somebody who shares :)
@jamalseyedmohammadi6681 Před 3 lety
Hi. Great video. I have one question. What is the difference between log frequency spectrogram and mel spectrogram? Thanks
@ValerioVelardoTheSoundofAI Před 3 lety ⁺¹
I suggest you to check out the previous videos on STFT, where I introduce the concept of (Log) Spectrogram. In a nutshell, the Mel Spectrogram is a normal spectrogram where we apply Mel filterbanks.
@pranavsingh1081 Před 3 lety
@@ValerioVelardoTheSoundofAI it is not clear .please explain difference between log spectrogram and mel spectrogram
@shreyaskulkarni5823 Před 2 lety
It should have been 2052 actually to get 263 difference constant.When you showed the graph of mel and freq.
@user-ul2gm5np3i Před 3 lety
Thanks you are so genius and everyone can understand the concept of Mel Spectrogram by watching your video, however it actually takes too long time to understand a single concept cuz it seems that you repeat certain words or sentences several times and too offer much extra informations time to time. If you can deal with that, I am sure that you will get way more subscribers. Anyways thank you so much.
@ratfuk9340 Před rokem
Why is f=700(10^(m/2595) -1)? Shouldnt it be f=500(10^(m/2595) -1) if m=2595*log(1+f/500)
@deepikasingh3122 Před 9 měsíci
but what are filter banks?
@mangomonkey7830 Před 3 lety
Hi, What if my audio files are an hour long. When I use librosa to load them, I only obtain the first 3 mins. What's the standard practice to generate mel spectrograms for hour-long audio recordings?
@ValerioVelardoTheSoundofAI Před 3 lety
I would suggest segmenting the audio files, if possible.
@LewisWolstanholme Před 3 lety ⁺¹
your formula for working out frequency to mel (m = ...) is wrong. your formula for mel to hz however is correct (f = ...)
@laithswais7172 Před 2 měsíci
❤❤❤
@user-fh7tg3gf5p Před 2 měsíci
There were supposed to be a pair of notes C2, C4, there was only one. Bad editing ?
@harshitjuneja9462 Před rokem
If we use a CNN model (let's say), shouldn't they automatically learn any such mathematical transformations?
@seohopa Před rokem ⁺¹
czcams.com/video/3HzgUx9jdy8/video.html
챗지피티 인터프리터로 스펙트로그램 만들기 입니다.
@pranavsingh1081 Před 3 lety
what is this vanilla spectrogram?
@ValerioVelardoTheSoundofAI Před 3 lety ⁺¹
It's just the "basic" spectrogram without any manipulation (e.g., applying log, transforming amplitude to dBs).
@pranavsingh1081 Před 3 lety
@@ValerioVelardoTheSoundofAI thank u so much
@barbara-su Před 3 měsíci
非常好的视频，爱来自中国
@berankilic Před 2 lety
You are like watching chess videos. And I like chess xd
@ValerioVelardoTheSoundofAI Před 2 lety ⁺¹
I love it too!
@HorrorArmor Před 2 dny
Moore Kevin Walker Paul Young Kimberly
@oguzynx Před 2 lety
what da f is mel bands..... dude do not comfuse us..
@mdevelde Před 3 lety
Wrong explanation with many errors. You clearly have no real understanding of what you're talking about.
First of all. Everybody knows since ancient times we perceive frequency mostly logarithmic. For instance octaves / musical intervals / musical instrument tuning etc are based on this.
So the question is not how the Mel scale (a recent invention) differs from linear frequency but how it differs from logarithmic frequency. So your whole video is nonsense and fails to explain the actual difference between the Mel scale and the logarithmic scale.
And many other errors in explaining things and choice of filterbank type etc etc.
@ValerioVelardoTheSoundofAI Před 3 lety ⁺⁴
I'll wait for your explanation to learn more.
@mdevelde Před 3 lety ⁺¹
@@ValerioVelardoTheSoundofAI Too large a list to respond to here. But a simple look at the Mel scale wikipedia page should inform you.
As for musical intervals they are based on a division of octaves. Octaves are 2/1 ratio, so 100Hz - 200Hz - 400Hz - 800Hz - etc. A logarithmic scale. Again, as I already said, one should compare the Mel scale to a logarithmic scale not to a linear scale.
And further, number of filterbands are not just randomly chosen they have good reason. It has to do with ringing of the filters or in other words you cannot zoom in on a narrow frequency band without introducing errors in other ways namely amplitude and time. It always works like this it is the law of nature there's no getting around it. And the choice of triangular filters is a particularly poor and naïve one but understandable as many examples have been written using them.
One more thing about the Mel scale. It's likely not a great model for equidistance hearing. Errors were made in the studies when inventing it over 50 years ago. But again, understandable to use it.
And apologies for the unfriendly tone of my previous message. I just read it back and could have written it in another way. I was a bit tired and grumphy.
@ValerioVelardoTheSoundofAI Před 3 lety ⁺¹¹
@@mdevelde I'll avoid commenting on your smug attitude. It speaks volumes by itself.
I don't see how the "arguments" you raise clash with the content of the video. What superior power ordered that we should "compare the Mel scale to a logarithmic scale not to a linear scale"? Also, what does this mean? The Mel scale IS a logarithmic scale. Or, do you think that applying a few scaling factors to a logarithm (as in the case of the Mel scale) modifies the nature of the logarithm? If you're referring to the difference between the Mel scale and a log2 function, of course I could have shown that. However, people are usually familiar with linear scales, and they probably have an easier time appreciating the difference between a linear scale and the Mel scale, than they have between the latter and a log2 function. BTW, thank you very much for letting me know about the 2/1 octave ratio. In my 25+ years of study in music and my PhD I never encountered this information. Have you thought of publishing this revolutionary result? Oh wait... I mentioned this revolutionary property in a previous video in the series.
Your comment regarding the number of filter bands makes little sense in the context of this video. I'm not sure what's your background, but in AI audio we use a wide array of filter bands (from as little as 40, to as much as 128+), depending on what works best for the problem at hand.
I've read papers that suggest that errors were made while working on the experiments for the Mel scale. I'm also aware that triangular filters are not ideal. Nonetheless, Mel spectrograms are used in Machine Learning these days and achieve state-of-the art results in several audio classification problems. This is why I introduced this feature in this series (Audio Signal Processing for ML). I'm not sure if this is clear, but this video approaches the Mel scale from the perspective of machine learning and audio processing, not music cognition.
@shahnaz1981fat Před 2 lety
Hai Valerie . Nice explanation on Mel spectrograms. But I could not understand the triangular filter banks.
It gives visualilization of the transformation from hz to mels. But as the triangles are overlapping, is it one to many transformation? I am preparing for PhD interview, unless it is not clear for me I cannot be confident. Please clarify…

Další v pořadí

Automatické přehrávání