After years of preparation, I'm excited to share that my online course on Speaker Recognition now starts to accept enrollment on Udemy: www.udemy.com/course/speaker-recognition/?referralCode=1914766AF241CE15D19A Also this Udemy online course on Speaker Diarization: www.udemy.com/course/diarization/?referralCode=21D7CC0AEABB7FE3680F Please contact me if you need a coupon. Looking forward to seeing you in the lectures!
This is really interesting! I'm pretty new to deep learning and am probably in over my head so pardon the simple question, but I wanted to know whether after training the model can verify a completely new users' utterances or it needs to be retrained on the new users data as well? Is enrollment just another way of saying retraining? Would appreciate any clarification.
Is there any reason you do not L2-normalize the speaker embeddings, as opposed to the utterance embeddings? Is it because it doesn't matter when computing the cosine similarity? Also, when you say that you group batches by segment lengths, you still compute the similarity matrix from utterances of the same length (1.6s), right?
"Is it because it doesn't matter when computing the cosine similarity?" - Yes, exactly. Actually in many cases we DO L2 normalize the embeddings. But for verification, cosine similarity has L2 normalization as part of the process. So it does not matter. "from utterances of the same length (1.6s), right?" - These are not `utterances` of 1.6s. These are randomly positioned segments that we extracted from usually longer utterances. And these segments are all 1.6s. (in some settings the batch size ranges from a smaller length to a larger length, but each individual batch has all segments of same length)
Thank you. I asked the first question because in "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" you condition Tacotron on speaker embeddings during training, but at inference time on utterance embeddings instead. If speaker embeddings are not L2-normalized during the training, don't you think it could cause a problem? Also, have you considered training Tacotron on utterance embeddings instead?
@@CorentinJemine Hi, Tacotron IS conditioned on utterance embeddings. We compute segment embeddings from sliding windows of the utterance, then l2 normalize and average them to form a single embedding for the entire utterance.
@@CorentinJemine I do not remember whether we did a final l2 normalization on the entire utterance embedding. My impression is that the impact was small.
From my notes summarizing TE2E, GE2E and SV2TTS ("Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis"), I have that - Partial utterance (1.6s) embeddings are normalized - Utterance embeddings (from several partial utterances) are also normalized - Speaker embeddings are not normalized And in section 2.2 of SV2TTS it says "An embedding vector for the target speaker is concatenated with the synthesizer encoder output at each time step". But I agree with you, it probably does not have much of an impact. Although with a lot of utterances, the average embedding can quite differ from its normalized form. By the way, I'd like to show you my projections of the embeddings computed during the training of the model: i.imgur.com/vQ8suSX.png. This is on LibriSpeech+Voxceleb1+Voxceleb2. I am very impressed by the clustering, even on unseen speakers. I really find the entire approach elegant.
Thanks for the interesting talk! I am currently implementing GE2E with Attention (shared weights, tanh) as discussed in your paper (Attention Based Models for Text Dependent Speaker Verification) and it showed a huge improvement on Voxceleb data from using GE2E alone. I am wondering if you have any insights/suggestions on how to improve the current performance further. I have tried to use wide-margin based methods like AM Softmax, but it doesn't seem to work well with GE2E. Looking forward to new papers on GE2E if your team are still working on them!
Thanks for your interests! There had been many new efforts on improving the training, but we are not allowed to disclose the work before being approved for publication. I will share the work once we got approved.
Hi Quan, thanks for making this video series, it's extremely educational! Question at 10:08, can you elaborate on how you arrived at the equation for probability P? (I understand what p1 and p2 already). Thanks!
Thanks for the nice words! In the equation for P, it is the probability of at least two speakers are from the smaller dataset. So it is equal to: 1 - P(all N speakers from larger dataset) - P(one from smaller and N-1 others from larger). Does that make sense?
@@QuanWang Yes, that was very clear. Thank you! Another more theoretical question if you don't mind. My goal is to identify the same speakers across a set of multiple recordings in different environments (e.g. quiet, noisy) and varying number of speakers in each recording. In general, can I assume the d-vectors that I calculate for a specific speaker in recording #1 (quiet) will be similar for the same speaker's d-vectors in recording #2 (noisy) or will they occupy different areas in vector space (and therefore dissimilar)? In other words, how does the noise or quantity of speakers in a single recording affect d-vectors similarity across multiple recordings? Thanks!
@@Scranny Different acoustic environments do make it more difficult. The way we handle it is to cover different acoustic environments in the training data as well. We also have a process that randomly add some noise to each training utterance, which further mitigates the problem. As of the quantity of speakers in each recording, it should always be one. Multi-speaker problem should NOT be solved by speaker recognition - it should be solved by speaker diarization or source separation.
@@QuanWang Thank you for quick answer. I didn't expect this quick :) In addition, the embedding vector contains the text information and speaker information. Even tho the loss function encourages the text information degenerated in the d-vector, I feel there is better way to extract only speaker information from the embedding. for example, gram matrix like CNN style transfer. What do you think?
@@dongseonghwang7870 As long as the training data cover a sufficiently large range of content, the text information shouldn't be a problem. The style transfer idea may also help. We didn't try that.
Great work and thank you for sharing on this medium as a video!! Such a helpful content.
After years of preparation, I'm excited to share that my online course on Speaker Recognition now starts to accept enrollment on Udemy: www.udemy.com/course/speaker-recognition/?referralCode=1914766AF241CE15D19A
Also this Udemy online course on Speaker Diarization: www.udemy.com/course/diarization/?referralCode=21D7CC0AEABB7FE3680F
Please contact me if you need a coupon. Looking forward to seeing you in the lectures!
This is really interesting! I'm pretty new to deep learning and am probably in over my head so pardon the simple question, but I wanted to know whether after training the model can verify a completely new users' utterances or it needs to be retrained on the new users data as well? Is enrollment just another way of saying retraining? Would appreciate any clarification.
Is there any reason you do not L2-normalize the speaker embeddings, as opposed to the utterance embeddings? Is it because it doesn't matter when computing the cosine similarity?
Also, when you say that you group batches by segment lengths, you still compute the similarity matrix from utterances of the same length (1.6s), right?
"Is it because it doesn't matter when computing the cosine similarity?" - Yes, exactly. Actually in many cases we DO L2 normalize the embeddings. But for verification, cosine similarity has L2 normalization as part of the process. So it does not matter.
"from utterances of the same length (1.6s), right?" - These are not `utterances` of 1.6s. These are randomly positioned segments that we extracted from usually longer utterances. And these segments are all 1.6s. (in some settings the batch size ranges from a smaller length to a larger length, but each individual batch has all segments of same length)
Thank you. I asked the first question because in "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" you condition Tacotron on speaker embeddings during training, but at inference time on utterance embeddings instead. If speaker embeddings are not L2-normalized during the training, don't you think it could cause a problem? Also, have you considered training Tacotron on utterance embeddings instead?
@@CorentinJemine Hi, Tacotron IS conditioned on utterance embeddings. We compute segment embeddings from sliding windows of the utterance, then l2 normalize and average them to form a single embedding for the entire utterance.
@@CorentinJemine I do not remember whether we did a final l2 normalization on the entire utterance embedding. My impression is that the impact was small.
From my notes summarizing TE2E, GE2E and SV2TTS ("Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis"), I have that
- Partial utterance (1.6s) embeddings are normalized
- Utterance embeddings (from several partial utterances) are also normalized
- Speaker embeddings are not normalized
And in section 2.2 of SV2TTS it says "An embedding vector for the target speaker is concatenated with the synthesizer encoder output at each time step". But I agree with you, it probably does not have much of an impact. Although with a lot of utterances, the average embedding can quite differ from its normalized form.
By the way, I'd like to show you my projections of the embeddings computed during the training of the model: i.imgur.com/vQ8suSX.png. This is on LibriSpeech+Voxceleb1+Voxceleb2. I am very impressed by the clustering, even on unseen speakers. I really find the entire approach elegant.
How is the EER calculated?
These videos are quite useful, thanks! Is there any python implementation of the d-vector system?
No official one. But there are some third-party implementations: github.com/wq2012/awesome-diarization#speaker-embedding
Thanks for the interesting talk! I am currently implementing GE2E with Attention (shared weights, tanh) as discussed in your paper (Attention Based Models for Text Dependent Speaker Verification) and it showed a huge improvement on Voxceleb data from using GE2E alone.
I am wondering if you have any insights/suggestions on how to improve the current performance further. I have tried to use wide-margin based methods like AM Softmax, but it doesn't seem to work well with GE2E. Looking forward to new papers on GE2E if your team are still working on them!
Thanks for your interests! There had been many new efforts on improving the training, but we are not allowed to disclose the work before being approved for publication. I will share the work once we got approved.
@@QuanWang Thanks for the quick response! Wish you guys all the best in developing new methods!
catmkf09 could you share your results on voxceleb dataset for GE2E and GE2E with attention.
Hi Quan, thanks for making this video series, it's extremely educational! Question at 10:08, can you elaborate on how you arrived at the equation for probability P? (I understand what p1 and p2 already). Thanks!
Thanks for the nice words! In the equation for P, it is the probability of at least two speakers are from the smaller dataset. So it is equal to: 1 - P(all N speakers from larger dataset) - P(one from smaller and N-1 others from larger). Does that make sense?
@@QuanWang Yes, that was very clear. Thank you! Another more theoretical question if you don't mind. My goal is to identify the same speakers across a set of multiple recordings in different environments (e.g. quiet, noisy) and varying number of speakers in each recording. In general, can I assume the d-vectors that I calculate for a specific speaker in recording #1 (quiet) will be similar for the same speaker's d-vectors in recording #2 (noisy) or will they occupy different areas in vector space (and therefore dissimilar)? In other words, how does the noise or quantity of speakers in a single recording affect d-vectors similarity across multiple recordings? Thanks!
@@Scranny Different acoustic environments do make it more difficult. The way we handle it is to cover different acoustic environments in the training data as well. We also have a process that randomly add some noise to each training utterance, which further mitigates the problem. As of the quantity of speakers in each recording, it should always be one. Multi-speaker problem should NOT be solved by speaker recognition - it should be solved by speaker diarization or source separation.
@@QuanWang Thanks for your replies and suggestions!
Nice video! Is there a specific reason why cosine similarity is used instead of euclidean distance? FaceNet uses euclidean distance tho.
Cosine similarity is more common for speaker recognition systems. It's also easier to apply a threshold since the value is always between -1 and 1.
@@QuanWang Thank you for quick answer. I didn't expect this quick :)
In addition, the embedding vector contains the text information and speaker information. Even tho the loss function encourages the text information degenerated in the d-vector, I feel there is better way to extract only speaker information from the embedding. for example, gram matrix like CNN style transfer. What do you think?
@@dongseonghwang7870 As long as the training data cover a sufficiently large range of content, the text information shouldn't be a problem. The style transfer idea may also help. We didn't try that.
Is contrast loss the same as contrastive loss?
Kind of the similar idea.
@@QuanWang are you the author of the paper?
@@imranparuk5580 Yes I'm one of the authors. If you have more questions feel free to email us.
@@QuanWang Thank you, ill do so
100th like i did!
可以共享一下您的 slides 吗
哈,找到了。google.github.io/speaker-id/publications/GE2E/resources/ICASSP%202018%20GE2E.pptx
请从官网下载:google.github.io/speaker-id/publications/GE2E/
@@QuanWang 非常感谢