[ICASSP 2018] Google's D-Vector System: Generalized End-to-End Loss for Speaker Verification

Sdílet
Vložit
  • čas přidán 13. 09. 2024

Komentáře • 36

  • @aliacar4200
    @aliacar4200 Před 4 lety +1

    Great work and thank you for sharing on this medium as a video!! Such a helpful content.

  • @QuanWang
    @QuanWang  Před 2 lety +2

    After years of preparation, I'm excited to share that my online course on Speaker Recognition now starts to accept enrollment on Udemy: www.udemy.com/course/speaker-recognition/?referralCode=1914766AF241CE15D19A
    Also this Udemy online course on Speaker Diarization: www.udemy.com/course/diarization/?referralCode=21D7CC0AEABB7FE3680F
    Please contact me if you need a coupon. Looking forward to seeing you in the lectures!

  • @taimuribrahim2146
    @taimuribrahim2146 Před 4 lety

    This is really interesting! I'm pretty new to deep learning and am probably in over my head so pardon the simple question, but I wanted to know whether after training the model can verify a completely new users' utterances or it needs to be retrained on the new users data as well? Is enrollment just another way of saying retraining? Would appreciate any clarification.

  • @CorentinJemine
    @CorentinJemine Před 5 lety +1

    Is there any reason you do not L2-normalize the speaker embeddings, as opposed to the utterance embeddings? Is it because it doesn't matter when computing the cosine similarity?
    Also, when you say that you group batches by segment lengths, you still compute the similarity matrix from utterances of the same length (1.6s), right?

    • @QuanWang
      @QuanWang  Před 5 lety +1

      "Is it because it doesn't matter when computing the cosine similarity?" - Yes, exactly. Actually in many cases we DO L2 normalize the embeddings. But for verification, cosine similarity has L2 normalization as part of the process. So it does not matter.
      "from utterances of the same length (1.6s), right?" - These are not `utterances` of 1.6s. These are randomly positioned segments that we extracted from usually longer utterances. And these segments are all 1.6s. (in some settings the batch size ranges from a smaller length to a larger length, but each individual batch has all segments of same length)

    • @CorentinJemine
      @CorentinJemine Před 5 lety

      Thank you. I asked the first question because in "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" you condition Tacotron on speaker embeddings during training, but at inference time on utterance embeddings instead. If speaker embeddings are not L2-normalized during the training, don't you think it could cause a problem? Also, have you considered training Tacotron on utterance embeddings instead?

    • @QuanWang
      @QuanWang  Před 5 lety

      @@CorentinJemine Hi, Tacotron IS conditioned on utterance embeddings. We compute segment embeddings from sliding windows of the utterance, then l2 normalize and average them to form a single embedding for the entire utterance.

    • @QuanWang
      @QuanWang  Před 5 lety

      @@CorentinJemine I do not remember whether we did a final l2 normalization on the entire utterance embedding. My impression is that the impact was small.

    • @CorentinJemine
      @CorentinJemine Před 5 lety

      From my notes summarizing TE2E, GE2E and SV2TTS ("Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis"), I have that
      - Partial utterance (1.6s) embeddings are normalized
      - Utterance embeddings (from several partial utterances) are also normalized
      - Speaker embeddings are not normalized
      And in section 2.2 of SV2TTS it says "An embedding vector for the target speaker is concatenated with the synthesizer encoder output at each time step". But I agree with you, it probably does not have much of an impact. Although with a lot of utterances, the average embedding can quite differ from its normalized form.
      By the way, I'd like to show you my projections of the embeddings computed during the training of the model: i.imgur.com/vQ8suSX.png. This is on LibriSpeech+Voxceleb1+Voxceleb2. I am very impressed by the clustering, even on unseen speakers. I really find the entire approach elegant.

  • @salmasalem7338
    @salmasalem7338 Před 2 lety

    How is the EER calculated?

  • @the_synapse
    @the_synapse Před 4 lety

    These videos are quite useful, thanks! Is there any python implementation of the d-vector system?

    • @QuanWang
      @QuanWang  Před 4 lety

      No official one. But there are some third-party implementations: github.com/wq2012/awesome-diarization#speaker-embedding

  • @catmkf09
    @catmkf09 Před 5 lety

    Thanks for the interesting talk! I am currently implementing GE2E with Attention (shared weights, tanh) as discussed in your paper (Attention Based Models for Text Dependent Speaker Verification) and it showed a huge improvement on Voxceleb data from using GE2E alone.
    I am wondering if you have any insights/suggestions on how to improve the current performance further. I have tried to use wide-margin based methods like AM Softmax, but it doesn't seem to work well with GE2E. Looking forward to new papers on GE2E if your team are still working on them!

    • @QuanWang
      @QuanWang  Před 5 lety

      Thanks for your interests! There had been many new efforts on improving the training, but we are not allowed to disclose the work before being approved for publication. I will share the work once we got approved.

    • @catmkf09
      @catmkf09 Před 5 lety

      @@QuanWang Thanks for the quick response! Wish you guys all the best in developing new methods!

    • @sumukbadam7664
      @sumukbadam7664 Před 4 lety

      catmkf09 could you share your results on voxceleb dataset for GE2E and GE2E with attention.

  • @Scranny
    @Scranny Před 5 lety

    Hi Quan, thanks for making this video series, it's extremely educational! Question at 10:08, can you elaborate on how you arrived at the equation for probability P? (I understand what p1 and p2 already). Thanks!

    • @QuanWang
      @QuanWang  Před 5 lety +2

      Thanks for the nice words! In the equation for P, it is the probability of at least two speakers are from the smaller dataset. So it is equal to: 1 - P(all N speakers from larger dataset) - P(one from smaller and N-1 others from larger). Does that make sense?

    • @Scranny
      @Scranny Před 5 lety

      @@QuanWang Yes, that was very clear. Thank you! Another more theoretical question if you don't mind. My goal is to identify the same speakers across a set of multiple recordings in different environments (e.g. quiet, noisy) and varying number of speakers in each recording. In general, can I assume the d-vectors that I calculate for a specific speaker in recording #1 (quiet) will be similar for the same speaker's d-vectors in recording #2 (noisy) or will they occupy different areas in vector space (and therefore dissimilar)? In other words, how does the noise or quantity of speakers in a single recording affect d-vectors similarity across multiple recordings? Thanks!

    • @QuanWang
      @QuanWang  Před 5 lety +1

      @@Scranny Different acoustic environments do make it more difficult. The way we handle it is to cover different acoustic environments in the training data as well. We also have a process that randomly add some noise to each training utterance, which further mitigates the problem. As of the quantity of speakers in each recording, it should always be one. Multi-speaker problem should NOT be solved by speaker recognition - it should be solved by speaker diarization or source separation.

    • @Scranny
      @Scranny Před 5 lety

      @@QuanWang Thanks for your replies and suggestions!

  • @dongseonghwang7870
    @dongseonghwang7870 Před 5 lety

    Nice video! Is there a specific reason why cosine similarity is used instead of euclidean distance? FaceNet uses euclidean distance tho.

    • @QuanWang
      @QuanWang  Před 5 lety

      Cosine similarity is more common for speaker recognition systems. It's also easier to apply a threshold since the value is always between -1 and 1.

    • @dongseonghwang7870
      @dongseonghwang7870 Před 5 lety

      @@QuanWang Thank you for quick answer. I didn't expect this quick :)
      In addition, the embedding vector contains the text information and speaker information. Even tho the loss function encourages the text information degenerated in the d-vector, I feel there is better way to extract only speaker information from the embedding. for example, gram matrix like CNN style transfer. What do you think?

    • @QuanWang
      @QuanWang  Před 5 lety +1

      @@dongseonghwang7870 As long as the training data cover a sufficiently large range of content, the text information shouldn't be a problem. The style transfer idea may also help. We didn't try that.

  • @imranparuk5580
    @imranparuk5580 Před 5 lety +1

    Is contrast loss the same as contrastive loss?

    • @QuanWang
      @QuanWang  Před 5 lety

      Kind of the similar idea.

    • @imranparuk5580
      @imranparuk5580 Před 5 lety

      @@QuanWang are you the author of the paper?

    • @QuanWang
      @QuanWang  Před 5 lety

      @@imranparuk5580 Yes I'm one of the authors. If you have more questions feel free to email us.

    • @imranparuk5580
      @imranparuk5580 Před 5 lety

      ​@@QuanWang Thank you, ill do so

  • @meghashreebhattacharya7376

    100th like i did!

  • @法外狂徒张三-x5y

    可以共享一下您的 slides 吗

    • @法外狂徒张三-x5y
      @法外狂徒张三-x5y Před 4 lety

      哈,找到了。google.github.io/speaker-id/publications/GE2E/resources/ICASSP%202018%20GE2E.pptx

    • @QuanWang
      @QuanWang  Před 4 lety

      请从官网下载:google.github.io/speaker-id/publications/GE2E/

    • @法外狂徒张三-x5y
      @法外狂徒张三-x5y Před 4 lety

      @@QuanWang 非常感谢