Deep Learning for Speech Recognition (Adam Coates, Baidu)

Sdílet
Vložit
  • čas přidán 13. 09. 2024

Komentáře • 28

  • @muditjain7667
    @muditjain7667 Před 7 lety +14

    Very comprehensive overview of speech recognition!

  • @fbrault7
    @fbrault7 Před 7 lety +40

    47:58 guy caressing his friend's head

    • @evil1717
      @evil1717 Před 7 lety +3

      lmao

    • @nutelina
      @nutelina Před 6 lety

      You have not been paying attention to the talk. -points! ;)

  • @95guignol
    @95guignol Před 7 lety +20

    Andrew Ng first row

    • @AnirbanKar4294
      @AnirbanKar4294 Před 4 lety

      I was about to ask that in the comments

    • @susmitislam1910
      @susmitislam1910 Před 3 lety

      He himself gave one of the previous lectures that day so I would've been surprised if he weren't.

  • @deepakbabupr1173
    @deepakbabupr1173 Před 5 lety

    A good overview of DL based speech recognition. It's ironical that machine transcription of this video keeps decoding "Baidu" as "I do". Where's the Google's CTC ?

  • @taufiquzzamanpeyash6008

    Decoding Techniques start at 41:34

  • @dianaamiri9520
    @dianaamiri9520 Před 4 lety

    thank you for sharing this video. it helped me to grasp fast the whole idea

  • @nutelina
    @nutelina Před 6 lety +1

    Wauw, what a great talk, a little low on the math explanations typical of American Uni's but great overall, well done. Thank you.

  • @giannagiavelli5098
    @giannagiavelli5098 Před 7 lety +2

    At Noonean we use holographic recognition which takes just a few minutes to train with a 2 gpu 50 TFLOP noonean cube. and gives us one degree of recognition. We can encode a few hundred features into a single holographic plane (for vision its 4kx3kx200 for speech its a similar number but dimensions are more square). That's still eating up 2.5 billion neurons and for depth in vision we use 2 thats 5 billion of our 8 billion possible neurons. So doing just vision or speech eats up one full cube's processing power today. We use about five machines to have dozens of holographic planes for shapes texture hair furniture animals etc but for speech just 100 teraflops work or 2 machines one for about a dozen trained speech holograms and one for the ontic processor. So we might take the word "Hello" and train it with 1000 different native speakers saying it, positive reinforcement with holographic reinforcement with diminishment. Training takes about five minutes. So then we work on 5000 words for basic english including as part of sentence fragments and we get training in about a day but its actually just a few hours of actual compute time. The thought of having to use 60 gpu clusters to achieve a week training time is just ridiculous and backwards. If we had a Noonean supercube of 64 cubes delivering 3 PFlop, our training time would be milliseconds. This works for visual or speech features. A second optimization with our ontic (concept/language) processor fixes the porkchop/portshop issues. As the ontic relationships are pre-created, getting a score between options is very fast. However, a 1000 watt desktop machine is still far too big and heat generating for embedded android brains so we still struggle knowing the technology to build full cybertrons is at least 8 years off unless the low watt synapse type hardware gets scales better (maybe the new google breakaway team can help us get it!). Holographic recognition and their peculiar reinforcement patterns work especially well for vision problems but also apply to audio if you think of spatial distortion and bi-aural classification of sounds. Our hope is that the 10-15 noonean cubes it would take for vision speech and thought will in 8 years become one large desktop machine and in another few years become a small embeddable machine. Our standard noonean cube which is not for vision is a 2k^3 8 billion neural unit fully interconnected. We use both neural darwinism and dynamic new synapse association creation on proximal excited areas. So it is more cognitive science brain modeling based than machine learning CNN based.

  • @chackothomas8757
    @chackothomas8757 Před 4 lety

    Are you not using any windowing (Hamming, Hanning etc) on speech frames to smooth them before calculating the spectrogram?

  • @srijonishampriti3473
    @srijonishampriti3473 Před 4 lety

    What is basic difference of deep speech and deep speech2 on basis of model architecture

  • @diyuanlu6107
    @diyuanlu6107 Před 7 lety

    in 33 minute, the speaker talked about the sum up all the possible alignments "c" given one input sequence "x" to get the final P(y|x). How to get all the possible "c"? isn't it the case that every time step the softmax layer outputs the probability distribution over all possible characters. and after "t" steps, you get an output matrix "O" with size t *27(26 letters + blank), and you take argmax(O, axis=1), you only get one most probable transcription sequence "c". How can you get all the possible "c"s?

    • @opinoynated
      @opinoynated Před 6 lety

      I think as stated in the paper during training you are able to programmatically generate the possible c's by first figuring out all possible alignments and then generating the possibilities from those alignments (www.cs.toronto.edu/~fritz/absps/RNN13.pdf)

  • @fanjerry8100
    @fanjerry8100 Před 7 lety

    So, is the Closed Caption of this video auto generated or manual generated?

  • @siddharthkotwal8823
    @siddharthkotwal8823 Před 7 lety +8

    Andrew NG walked in late at 0:13

  • @fedsummer90
    @fedsummer90 Před 5 lety +1

    Thanks!!!

  • @806aman
    @806aman Před 2 lety

    Hi Lex
    Can you help me with English to Mandarin dataset?

  • @deepamohan3157
    @deepamohan3157 Před 7 lety +1

    Is the architecture similar for keyword spotting where input is a text query? Does this work for Indian languages?

  • @megadebtification
    @megadebtification Před 7 lety +3

    ghost in 47:57- 47:58 -3rd row left in front of mic

  • @dragonnaturallyspeakingsup8959

    nice...

  • @kenichimori8533
    @kenichimori8533 Před 7 lety +1

    Compophorism0

  • @arunnambiar2315
    @arunnambiar2315 Před 4 lety

    Speak louder next time

  • @arjunsinghyadav4273
    @arjunsinghyadav4273 Před rokem +1

    Going back to this problem today yes the LLM solved this
    me: transcribe this for me please
    hhhhhhhheeee. lllll. ooooo. iiiii
    chatGPT: "Hello."
    me: transcribe this for me please
    primi miniter nerner modi
    chatGPT: "Prime Minister Narendra Modi."