Accidently Training Tortoise TTS on Crappy Audio Data

Sdílet
Vložit
  • čas přidán 6. 09. 2024
  • Hardware for my PC:
    Graphics Card - amzn.to/3pcREux
    CPU - amzn.to/43O66Ir
    Cooler - amzn.to/3p98TwX
    RAM - amzn.to/3NBAsIq
    SSD Storage - amzn.to/42NgMFR
    Power Supply (PSU) - amzn.to/430bIhy
    PC Case - amzn.to/447499T
    Mother Board - amzn.to/3CziMXI
    Alternative prebuilds to my PC:
    Corsair Vengeance i7400 - amzn.to/3p64r22
    MSI MPG Velox - amzn.to/42MnJHl
    Cheapest and PC recommended:
    Cyberpower 3060 - amzn.to/3XjtZoP
    Come join The Learning Journey!
    Discord - / discord
    Github - github.com/Jar...
    TikTok - / jarodsjourney
    If you found anything helpful, please consider supporting me and the content I am trying to produce!
    www.buymeacoff...

Komentáře • 18

  • @GraveUypo
    @GraveUypo Před 6 měsíci +5

    the famous "garbage in garbage out" saying peeking its ugly face

  • @dougmaisner
    @dougmaisner Před 6 měsíci +5

    happens to the best of us

  • @nbase2652
    @nbase2652 Před 5 měsíci +1

    Have you thought about using IRs to bring back some depth when the dataset is too sterile or thin-sounding after all that UVR stuff? Still won't turn garbage into gold, but if high quality audio just isn't available for whatever reason, you can at least make it a bit better, and even smooth out those dead cutoffs a bit.
    (Impulse Responses are small .wav files that basically capture the characteristics of a recording environment. This is usually used for "convolution reverb", but capabilities go way beyond just reverb. It can be the frequency response of a certain microphone, how a guitar amp cabinet sounds when recorded from the back etc... I recall using a short damp hit on the body of an acoustic guitar to fatten up thin vocals for example.)

  • @RobAgrees
    @RobAgrees Před 5 měsíci +1

    Hey Jarrod! Been following your channel for awhile since you always have the best AI voice content. I'm curious what you think is the highest fidelity voice clone repo out there currently? Is it still mrq's tortoise TTS fork?

  • @lo7o7xenpai76
    @lo7o7xenpai76 Před 3 měsíci

    Sounds better than my 1st model . I just throw in bunch on audio on colab and it was shit.

  • @shovonjamali7854
    @shovonjamali7854 Před 5 měsíci

    How did you segment audio for Vietnamese as this language is not supported in whisperx because I believe you are using whisperx for segmenting this?

    • @Jarods_Journey
      @Jarods_Journey  Před 5 měsíci

      You can run whisperx with --no_align. It's just not supported well enough with an alignment model, whisper supports it

  • @StringerBell
    @StringerBell Před 6 měsíci

    Hey, Jarod. I've been failing miserably to train a Bulgarian voice model for months for Tortoise TTS. I have absolutely no issue training RVC models with great success but for some reason my TTS models are borderline unusable, no matter what I try. My dataset is consisted of studio quality voice recordings, so the quality is not the issue.
    Is there any way to hire you for consultation to help me out? Thanks!

  • @Zegur
    @Zegur Před 6 měsíci

    I need some help, I've trained a model today running on a i9 9900k, Rtx 2070 super. The training went fine but when actually using the Text to speech it just seems to take ages. Im trying to do 7 sentences and have been waiting for about 2 hours and im at 4/7 meanwhile i see you just almost having instant results

    • @Jarods_Journey
      @Jarods_Journey  Před 6 měsíci

      Your samples # is probably too high, reduce that to 2. As well, you may have too many audio samples trying to make latents from, move those audio files to a backup folder or create a new voice in voice and place 2 small audio files there for generation inference

    • @Zegur
      @Zegur Před 6 měsíci

      @@Jarods_Journey Thanks, I completely missed this step. Hope it works now

  • @user-iv2sp2gl1z
    @user-iv2sp2gl1z Před 5 měsíci

    Can your method applied to Chinese?

    • @Jarods_Journey
      @Jarods_Journey  Před 5 měsíci

      Should be fine, but I'd recommend using Pinyin. The tokenizer isn't wide enough to accept all the kanji in Chinese

    • @user-iv2sp2gl1z
      @user-iv2sp2gl1z Před 5 měsíci

      ​@@Jarods_Journey YThank you for your reply. A few days ago, I used about 900 hours of Chinese voice data to fine-tune the tortoise-tts model. However, I found that the voice generated by the model ultimately contained a severe foreign accent, as if a foreigner was speaking Chinese. It's not authentic enough. What could be the reason for this?

    • @user-iv2sp2gl1z
      @user-iv2sp2gl1z Před 4 měsíci

      @@Jarods_Journey Can you show the mel/text loss in validation set when you train Japanese tortoise-tts with 840-hour speech corpus? When I train the model in 720-hour Chinese speech corpus, I can observe the similar mel/text loss in training set. However, when I added the courterpart in validation set, the mel/text loss in validation set didn't decrease, but increase dramatically. Why? Did you observe the similar phenomenon?

  • @mrpokemon517
    @mrpokemon517 Před 6 měsíci

    Anime female girl voice

  • @mrpokemon517
    @mrpokemon517 Před 6 měsíci

    Generate website