3 Seconds of Audio Can Clone Any Voice - Speech Editting with VoiceCraft

Sdílet
Vložit
  • čas přidán 6. 09. 2024
  • Links referenced in the video:
    VoiceCraft Demo - jasonppy.githu...
    VoiceCraft Github - github.com/jas...
    Hardware for my PC:
    Graphics Card - amzn.to/3pcREux
    CPU - amzn.to/43O66Ir
    Cooler - amzn.to/3p98TwX
    RAM - amzn.to/3NBAsIq
    SSD Storage - amzn.to/42NgMFR
    Power Supply (PSU) - amzn.to/430bIhy
    PC Case - amzn.to/447499T
    Mother Board - amzn.to/3CziMXI
    Alternative prebuilds to my PC:
    Corsair Vengeance i7400 - amzn.to/3p64r22
    MSI MPG Velox - amzn.to/42MnJHl
    Cheapest and PC recommended:
    Cyberpower 3060 - amzn.to/3XjtZoP
    Come join The Learning Journey!
    Discord - / discord
    Github - github.com/Jar...
    TikTok - / jarodsjourney
    If you found anything helpful, please consider supporting me and the content I am trying to produce!
    www.buymeacoff...

Komentáře • 78

  • @pawelszpyt1640
    @pawelszpyt1640 Před 5 měsíci +9

    It is worth to note that both Voicecraft and Audiocraft cannot be used commercially (weights have non-commercial licenses). Most people probably will not care as it is hard to prove, but I guess some people would like to know that. AFAIK you can use Tortoise (and probably RVC) commercially. Thanks for the video though, the results from just 3s of audio are just stunning. I get it that you can copy the voice patterns pretty well if you get like a few minutes of speech, but what you can get here with just 3 seconds is crazy.

  • @giovannif2567
    @giovannif2567 Před 5 měsíci +6

    Awesome quality for a zero-shoot! I'll give it a try. Also thanks for the update on the upcoming voice clone video! I'm glad it's coming, hopefully during this week!

  • @classic_sci_fi
    @classic_sci_fi Před 4 měsíci +2

    VoiceCraft seems to be the best one I've heard so far -- at least on CZcams.

    • @Ovyron
      @Ovyron Před 4 měsíci

      Lyrebird was at this level long time ago, but it was commercial, yeah, finally open source caught up!

  • @mygamecomputer1691
    @mygamecomputer1691 Před 5 měsíci +10

    Any chance you can do your magic and create a one click install that will work on windows if you have an Nvidia GPU? Pretty please?

    • @JaysterJayster
      @JaysterJayster Před 2 měsíci +1

      Just found a CZcamsr who did it on his patreon :)

  • @hehe42069-k
    @hehe42069-k Před 5 měsíci +5

    SWEET finally, we are so BACK!

  • @psalmy26
    @psalmy26 Před 4 měsíci +1

    I can't get docker working, it's breaking on the first cell. Would LOVE a walkthrough.

  • @BeyondTheLastPage-zm4up
    @BeyondTheLastPage-zm4up Před 5 měsíci +2

    Cool stuff. Appreciate the update on the next voice clone repository video

  • @dhrumil5977
    @dhrumil5977 Před 5 měsíci +1

    I think the voice engine by open ai is just hype now because its from open ai

  • @geraldcortez826
    @geraldcortez826 Před 5 měsíci +1

    please do a video on windows setup. I really want something to clone voices from small datasets. I been replacing game audio of Castlevania SOTN. and the PSX voice for Dracula doesn't clone because of only one minute of audio. I really would like to use all the PSX voices but right now been using PSP Dracula. thank you for any help. and thank you for the video.

  • @ukaszLiniewicz
    @ukaszLiniewicz Před 5 měsíci +4

    I made an API server for VoiceCraft as well as added it to my audiobook/dubbing generation app, Pandrator. Both run on Windows and Pandrator has a one-click installer. I'm not sure what I think about it yet, to be honest. I achieve very good results with XTTS, but I cannot experiment with VoiceCraft too much, because generation is very slow on my measly 4GB 3050 (laptop), slower than processing XTTS results with RVC, even. I have only tried the smaller model (though, according to the author, the difference in quality is negligible). Sometimes it drastically changes the pitch, it sounds as though a sentence or a part of one was generated using a different voice altogether. It can be mitigated by playing with the parameters a little, probably. The quality of voice cloning alone is much better than XTTS (I'm not sure about XTTS+RVC), but consistency seems to be worse.

    • @Ravisidharthan
      @Ravisidharthan Před 5 měsíci

      Hi, man thanks for the work..
      Does it support apple silicon?
      Can you pack one click installer for mac mps?
      is there any demo online that Mac users can test, Huggingface?
      And does it support silero Indic languages?
      ❤ Great work

    • @vineyardworker
      @vineyardworker Před 4 měsíci

      Thanks for Pandrator, much appreciated.

  • @wakandaPeter
    @wakandaPeter Před 2 měsíci

    Thanks Man for all you do

  • @few2012few
    @few2012few Před 5 měsíci +1

    Thank you for good content.
    I'm just curious about the commercial use .
    Since Coqui Public Model License is only for non-commercial use . Is it good to use output of this model in youtube ?
    I'm also planning to use tts in my content.

  • @tylerchambliss8379
    @tylerchambliss8379 Před 5 měsíci +2

    Hey Jarrod. Do you think this would be a viable replacement for tortoise for me to use to make audio books? I really need something soon.

    • @FenrirRobu
      @FenrirRobu Před 5 měsíci

      The license is "Attribution-NonCommercial-ShareAlike 4.0 International", whether or not that affects you is a complicated question. Tortoise is free to use for almost *anything*.

    • @Jarods_Journey
      @Jarods_Journey  Před 5 měsíci

      If the audiobooks are for personal usage, I don't see why not. But unfortunately, I won't be able to implement this into any of my projects anytime soon

  • @GraveUypo
    @GraveUypo Před 5 měsíci

    oh, awesome. progress. we need more of it!

  • @nicknightly336
    @nicknightly336 Před 5 měsíci

    How does the cloning do on voices with special effects tied to them? I.e. SHODAN from System Shock? Will a large enough finetuning dataset accomplish good results, or will the inherent underlying training cause issue from not training on voices with special effects/pattern variances?

  • @nightknight8651
    @nightknight8651 Před 5 měsíci

    very cool
    I have to say I learned a lot from you especially when it comes to RVC so thank you very much but even with all of that I could never get a voice model that is identical to the original voice
    any help with that?

  • @clarkkent12880
    @clarkkent12880 Před 5 měsíci +2

    What's the lowest power Nvidia card needed to make this reasonably work

    • @zachary3603
      @zachary3603 Před 2 měsíci

      if you don't know the answer to this, installing it is going to be a NIGHTMARE

  • @adamrastrand9409
    @adamrastrand9409 Před 5 měsíci

    Hello, why do I need one voice sample in the tortoise auto aggressive fine-tuned model in the voices folder? Why can’t I select just none and will the voice sample affect the quality of the trained voice I mean when you put voice samples in the voices folder after training and selecting the model and wears the prepare a new language tab and how does it work?

  • @MrcVicM
    @MrcVicM Před 5 měsíci

    thanks for all the insights !!

  • @manhattanmyaa-mufc5555
    @manhattanmyaa-mufc5555 Před 5 měsíci

    Love your stuff. I am using this repo and was able to reproduce the output wav files. I need to be able to load the model and then make many inference calls without having to reload the model every time. Do you have insight on how to do this or can you share your visual studio code project on github?
    So far, I am loading the model using fastapi and unicorn to host it. I can do inference via a post call the first time, but when trying to run again, I get WARNING:phonemizer:words count mismatch
    I think I need to unload or re-execute something to reset it, without removing the entire model.

  • @jonathandaudin
    @jonathandaudin Před 3 měsíci +1

    It works only in english?

  • @user-ms8ek1ju1g
    @user-ms8ek1ju1g Před 5 měsíci

    Thank you for this useful and wonderful channel. I have been searching for a long time for a program that converts text to speech for Arabic. I have tried many programs and found that they do not reach the desired level regarding the Arabic language. The best one was eleven labs, but it's not free. Also, xtts coqui is good for the Arabic language, but it needs improvement. Currently I am looking for a program that converts text into speech. It can be trained to pronounce the Arabic language.. I do not know how to train xtts coqui in the Arabic language.

  • @Interprestor
    @Interprestor Před 5 měsíci

    Can this do things like laughing, sneezing, coughing or things like that?

  • @user-on3sy6gv8m
    @user-on3sy6gv8m Před 3 měsíci

    have you tried like indian accent or singlish accent ? will it follow?

  • @Benbobr
    @Benbobr Před 5 měsíci

    Can you share your build? This is the quickest way to use VoiceCraft

  • @dougmaisner
    @dougmaisner Před 5 měsíci

    great stuff!

  • @justriseandgrind6910
    @justriseandgrind6910 Před 5 měsíci

    it's getting scarier and scarier

  • @joshuashepherd7189
    @joshuashepherd7189 Před 5 měsíci +1

    Jarod, How was the inference speed and GPU utilization on your 4090(I assume you're using one)?

    • @Jarods_Journey
      @Jarods_Journey  Před 5 měsíci +1

      Eh, about 5-15 seconds once alignment was finished using MFA. It is pretty fast generally, but I'm sure it's not optimized either

    • @joshuashepherd7189
      @joshuashepherd7189 Před 5 měsíci

      @@Jarods_Journey That's not too bad at all! I'm trying to move away from freaking ElevenLabs for my project. So I'm about to get real heavy into RVC and Tortoise. Gonna hafta binge your videos XD

    • @HarryClipzFilmz
      @HarryClipzFilmz Před 4 měsíci

      @@joshuashepherd7189 What is the difference from ElevenLabs and this? I am new to voice cloning and just want to understand the programs to use to do it

  • @aziacomics
    @aziacomics Před 5 měsíci

    Cool. Does it run on CPU or you must have a graphics card.

    • @Jarods_Journey
      @Jarods_Journey  Před 5 měsíci

      Need a GPU, will run on CPU, but way too slow.

  • @encapsulatio
    @encapsulatio Před 5 měsíci

    What about the best open source speech to text? Is there nothing better than Whisper in accuracy?

  • @dthSinthoras
    @dthSinthoras Před 5 měsíci

    Before I try to istall this: How does it perform with other languages?

  • @TheSpartan-tu2fn
    @TheSpartan-tu2fn Před 5 měsíci

    What you think about using cloud gpu?

  • @tr1pod623
    @tr1pod623 Před 5 měsíci

    could you make a one click installer for us? that would be greatly appreciated!

  • @ryannpulido4004
    @ryannpulido4004 Před 5 měsíci

    Do you have suggestions for real time TTS models that are open source?

    • @pawelszpyt1640
      @pawelszpyt1640 Před 5 měsíci +1

      Tortoise with the right settings I guess. Select DeepSpeed and low samples (2) / iterations (

    • @AEONIC_MUSIC
      @AEONIC_MUSIC Před 5 měsíci +1

      Alltalk is the best I have seen better then tortoise and it has a streaming mode it does one word at a time and plays it do you could probably do real time. It's also faster

    • @pawelszpyt1640
      @pawelszpyt1640 Před 5 měsíci

      ​@@AEONIC_MUSIC It uses XTTSv2, which is very powerful model, but released under terrible Coqui license. Not open source at all. I considered paying Coqui for it, but before I pulled the trigger, they shut down their business and now you have no way to legally use it at all.

    • @Jarods_Journey
      @Jarods_Journey  Před 5 měsíci +2

      Tortoise TTS would be it or xtts if you don't care about licensing. Styletts2 is a good contender as well

  • @FernandoOliveira-jd5il
    @FernandoOliveira-jd5il Před 5 měsíci

    Hi, it can talk another languages ?? By the way, tanks for all the work, nice channel to creators.

  • @user-xj5gz7ln3q
    @user-xj5gz7ln3q Před 5 měsíci +1

    This license allows only non-commercial use of a machine learning model and its outputs. Why even bother...

    • @FenrirRobu
      @FenrirRobu Před 5 měsíci

      Did they specify it for outputs? Usually it's been vague but risky, did they finally admit that it's just not usable?

    • @user-xj5gz7ln3q
      @user-xj5gz7ln3q Před 5 měsíci +1

      @@FenrirRobu Yup.. for output.

    • @yuyutsurao
      @yuyutsurao Před 5 měsíci

      You can use they are not going to re check you

    • @Jarods_Journey
      @Jarods_Journey  Před 5 měsíci

      It's cool tech, the abilities for prosody and pitch maintenance from 3 seconds of audio is pretty wild. To put it out in the open means that another project will build upon it, possible with more permissive licenses

    • @FenrirRobu
      @FenrirRobu Před 5 měsíci

      @@Jarods_Journey let's see if lucidrains takes up the task.

  • @NaitorStudios
    @NaitorStudios Před 5 měsíci

    Does it work well for other languages?

  • @MadeEasyTube
    @MadeEasyTube Před 4 měsíci

    Pleas can use with arabic voice

  • @warpsol
    @warpsol Před 5 měsíci

    How does this compare to my boy tort TTs RVC ?

    • @AEONIC_MUSIC
      @AEONIC_MUSIC Před 5 měsíci

      Alltalk is way better and faster at everything then tortoise but im curious to see what this one is like compared to allltalk

    • @Jarods_Journey
      @Jarods_Journey  Před 5 měsíci

      Haven't done any thourough comparisons, but, it's pretty darn good lol

  • @nandu18157
    @nandu18157 Před 5 měsíci +1

    How to run in Google colab

    • @Jarods_Journey
      @Jarods_Journey  Před 5 měsíci

      Since it was built from Linux, you can probably just follow the readme

  • @Random_person_07
    @Random_person_07 Před 5 měsíci

    Does this come with the webui?

    • @greypsyche5255
      @greypsyche5255 Před 5 měsíci

      No, it comes with a jupyter note and I have no idea how to use this thing.

    • @Jarods_Journey
      @Jarods_Journey  Před 5 měsíci

      No, I built the webui for it to make it easier to use. Someone said they were going to PR their own version of one on the repo, so we'll have to wait and see

  • @yuyutsurao
    @yuyutsurao Před 5 měsíci

    Can I run this without GPU any method 😢

    • @Jarods_Journey
      @Jarods_Journey  Před 5 měsíci

      I don't recommend it. It seems to run on CPU, but the outputs take much too long to generate.

  • @user-rt6nk9sc4y
    @user-rt6nk9sc4y Před 5 měsíci

    Does it support CHinese and cantonese , vietnamese ?

  • @adamrastrand9409
    @adamrastrand9409 Před 5 měsíci

    Hello, why do I need one voice sample in the tortoise auto aggressive fine-tuned model in the voices folder? Why can’t I select just none and will the voice sample affect the quality of the trained voice I mean when you put voice samples in the voices folder after training and selecting the model and wears the prepare a new language tab and how does it work?