- 242
- 7 958 084
Prompt Engineering
United States
Registrace 14. 01. 2023
Ph.D., Artificial Intelligence & Coding.
Building cool stuff!
▶️ Subscribe: www.youtube.com/@engineerprompt?sub_confirmation=1
Want to discuss your next AI project with me? BOOK NOW:
calendly.com/engineerprompt/consulting-call
For business inquiries email: engineerprompt@gmail.com
Building cool stuff!
▶️ Subscribe: www.youtube.com/@engineerprompt?sub_confirmation=1
Want to discuss your next AI project with me? BOOK NOW:
calendly.com/engineerprompt/consulting-call
For business inquiries email: engineerprompt@gmail.com
Why Cartesia-AI's Voice Tech is a Game-Changer You Can't Ignore!
In this video, I'm excited to introduce Cartesia AI's revolutionary real-time text-to-speech system, Sonic, which offers 135ms model latency and lifelike generative voice capabilities. I'll demonstrate how this versatile API can be integrated into your projects, including a step-by-step guide on obtaining and using the API key. With a variety of voices to choose from, including options for emotion customization, this platform stands out for its quality and speed. I'll also cover setting up a voice-to-voice chat assistant and how you can configure the voices for your needs. Stay tuned for more on voice cloning and advanced setups in upcoming videos!
#tts #aivoice #voicechat
🦾 Discord: discord.com/invite/t4eYQRUcXB
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Patreon: www.patreon.com/PromptEngineering
💼Consulting: calendly.com/engineerprompt/consulting-call
📧 Business Contact: engineerprompt@gmail.com
Become Member: tinyurl.com/y5h28s6h
💻 Pre-configured localGPT VM: bit.ly/localGPT (use Code: PromptEngineering for 50% off).
Signup for Advanced RAG:
tally.so/r/3y9bb0a
LINKS:
play.cartesia.ai/
Verbi Github: github.com/PromtEngineer/Verbi
TIMESTAMP
00:00 Introduction to Cartesia AI's Text-to-Speech System
00:51 Demonstrating Voice Generation Speed and Quality
01:20 Exploring Different Voice Profiles
03:03 Setting Up Your Account and API Key
03:40 Customizing Voice Parameters
05:22 Implementing the Text-to-Speech System
05:53 Running the Standalone Example
10:36 Voice-to-Voice Chat Assistant Project
13:03 Conclusion and Future Plans
All Interesting Videos:
Everything LangChain: czcams.com/play/PLVEEucA9MYhOu89CX8H3MBZqayTbcCTMr.html
Everything LLM: czcams.com/play/PLVEEucA9MYhNF5-zeb4Iw2Nl1OKTH-Txw.html
Everything Midjourney: czcams.com/play/PLVEEucA9MYhMdrdHZtFeEebl20LPkaSmw.html
AI Image Generation: czcams.com/play/PLVEEucA9MYhPVgYazU5hx6emMXtargd4z.html
#tts #aivoice #voicechat
🦾 Discord: discord.com/invite/t4eYQRUcXB
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Patreon: www.patreon.com/PromptEngineering
💼Consulting: calendly.com/engineerprompt/consulting-call
📧 Business Contact: engineerprompt@gmail.com
Become Member: tinyurl.com/y5h28s6h
💻 Pre-configured localGPT VM: bit.ly/localGPT (use Code: PromptEngineering for 50% off).
Signup for Advanced RAG:
tally.so/r/3y9bb0a
LINKS:
play.cartesia.ai/
Verbi Github: github.com/PromtEngineer/Verbi
TIMESTAMP
00:00 Introduction to Cartesia AI's Text-to-Speech System
00:51 Demonstrating Voice Generation Speed and Quality
01:20 Exploring Different Voice Profiles
03:03 Setting Up Your Account and API Key
03:40 Customizing Voice Parameters
05:22 Implementing the Text-to-Speech System
05:53 Running the Standalone Example
10:36 Voice-to-Voice Chat Assistant Project
13:03 Conclusion and Future Plans
All Interesting Videos:
Everything LangChain: czcams.com/play/PLVEEucA9MYhOu89CX8H3MBZqayTbcCTMr.html
Everything LLM: czcams.com/play/PLVEEucA9MYhNF5-zeb4Iw2Nl1OKTH-Txw.html
Everything Midjourney: czcams.com/play/PLVEEucA9MYhMdrdHZtFeEebl20LPkaSmw.html
AI Image Generation: czcams.com/play/PLVEEucA9MYhPVgYazU5hx6emMXtargd4z.html
zhlédnutí: 8 280
Video
Marker: This Open-Source Tool will make your PDFs LLM Ready
zhlédnutí 17KPřed 4 hodinami
In this video, I discuss the challenges of working with PDFs for LLM applications and introduce you to an open-source tool called Marker. Marker simplifies the conversion of complex PDF files into structured Markdown, making data extraction much easier. I compare Marker with NuGet, showing its superior performance in preserving document structure accurately. Additionally, I give a detailed tuto...
Master Fine-Tuning Mistral AI Models with Official Mistral-FineTune Package
zhlédnutí 5KPřed 14 hodinami
In this video, I walk you through the official Mistral AI fine-tuning guide using their new Mistral FineTune package. This lightweight code base enables memory-efficient and high-performance fine-tuning of Mistral models. I delve into the detailed data preparation process and explain how to format your datasets correctly in JSONL format to get the best results. We'll also set up an example trai...
Advanced Function Calling with Mistral-7B - Multi function and Nested Tool Usage
zhlédnutí 4,8KPřed 19 hodinami
Testing Multi and Nested Function Calls with Mistral 7b In this video, I explore the advanced function calling capabilities of the Mistral 7b v3 model, including multi-function and nested function calls. Using a Google Colab notebook by Uncle Code, I demonstrate how to set up, install the Mistral inference package, and log into Hugging Face hub. Practical examples of the model handling multiple...
ChatGPT Desktop App: First Impressions and What's Missing!!!
zhlédnutí 3,8KPřed 21 hodinou
Official ChatGPT Desktop App for Mac OS - Early Access Review and Features In this video, I explore and review the newly released official ChatGPT desktop app for Mac OS. After downloading it from the ChatGPT website, I walk you through the installation process, launching the app, logging in, and utilizing its various features such as text input, uploading files, and voice conversations. I also...
NEW MISTRAL: Uncensored and Powerful with Function Calling
zhlédnutí 7KPřed dnem
In this video, I explore the new Mistral 7B-v0.3 model, now available on Hugging Face. I'll show you how to install the Mistral inference package, download the model, and run initial queries. We also test its performance and highlight its new features like uncensored responses and function calling. Stay tuned for future videos on fine-tuning this model! #mistral #functioncalling #llm 🦾 Discord:...
INSANELY FAST Talking AI: Powered by Groq & Deepgram
zhlédnutí 7KPřed dnem
Fastest Voice Chat Inference with Groq and DeepGram In this video, I show how to achieve the fastest voice chat inference using Groq and DeepGram APIs. I compare their speeds to OpenAI’s Whisper and demonstrate how to set up and code the process. Learn about handling rate limits, buffering issues, and how to get started with these services. Stay tuned for future videos on local model implementa...
Creating JARVIS - Your Voice Assistant with Memory
zhlédnutí 6KPřed 14 dny
In this video, you will see a demo of a voice assistant that can remember past conversations. We use external APIs like OpenAI's Whisper for audio transcription, GPT-4 for generating responses, and OpenAI's voice engine for text-to-speech conversion. The main focus is on using modular code and OpenAI's tools to construct a conversational assistant with a memory feature. 🦾 Discord: discord.com/i...
Creating J.A.R.V.I.S.
zhlédnutí 3,5KPřed 14 dny
A sneak peek of voice-to-voice chat assistant. 🦾 Discord: discord.com/invite/t4eYQRUcXB ☕ Buy me a Coffee: ko-fi.com/promptengineering |🔴 Patreon: www.patreon.com/PromptEngineering 💼Consulting: calendly.com/engineerprompt/consulting-call 📧 Business Contact: engineerprompt@gmail.com Become Member: tinyurl.com/y5h28s6h 💻 Pre-configured localGPT VM: bit.ly/localGPT (use Code: PromptEngineering for...
First Impressions of Gemini Flash 1.5 - The Fastest 1 Million Token Model
zhlédnutí 7KPřed 14 dny
Just checked out Google's new Gemini Flash at Google I/O. It's a super-fast AI model designed for handling big tasks - think processing videos, audios, or huge codebases, all while keeping costs low. I put it through its paces against giants like GPT 3.5 and GPT 4.0, looking at performance, costs, and how it handles real-world tasks. I even tried confusing it with tricky questions and coding ch...
Google IO: Agents is The Future - Demos
zhlédnutí 3,4KPřed 14 dny
Google IO was all about Agents. Here are some examples demo shown. 🦾 Discord: discord.com/invite/t4eYQRUcXB ☕ Buy me a Coffee: ko-fi.com/promptengineering |🔴 Patreon: www.patreon.com/PromptEngineering 💼Consulting: calendly.com/engineerprompt/consulting-call 📧 Business Contact: engineerprompt@gmail.com Become Member: tinyurl.com/y5h28s6h 💻 Pre-configured localGPT VM: bit.ly/localGPT (use Code: P...
Getting Started with GPT-4o API, Image Understanding, Function Calling and MORE
zhlédnutí 8KPřed 14 dny
Getting Started with GPT 4.0: A Comprehensive Tutorial This video tutorial guides you through the basics of getting started with the GPT-4o API, including comparisons with GPT 4.0 Turbo, exploring capabilities like text generation, image understanding, and function calling. 🦾 Discord: discord.com/invite/t4eYQRUcXB ☕ Buy me a Coffee: ko-fi.com/promptengineering |🔴 Patreon: www.patreon.com/Prompt...
GPT-4o: OpenAI's NEW OMNI-MODEL Can DO it ALL
zhlédnutí 4,3KPřed 14 dny
In this video we look at GPT-4 OmniModel, a groundbreaking AI model capable of processing and responding to audio, vision, and text in real-time. Demonstrating its versatility, the video showcases various scenarios including customer support, language translation, and educational tutoring, highlighting the OmniModel's ability to understand and interact in near-human response times. 🦾 Discord: d...
Yi-1.5: True Apache 2.0 Competitor to LLAMA-3
zhlédnutí 6KPřed 14 dny
In this video, we will look at Yi-1.5 series models were just released by 01-AI. This update includes 3 different models with sizes ranging from 6 billion to 34 billion parameters and training on up to 4.1 trillion tokens. All models are released under Apache 2.0 license. 🦾 Discord: discord.com/invite/t4eYQRUcXB ☕ Buy me a Coffee: ko-fi.com/promptengineering |🔴 Patreon: www.patreon.com/PromptEn...
NVIDIA ChatRTX: Private Chatbot for Your Files, Image Search via Voice | How to get started
zhlédnutí 8KPřed 21 dnem
This video provides an in-depth review and tutorial of NVIDIA's ChatRTX, a new tool designed for users with RTX GPUs on Windows PCs. The tool leverages Retrieval Augmented Generation technology and tensor RT LLM alongside RTX acceleration to chat with documents and use voice interaction. It now supports local photo and image search with improvements in its features. The application requires spe...
Free LOCAL Copilot to Take Your Coding to the NEXT LEVEL
zhlédnutí 6KPřed 21 dnem
Free LOCAL Copilot to Take Your Coding to the NEXT LEVEL
Free Copilot to Take Your Coding to the NEXT LEVEL
zhlédnutí 13KPřed 21 dnem
Free Copilot to Take Your Coding to the NEXT LEVEL
Llama-3 🦙 with LocalGPT: Chat with YOUR Documents in Private
zhlédnutí 9KPřed 28 dny
Llama-3 🦙 with LocalGPT: Chat with YOUR Documents in Private
Extending Llama-3 to 1M+ Tokens - Does it Impact the Performance?
zhlédnutí 11KPřed měsícem
Extending Llama-3 to 1M Tokens - Does it Impact the Performance?
Get your own custom Phi-3-mini for your use cases
zhlédnutí 11KPřed měsícem
Get your own custom Phi-3-mini for your use cases
How Good is LLAMA-3 for RAG, Routing, and Function Calling
zhlédnutí 8KPřed měsícem
How Good is LLAMA-3 for RAG, Routing, and Function Calling
How Good is Phi-3-Mini for RAG, Routing, Agents
zhlédnutí 10KPřed měsícem
How Good is Phi-3-Mini for RAG, Routing, Agents
Does Size Matter? Phi-3-Mini Punching Above its Size on "BENCHMARKS"
zhlédnutí 5KPřed měsícem
Does Size Matter? Phi-3-Mini Punching Above its Size on "BENCHMARKS"
MIXTRAL 8x22B: The BEST MoE Just got Better | RAG and Function Calling
zhlédnutí 4,1KPřed měsícem
MIXTRAL 8x22B: The BEST MoE Just got Better | RAG and Function Calling
Insanely Fast LLAMA-3 on Groq Playground and API for FREE
zhlédnutí 24KPřed měsícem
Insanely Fast LLAMA-3 on Groq Playground and API for FREE
LLAMA-3 🦙: EASIET WAY To FINE-TUNE ON YOUR DATA 🙌
zhlédnutí 45KPřed měsícem
LLAMA-3 🦙: EASIET WAY To FINE-TUNE ON YOUR DATA 🙌
LLAMA 3 Released - All You Need to Know
zhlédnutí 11KPřed měsícem
LLAMA 3 Released - All You Need to Know
WizardLM 2 - First Open Model Outperforming GPT-4
zhlédnutí 16KPřed měsícem
WizardLM 2 - First Open Model Outperforming GPT-4
Create Financial Agents with Vision 👀 - Powered by Claude 3 Haiku & Opus
zhlédnutí 6KPřed měsícem
Create Financial Agents with Vision 👀 - Powered by Claude 3 Haiku & Opus
I'm all in. better price the eleven labs.
Was working on a project whwre i need to use my local language but having issuse with coqui ai tts Library, aby other alternative that would be helpful, and easy to use thank you
Try meloTTS
Thank you, I will try it
is it faster than Deepgram?
Yes, on the playground. The Cartesia team recommends streaming. I am going to test that and report.
THis is not completely open-source so dont report it as such with clarification midway in the vid.
What software do you use for those super smooth zooms?
It's called screen studio. It's only for mac
Hey , Can we use this offline?
Yes
thank you for sharing this excelent tools!
why do they censor these models ? AI should remain non biased and present facts when asked not give you reasons why it cannot answer a question just because the truth may offend . facts don't care about feelings. Glad they have overcome censorship.
Ok as usual the lack of Gui destroys it for me..😢
Grow out of that and a world will open up
I am uncertain about marker, it is for scientific use, but says it removes footers, that is where you normally put in your sources, and apendix links.. so?!
This sounds very interesting but I will need to learn some python environment basic before I can put this to the test. A solution like this could help me a lot!
Thanks for posting Verbi. I wanted to get it to speak more than just English, I couldn't find any Carteia models that were anything other than English or American but ElevenLabs has great multi-lingual support. The following change in text_to_speach() enable ElevenLabs to speak quite a few languages... elif model == 'elevenlabs': ELEVENLABS_VOICE_ID = "Rachel" client = ElevenLabs(api_key=api_key) audio = client.generate( text=text, voice=ELEVENLABS_VOICE_ID, output_format="mp3_22050_32", model="eleven_multilingual_v2" ) elevenlabs.save(audio, output_file_path)
Hi, do you know how much RAM is required for this application? I tried, but it said that it was out of memory. My laptop has 16 GB RAM w/o Nvidia GPU. Thanks a lot
meh, I thought it's a better local tts ...ohh well.
Nobody would pay for services while we can do it on our own PC locally.
No thanks for advertising.
Oh boy, it's three times more expensive than Google's premium voices and only includes English. Skipped.
Thanks for another great video!! Can you please make a video or at least share the material on fine-tuning a quantized mistral v0.3 model
In general, you want to load the model in 4-bit. Look at my finetuning videos using unsloth.
They still have natural cadence issues, which is a hard problem to solve.
Yes, I think this is just the alpha version so hopefully will get better over time.
thanks
I'm interested in open source only... can't finish watching. Thumbs down, sorry.
dude, I wanna deploy this on huggingface as an API. make a tutorial on this.
deployment series is coming soon, will give you an idea on how to do this.
Brilliant vid - it is a godsend. OCRing a PDF is just not workable, period. I gave up on attempting parsing PDF. This new information is amazing and I am once again excited.
Glad it was helpful!
If you wanted this to plug into a chatbot the pricing does not add up. I've done some crunching, it won't even get you far with a basic smallish customer doing say doing 1000-3000 chats a month which isn't a lot. Most engines price in at audio sequence every 15s or 1m. More good engines are emerging. For our low end customers, we usually see 3 to 5 concurrency anyway and that's like the smallest model. Currently we have done 100's of millions of chat, 100's of millions of live chat too. So getting into the billions. The market is competitive. Some of the new google studio voices are comparable, deep gram too. Sure these are nice voices but for streaming api, at cost and competitive, sorry but no! unless the pricing model radically improves. It's early days so hopefully there will be new models, new options and a realization. Suggest you take say 5000, 10000, 30000 and 100,000 chats and work out the text size average transcript on the bot side, and average out the characters. You will see my point!
that's a valid argument. Hopefully they will be able to reduce their price as they scale.
I wonder how far away we are from arbitrarily high accuracy on tasks like this.
To be honest, when it comes to voice models, open source models are lagging behind!
If you do make a video about scraping data, please go over content that requires javascript to load. It’s been difficult to find a clear guide specifically for capturing this data for LLM usage. I loved this video, thank you!
I haven't look into it before so let me see what I can come up with.
They do not sound good at all....
thanks for the feature! super excited to keep building here. For the best experience w/ the API, I recommend using `stream=True` to get the first audio back super fast . Audio will come back in chunks. we'll add more info about how to use this to our docs
thanks for pointing it out. I do feel the docs need more work, I am going to explore it further. thanks for putting it together.
Has anyone done a demo of a single Cartesia voice outputting something like podcast length? 20 to 30 minutes? The human quality on short text is stunning but I worry that over longer text it will fall into repetitive cadence. The fact that voices are cloned on just a 20 second sample reinforces my concern. Have you tested that?
Interesting point, I will do a test and report back. It will be a fun experiment.
Anything to convert to epub? Getting rid of headers and footers
Markdowns for PDF for LLM😁
:)
Thanks but I'm Brazilian and didn't find portuguese in it
At the moment, its only English.
@@engineerprompt Hi, ok. Thank you for your attention and answer
Hey Prompt Engineer, If you don't mind. Could i also be a contributor of your Project. I have some wonderful Features which could help you make your Verbi AI more better and a perfect voice assistant 🥹 Its a request to add me in the group. I would disappoint you 😼
Yes, would love contributions. Please open a PR. We have a dedicated channel on the discord server. Feel free to join the discussion there.
Remind me of Elevenslab's early days. I think they use stream mode in their playground, measuring the time it takes to generate the first audio segment. That's why seems very fast. What do u think?
That's exactly how they are doing it. Their cofounder pointed it out and suggested to enable streaming via api as well. On the discord a contributor to project-verbi said its possible to get about 200-400ms with streaming. I might redo this again.
This... incredible... awesome, NICE WORK !!
Appreciate your efforts, but why the heck would you need an API call to get the ID of the voice you want to use or other seemingly static parameters? Also the API latency is terrible compared to their playground. Either you're doing something unnecessary still or their infrastructure is poor, which defeats the purpose of their supposedly low latency. Further the text to speech piece should be chunked into sentences and be streamed to the TTS service instead of waiting for the full response. This is OK for one or two sentence responses but if latency increases linearly then it's no good. Is there endpointing? interruption?
thanks for the feedback @avi7278. 1. you can get the voice_id straight from the playground! will have support very soon for passing that in directly 2. For the best experience w/ the API, recommend using `stream=True` to get the first audio back super fast 🚀. Audio will come back in chunks. we'll add more info about this to our docs 3. you can definitely send text chunks over the wire, will have more native support for text streaming soon
the free plan is "10000" characters , while the lowest $5 per month gets you "100,000 characters per month". I re-read that again, its in "characters" and not "words" . am i dreaming? so one letter is one character, right? is that correct? isn't that super expensive.
Its cheap compared to the other paid voice service (elevenlabs), that gives only 30k characters for 5 dollars, for the same 100k characters, it cost over 20 dollars on elevenlabs, 4x more expensive, but yeah, compared to other AI services where you pay once and gets almost unlimited usage, like infermatic for text AI, its expensive.
And with character they can count every space as a character.
Yup that's only characters. On average, 1000 characters is about 1 min of audio iirc, so the free tier is 10 min audio. For the same price ($5), the starter pack of Elevenlabs is only 30000 characters per month, so only half an hour.
@@BackTiVi I'll stay with Coqui.
I had another question, are you also on Udemy?
I am not on Udemy but just launching my RAG course here: prompt-s-site.thinkific.com/courses/rag
Thank you very much!!! I was looking for something like this for a long time. I work for a large bank but with very small budget for my project. Due to budget crunch we cannot afford buying third party tools, this sounds to be a perfect fit but since there is a limit of $5MN we may not qualify to use this for free. Would you suggest going with Nougat or you have a better alternative for my use case, really appreciate your content!
Nougat can be an option or look into unstructuredio. Also I would recommend to look into Claude or GPT4o with vision if data privacy is not a big issue. Some of these proprietary tools have good data privacy based on their TOS.
@@engineerprompt Thanks for the prompt response!!
I‘m always so impressed by models like this. But where are all the open source solutions according to this topic? Research is crazy!
Multi-language?
No
Coming soon 🚁
The more models I use the less I want to pay for the apis
Yeah, we really need this stuff free and open source. The only real limiting factor is the affordability of the GPU(s) needed to run this locally. There's stuff out there, but local open source audio stuff is behind text and image based models sadly, but maybe soon that'll change.
@@14supersonic actually you can use stable diffusion locally with a mid range 12 gb commercial gpu for image generation, audio models as well, also quantized llm models are very good for simpler tasks like summarizations
@Zale370 I know, that's why I said audio based AI models are behind text and image based solutions. When you compare something like local Llama 3 or SD3 to local audio based AI models, there's no audio modality comparable to them yet in terms of local usage.
Indeed, there are no optimal and swift text-to-speech (TTS) solutions for local LLM inference. I personally believe this is not solely due to GPU memory constraints but also driven by security considerations.
Yeah, I'm a security consultant and the risks inherent in this are just insane. I won't ever say open source should slow down but I appreciate the time we are getting to communicate what's coming. Amazingly, the EU AI legislation classifies voice cloner AI as lower risk. I don't think they've ever got a phone call from their doctor asking them to stop a particular medication or their wife saying they're being held hostage and they're demanding all the money. It gets darker from there.
Liking without watching coz I know this is gonna be amazing
This is Very useful!, Now, In next video, Tell how to finetune any model (with some long context length like "Phi-3-mini-128k-instruct") With this Markdown Data 😍😍
let me see what i can do :)
If you are interested in learning more about how to build robust RAG applications, check out this course: prompt-s-site.thinkific.com/courses/rag
If you are interested in learning more about how to build robust RAG applications, check out this course: prompt-s-site.thinkific.com/courses/rag
Around 7minutes, having installed a conda environment you select pip not conda when installing PyTorch - any reason why? If there’s a working conda option doesn’t it make sense to keep using conda and only use pip when you absolutely have to? Just wondering.. (thanks for the video btw - had just been wondering about effective ways of making off content reliably available to RAG and the video is super-useful).
I usually use pip because that has most of the python packages available. conda is somehow limited with available python packaged. conda will also work in this case but its more of my own habit at this point :)
On my m1 Mac i have tried this out, installing dependencies = [ "torch>=2.3.0", "torchvision>=0.18.0", "torchaudio>=2.3.0", "marker-pdf>=0.2.13", ] Then when i try out just a single pdf it fails on a simple python import. marker_single 26572517.pdf OUTPUT --max_pages 2 --langs English Traceback (most recent call last): File "marker/.venv/bin/marker_single", line 5, in <module> from convert_single import main File "marker/.venv/lib/python3.12/site-packages/convert_single.py", line 5, in <module> from marker.convert import convert_single_pdf ModuleNotFoundError: No module named ‘marker.convert' Anyone getting the same? Tried with python 3.10 and 3.12
are you using a virtual environment? use this command: python -m pip install marker-pdf This will ensure its installing the package in the current virtual env.
Using rye, and yes it is there in my virtual env.
The marker scripts are there to be called.
@@engineerprompt Ok the problem seems to be with the way Rye handled the imports, sorry bout that. Creating the virtual env normally i can run the commands. Thanks for the video, i have been looking for how to do this a long time.
What I want is for LLMs to cook my next meal.
Docker version please