![Reda Marzouk](/img/default-banner.jpg)
- 42
- 178 285
Reda Marzouk
France
Registrace 11. 10. 2015
Hi there! My name is Reda and I'm an RPA Developer.
I'm thrilled to share my knowledge with you through this CZcams channel. Making videos is an incredible way for me to learn, and I'm hoping that my videos can help you do the same.
My channel focuses on UiPath and Power Automate, two incredible tools for automation.
RPA is expanding at an unprecedented rate, and I want to make sure you're staying up to date with all its News. That's why I create high-level videos that cover everything you need to know.
But I understand that sometimes you need more than just an overview. That's why I also make shorter, more specific videos that go deep into particular problems. I'm here to help you solve even the toughest issues and become an RPA pro.
I'm so excited to have you on this journey of learning and sharing. And if you have any ideas for future videos, please let me know!
Cheers,
Reda
I'm thrilled to share my knowledge with you through this CZcams channel. Making videos is an incredible way for me to learn, and I'm hoping that my videos can help you do the same.
My channel focuses on UiPath and Power Automate, two incredible tools for automation.
RPA is expanding at an unprecedented rate, and I want to make sure you're staying up to date with all its News. That's why I create high-level videos that cover everything you need to know.
But I understand that sometimes you need more than just an overview. That's why I also make shorter, more specific videos that go deep into particular problems. I'm here to help you solve even the toughest issues and become an RPA pro.
I'm so excited to have you on this journey of learning and sharing. And if you have any ideas for future videos, please let me know!
Cheers,
Reda
This AI Agent can Scrape ANY WEBSITE!!!
In this video, we'll create a python script together that can scrape any website with only minor modifications
________ 👇 Links 👇 ________
🤝 Discord: discord.gg/jUe948xsv4
💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/
📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa
🤖 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos
Website: www.automation-campus.com/
FireCrawl: www.firecrawl.dev/
Github repo: github.com/redamarzouk/Scraping_Agent
________ 👇 Content👇 ________
Introduction to Web Scraping with AI - 0:00
Advantages Over Traditional Methods - 0:36
Overview of FireCrawl Library - 1:13
Setting Up FireCrawl Account and API Key - 1:24
Scraping with FireCrawl : Example and Explanation - 1:36
Universal Web Scraping Agent Workflow - 2:33
Setting Up the Project in VS Code - 3:52
Writing the Scrape Data Function - 5:41
Formatting and Saving Data - 6:58
Running the Code: First Example - 10:14
Handling Large Data and Foreign Languages - 13:17
Conclusion and Recap - 17:21
________ 👇 Links 👇 ________
🤝 Discord: discord.gg/jUe948xsv4
💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/
📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa
🤖 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos
Website: www.automation-campus.com/
FireCrawl: www.firecrawl.dev/
Github repo: github.com/redamarzouk/Scraping_Agent
________ 👇 Content👇 ________
Introduction to Web Scraping with AI - 0:00
Advantages Over Traditional Methods - 0:36
Overview of FireCrawl Library - 1:13
Setting Up FireCrawl Account and API Key - 1:24
Scraping with FireCrawl : Example and Explanation - 1:36
Universal Web Scraping Agent Workflow - 2:33
Setting Up the Project in VS Code - 3:52
Writing the Scrape Data Function - 5:41
Formatting and Saving Data - 6:58
Running the Code: First Example - 10:14
Handling Large Data and Foreign Languages - 13:17
Conclusion and Recap - 17:21
zhlédnutí: 42 765
Video
NEW GPT-4o: Prepare to be SHOCKED!!
zhlédnutí 1,5KPřed měsícem
In this video, we dive into the launch of GPT-4o by OpenAI, covering its new features and capabilities. We'll check out its real-time conversational speech, top-notch benchmark performance, and availability for free users. Stick around as we react to live demos and chat about it. 👇 Links 👇 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: instagram....
Llama 3 FULLY LOCAL on your Machine | Run Llama3 locally
zhlédnutí 1KPřed 2 měsíci
FULLY Local Llama 3, on your machine. Run Llama 3-8B in a local server and integrate it inside your AI Agent project. 👇 Links 👇 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos LMStudio: lmstudio.ai/ Ollama: ollama.com/ www.automation-campus.com/ Introduction Llama3: 00:00...
Llama 3 BREAKS the industry !!! | Llama3 fully Tested
zhlédnutí 2,3KPřed 2 měsíci
FULLY Tested Llama 3, the flagship model from Meta. Benchmark of GPT-4 vs GPT-4 Turbo vs Llama 3. 👇 Links 👇 lmstudio.ai/ 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos www.automation-campus.com/ 👇 Content👇 00:00 Introduction to Llama3 00:30 All you need to know about Lla...
GPT-4 Surpassed Claude 3 (Again) | GPT-4 Turbo fully tested
zhlédnutí 3,2KPřed 2 měsíci
FULLY Tested GPT-4 Turbo, the flagship model from OPENAI. Benchmark of GPT-4 vs GPT-4 Turbo 👇 Links 👇 lmstudio.ai/ 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos www.automation-campus.com/ 👇 Content👇 00:00 Introduction and AI Week News 00:11 Launch of a new AI model by M...
AUTOGEN STUDIO : The Complete GUIDE (Build AI AGENTS in minutes)
zhlédnutí 8KPřed 2 měsíci
The full guide to get started with Autogen Studio, Create Powerful AI Agents in a couple of minutes with real life projects. 👇 Links 👇 lmstudio.ai/ 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos www.automation-campus.com/ 👇 Content👇 00:00 Introduction to Agentic Workflow...
Easily Run LOCAL Open-Source LLMs for Free
zhlédnutí 3KPřed 3 měsíci
Run locally hosted open source LLM for free. LMStudio helps you download and run private Models from huggingFace in a no code environment, it's a solid Free Chatgpt alternative. 👇 Links 👇 lmstudio.ai/ 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos www.automation-campus.c...
Elon Does The Unthinkable, Grok-1 is officially the LARGEST Open Source mode!!
zhlédnutí 2,7KPřed 3 měsíci
Elon musk has just launched Grok-1 to the rest of the world. #elonmusk #grok #chatgpt #openai 👇 Links 👇 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos www.automation-campus.com/downloads
This Agent can create Dalle Images at SCALE!!
zhlédnutí 748Před 3 měsíci
Agent to generate images on Dalle 3 automatically. #chatgpt #gpt #dalle3 #automation 👇 Websites👇 Cloud.uipath.com 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos www.automation-campus.com/downloads 👇 Content👇 00:00 Dalle Agent 01:05 Prerequisites 02:34 Agent steps 06:39 R...
The HARSH REALITY of being an RPA Developer!!
zhlédnutí 2,6KPřed 3 měsíci
Introducing the latest innovation in AI technology: Digital Agents that can control your desktop! With ChatGPT, you can now have a virtual assistant that can perform tasks on your computer, just by chatting with it. and this one can operate all of your desktop/web apps. 👇 Websites👇 github.com/OthersideAI/self-operating-computer 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/re...
This AI Agent can CONTROL your ENTIRE DESKTOP!!!
zhlédnutí 8KPřed 4 měsíci
Introducing the latest innovation in AI technology: Digital Agents that can control your desktop! With ChatGPT, you can now have a virtual assistant that can perform tasks on your computer, just by chatting with it. and this one can operate all of your desktop/web apps. 👇 Websites👇 github.com/OthersideAI/self-operating-computer 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/re...
10x your PRODUCTIVITY with this NEW AI tool !!!
zhlédnutí 15KPřed 4 měsíci
Improve your productivity with this amazing new AI tool! This tutorial will show you how to use this tool to copy and paste from any document, screen, or application. Say goodbye to time-consuming tasks and hello to increased efficiency with this game-changing tool! 👇 Websites👇 www.uipath.com/product/clipboard-ai 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/...
A Free Personal AI Agent that actually WORKS!!!
zhlédnutí 26KPřed 4 měsíci
Learn about the future of digital automation with autonomous agents, large action models. Discover how these technologies are transforming industries and improving efficiency and productivity. Don't get left behind, stay ahead of the game and find out what the future holds for digital automation! 👇 Websites👇 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻...
UiPath joins Large Action Model Race
zhlédnutí 1,8KPřed 5 měsíci
In this video you'll learn how to create a robot to fill forms automatically on any website and with only minimal changes. 👇 Websites👇 www.automation-campus.com/downloads cloud.uipath.com/ 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos 👇 Content👇 00:00 Intro 00:49 Data E...
UiPath made PDF Extraction a lot easier - Document Understanding UiPath
zhlédnutí 2KPřed 5 měsíci
Extract any pdf using 4 simple UiPath Activities, follow the steps in the video and you'll have a single process to interact with any pdf file. 👇 Websites👇 www.automation-campus.com/downloads cloud.uipath.com/ 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos 👇 Content👇 00:...
UiPath Errors Troubleshoot - The only Trick you'll EVER need
zhlédnutí 843Před 9 měsíci
UiPath Errors Troubleshoot - The only Trick you'll EVER need
Object reference not set to an instance of an object - UiPath - BEST SOLUTION!!!
zhlédnutí 8KPřed 9 měsíci
Object reference not set to an instance of an object - UiPath - BEST SOLUTION!!!
ChatGPT API Advanced Configuration | GPT-4 API Pricing
zhlédnutí 652Před 10 měsíci
ChatGPT API Advanced Configuration | GPT-4 API Pricing
REVOLUTIONARY!!!!! UiPath Document Understanding and Generative AI - Invoice Data Extraction
zhlédnutí 4,5KPřed 10 měsíci
REVOLUTIONARY!!!!! UiPath Document Understanding and Generative AI - Invoice Data Extraction
Don't Miss Out: UiPath Document Understanding and Generative AI. (Game-Changer!!!!)
zhlédnutí 3,4KPřed 10 měsíci
Don't Miss Out: UiPath Document Understanding and Generative AI. (Game-Changer!!!!)
UiPath Advanced Certification | Activities and Properties Part 2 | Questions And Answers
zhlédnutí 372Před 10 měsíci
UiPath Advanced Certification | Activities and Properties Part 2 | Questions And Answers
UiPath Advanced Certification | Activities and Properties | Practice Test Solutions
zhlédnutí 367Před 11 měsíci
UiPath Advanced Certification | Activities and Properties | Practice Test Solutions
UiPath Advanced Certification | UiPath Studio | UiPath Practice Exam
zhlédnutí 499Před 11 měsíci
UiPath Advanced Certification | UiPath Studio | UiPath Practice Exam
UiPath Advanced Certification | State Machine, Flowchart and Sequence in UiPath | UiPath RPA
zhlédnutí 460Před 11 měsíci
UiPath Advanced Certification | State Machine, Flowchart and Sequence in UiPath | UiPath RPA
UiPath Advanced Certification | How to get certified in UiPath RPA in 2023
zhlédnutí 1,6KPřed 11 měsíci
UiPath Advanced Certification | How to get certified in UiPath RPA in 2023
Resume Screener - Extract data from CV PDF documents using UiPath and ChatGPT
zhlédnutí 2,6KPřed 11 měsíci
Resume Screener - Extract data from CV PDF documents using UiPath and ChatGPT
UiPath - Download File From URL | How to download file from website using UiPath
zhlédnutí 6KPřed rokem
UiPath - Download File From URL | How to download file from website using UiPath
UiPath Excel Add-In | UiPath Attended Automation inside Excel
zhlédnutí 665Před rokem
UiPath Excel Add-In | UiPath Attended Automation inside Excel
Top 3 Changes in the MODERN Design of UiPath Studio
zhlédnutí 535Před rokem
Top 3 Changes in the MODERN Design of UiPath Studio
Nice project, I worked on your code base for a while and used Groq mixtral instead, with multiple keys to pass limits, and Firecrawl is not automatic when it comes to pagination, you still need to add HTML code, which defeats the purpose, slow but ok for a free purpose. But I got around that I think. The next step is to use it in the front end. Zillow's API is only available for property developers, so scraping with manual inputs is the only way. However, working with the live API functionality would be the best way forward. Nice job!
Having my computer managed by an AI I can naturally communicate with is one of my biggest dreams for the short-term future.
I can't find a OPENAI model that works for me. I've tried gpt-3, gpt-3.5, gpt-3.5-turbo-1186, I always get a 404 does not exist or you don't have access to it. GPT says use davinci or curie. Any suggestions?
Hey, I am getting Invoke Code: Exception has been thrown by the target of an invocation. Error when ran this.
The AI agent is unable to bypass Cloudflare, even after trying Ollama.
Dear sir, can I use my local LLM models instead of OpenAI API?
Thanks bro
You’re welcome 😄
Hello Si Reda, all the best insh'Allah :)
Thank you so much and to you too 😄
Good technology to keep in good book!
can we use this for email and phone number extraction
Absolutely you just need to change the websites and the fields and you’re good to go
What other options are beside Firecrawl? Thanks!
Just found it in the comments: "Firecrawl has 5K stars on GitHub, Jina ai has 4k and scrapegraph has 9k."
Exactly Jina AI, scrapegraph AI are also options
I like waht you did, but for no code people this is so hard because we dont know what we should install for windows etc.. really really nice video
Can you scrape multiple URLs at once? For example if you wanted to scrape all the zillow pages not just the first page with a few houses. @redamarzouk
This is a nice video and very useful. In my applications I'm looking for the 'system' to have ALL the customer pdF Invoices uploaded, or better yet, as an SalesOrder table in a database. this seems like a lot of work for just one customer and one email. Is there a way to create Agents that could filter out which customer order? Etc.
I'm curious - what do you do after structuring the data - do you store it in a vector DB? If so, do you store the Json as it is or something else? And can it actually be completely universal - by that i mean can it structure data by us not providing the fields on which it should strucutre the data. Can we make it in some way where upload a website and it understands the data and structures it according to it?
Im just getting "An error occurred: name 'phone_fields' is not defined"
nice! any idea on how to self host firecrawl? like with Docker? also, can it be coupled with n8n? how?
I gotta be honest, I didn't even try. I tried to self host an agentic software tool before and my pc was going crazy, it couldn't take the load from Llama3-8B running on LM Studio plus docker plus filming at the same time, I simply don't have the hardware for it. if you want to self host here is the link: github.com/mendableai/firecrawl/blob/main/SELF_HOST.md it is with docker.
@@redamarzouk thanks. Is there any sense to use it with n8n? or maybe n8n can do the same without firecrawl? (noob here)
@@redamarzouk or maybe with things like Flowise?
can Use LLMA 3/ Phi3 on local pc ?
You theoretically can use it when it comes to Data Extraction, but you will need a large context window version of Llama3 or Phi3. I've seen a model where they have extended the context length to 1M tokens for Llama3-7B. you need to keep in my that your hardware need to match the requirements.
awesome bro
Glad you liked it
nop it s not better that GPT
you're right now it's not, these models are beating each other like there is no tomorrow, to this date GPT-4o is the one at the top.
@@redamarzouk before gpt 4 omni, gpt 4 turbo still better, the only best point with llama is free model :)
Nice video. Another helpful video on the same topic czcams.com/video/dSX5eoD4-u4/video.htmlsi=8iKzgqHG97Ivf8wK
Very helpful. How do you work around the output limit of 4096 tokens?
Hello, if you're using open ai api, you need to add the parameter (max_tokens=xxxxxxxx) inside your client open ai call and define a number that don't exceed the max number of token of the model you're using (128 000 for gpt-4o for example)
Helpful, Thank you
Glad it helped!
Does it work on mbp
I get Error code: 429 when running the code. -'You exceeded your current quota,...
In case you haven't used your OpenAI API key in a while: they changed the way it works, you need to pay in advance to refill your quota
Can you use this for self operating pc ? Thanks
Believe me I tried, but my NVIDIA RTX 3050 4Gb simply can’t withstand filming and running Llava at the same time. Hopefully I’ll upgrade my setup soon and be able to do it.
So it is possible it's just matter of programing and pc sepecs
Does it parse JavaScript, infinity scroll, button click navigations?
Yes, you can ask LLMs to do all that like a human would.
In the US, a “bedroom” is a room with a closet, a window, and a door that can be closed.
Neat overview. Curious about API costs associated with these demos. Try zooming into your code for viewers.
watch on big monitors as most coders do
for only the demo you've seen, I spent 0.5$, for creating the code and launching it 60+ times, I spent 3$. I will zoom in next time.
Wow! The AI was even clever enough to convert square meters into square feet, no need to write a conversion function!
Webscraping as it is right now is here to stay and AI will not replace it (it can just enhance it in certain scenarios). First of all the term "scraping" is tossed everywhere and being used vaguely. When you "scrape" all you do is move information from one place to another. For example getting a website's HTML into your computer's memory. Then comes "parsing", which is extracting different entities from that information. For example extracting product price and title, from the HTML we "scraped". These are separate actions, they are not interchangeable, one is not more important than the other, and one can't work without the other. Both actions come with their own challenges. What these kind of videos promise to fix is the "parsing" part of it. It doesn't matter how advanced AI gets, there is only ONE way to "scrape" information, and that is to make a connection to the place the information is stored(whether its HTTP request, browser navigation, RSS feed request, FTP download or a stream of data). It's just semi-automated in the background. Now that we have the fundamentals, let me clearly state this: For the vast majority(99%) of the cases "web scraping with AI" is a waste of time, money, resources and our environment. Time: its deceiving, as AI promises to extract information with a "simple prompt", you'll need to iterate over that prompt quite a few times in order to make a somewhat reliable data parsing solution. In that time you could have built a simple python script to extract the data required. More complicated scenarios will affect both the AI, and the traditional route. Money: You either use 3rd party services for LLM inference or you self-host an LLM. Both solutions in the long term will be in orders of magnitude more expensive than a traditional python script. Resources: A lot of people don't realize this but running an LLM for cases in which an LLM is not needed is extremely wasteful on resources. Ive ran scrapers on old computers, raspberry pi's and serverless functions, this is just a spec of dust of hardware requirements compared to running an LLM on an industrial grade computer with powerful GPU(s) Environment: As per the resources needed, this affects our environment greatly, as new and more powerful hardware needs to be invented, manufactured and ran. For the people that don't know, AI inference machines (whether self-hosted or 3rd party) are powerhouses, thus a lot of watt/hours wasted, fossil fuels burnt etc. Reliability: "Parsing" information with AI is quite unreliable, manly because of the nature of how LLMs work, but also because a lot more points of failure are introduced(information has to travel multiple times between services, LLM models change, you hit usage and/or budget limits, LLMs experience high loads and inference speed sucks or it fails all together, etc.) Finally: most of AI extraction is just marketing BS letting you believe that you'll achieve something that requires a human brain and workforce with just "a simple prompt". I've been doing web automation and data extraction for more than a decade for a living. Ive also started incorporating AI in some rare cases, where traditional methods just don't cut it. All that being said, for the last 1% of the cases that do make sense to use AI for data parsing, here's what I typically do (after the information is already scraped): 1. First I remove vast majority of the HTML. If you need an article from a website, its not going to be in the <script>, <style>, <head>, <footer> tags(you get the idea), so using a python library (I love lxml) I remove all these tags, along with their content. Since we are just looking for an article I will also remove ALL of the HTML attributes, like classes(big one), ids, and so on. After that I will remove all the parent/sibling cases where it looks like a useless staircase of tags. I've tried converting to markdown and parsing, Ive tried parsing with a screenshot, but this method is vastly superior due to important HTML elements still being present, and the general HTML knowledge of LLMs. This step will make each request at least 10 times cheaper, and will allow us to use models with lower context sizes. 2. I will then manually copy the article content that I need and will put it along with the above resulting string into a json object + prompts to extract an article form given HTML, I will do this at least 15 times. This is the step where training data is created. 3. Then I will fine tune a GPT3.5Turbo model with that json data. After 10ish minutes of fine-tuning and around $5-10, I have an "article extraction fine-tuned model", that will always outperform any agentic solution in all areas(price, speed, accuracy, reliability). Then I just feed the model a new(un-seen) piece of HTML that has passed step1(above) and it will reliably spew out an article for a fraction of a cent in a single step (no agents needed). I have a few of those running in production for clients(for different datapoints), and they do very good, but its important that a human goes over the results every now and again. Also if there is an edge case and the fine-tune did not perform well, you just iterate and feed it more training data, and it just works.
Thanks for taking the time to explain this! Very useful to clarify!
Thanks man. I am specializing in web scraping in my career. Do you have some blog or similar where you share content of web scraping as a career?
Nonsense. Scraping has for 10 years included both fetching data and then structuring it in some format, XML or JSON. Then we can do whatever we want with that structured that. Introducing "parsing" as some distinct construct is inane. More importantly, the way scraping can work today is leagues better than what the likes of APIFY used to do until 2 year ago, and yes this uses LLMs. Expand your reading.
@@ilianos his "explanation" is stupid.
@@rafael_tg watch more sensible videos and comments.
Dumpling ai is a startup doing The same! I’m swapping to this they are 50$ a month for 10,000 and 6 a min
It seems Zillow is blocking my access --> Press & Hold to confirm you are a human (and not a bot). I was able to run on trulia, but without my VPN.
tbarkallah 3lik a bro mashallah
Lah yhafdk
Thanks for the helpful content.
You're most welcome!
You said that sometimes the model returning the response with different keynames, but if you pass the pydantic model to the OpenAI model as a function, you can expect invariable object with the keys that you need
Also, pydantic models can be scripted to have nested structure, in contrast to json schemas
Correct I've actually used them while I was playing around with my code (alongside function calling), the issue I found is that I have to explain both pydantic schema and how I made it dynamic, because if I want a universal web scrapper that can use different fields everytime we're scrapping a different website. That ultimately would've made the video a 30mins+ video, so I opted for the easier less performant way.
What about Angie list 😢
What about captcha
Websites don't like scrappers in general, so extensive scrapping will need a vpn (that can handle the volume of your scrapping).
@@redamarzouk also a VPN would not defend from captcha. They are there for a good reason but would be interesting to find a way around it to build tools for customers
Thank you Reda for sharing the knowledge! Very appreciated!
Really appreciate the kind words, My pleasure 🙏🙏
In my experience, function calling is way better at extracting consistent JSON than just prompting. Anyway, تبارك الله على ولد بلادي.
Good idea
You're on point with this, using function calling is always better for JSON Consistency. I actually used it when I was creating my original code. The issue is that I have a parameter "Fields" that can change depending on the type of website being scraped. So to account for that in my code I either need to make the schema inside the function calling generic (not so great) or I make it dynamic (really didn't want to go there, it will make the tutorial much more complicated). I also tried using pydantic expressions since Firecrawl has their own LLM Extractor that can use them, but it didn't perform as well. But yeah you're right function calling is always better. Lah yhfdk a sat.
@@redamarzouk You have a knack for this bro. Keep up the good work. وفقك الله
You could just ask GPT-4 one time to generate the extraction code or the tags to look for, per website, so that it doesn't need to always use AI for scraping, and you might get better results, and then in that code if it fails you fall back to regenerating it and cache it again.
Creating a dedicated script for a website is the best way to get the exact data you want, you're right in that sense, and you can always fix it with gpt-4 as well. But let say you're actively scraping 10 competitor websites where you only want to get their pricing updates and their new offerings, will it make sense to you to maintain 10 different scripts rather than have 1 script that can do the job and will need very minimal intervention? It depends on the use case, but there are times where customized scraping code isn't the best approach.
@@redamarzouk I didn't mean like that. I meant you would basically do the same thing as your technique, but you could just use the AI one for each domain, asking it what the CSS selectors are for the elements you're interested in. That way when you're looking for updates you don't even need to do any calls to the LLM unless it fails because the structure is different. You don't even have to maintain multiple scripts, just make a Dictionary with the domain name and the CSS paths and there you go. Of course a lot of different pages may have different structure but you could probably just feed in the HTML from a few different pages of the site and use a prompt telling GPT-4 the URLs and the markup and tell it to figure out the URL pattern that will match the specific stuff to look for. You could even still do this with GPT-3.5-Turbo. Basically the only idea I'm throwing out there is to ask the AI to tell you the tag names and have your code simply extract the info using BeautifulSoup or something else that can grab info out of tags based on CSS query selectors. That way, you can cache that info and then scrape faster after you get that info the initial time. Would only be a little more work but might be a lot better for some use cases. Just thought it was a cool idea
Thank you. I have a case use, can I use the tool to make querys to a database, save the results as your tutorial shows and also print to PDF the result of every query?
If you already have a database you want to make queries against, you don't need any scraping (unless you need to scrape website to create that database). But yeah it sounds like you can do that without the need for any AI in the loop.
While it is a practical solution, it still requires access to an OpenAI API or other LLM API, which can incur costs. BeautifulSoup and Selenium remain free alternatives. However, using LM Studio locally can provide the advantage of utilizing your own LLM, offering greater flexibility and control.
Nice idea. Now wake me up when there are no credits involved (completely free).
it's open source, this is how you can run it locally and contribute to the project. github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md but honestly as IT folks we gotta stop going at each other for wanting to charge for an app we've created, granted I'm not recommending this to my clients yet and 50$/month is high, but if that what they want to charge it's really up to them.
just use LLM, pass it the source code of the page, and generate the a-la-carte the scraping function LLM is the secret sauce
Bghit ghir nfhm chno dawr dial firecrawl fhadchi kamel ? Banli la mafih ta haja spéciale ga3 !!!
Awesome! thanks for sharing.
My pleasure!
It's easy to do it with free python library. Reading HTML convert to markdown, even convert it for free to vector with transformer ect
Exactly I didn't really understand the point of firecrawl in this solution!? Does Firecrawl do anything better then free python library. Any suggestion on Python libraries btw?
Have you used it on complex websites with s or many ads, or logins or progressive JS based loads, or infinite scrolls? Clearly not.
firecrawl has 5K stars on GitHub, Jina ai has 4k and scrapegraph has 9k. Saying that you can just implement these tools easily is frankly disrespectful to the developers who have created these libraries and made them open source for the rest of us. in the example I covered, I didn't show the capabilities of filtering the markdown to only keep the main content in a page nor did I show how to scrape using a search query. I've done scraping professionally for 7+ years now, and the amount of problems you could encounter is immense, from websites blocking you to websites with table looking elements that are in fact just a chaos of divs to Iframes... About Vectorizing your markdown, I once did that on my machine in a "chat with pdf" project, and just with 1024 dimensions and 20 pages of pdf I have to wait long minutes to generate the vectorstore that has to be searched for every request also locally (not everyone has the hardware for it).
@@redamarzouk FireCrawl doesn't offer much value when there are free Python resources and paid tools that let you scrape websites without needing your own API key. You still have to input your OpenAI API key with FireCrawl, making it less appealing. Why pay for something when there are free or cheaper options that are easier to use? Thanks for sharing, but I'll stick with the alternatives.
50$ montly fee 🎉😂😅
I actually filmed an hour and I wanted to go through the financials of this method and if it makes sense, but I edited that part out so the video is less than 30mins. but I agree 50$ is high, and the markdowns should be of quality for the tokens to be less therefore cheap LLM cost. btw I"m not sponsored by any means by firecrawl, I was gonna talk about jina ai or scrapegraph-ai which do the same thing before deciding on firecrawl.
firecrawl is not open-sourced!!!
You too nailed it. We need to refuse these false open source codes that are in reality commercial endeavours. I use only FREE and OPEN codes.
Except it is. Refer to its repo, it shows how to run it locally github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md
@@redamarzouk I'll take a look. Thanks.
But you are not using the open source, you are using their API... perhaps for the next time that you could do it run locally
@@redamarzouk the open source repo is still not ready for self hosting.