Reda Marzouk
Reda Marzouk
  • 42
  • 178 285
This AI Agent can Scrape ANY WEBSITE!!!
In this video, we'll create a python script together that can scrape any website with only minor modifications
________ 👇 Links 👇 ________
🤝 Discord: discord.gg/jUe948xsv4
💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/
📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa
🤖 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos
Website: www.automation-campus.com/
FireCrawl: www.firecrawl.dev/
Github repo: github.com/redamarzouk/Scraping_Agent
________ 👇 Content👇 ________
Introduction to Web Scraping with AI - 0:00
Advantages Over Traditional Methods - 0:36
Overview of FireCrawl Library - 1:13
Setting Up FireCrawl Account and API Key - 1:24
Scraping with FireCrawl : Example and Explanation - 1:36
Universal Web Scraping Agent Workflow - 2:33
Setting Up the Project in VS Code - 3:52
Writing the Scrape Data Function - 5:41
Formatting and Saving Data - 6:58
Running the Code: First Example - 10:14
Handling Large Data and Foreign Languages - 13:17
Conclusion and Recap - 17:21
zhlédnutí: 42 765

Video

NEW GPT-4o: Prepare to be SHOCKED!!
zhlédnutí 1,5KPřed měsícem
In this video, we dive into the launch of GPT-4o by OpenAI, covering its new features and capabilities. We'll check out its real-time conversational speech, top-notch benchmark performance, and availability for free users. Stick around as we react to live demos and chat about it. 👇 Links 👇 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: instagram....
Llama 3 FULLY LOCAL on your Machine | Run Llama3 locally
zhlédnutí 1KPřed 2 měsíci
FULLY Local Llama 3, on your machine. Run Llama 3-8B in a local server and integrate it inside your AI Agent project. 👇 Links 👇 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos LMStudio: lmstudio.ai/ Ollama: ollama.com/ www.automation-campus.com/ Introduction Llama3: 00:00...
Llama 3 BREAKS the industry !!! | Llama3 fully Tested
zhlédnutí 2,3KPřed 2 měsíci
FULLY Tested Llama 3, the flagship model from Meta. Benchmark of GPT-4 vs GPT-4 Turbo vs Llama 3. 👇 Links 👇 lmstudio.ai/ 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos www.automation-campus.com/ 👇 Content👇 00:00 Introduction to Llama3 00:30 All you need to know about Lla...
GPT-4 Surpassed Claude 3 (Again) | GPT-4 Turbo fully tested
zhlédnutí 3,2KPřed 2 měsíci
FULLY Tested GPT-4 Turbo, the flagship model from OPENAI. Benchmark of GPT-4 vs GPT-4 Turbo 👇 Links 👇 lmstudio.ai/ 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos www.automation-campus.com/ 👇 Content👇 00:00 Introduction and AI Week News 00:11 Launch of a new AI model by M...
AUTOGEN STUDIO : The Complete GUIDE (Build AI AGENTS in minutes)
zhlédnutí 8KPřed 2 měsíci
The full guide to get started with Autogen Studio, Create Powerful AI Agents in a couple of minutes with real life projects. 👇 Links 👇 lmstudio.ai/ 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos www.automation-campus.com/ 👇 Content👇 00:00 Introduction to Agentic Workflow...
Easily Run LOCAL Open-Source LLMs for Free
zhlédnutí 3KPřed 3 měsíci
Run locally hosted open source LLM for free. LMStudio helps you download and run private Models from huggingFace in a no code environment, it's a solid Free Chatgpt alternative. 👇 Links 👇 lmstudio.ai/ 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos www.automation-campus.c...
Elon Does The Unthinkable, Grok-1 is officially the LARGEST Open Source mode!!
zhlédnutí 2,7KPřed 3 měsíci
Elon musk has just launched Grok-1 to the rest of the world. #elonmusk #grok #chatgpt #openai 👇 Links 👇 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos www.automation-campus.com/downloads
This Agent can create Dalle Images at SCALE!!
zhlédnutí 748Před 3 měsíci
Agent to generate images on Dalle 3 automatically. #chatgpt #gpt #dalle3 #automation 👇 Websites👇 Cloud.uipath.com 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos www.automation-campus.com/downloads 👇 Content👇 00:00 Dalle Agent 01:05 Prerequisites 02:34 Agent steps 06:39 R...
The HARSH REALITY of being an RPA Developer!!
zhlédnutí 2,6KPřed 3 měsíci
Introducing the latest innovation in AI technology: Digital Agents that can control your desktop! With ChatGPT, you can now have a virtual assistant that can perform tasks on your computer, just by chatting with it. and this one can operate all of your desktop/web apps. 👇 Websites👇 github.com/OthersideAI/self-operating-computer 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/re...
This AI Agent can CONTROL your ENTIRE DESKTOP!!!
zhlédnutí 8KPřed 4 měsíci
Introducing the latest innovation in AI technology: Digital Agents that can control your desktop! With ChatGPT, you can now have a virtual assistant that can perform tasks on your computer, just by chatting with it. and this one can operate all of your desktop/web apps. 👇 Websites👇 github.com/OthersideAI/self-operating-computer 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/re...
10x your PRODUCTIVITY with this NEW AI tool !!!
zhlédnutí 15KPřed 4 měsíci
Improve your productivity with this amazing new AI tool! This tutorial will show you how to use this tool to copy and paste from any document, screen, or application. Say goodbye to time-consuming tasks and hello to increased efficiency with this game-changing tool! 👇 Websites👇 www.uipath.com/product/clipboard-ai 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/...
A Free Personal AI Agent that actually WORKS!!!
zhlédnutí 26KPřed 4 měsíci
Learn about the future of digital automation with autonomous agents, large action models. Discover how these technologies are transforming industries and improving efficiency and productivity. Don't get left behind, stay ahead of the game and find out what the future holds for digital automation! 👇 Websites👇 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻...
UiPath joins Large Action Model Race
zhlédnutí 1,8KPřed 5 měsíci
In this video you'll learn how to create a robot to fill forms automatically on any website and with only minimal changes. 👇 Websites👇 www.automation-campus.com/downloads cloud.uipath.com/ 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos 👇 Content👇 00:00 Intro 00:49 Data E...
UiPath made PDF Extraction a lot easier - Document Understanding UiPath
zhlédnutí 2KPřed 5 měsíci
Extract any pdf using 4 simple UiPath Activities, follow the steps in the video and you'll have a single process to interact with any pdf file. 👇 Websites👇 www.automation-campus.com/downloads cloud.uipath.com/ 🤝 Discord: discord.gg/jUe948xsv4 💼 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻: www.linkedin.com/in/reda-marzouk-rpa/ 📸 𝗜𝗻𝘀𝘁𝗮𝗴𝗿𝗮𝗺: redamarzouk.rpa 𝗬𝗼𝘂𝗧𝘂𝗯𝗲: www.youtube.com/@redamarzouk/videos 👇 Content👇 00:...
UiPath Queues For Beginners
zhlédnutí 2,1KPřed 8 měsíci
UiPath Queues For Beginners
UiPath Errors Troubleshoot - The only Trick you'll EVER need
zhlédnutí 843Před 9 měsíci
UiPath Errors Troubleshoot - The only Trick you'll EVER need
Object reference not set to an instance of an object - UiPath - BEST SOLUTION!!!
zhlédnutí 8KPřed 9 měsíci
Object reference not set to an instance of an object - UiPath - BEST SOLUTION!!!
ChatGPT API Advanced Configuration | GPT-4 API Pricing
zhlédnutí 652Před 10 měsíci
ChatGPT API Advanced Configuration | GPT-4 API Pricing
REVOLUTIONARY!!!!! UiPath Document Understanding and Generative AI - Invoice Data Extraction
zhlédnutí 4,5KPřed 10 měsíci
REVOLUTIONARY!!!!! UiPath Document Understanding and Generative AI - Invoice Data Extraction
Don't Miss Out: UiPath Document Understanding and Generative AI. (Game-Changer!!!!)
zhlédnutí 3,4KPřed 10 měsíci
Don't Miss Out: UiPath Document Understanding and Generative AI. (Game-Changer!!!!)
UiPath Advanced Certification | Activities and Properties Part 2 | Questions And Answers
zhlédnutí 372Před 10 měsíci
UiPath Advanced Certification | Activities and Properties Part 2 | Questions And Answers
UiPath Advanced Certification | Activities and Properties | Practice Test Solutions
zhlédnutí 367Před 11 měsíci
UiPath Advanced Certification | Activities and Properties | Practice Test Solutions
UiPath Advanced Certification | UiPath Studio | UiPath Practice Exam
zhlédnutí 499Před 11 měsíci
UiPath Advanced Certification | UiPath Studio | UiPath Practice Exam
UiPath Advanced Certification | State Machine, Flowchart and Sequence in UiPath | UiPath RPA
zhlédnutí 460Před 11 měsíci
UiPath Advanced Certification | State Machine, Flowchart and Sequence in UiPath | UiPath RPA
UiPath Advanced Certification | How to get certified in UiPath RPA in 2023
zhlédnutí 1,6KPřed 11 měsíci
UiPath Advanced Certification | How to get certified in UiPath RPA in 2023
Resume Screener - Extract data from CV PDF documents using UiPath and ChatGPT
zhlédnutí 2,6KPřed 11 měsíci
Resume Screener - Extract data from CV PDF documents using UiPath and ChatGPT
UiPath - Download File From URL | How to download file from website using UiPath
zhlédnutí 6KPřed rokem
UiPath - Download File From URL | How to download file from website using UiPath
UiPath Excel Add-In | UiPath Attended Automation inside Excel
zhlédnutí 665Před rokem
UiPath Excel Add-In | UiPath Attended Automation inside Excel
Top 3 Changes in the MODERN Design of UiPath Studio
zhlédnutí 535Před rokem
Top 3 Changes in the MODERN Design of UiPath Studio

Komentáře

  • @aimattant
    @aimattant Před 23 hodinami

    Nice project, I worked on your code base for a while and used Groq mixtral instead, with multiple keys to pass limits, and Firecrawl is not automatic when it comes to pagination, you still need to add HTML code, which defeats the purpose, slow but ok for a free purpose. But I got around that I think. The next step is to use it in the front end. Zillow's API is only available for property developers, so scraping with manual inputs is the only way. However, working with the live API functionality would be the best way forward. Nice job!

  • @j.d.4697
    @j.d.4697 Před 3 dny

    Having my computer managed by an AI I can naturally communicate with is one of my biggest dreams for the short-term future.

  • @EricAiken-oq4vu
    @EricAiken-oq4vu Před 10 dny

    I can't find a OPENAI model that works for me. I've tried gpt-3, gpt-3.5, gpt-3.5-turbo-1186, I always get a 404 does not exist or you don't have access to it. GPT says use davinci or curie. Any suggestions?

  • @hemenths.k9009
    @hemenths.k9009 Před 11 dny

    Hey, I am getting Invoke Code: Exception has been thrown by the target of an invocation. Error when ran this.

  • @Ashort12345
    @Ashort12345 Před 16 dny

    The AI agent is unable to bypass Cloudflare, even after trying Ollama.

  • @dungtrananh1522
    @dungtrananh1522 Před 16 dny

    Dear sir, can I use my local LLM models instead of OpenAI API?

  • @smokedoutmotions_
    @smokedoutmotions_ Před 17 dny

    Thanks bro

  • @LearnAvecAmeen
    @LearnAvecAmeen Před 21 dnem

    Hello Si Reda, all the best insh'Allah :)

    • @redamarzouk
      @redamarzouk Před 18 dny

      Thank you so much and to you too 😄

  • @sharifulislam7441
    @sharifulislam7441 Před 21 dnem

    Good technology to keep in good book!

  • @jatinsongara4459
    @jatinsongara4459 Před 23 dny

    can we use this for email and phone number extraction

    • @redamarzouk
      @redamarzouk Před 18 dny

      Absolutely you just need to change the websites and the fields and you’re good to go

  • @JoaquinTorroba
    @JoaquinTorroba Před 24 dny

    What other options are beside Firecrawl? Thanks!

    • @JoaquinTorroba
      @JoaquinTorroba Před 24 dny

      Just found it in the comments: "Firecrawl has 5K stars on GitHub, Jina ai has 4k and scrapegraph has 9k."

    • @redamarzouk
      @redamarzouk Před 18 dny

      Exactly Jina AI, scrapegraph AI are also options

  • @FaithfulStreaming
    @FaithfulStreaming Před 26 dny

    I like waht you did, but for no code people this is so hard because we dont know what we should install for windows etc.. really really nice video

  • @benom3
    @benom3 Před 29 dny

    Can you scrape multiple URLs at once? For example if you wanted to scrape all the zillow pages not just the first page with a few houses. @redamarzouk

  • @avramgrossman6084
    @avramgrossman6084 Před 29 dny

    This is a nice video and very useful. In my applications I'm looking for the 'system' to have ALL the customer pdF Invoices uploaded, or better yet, as an SalesOrder table in a database. this seems like a lot of work for just one customer and one email. Is there a way to create Agents that could filter out which customer order? Etc.

  • @AmanShrivastava23
    @AmanShrivastava23 Před měsícem

    I'm curious - what do you do after structuring the data - do you store it in a vector DB? If so, do you store the Json as it is or something else? And can it actually be completely universal - by that i mean can it structure data by us not providing the fields on which it should strucutre the data. Can we make it in some way where upload a website and it understands the data and structures it according to it?

  • @ilanlee3025
    @ilanlee3025 Před měsícem

    Im just getting "An error occurred: name 'phone_fields' is not defined"

  • @nkofr
    @nkofr Před měsícem

    nice! any idea on how to self host firecrawl? like with Docker? also, can it be coupled with n8n? how?

    • @redamarzouk
      @redamarzouk Před měsícem

      I gotta be honest, I didn't even try. I tried to self host an agentic software tool before and my pc was going crazy, it couldn't take the load from Llama3-8B running on LM Studio plus docker plus filming at the same time, I simply don't have the hardware for it. if you want to self host here is the link: github.com/mendableai/firecrawl/blob/main/SELF_HOST.md it is with docker.

    • @nkofr
      @nkofr Před měsícem

      @@redamarzouk thanks. Is there any sense to use it with n8n? or maybe n8n can do the same without firecrawl? (noob here)

    • @nkofr
      @nkofr Před měsícem

      @@redamarzouk or maybe with things like Flowise?

  • @zvickyhac
    @zvickyhac Před měsícem

    can Use LLMA 3/ Phi3 on local pc ?

    • @redamarzouk
      @redamarzouk Před měsícem

      You theoretically can use it when it comes to Data Extraction, but you will need a large context window version of Llama3 or Phi3. I've seen a model where they have extended the context length to 1M tokens for Llama3-7B. you need to keep in my that your hardware need to match the requirements.

  • @karthickb1973
    @karthickb1973 Před měsícem

    awesome bro

  • @kamalkamals
    @kamalkamals Před měsícem

    nop it s not better that GPT

    • @redamarzouk
      @redamarzouk Před měsícem

      you're right now it's not, these models are beating each other like there is no tomorrow, to this date GPT-4o is the one at the top.

    • @kamalkamals
      @kamalkamals Před měsícem

      @@redamarzouk before gpt 4 omni, gpt 4 turbo still better, the only best point with llama is free model :)

  • @titubhowmick9977
    @titubhowmick9977 Před měsícem

    Nice video. Another helpful video on the same topic czcams.com/video/dSX5eoD4-u4/video.htmlsi=8iKzgqHG97Ivf8wK

  • @titubhowmick9977
    @titubhowmick9977 Před měsícem

    Very helpful. How do you work around the output limit of 4096 tokens?

    • @redamarzouk
      @redamarzouk Před měsícem

      Hello, if you're using open ai api, you need to add the parameter (max_tokens=xxxxxxxx) inside your client open ai call and define a number that don't exceed the max number of token of the model you're using (128 000 for gpt-4o for example)

  • @YOGiiZA
    @YOGiiZA Před měsícem

    Helpful, Thank you

  • @IlllIlllIlllIlll
    @IlllIlllIlllIlll Před měsícem

    Does it work on mbp

  • @santhoshkumar995
    @santhoshkumar995 Před měsícem

    I get Error code: 429 when running the code. -'You exceeded your current quota,...

    • @ilianos
      @ilianos Před měsícem

      In case you haven't used your OpenAI API key in a while: they changed the way it works, you need to pay in advance to refill your quota

  • @ArisimoV
    @ArisimoV Před měsícem

    Can you use this for self operating pc ? Thanks

    • @redamarzouk
      @redamarzouk Před měsícem

      Believe me I tried, but my NVIDIA RTX 3050 4Gb simply can’t withstand filming and running Llava at the same time. Hopefully I’ll upgrade my setup soon and be able to do it.

    • @ArisimoV
      @ArisimoV Před měsícem

      So it is possible it's just matter of programing and pc sepecs

  • @PointlessMuffin
    @PointlessMuffin Před měsícem

    Does it parse JavaScript, infinity scroll, button click navigations?

    • @morespinach9832
      @morespinach9832 Před měsícem

      Yes, you can ask LLMs to do all that like a human would.

  • @SJ-rp2bq
    @SJ-rp2bq Před měsícem

    In the US, a “bedroom” is a room with a closet, a window, and a door that can be closed.

  • @bls512
    @bls512 Před měsícem

    Neat overview. Curious about API costs associated with these demos. Try zooming into your code for viewers.

    • @morespinach9832
      @morespinach9832 Před měsícem

      watch on big monitors as most coders do

    • @redamarzouk
      @redamarzouk Před měsícem

      for only the demo you've seen, I spent 0.5$, for creating the code and launching it 60+ times, I spent 3$. I will zoom in next time.

  • @shauntritton9541
    @shauntritton9541 Před měsícem

    Wow! The AI was even clever enough to convert square meters into square feet, no need to write a conversion function!

  • @todordonev
    @todordonev Před měsícem

    Webscraping as it is right now is here to stay and AI will not replace it (it can just enhance it in certain scenarios). First of all the term "scraping" is tossed everywhere and being used vaguely. When you "scrape" all you do is move information from one place to another. For example getting a website's HTML into your computer's memory. Then comes "parsing", which is extracting different entities from that information. For example extracting product price and title, from the HTML we "scraped". These are separate actions, they are not interchangeable, one is not more important than the other, and one can't work without the other. Both actions come with their own challenges. What these kind of videos promise to fix is the "parsing" part of it. It doesn't matter how advanced AI gets, there is only ONE way to "scrape" information, and that is to make a connection to the place the information is stored(whether its HTTP request, browser navigation, RSS feed request, FTP download or a stream of data). It's just semi-automated in the background. Now that we have the fundamentals, let me clearly state this: For the vast majority(99%) of the cases "web scraping with AI" is a waste of time, money, resources and our environment. Time: its deceiving, as AI promises to extract information with a "simple prompt", you'll need to iterate over that prompt quite a few times in order to make a somewhat reliable data parsing solution. In that time you could have built a simple python script to extract the data required. More complicated scenarios will affect both the AI, and the traditional route. Money: You either use 3rd party services for LLM inference or you self-host an LLM. Both solutions in the long term will be in orders of magnitude more expensive than a traditional python script. Resources: A lot of people don't realize this but running an LLM for cases in which an LLM is not needed is extremely wasteful on resources. Ive ran scrapers on old computers, raspberry pi's and serverless functions, this is just a spec of dust of hardware requirements compared to running an LLM on an industrial grade computer with powerful GPU(s) Environment: As per the resources needed, this affects our environment greatly, as new and more powerful hardware needs to be invented, manufactured and ran. For the people that don't know, AI inference machines (whether self-hosted or 3rd party) are powerhouses, thus a lot of watt/hours wasted, fossil fuels burnt etc. Reliability: "Parsing" information with AI is quite unreliable, manly because of the nature of how LLMs work, but also because a lot more points of failure are introduced(information has to travel multiple times between services, LLM models change, you hit usage and/or budget limits, LLMs experience high loads and inference speed sucks or it fails all together, etc.) Finally: most of AI extraction is just marketing BS letting you believe that you'll achieve something that requires a human brain and workforce with just "a simple prompt". I've been doing web automation and data extraction for more than a decade for a living. Ive also started incorporating AI in some rare cases, where traditional methods just don't cut it. All that being said, for the last 1% of the cases that do make sense to use AI for data parsing, here's what I typically do (after the information is already scraped): 1. First I remove vast majority of the HTML. If you need an article from a website, its not going to be in the <script>, <style>, <head>, <footer> tags(you get the idea), so using a python library (I love lxml) I remove all these tags, along with their content. Since we are just looking for an article I will also remove ALL of the HTML attributes, like classes(big one), ids, and so on. After that I will remove all the parent/sibling cases where it looks like a useless staircase of tags. I've tried converting to markdown and parsing, Ive tried parsing with a screenshot, but this method is vastly superior due to important HTML elements still being present, and the general HTML knowledge of LLMs. This step will make each request at least 10 times cheaper, and will allow us to use models with lower context sizes. 2. I will then manually copy the article content that I need and will put it along with the above resulting string into a json object + prompts to extract an article form given HTML, I will do this at least 15 times. This is the step where training data is created. 3. Then I will fine tune a GPT3.5Turbo model with that json data. After 10ish minutes of fine-tuning and around $5-10, I have an "article extraction fine-tuned model", that will always outperform any agentic solution in all areas(price, speed, accuracy, reliability). Then I just feed the model a new(un-seen) piece of HTML that has passed step1(above) and it will reliably spew out an article for a fraction of a cent in a single step (no agents needed). I have a few of those running in production for clients(for different datapoints), and they do very good, but its important that a human goes over the results every now and again. Also if there is an edge case and the fine-tune did not perform well, you just iterate and feed it more training data, and it just works.

    • @ilianos
      @ilianos Před měsícem

      Thanks for taking the time to explain this! Very useful to clarify!

    • @rafael_tg
      @rafael_tg Před měsícem

      Thanks man. I am specializing in web scraping in my career. Do you have some blog or similar where you share content of web scraping as a career?

    • @morespinach9832
      @morespinach9832 Před měsícem

      Nonsense. Scraping has for 10 years included both fetching data and then structuring it in some format, XML or JSON. Then we can do whatever we want with that structured that. Introducing "parsing" as some distinct construct is inane. More importantly, the way scraping can work today is leagues better than what the likes of APIFY used to do until 2 year ago, and yes this uses LLMs. Expand your reading.

    • @morespinach9832
      @morespinach9832 Před měsícem

      @@ilianos his "explanation" is stupid.

    • @morespinach9832
      @morespinach9832 Před měsícem

      @@rafael_tg watch more sensible videos and comments.

  • @6lack5ushi
    @6lack5ushi Před měsícem

    Dumpling ai is a startup doing The same! I’m swapping to this they are 50$ a month for 10,000 and 6 a min

  • @ajax0116
    @ajax0116 Před měsícem

    It seems Zillow is blocking my access --> Press & Hold to confirm you are a human (and not a bot). I was able to run on trulia, but without my VPN.

  • @nabil-nc9sl
    @nabil-nc9sl Před měsícem

    tbarkallah 3lik a bro mashallah

  • @tirthb
    @tirthb Před měsícem

    Thanks for the helpful content.

  • @user-se9qv5pi1q
    @user-se9qv5pi1q Před měsícem

    You said that sometimes the model returning the response with different keynames, but if you pass the pydantic model to the OpenAI model as a function, you can expect invariable object with the keys that you need

    • @user-se9qv5pi1q
      @user-se9qv5pi1q Před měsícem

      Also, pydantic models can be scripted to have nested structure, in contrast to json schemas

    • @redamarzouk
      @redamarzouk Před měsícem

      Correct I've actually used them while I was playing around with my code (alongside function calling), the issue I found is that I have to explain both pydantic schema and how I made it dynamic, because if I want a universal web scrapper that can use different fields everytime we're scrapping a different website. That ultimately would've made the video a 30mins+ video, so I opted for the easier less performant way.

  • @EddieGillies
    @EddieGillies Před měsícem

    What about Angie list 😢

  • @egimessito
    @egimessito Před měsícem

    What about captcha

    • @redamarzouk
      @redamarzouk Před měsícem

      Websites don't like scrappers in general, so extensive scrapping will need a vpn (that can handle the volume of your scrapping).

    • @egimessito
      @egimessito Před měsícem

      @@redamarzouk also a VPN would not defend from captcha. They are there for a good reason but would be interesting to find a way around it to build tools for customers

  • @Chamati_ab
    @Chamati_ab Před měsícem

    Thank you Reda for sharing the knowledge! Very appreciated!

    • @redamarzouk
      @redamarzouk Před měsícem

      Really appreciate the kind words, My pleasure 🙏🙏

  • @Yassine-tm2tj
    @Yassine-tm2tj Před měsícem

    In my experience, function calling is way better at extracting consistent JSON than just prompting. Anyway, تبارك الله على ولد بلادي.

    • @Chillingworth
      @Chillingworth Před měsícem

      Good idea

    • @redamarzouk
      @redamarzouk Před měsícem

      You're on point with this, using function calling is always better for JSON Consistency. I actually used it when I was creating my original code. The issue is that I have a parameter "Fields" that can change depending on the type of website being scraped. So to account for that in my code I either need to make the schema inside the function calling generic (not so great) or I make it dynamic (really didn't want to go there, it will make the tutorial much more complicated). I also tried using pydantic expressions since Firecrawl has their own LLM Extractor that can use them, but it didn't perform as well. But yeah you're right function calling is always better. Lah yhfdk a sat.

    • @Yassine-tm2tj
      @Yassine-tm2tj Před měsícem

      ​@@redamarzouk You have a knack for this bro. Keep up the good work. وفقك الله

  • @Chillingworth
    @Chillingworth Před měsícem

    You could just ask GPT-4 one time to generate the extraction code or the tags to look for, per website, so that it doesn't need to always use AI for scraping, and you might get better results, and then in that code if it fails you fall back to regenerating it and cache it again.

    • @redamarzouk
      @redamarzouk Před měsícem

      Creating a dedicated script for a website is the best way to get the exact data you want, you're right in that sense, and you can always fix it with gpt-4 as well. But let say you're actively scraping 10 competitor websites where you only want to get their pricing updates and their new offerings, will it make sense to you to maintain 10 different scripts rather than have 1 script that can do the job and will need very minimal intervention? It depends on the use case, but there are times where customized scraping code isn't the best approach.

    • @Chillingworth
      @Chillingworth Před měsícem

      @@redamarzouk I didn't mean like that. I meant you would basically do the same thing as your technique, but you could just use the AI one for each domain, asking it what the CSS selectors are for the elements you're interested in. That way when you're looking for updates you don't even need to do any calls to the LLM unless it fails because the structure is different. You don't even have to maintain multiple scripts, just make a Dictionary with the domain name and the CSS paths and there you go. Of course a lot of different pages may have different structure but you could probably just feed in the HTML from a few different pages of the site and use a prompt telling GPT-4 the URLs and the markup and tell it to figure out the URL pattern that will match the specific stuff to look for. You could even still do this with GPT-3.5-Turbo. Basically the only idea I'm throwing out there is to ask the AI to tell you the tag names and have your code simply extract the info using BeautifulSoup or something else that can grab info out of tags based on CSS query selectors. That way, you can cache that info and then scrape faster after you get that info the initial time. Would only be a little more work but might be a lot better for some use cases. Just thought it was a cool idea

  • @d.d.z.
    @d.d.z. Před měsícem

    Thank you. I have a case use, can I use the tool to make querys to a database, save the results as your tutorial shows and also print to PDF the result of every query?

    • @redamarzouk
      @redamarzouk Před měsícem

      If you already have a database you want to make queries against, you don't need any scraping (unless you need to scrape website to create that database). But yeah it sounds like you can do that without the need for any AI in the loop.

  • @ika9
    @ika9 Před měsícem

    While it is a practical solution, it still requires access to an OpenAI API or other LLM API, which can incur costs. BeautifulSoup and Selenium remain free alternatives. However, using LM Studio locally can provide the advantage of utilizing your own LLM, offering greater flexibility and control.

  • @user-xq4yj6ni8v
    @user-xq4yj6ni8v Před měsícem

    Nice idea. Now wake me up when there are no credits involved (completely free).

    • @redamarzouk
      @redamarzouk Před měsícem

      it's open source, this is how you can run it locally and contribute to the project. github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md but honestly as IT folks we gotta stop going at each other for wanting to charge for an app we've created, granted I'm not recommending this to my clients yet and 50$/month is high, but if that what they want to charge it's really up to them.

  • @bastabey2652
    @bastabey2652 Před měsícem

    just use LLM, pass it the source code of the page, and generate the a-la-carte the scraping function LLM is the secret sauce

  • @ridabrahim7604
    @ridabrahim7604 Před měsícem

    Bghit ghir nfhm chno dawr dial firecrawl fhadchi kamel ? Banli la mafih ta haja spéciale ga3 !!!

  • @roblesrt
    @roblesrt Před měsícem

    Awesome! thanks for sharing.

  • @ginocote
    @ginocote Před měsícem

    It's easy to do it with free python library. Reading HTML convert to markdown, even convert it for free to vector with transformer ect

    • @actorjohanmatsfredkarlsson2293
      @actorjohanmatsfredkarlsson2293 Před měsícem

      Exactly I didn't really understand the point of firecrawl in this solution!? Does Firecrawl do anything better then free python library. Any suggestion on Python libraries btw?

    • @morespinach9832
      @morespinach9832 Před měsícem

      Have you used it on complex websites with s or many ads, or logins or progressive JS based loads, or infinite scrolls? Clearly not.

    • @redamarzouk
      @redamarzouk Před měsícem

      firecrawl has 5K stars on GitHub, Jina ai has 4k and scrapegraph has 9k. Saying that you can just implement these tools easily is frankly disrespectful to the developers who have created these libraries and made them open source for the rest of us. in the example I covered, I didn't show the capabilities of filtering the markdown to only keep the main content in a page nor did I show how to scrape using a search query. I've done scraping professionally for 7+ years now, and the amount of problems you could encounter is immense, from websites blocking you to websites with table looking elements that are in fact just a chaos of divs to Iframes... About Vectorizing your markdown, I once did that on my machine in a "chat with pdf" project, and just with 1024 dimensions and 20 pages of pdf I have to wait long minutes to generate the vectorstore that has to be searched for every request also locally (not everyone has the hardware for it).

    • @KarriemMuhammad-wq4lx
      @KarriemMuhammad-wq4lx Před 13 dny

      @@redamarzouk FireCrawl doesn't offer much value when there are free Python resources and paid tools that let you scrape websites without needing your own API key. You still have to input your OpenAI API key with FireCrawl, making it less appealing. Why pay for something when there are free or cheaper options that are easier to use? Thanks for sharing, but I'll stick with the alternatives.

  • @stanpittner313
    @stanpittner313 Před měsícem

    50$ montly fee 🎉😂😅

    • @redamarzouk
      @redamarzouk Před měsícem

      I actually filmed an hour and I wanted to go through the financials of this method and if it makes sense, but I edited that part out so the video is less than 30mins. but I agree 50$ is high, and the markdowns should be of quality for the tokens to be less therefore cheap LLM cost. btw I"m not sponsored by any means by firecrawl, I was gonna talk about jina ai or scrapegraph-ai which do the same thing before deciding on firecrawl.

  • @simonren4890
    @simonren4890 Před měsícem

    firecrawl is not open-sourced!!!

    • @paulocacella
      @paulocacella Před měsícem

      You too nailed it. We need to refuse these false open source codes that are in reality commercial endeavours. I use only FREE and OPEN codes.

    • @redamarzouk
      @redamarzouk Před měsícem

      Except it is. Refer to its repo, it shows how to run it locally github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md

    • @paulocacella
      @paulocacella Před měsícem

      @@redamarzouk I'll take a look. Thanks.

    • @javosch
      @javosch Před měsícem

      But you are not using the open source, you are using their API... perhaps for the next time that you could do it run locally

    • @everbliss7955
      @everbliss7955 Před 25 dny

      ​@@redamarzouk the open source repo is still not ready for self hosting.