Web Scraping AI AGENT, that absolutely works 😍

Sdílet
Vložit
  • čas přidán 8. 05. 2024
  • ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites, documents and XML files. Just say which information you want to extract and the library will do it for you!
    🔗 Links 🔗
    Scrape Graph AI
    github.com/VinciGit00/Scrapeg...
    Code used in the video - github.com/amrrs/scrapegraph-...
    ❤️ If you want to support the channel ❤️
    Support here:
    Patreon - / 1littlecoder
    Ko-Fi - ko-fi.com/1littlecoder
    🧭 Follow me on 🧭
    Twitter - / 1littlecoder
    Linkedin - / amrrs
  • Věda a technologie

Komentáře • 85

  • @bastabey2652
    @bastabey2652 Před dnem

    this ScrapegraphAI tool is the most interesting scraping tool I've tested so far

  • @unclemike2008
    @unclemike2008 Před 27 dny +4

    "poor" Love you brother! Right there with you. Great video. Been trying and failing to get a scraper with java support. Cheers!

  • @marcoaerlic2576
    @marcoaerlic2576 Před 11 dny +1

    Really great video, thank you. I would be interested in seeing more videos about ScrapeGraphAI.

  • @ayyanarjayabalan
    @ayyanarjayabalan Před 28 dny

    Awesome we need more practical session with code like this.

  • @Balajik7-qh1pq
    @Balajik7-qh1pq Před 28 dny

    I like all your videos , keep rocking bro

  • @user-ew8ld1cy4d
    @user-ew8ld1cy4d Před 8 dny

    Great video! Thank you!

  • @alx8439
    @alx8439 Před 28 dny +10

    Next time it will also need a visual model to solve capchas because website administrators will be protecting their precious content from scraping :)

  • @liamlarsen9286
    @liamlarsen9286 Před 27 dny

    thanks for the heads up at 6:00 .
    worked when using that version only

  • @HeberLopez
    @HeberLopez Před 27 dny +1

    I find this live example pretty useful for general purpose, I can think of multiple ways I could use this for one off PoCs

  • @alqods80
    @alqods80 Před 27 dny +1

    There is a playwright function that bypasses the irrelevant resources so the scraping becomes faster

  • @patrickwasp
    @patrickwasp Před 28 dny +7

    It’s a spider, not an octopus. Spiders crawl on webs.

    • @opusdei1151
      @opusdei1151 Před 27 dny

      What is an octopus? Which crawls API's or do datamining

  • @Raphy_Afk
    @Raphy_Afk Před 28 dny +2

    Amazing ! If my PSU wasn’t dead I wouldn’t be sleeping for days

  • @manojy1015
    @manojy1015 Před 28 dny

    We need more tutorials of practical live examples of llm especially rag and fine tuning

  • @madhudson1
    @madhudson1 Před 10 dny

    It depends on the llm used and questions you pose it. It can often not generate json and the library isnt best suited for iteration through a collection of sites

  • @jbo8540
    @jbo8540 Před 28 dny +3

    If your LLM gives you an article you can't find, my first assumption is that it made it up. While this is an interesting use case, it's going to likely take very precise prompt engineering to not get hallucinated outputs.

    • @1littlecoder
      @1littlecoder  Před 28 dny +2

      No, it's my bad. After the video I reviewed the web page. In fact, I added the screenshot in the video. It was inside the carousel

  • @kalilinux8682
    @kalilinux8682 Před 28 dny +1

    Could you please do more videos on this. Like trying to use it on more educational content with equations used using mathjax and katex

  • @honneon
    @honneon Před 28 dny

    i luv it❤

  • @jmirodg7094
    @jmirodg7094 Před 24 dny

    thanks! 👍

  • @inplainview1
    @inplainview1 Před 28 dny +3

    Watching this before youtube gets upset again. 😉

    • @1littlecoder
      @1littlecoder  Před 28 dny +2

      Honestly, I was actually scared before uploading this, but let's see!

    • @inplainview1
      @inplainview1 Před 28 dny +1

      @1littlecoder Hopefully all is well.

  • @EobardUchihaThawne
    @EobardUchihaThawne Před 28 dny

    Ok, now that's a good useage of ai model

  • @ngoduyvu
    @ngoduyvu Před 25 dny

    thanks for the tutorial, please make more tutorial for this ScrapeGraphAI, can you make one for scraping the website that has antibot or credential (require login)

  • @tauquirahmed1879
    @tauquirahmed1879 Před 27 dny +1

    great video....

  • @jarad4621
    @jarad4621 Před 24 dny

    Is the llm there to convert the raw html to structured data? Then it saves to rag and you can query the data with another llm to analyse? I need to scrape homepages from 10k sites tostructured data into rag db to ask The sites questions, can it be setup todo many sites like an automated agent, or can it be used as a tool or function call in an agent framework like crew ai? that video would be cool

  • @monuaimat5228
    @monuaimat5228 Před 27 dny +1

    RAG: Ritual Augmented Generation 😂

    • @J3R3MI6
      @J3R3MI6 Před 27 dny +1

      🕯️🕷️🕯️

  • @Macorelppa
    @Macorelppa Před 28 dny +1

    🥇

  • @BiXmaTube
    @BiXmaTube Před 27 dny

    Need proper pdf parsing ai that I can run on a cloud server without gpu. Extracting text, tables and images and arranging it in a db based on a prompt that puts each data in the right table. That will be amazing if you can find something like that.

  • @user-zt2lp6hq7l
    @user-zt2lp6hq7l Před 2 dny

    reddit being called front page of internet is like... no please

  • @IdPreferNot1
    @IdPreferNot1 Před 27 dny

    What am i missing.... error running the async cell?

  • @user-nm2wc1tt9u
    @user-nm2wc1tt9u Před 18 dny

    does it work on google colab?

  • @shobhanaayodya7024
    @shobhanaayodya7024 Před 26 dny

    That logo is a spider 🕸️🕷️

  • @morease
    @morease Před 25 dny

    I fail to see why rag is needed when the library can simply be asked to identify the html path/element that contains the content, and then extract the html from that with cheerio

  • @NaveenChouhan-mm5gz
    @NaveenChouhan-mm5gz Před 15 dny +1

    I tried to install the scrapegraphai but I'm getting stuck in the yahoo search dependency which breaks the execution and return attribute error.

    • @Ashort12345
      @Ashort12345 Před 12 dny

      it is the same error or not here: I'm very beginer level if someone know how to fix mine please leave the comment
      ---------------------------------------------------------------------------
      AttributeError Traceback (most recent call last)
      Cell In[25], line 17
      3 graph_config = {
      4 "llm": {
      5 "model": "ollama/mistral",
      (...)
      13 }
      14 }
      16 # Instantiate the SmartScraperGraph class
      ---> 17 smart_scraper_graph = SmartScraperGraph(
      18 prompt="List me all the articles",
      19 source="news.ycombinator.com",
      20 config=graph_config
      21 )
      23 # Run the smart scraper graph
      24 result = smart_scraper_graph.run()
      File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\scrapegraphai\graphs\smart_scraper_graph.py:47, in SmartScraperGraph.__init__(self, prompt, source, config)
      46 def __init__(self, prompt: str, source: str, config: dict):
      ---> 47 super().__init__(prompt, config, source)
      49 self.input_key = "url" if source.startswith("http") else "local_dir"
      File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\scrapegraphai\graphs\abstract_graph.py:49, in AbstractGraph.__init__(self, prompt, config, source)
      47 self.config = config
      ...
      --> 227 params = self.llm_model._lc_kwargs
      228 # remove streaming and temperature
      229 params.pop("streaming", None)
      AttributeError: 'Ollama' object has no attribute '_lc_kwargs'
      Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

  • @oliverli9630
    @oliverli9630 Před 28 dny

    wondering when somebody will integrate `undetected-chrome` to it.

  • @AI-Wire
    @AI-Wire Před 27 dny

    So, this is impossible to run in Colab? I like to automate many of my tasks using Github actions.

    • @1littlecoder
      @1littlecoder  Před 27 dny

      You can run on colab. But you'd need openai keys

  • @planplay5921
    @planplay5921 Před 28 dny

    it still have the risk of being blocked, it's just a way of parsing

  • @yashsrivastava677
    @yashsrivastava677 Před 27 dny

    Will it work to scrape linkedIn jobs?

  • @adriangpuiu
    @adriangpuiu Před 27 dny

    another question , what if we only want to scrape and not emmbed anything ?

    • @1littlecoder
      @1littlecoder  Před 27 dny

      I think in those cases you can probably use a conventional libraries I guess but that's a good question there are different classes within this library that might let it do

    • @adriangpuiu
      @adriangpuiu Před 27 dny

      @@1littlecoder
      from scrapegraphai.graphs import BaseGraph
      from scrapegraphai.nodes import FetchNode, ParseNode,generate_answer_node
      graph = BaseGraph(
      nodes={
      fetch_node,
      parse_node,
      },
      edges={
      (fetch_node, parse_node),
      (parse_node, generate_answer_node),
      },
      entry_point=fetch_node
      ) .. i dont have time to try it now cause im at work :))

  • @LeeBrenton
    @LeeBrenton Před 28 dny

    scrape Facebook please! - I need to do the most boring thing for work, I tried to program a scrapper but FB makes it very hard, I was only partially successful (expecially grabbing the post date). This method looks very exciting :)

    • @webhosting7062
      @webhosting7062 Před 27 dny

      What was ur requirements?

    • @LeeBrenton
      @LeeBrenton Před 27 dny

      @@webhosting7062 I write a daily report, based on the new posts in various FB groups .. but FB doesn't put posts in the correct order (also, pinned posts up the top will be old posts) .. so i need to check the date, but, FB obfuscates the date like a MF .. i wasn't able to figure it out with selenium.
      so, requirements are .. 'get the latest (less than ~24hr old posts) from a FB group.

  • @sandrallancherosg
    @sandrallancherosg Před 28 dny +1

    BTW, that's a spider in the logo. It's a spider that lives in the World Wide Web 😅

  • @CM-zl2jw
    @CM-zl2jw Před 24 dny

    🤣 I enjoy your sense of humor. Thank you. You are RICH in kindness and intelligence. That’s almost as good as money…. Money only buys limited amounts of happiness.
    Your videos are very helpful and informative. I’ll pay you to help me figure a couple things out. What’s your contact?

    • @1littlecoder
      @1littlecoder  Před 24 dny

      Thank you 1littlecoder@gmail.com is my email

  • @user-vm8lr2hr7d
    @user-vm8lr2hr7d Před 27 dny

    Only is own-lee
    Not one-lee
    Btw great video

  • @DM-py7pj
    @DM-py7pj Před 27 dny

    looks something like spider (scrape/crawl) + bone (GET/fetch) + document | parse ( HTML) ???

  • @prasannaprakash892
    @prasannaprakash892 Před 27 dny

    This is great, thanks for sharing, Can you share your python version as i am getting an error when running the same code

  • @Ari_Alur
    @Ari_Alur Před 27 dny +1

    Would it be possible to explain the whole thing to someone who has nothing to do with programming? I was able to install everything but I can't do anything with the code from github...
    Would be great :) Thanks for the video! Very interesting but unfortunately not feasible for me.
    (I'm on Linux)

    • @1littlecoder
      @1littlecoder  Před 27 dny +1

      Do you want me to show how to run the code from GitHub? Will it be helpful

    • @Ari_Alur
      @Ari_Alur Před 27 dny

      Yeah! At least in a way that's easier to understand. I don't know anything about code, so I need things to be clear and simple.

    • @Ari_Alur
      @Ari_Alur Před 27 dny

      Thanks!:)

  • @viddeshk8020
    @viddeshk8020 Před 27 dny

    I don't understand that for web scrapping why do I have to install so much of other dependencies like ollama etc. I mean it is just a simple webscraping why make the thinks complex? Still for the complex task a complex prompt needs to be given.

    • @liamlarsen9286
      @liamlarsen9286 Před 27 dny

      ollama is just a frmework to run LLMs locally, so it downloads the model insted of using an API and connecting to server

    • @madhudson1
      @madhudson1 Před 10 dny

      If you just want scraping, don't bother with this.
      However, if you want scraping + RAG, with LLM integration, then use this. But it's not without it's issues

  • @adriangpuiu
    @adriangpuiu Před 27 dny

    can it do heavy JavaScript sites ? :))

    • @1littlecoder
      @1littlecoder  Před 27 dny

      I've not tried it! it'd be a good opportunity to try that, especially given it uses Playwright!

    • @adriangpuiu
      @adriangpuiu Před 27 dny

      @@1littlecoder ill tell ya, i tried and it fails miserably :)) , if you have better luck let us know man

    • @1littlecoder
      @1littlecoder  Před 27 dny

      @@adriangpuiu ah that's bad. Which website was it ?

    • @adriangpuiu
      @adriangpuiu Před 27 dny

      @@1littlecoder the user replyes are incapsulated in a JS response from what i noticed, maybe they have an api or soething , i was just unable to figure it out . YET ...

    • @adriangpuiu
      @adriangpuiu Před 27 dny

      @@1littlecoder its the appian discussion forum

  • @rahuldinesh2840
    @rahuldinesh2840 Před 26 dny

    I think Chrome extensions are best.

  • @webhosting7062
    @webhosting7062 Před 27 dny

    What about site build with jquery.. Does it works for that too?

    • @1littlecoder
      @1littlecoder  Před 27 dny +1

      I have not tried it . Someone else in the comments said it might not very good.

  • @Balajik7-qh1pq
    @Balajik7-qh1pq Před 28 dny +1

    I like all your videos , keep rocking bro