Crawl4AI - Crawl the web in an LLM-friendly Style

Sdílet
Vložit
  • čas přidán 16. 05. 2024
  • Welcome to the detailed walkthrough of Crawl4AI v0.2.0! 🚀
    In this video, I'll dive deep into the code base of Crawl4AI, our powerful web crawling tool designed for AI enthusiasts and developers. We'll explore all the new and exciting features that make this release a game-changer:
    - 🕷️ Efficient web crawling to extract valuable data from websites
    - 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
    - 🌍 Supports crawling multiple URLs simultaneously
    - 🌃 Replace media tags with ALT
    - 🆓 Completely free to use and open-source
    - 📜 Execute custom JavaScript before crawling
    - 📚 Chunking strategies: topic-based, regex, sentence, and more
    - 🧠 Extraction strategies: cosine clustering, LLM, and more
    - 🎯 CSS selector support
    - 📝 Pass instructions/keywords to refine extraction
    I explain all these features in detail in the video. No API key, signup, or other boring stuff required! 🌐
    Check out the repo: [Crawl4AI on GitHub](github.com/unclecode/crawl4ai)
    If you find this tool useful, please star the repo and leave a comment! Your feedback helps us improve and support the project.
    Follow me on Twitter (X) for updates on my research on function-calling for LLMs and AI agents: x.com/unclecode
    I appreciate your feedback and thoughts on this project.
    #Crawl4AI #WebCrawling #AI #LLM #Colab #WebScraping #OpenSource #GitHub #OpenSourceAI
  • Věda a technologie

Komentáře • 13

  • @po6577
    @po6577 Před měsícem +1

    Love how you so excited of your project! Keep it up man! Great project

  • @AWSFan
    @AWSFan Před 4 dny

    Very useful Project, I must admit! Is it a recursive crawler, when I say recursive, I mean it, (not restricted to depth threshold). Also How differet is this from FireCrawl, in terms of functionality and other stuffs. I can't wait to get started on using this project, and give it a shot! Thanks!

  • @MikeLevin
    @MikeLevin Před měsícem

    Looks exciting. Have you considered a nix script?

  • @plumpy8854
    @plumpy8854 Před 16 dny

    Hey man. I'm going to be honest but i'm new to data scraping and wanted to ask if crawl4ai can be used to scrape data from tiktok. They have implemented some harsh measures with request rate limits and login requirements. From what i saw crawl4ai has some login feature but just wanted to ask you if i'm going in the right direction. Otherwise looks great

  • @xinfeng3022
    @xinfeng3022 Před měsícem +1

    possible to put up a prebuilt docker image, including the 'models'? I had problem downloading the models during build docker. Thanks!

    • @unclecode788
      @unclecode788  Před měsícem +2

      I will work on that. Trying to have a version without model dependency as well

  • @carlosa.villanuevacampoy931

    Really cool man! Can I crawl all accessible subpages from a main page? So I crawl 2 levels in total?

    • @unclecode788
      @unclecode788  Před měsícem +2

      You can send multiple links, so first crawl the main page, then get links and send them again. However soon I will release the ability to se the depth and get a cool result for that

  • @fieldcommandermarshall
    @fieldcommandermarshall Před měsícem

    WHAT HAPPENED TO THE FLUTE UNCLE CODE

  • @bitcoinquickbytes
    @bitcoinquickbytes Před 2 měsíci

    i got a result object. how to parse it

    • @unclecode788
      @unclecode788  Před měsícem

      Result is an object like this:
      class CrawlResult(BaseModel):
      url: str
      html: str
      success: bool
      cleaned_html: str = None
      markdown: str = None
      extracted_content: str = None
      metadata: dict = None
      error_message: str = None
      So you can access using this property (cleaned_html, markdown, extracted_content), or dump the model into a python dictionary using "result.model_dump()`