Crawl4AI - Crawl the web in an LLM-friendly Style
Vložit
- čas přidán 16. 05. 2024
- Welcome to the detailed walkthrough of Crawl4AI v0.2.0! 🚀
In this video, I'll dive deep into the code base of Crawl4AI, our powerful web crawling tool designed for AI enthusiasts and developers. We'll explore all the new and exciting features that make this release a game-changer:
- 🕷️ Efficient web crawling to extract valuable data from websites
- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
- 🌍 Supports crawling multiple URLs simultaneously
- 🌃 Replace media tags with ALT
- 🆓 Completely free to use and open-source
- 📜 Execute custom JavaScript before crawling
- 📚 Chunking strategies: topic-based, regex, sentence, and more
- 🧠 Extraction strategies: cosine clustering, LLM, and more
- 🎯 CSS selector support
- 📝 Pass instructions/keywords to refine extraction
I explain all these features in detail in the video. No API key, signup, or other boring stuff required! 🌐
Check out the repo: [Crawl4AI on GitHub](github.com/unclecode/crawl4ai)
If you find this tool useful, please star the repo and leave a comment! Your feedback helps us improve and support the project.
Follow me on Twitter (X) for updates on my research on function-calling for LLMs and AI agents: x.com/unclecode
I appreciate your feedback and thoughts on this project.
#Crawl4AI #WebCrawling #AI #LLM #Colab #WebScraping #OpenSource #GitHub #OpenSourceAI - Věda a technologie
Love how you so excited of your project! Keep it up man! Great project
Thanks! Will do!
Very useful Project, I must admit! Is it a recursive crawler, when I say recursive, I mean it, (not restricted to depth threshold). Also How differet is this from FireCrawl, in terms of functionality and other stuffs. I can't wait to get started on using this project, and give it a shot! Thanks!
Looks exciting. Have you considered a nix script?
Hey man. I'm going to be honest but i'm new to data scraping and wanted to ask if crawl4ai can be used to scrape data from tiktok. They have implemented some harsh measures with request rate limits and login requirements. From what i saw crawl4ai has some login feature but just wanted to ask you if i'm going in the right direction. Otherwise looks great
possible to put up a prebuilt docker image, including the 'models'? I had problem downloading the models during build docker. Thanks!
I will work on that. Trying to have a version without model dependency as well
Really cool man! Can I crawl all accessible subpages from a main page? So I crawl 2 levels in total?
You can send multiple links, so first crawl the main page, then get links and send them again. However soon I will release the ability to se the depth and get a cool result for that
WHAT HAPPENED TO THE FLUTE UNCLE CODE
Hahahaha!! Ok, ok, message received
i got a result object. how to parse it
Result is an object like this:
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: str = None
markdown: str = None
extracted_content: str = None
metadata: dict = None
error_message: str = None
So you can access using this property (cleaned_html, markdown, extracted_content), or dump the model into a python dictionary using "result.model_dump()`