System Design distributed web crawler to crawl Billions of web pages | web crawler system design

Sdílet
Vložit
  • čas přidán 3. 07. 2019
  • Learn webcrawler system design, software architecture
    Design a distributed web crawler that will crawl all the pages on the internet.
    Question asked in most of the top company interviews like GOOGLE, FACEBOOK, and AMAZON
    Let's learn how to build google sipderbot or google distributed web crawler.
    #crawlersystemdesign
    #systemdesigntips #systemdesign #computerscience #learnsystemdesign #interviewpreperation #amazoninterview #googleinterview #uberinterview #micrsoftinterview
    #crawler #webcrawler

Komentáře • 196

  • @TheMdaliazhar
    @TheMdaliazhar Před 19 dny

    Thanks for this. Most detailed design. No other youtuber explained exactly how the URL Frontier works.

  • @sumonmal009
    @sumonmal009 Před 3 lety +22

    estimation 5:30
    HLD 6:33
    queue manage 25:30
    update and duplicate handle 33:40
    Sim hash 39:26
    storage 42:00

  • @arvindaaswani1303
    @arvindaaswani1303 Před 4 lety +50

    Awesome explanation, As a engineer i know that, how much hard work behind the scenes. Really Appreciate 👏

  • @prathashukla6596
    @prathashukla6596 Před 4 lety +1

    awesome explaination of all the high level components. Good job

  • @iitgupta2010
    @iitgupta2010 Před 5 lety +8

    Finally you start building the video in actual flow, that's really great and it will really help the viewer to understand and build the actual knowledge of SD. Great bro.

  • @ksenthu
    @ksenthu Před 3 lety +13

    The more detailed and clear content of crawler design I've seen. Thanks for doing this. It would be great if you can also clarify how the data transition happens between various services such as Extractor, Duplicate Detection, URL filter and Loader.

  • @howellPan
    @howellPan Před 5 lety +15

    Great content.. appreciate the details and thoroughness!

  • @manojbgm
    @manojbgm Před 3 lety +1

    Awesome, knowledgeable. thank you for the video

  • @PeterParker-vn2hv
    @PeterParker-vn2hv Před 2 lety +1

    Narenda, thank you for this excellent video. Much appreciated.

  • @pryansh_
    @pryansh_ Před 2 lety +2

    very informative, thanks

  • @Sarah-il5dr
    @Sarah-il5dr Před 4 lety +24

    Guys, please like the video, as an engineer I know how much hard work behind a video like this. This is my go to system design resources. Great work!

  • @akshaymonga
    @akshaymonga Před 2 lety +1

    very nice n detailed video, thank you sir!

  • @venjan21
    @venjan21 Před 3 lety

    Generally I don't post comments but this is one of the best system design (in detail) I have ever seen. It has re-kindled my thought process on how to think for a System Design question.

  • @sampathkarupakula7647
    @sampathkarupakula7647 Před 5 lety +3

    making things clear and easier, thanks for your effort. I really appreciate your efforts.

  • @jessica-mx5pw
    @jessica-mx5pw Před rokem +1

    thank you for the video! this was by far the most helpful system design video walk through I've seen. I've been struggling a lot with system design. Thank you for putting this together!

  • @ajaypremshankar
    @ajaypremshankar Před 5 lety +2

    It's not easy to make such in-depth content-rich video. Thank you Narendra :)

  • @ragdoll2324
    @ragdoll2324 Před 4 lety

    Very detailed discussion. Thanks for making this vdo.

  • @adrianliu2817
    @adrianliu2817 Před 5 lety +3

    You are the best! Enjoyed all of your system design videos!

  • @petar55555
    @petar55555 Před 2 lety +11

    Great in detail System design. The only part I would probably skip is the heap (each queue is already tied to a thread/worker) as it looks more like a bottleneck and serves only as a timer to slow down the crawling for politeness which can be done in different ways.

    • @aarushjuneja6640
      @aarushjuneja6640 Před rokem +1

      I was also thinking on the same lines.

    • @NANDINIGOEL
      @NANDINIGOEL Před 7 měsíci

      Think it this way, priorities based queue and then host based but you don’t know once in hosts which host to handle first ( priority is lost) so pq is filled with first elements of each of back a and then urls downloaded based on priority ensuring politeness. Merge k sorted arrays is good pointer to this , there is no point in locking threads to each queue if that is doubt because then priority is per q and not across all. Think a host has all priory 100 urls and others have 1-99 so then why should that 100 host be prioritized, it should not be unless we implement nice call something similar to increase priority to avoid aging

  • @anastasianaumko923
    @anastasianaumko923 Před rokem +1

    Thank you for this elaborate design, great work!

  • @theFifthMountain123
    @theFifthMountain123 Před 3 měsíci

    Had to watch multiple times to understand everything in the video. Thanks for the awesome explanation!

  • @JM_utube
    @JM_utube Před 4 lety +1

    thank you so much for posting! i love your videos.
    i just got asked this in a facebook interview and i wish i had seen this video beforehand.

  • @adamhughes9938
    @adamhughes9938 Před 4 lety +118

    Makes me sad that this dude crams so much amazing content into these videos and gets 42k views but the dumbest 10 second videos get millions of views...
    I wish youtube had a notion of content score and quality.

    • @junjiechen7341
      @junjiechen7341 Před 3 lety

      ikr! too much going on to be fully appreciated in his vids.

    • @warriorgeneral2735
      @warriorgeneral2735 Před 3 lety +7

      Hey it totally depends on what people are interested in...

  • @harishaseri
    @harishaseri Před 4 lety +1

    Best explained. Thanks u so much naran

  • @monikaa8230
    @monikaa8230 Před 3 lety +5

    I have a suggestion to include two things in your videos which will definitely help:
    1. QPS Calculation
    2. Sharding key when we are planning to shard the DB

  • @heller166
    @heller166 Před 3 lety

    This is going to be a lot of help for my distributed systems course :). Thanks for all the hard work.

  • @aleeshaali7180
    @aleeshaali7180 Před 2 lety

    Bes channel I came across for learning about system design, Thank you and keep it up
    Kudos to the wonderful work!!!

  • @rhythmPhil
    @rhythmPhil Před 4 lety +2

    Thanks for your work. This was really interesting.

  • @t4ruvk107
    @t4ruvk107 Před 4 lety

    Thanks for your time,efforts and content.

  • @vishalmahavratayajula9658

    Awesome video. Can't thank you enough narenndra

  • @ShabnamKhan-cj4zc
    @ShabnamKhan-cj4zc Před 3 lety

    Thanks a lot for exlpaning all the modules in simple manner.. Your channel is the place where one can stop and learn everything in easy way..thanks a ton and keep doing this great work

  • @impossible7434
    @impossible7434 Před 3 lety

    such an amazing explanation, thank you very much, keep up the good work

  • @ashish0687
    @ashish0687 Před 5 lety +1

    Thank you Naren, These video's are great source of learning. Very much appreciate the details/time/efforts on your part to build the content and present/share it across. If possible can you also please make a video about Geohashing (& usecase around performing geospatial searches) ...

  • @w.maximilliandejohnsonbour725

    Nice info...!!!!!.

  • @spyros5528
    @spyros5528 Před 11 měsíci +1

    Superb video, very helpful. Thank you.

  • @roooooot9545
    @roooooot9545 Před 4 lety

    Great work

  • @SkyCityInc
    @SkyCityInc Před 2 lety

    This was a really, really excellent overview, thank you for putting this video together!

  • @elachichai
    @elachichai Před 3 lety

    Definitely helpful ! Appreciate it Narendra!

  • @PiyushSingh-vx7bx
    @PiyushSingh-vx7bx Před 4 lety

    Amazing explanation brother 🔥

  • @sayantanray9595
    @sayantanray9595 Před 4 lety

    Helpful and detailed!!!

  • @iitgupta2010
    @iitgupta2010 Před 5 lety +3

    I crawled a word from this video is "basically" and inverted index it ....lol [don' have that much time 😝
    Great video as always

  • @keshavKumar-le4df
    @keshavKumar-le4df Před 3 měsíci

    Nice explanation.

  • @apurvasharma2853
    @apurvasharma2853 Před 3 lety

    Excellent explanation!

  • @user-hj2lb8mg8o
    @user-hj2lb8mg8o Před 4 lety

    Hi, really awesome videos, thanks!

  • @manmohanakash4222
    @manmohanakash4222 Před 3 lety

    This is the kinda of teammate I would like to work with. So much content. Thanks for sharing

  • @chaitanyareddy9848
    @chaitanyareddy9848 Před 4 lety +1

    Dude it's awesome job.

  • @aashnavaid6918
    @aashnavaid6918 Před 2 lety

    amazing video thank you so very much sir!!!

  • @theranajayant
    @theranajayant Před 4 lety

    Heyy Narendra, Quite interesting topic you have chosen and it's interesting to learn this topic. You are curating really good and valuable content.

  • @AyushRaj-so3zh
    @AyushRaj-so3zh Před 3 lety

    This was GOLD !! Amazing content

  • @argstutorial2916
    @argstutorial2916 Před 3 lety

    Very nice conceptual explanations & tools utilizations. You have put a lot of energy with R&D. I hope this will help who are seeking to develop their own system for data processing / scraping mechanisms. Great Work, Keep it Up MaN.

  • @utsavkapoor6069
    @utsavkapoor6069 Před 6 měsíci +1

    Great explanation man. Loved your videos. Why have you stopped making these. Hope to see you back soon!!

  • @pinkylover911
    @pinkylover911 Před 2 lety

    A lot of great effort has been put into your videos, thanks

  • @iitgupta2010
    @iitgupta2010 Před 5 lety +15

    I really really appreciate your effort bro, whoever ask me I always suggest your name first. There are few others like gkcs but if you ask me there are nothing in front of your design skills. You really talk about things which matters. This is something I have not found in even paid courses. This is awesome in one word.
    You should have lot of subscriber. They will be soon.

  • @Akashkumar-md6rg
    @Akashkumar-md6rg Před 4 lety

    Thnq sir!! For such a great content.
    Your videos are the most practical and interesting way to learn CS.
    You made me your fan sir...
    I really appreciate your hard work. Keep going.🙌🙌

  • @aliaksandrsheliutsin2374

    Just have to say that it's amazing content. Ket it up, Narendra!

  • @hlibpylypets1333
    @hlibpylypets1333 Před 2 lety

    Very detailed explanation - best ever :)

  • @vedant9173
    @vedant9173 Před 2 lety

    Sir, thank you so much for these great lessons

  • @alokuttamshukla
    @alokuttamshukla Před 5 lety +10

    Thank you so much for these efforts. I mean 45 minutes video is not a joke with so much to grasp.

    • @TechDummiesNarendraL
      @TechDummiesNarendraL  Před 5 lety

      I am trying make it short. But failed to do so

    • @alokuttamshukla
      @alokuttamshukla Před 5 lety

      @@TechDummiesNarendraL No , I am in no way complaining at all. I loved it. I am so thankful to you for this.

    • @TechDummiesNarendraL
      @TechDummiesNarendraL  Před 5 lety

      @@alokuttamshukla thanks

    • @readingsteiner6061
      @readingsteiner6061 Před 4 lety +4

      @@TechDummiesNarendraL Blaise Pascal, In his Lettres Provinciales, the French philosopher and mathematician Blaise Pascal famously wrote - "I would have written a shorter letter, but I did not have the time." : )
      Buddy you're awesome. Keep up the good work. Wish you the best.

    • @JM_utube
      @JM_utube Před 4 lety

      after watching a lot of system design videos i really had to understand that this level of detail is NOT EXPECTED in an interview. i really stressed myself out trying to ask so many clarifying questions, and cover every single aspect of a system in a 45 minute block. this is not expected. remember - these videos are edited, shortened, rehearsed, and practiced. trust me when i say set a lower bar for yourself for interviews LOL
      thanks!!!

  • @RealAbhishekSingh
    @RealAbhishekSingh Před 3 lety

    wow, such great explanation, thank you :)

  • @helikopter1231
    @helikopter1231 Před 2 lety

    Wow such detail and explained so well! Thank you so much! You actually made it sound interesting haha - im not a huge fan of web stuff but this actually made me curious.

  • @SimranGupta-pz7nw
    @SimranGupta-pz7nw Před 2 lety

    Thank you so much for the beautiful explanation :)

  • @IdoKleinman
    @IdoKleinman Před 2 lety

    Good stuff! Thank you. One suggestion, for the next video, keep the information text slides on screen for more than 300ms...

  • @puravshah2342
    @puravshah2342 Před 5 lety +6

    Hi Naren, thanks for the awesome video, can you also make a video on designing distributed scheduling system

  • @StormcastMarine
    @StormcastMarine Před rokem

    Thanks a lot for the video mate, really useful

  • @wellingtonrafaelbarrosamor4260

    Awesone didactic

  • @gouravkhanijoe1059
    @gouravkhanijoe1059 Před 2 lety

    Nice

  • @meetpatel5054
    @meetpatel5054 Před měsícem

    Instead of coupling back-queues with threads, I would say have more number of threads for priority URLs and less for others.
    for this to work, we can handle the politeness at front-queues where we put the subsequent URLs in low priority queues.

  • @renon3359
    @renon3359 Před 3 lety

    Great video man. You deserve much more subscribers.

  • @Imkflow
    @Imkflow Před 2 lety

    Thanks for the work on this, very helpful. Quick note, I think if every processor need to receive the same message what you need is a topic instead of a queue.

  • @CODFactory
    @CODFactory Před 2 lety

    a) Why not use a graph db instead of bigtable or anything
    b) why do those envelope calculations like 6PB or anything when we never used it and we never proved that the design will handle that amount of data
    c) We definitely should talk about how to make it distributed since 1 crawler cannot crawl everything, so how are we going to make sure that multiple crawlers are not crawling the same things
    d) how are we going to store these documents in different db and what kind of sharding we are doing to use
    i think those are some important things to talk about especially giving interviews

  • @nazmavazid9141
    @nazmavazid9141 Před 2 lety

    Very very nice sir

  • @samirhere4341
    @samirhere4341 Před 4 lety +1

    Great video. Keep up the good work. Can you do system design video on amazon fresh/getbojo/blue apron/plated/embrace box/trytheworld. The concept of how subscription and continues reoccurring delivery system works. Thank you

  • @augustoclaro
    @augustoclaro Před 2 lety

    I have watched this video so many times in the past year that I'm almost quoting every word you say

  • @shreyade5000
    @shreyade5000 Před rokem +1

    Nice content but long pause at 40:31, it distracts you if you are listening with concentration. Please edit it.

  • @dharmendrabhojwani
    @dharmendrabhojwani Před 5 lety

    awesome

  • @shreyasns1
    @shreyasns1 Před 2 lety +1

    @Narendra, Thanks for the video and detailed explanation. Could you also add the links to white papers you mentioned in the video description? This would help us to dive deep further to understand the concepts. Thanks again

  • @nikhilagrawal8888
    @nikhilagrawal8888 Před 4 lety

    amazing

  • @kartik-agarwal
    @kartik-agarwal Před 2 lety

    Kudos

  • @DebasisUntouchable
    @DebasisUntouchable Před 4 lety +1

    Great video! Thanks for sharing! Can you please refer me a book where I can get such great examples on System design?

  • @rishabhnitc
    @rishabhnitc Před 2 lety

    As always excellent. just remove the music at 46 second mark :)

  • @neoli8110
    @neoli8110 Před 3 lety +9

    why do you need a heap? it sounds like a bottom neck right there. why can't backqueue selector use LB like round robine select the queue and remove item from the queue.

  • @stalera
    @stalera Před 3 lety

    Thanks a lot for taking efforts to build up the video. This was amazing. I learnt a lot from this video. Just 1 question: why would you want to store the file content in the compressed form. Is it being used anywhere later? I couldn't find any mention about it.

  • @iitgupta2010
    @iitgupta2010 Před 5 lety

    I think we should decoupled the priority based crawler to normal crawler otherwise due to back queue router, all low priority crawler will be starve and never gets the chance to get crawl.
    We can have two/more system which are responsible for crawling every minute or less (like share market), every 5 minute or 1hr ... 1 day or week up to 1 month.
    This way we can scale them very easily and manage them better. This also help us to build politeness too.

  • @Wei-up2jn
    @Wei-up2jn Před 3 lety +4

    Great content! One question I have in mind is why we want to use one queue for one host? Is it because of http connection overhead if you connect to different host back and forth is high? But in realability the URL coming from front queues might be mixed with different hosts, e.g. a.com/a, b.com, a.com/c, in that case we still have to connect back and forth (assuming we only have one back queue). Unless we could guarantee that all URLs from the same host will come together to the back queue router.

  • @mtsmithtube
    @mtsmithtube Před 2 lety

    @16:38 "make it a standard convention of converting it to a lowercase" - careful because URLs are case sensitive. Maybe your duplicate detector should do a case insensitive compare but you don't want to lose the original case when saving urls.

  • @ramesh4joylife
    @ramesh4joylife Před rokem

    It would have helped much better if you had gone through this entire thing with an example crawl from a scaled site

  • @pengli7213
    @pengli7213 Před 3 lety +2

    What is the implementation of back queue? I don't think it's a Kafka queue, right? Or there might be too many topics. I guess it can be a key-value data structure, such as [domain_name, url, fetched(boolean)] ? Each time when we want to get a url from the "back queue", we just query the key-value and get a url which is not fetched ?

  • @puneetpatwari
    @puneetpatwari Před 5 lety +2

    Nice video. I have 1 question. In the URL frontier, there is a heap. I want to know if the heap is stored at only 1 place and is thread-safe?

  • @parupatimadhukarreddy6972

    Hi Narendra,
    I am basically a software developer who mainly deals with Java script technologies. I saw this videos of Distributed systems on your channel, it seems more interesting knowing the architectural front of the web space, even a newbies are able to understand the conceptual part of the subject Appreciate your efforts. What are the technologies or tools that i need to learn or start with to get to know more about Distributed Systems. Thank you

  • @jamess5330
    @jamess5330 Před rokem

    Narendra, awesome video for system design! Would you like to host mock interview sessions at Meetapro?

  • @FracturedRealityWorld
    @FracturedRealityWorld Před 5 lety

    plz make a video on leadership board system for coding contests or games

  • @jpnr8
    @jpnr8 Před 3 lety

    for back queue we can use kafka topics. it maintains order and number of consumers can be mapped to topics count... we can eliminate the heap.

  • @RAJESH2010able
    @RAJESH2010able Před 4 lety

    Hi Narendra, will it be possible for you to do a video on 'Design Online food ordering service like Uber eats/doordash and explain how to integrate it with existing (Uber) ride-sharing service'?

  • @forte9910
    @forte9910 Před 4 lety

    one question on front queues: if a site is newly added (perhaps as the result of being linked from a site previously crawled), will you leave it in the front queues forever and periodically crawl it?

  • @Tony-cy2yr
    @Tony-cy2yr Před 4 lety

    Is someone knocking on the other side of the wall at 40:46? I saw you are waiting them to finish. :)
    BTW, a question, at 13:56 when will the obsolete persistent storage on the bottom right be clear out?

  • @chickentikkasauce1301
    @chickentikkasauce1301 Před 4 lety +2

    Heap is an implementation detail. Im being nit picky (this is a great video) but just some thoughts - Why does time stamp based priority even matter in this system? You didn’t mention that. It could be because you don’t want certain queues to get starved. A simpler approach might be to process each queue round robin and only mention the priority queue to your interviewer if they nudge you in that direction or if you want to slowly build to it to discuss trade offs. If each back queue has a priority, then just call out that we want a priority queue. You could say back queues have same priority but maybe other back queues dedicated to urls that we expect are updated at a faster rate have higher priority. But then you need a solution to the problem of other lower priority queues getting starved.

    • @psn999100
      @psn999100 Před 4 lety +1

      Great explanation. Yes .
      What I gather is that "URL Frontier" essentially implements a
      1. Priority selection . -> Front Queue
      2. Politeness guarantee . -> Back Queue
      The main issue what we are looking at is how to pick the next URL from the "URL frontier" microservice to be sent to a thread for processing.
      As you said, we could do a round-robin method where all Back queues get picked from in an equal - fashion. Or kind of a "weighted" method aka. priority_queue based solution to make sure the hottest websites get crawled in smaller/tighter time intervals.
      I think its always better to just give the simplest approach first (i.e just draw a black box tagged "Queue Selection" ) and deep dive later if the interviewer wishes to. There is a saying in system designing world = "KISS" == Keep It Simple and Stupid . Its' unlikely that you would run your interviewer out of questions, so better to even nudge the interviewer in your direction of thinking by giving out ever so slightest of hints, so that he starts asking the questions which you already have the answer to.

  • @ambermani1667
    @ambermani1667 Před 4 lety +4

    19:06 why we directly jumped to conclusion to use bloom filter? why can't a distributed hash table will work to know if a site is already crawled or not. its not O(n). we can hash the urls and shard the urls based on hash, then search the url in specific shred hash table.

  • @vishalraut20
    @vishalraut20 Před 4 lety +2

    What is the purpose of Redis? if we are pushing the entries in the queue, what is the need of cache?

  • @RahulSathe.07
    @RahulSathe.07 Před 4 lety +3

    Hey Naren, awesome video. What would be a good (& scalable) way to keep track of duplicate URLs?

    • @NANDINIGOEL
      @NANDINIGOEL Před 7 měsíci

      Bloom filter / count min sketch

  • @ShailySaini-j6b
    @ShailySaini-j6b Před 11 dny

    have a question: at 25:30 it was mentioned that number of Back queues are same as number of worker threads , so is their a one-to-one mapping between back queue and worker thread as well . If so , what is the use of Heap here , whenever worker thread needs new job it will get from its assigned back queue ?