I Made a Search Engine

Sdílet
Vložit
  • čas přidán 16. 10. 2022
  • I Made a Search Engine
    Dylhack: dylhack.dev
    Project Repository: github.com/conaticus/search-e...
    Discord: / discord
    Twitter: / conaticus
  • Zábava

Komentáře • 566

  • @Fireship
    @Fireship Před rokem +3922

    Always bet on JavaScript

    • @dylhack
      @dylhack Před rokem +39

      That NodeJS bug was a memory issue. When the string was modified it would have side effects and cause the string to empty out. I wonder if it were immutable then would've prevented it. (dylhack)

    • @smylemusicproductions8897
      @smylemusicproductions8897 Před rokem +11

      @@dylhack The url's just have a carriage return appended to them, caused by splitting the url list on every newline (
      ), while newlines on windows actually look like

    • @dylhack
      @dylhack Před rokem +7

      @@smylemusicproductions8897 no we stripped them using regex. But it was an issue to begin with though.

    • @creative2158
      @creative2158 Před rokem +5

      too bad conaticus wrote bug :(

    • @CoderGautam
      @CoderGautam Před rokem +9

      Javascript for the win

  • @liondadev
    @liondadev Před rokem +571

    The most unrealistic part of this was that someone answered your question on StackOverflow, without just downvoting you and calling you a idiot.

  • @4GENS
    @4GENS Před rokem +146

    Title: "I made a Search Engine"
    Video: "Javascript is bad, I didn't finish my Search engine"

    • @JelleDeLoecker
      @JelleDeLoecker Před rokem +15

      He didn't even manage to parse a list of urls 😄 Babysteps people!

    • @aminebadri5118
      @aminebadri5118 Před rokem +1

      defeated

    • @shrinkhal179
      @shrinkhal179 Před rokem +1

      bro just asking cauz i need it do you know some place i could actually get to learn how to create a real time web search engine?????urgently needed

    • @Stroos
      @Stroos Před rokem

      ​@@JelleDeLoeckerhe is bad

  • @ExediceWhyNot
    @ExediceWhyNot Před rokem +433

    Next video: I made a search engine in assembly

    • @hanna_GG2
      @hanna_GG2 Před rokem +40

      @x5up0s seconds*

    • @qaraciyer
      @qaraciyer Před rokem +4

      İs this a search engine? 😳 This slow garbage written in blotaed javascript

    • @ExediceWhyNot
      @ExediceWhyNot Před rokem +2

      @@qaraciyer why did you reply to me tho?
      You could just comment under the video itself

    • @TimkaSR
      @TimkaSR Před rokem +1

      "I made a search engine in morse code"

    • @qaraciyer
      @qaraciyer Před rokem

      @@ExediceWhyNot shat up

  • @Piklets
    @Piklets Před rokem +444

    One thing to help with performance: have a separate server/program which just acts as the queue (i.e. not a part of the main program). That way as the scraper crawls through sites and finds links, it just sends it over to the server - and when it's ready to scrape more - it can just request the next link (rather then trying to keep millions of links in memory, the queue server can properly store it to disk/wherever). Then it's also much easier to run multiple instances of the scraper/indexer - which all pull from the queue - and then put the results into some other kind of keyword/score/indexed database of some kind. Aaand finally - you're search servers will then just send the query to that processed database (and I mean that's still incredibly basic for the mammoth task).

    • @randomdamian
      @randomdamian Před rokem +1

      Waste of resources, everything can be a scraper and just get SQL database and you're good to go. No need for a main server to do requests and fetching nodes. It would make the whole thing more stable yes... but with the right code all of the nodes could fetch...

    • @kamilb2322
      @kamilb2322 Před rokem

      In my crawler I just limited the queue size to something around 10000 links. Not the best solution, but it works. If you were to store all links, you would eventually run out of space no matter what you do.

    • @pukaputyouon3233
      @pukaputyouon3233 Před rokem

      @@randomdamianhey I will pay you to help me build my search engine with open ai codex? U in? 🙌✅

    • @parthmadan671
      @parthmadan671 Před rokem

      @@pukaputyouon3233 I can help if you want.

    • @ayomidediekola2505
      @ayomidediekola2505 Před rokem

      @@pukaputyouon3233 I can help you out if you're still interested

  • @HalfAsleepSam
    @HalfAsleepSam Před rokem +28

    "I can't believe I just spent 30 minutes trying to fix something when I had given the wrong input" is the description of my life

  • @IMJamby
    @IMJamby Před rokem +138

    Someone else already pointed out, but the "bug" of the url is simply caused by the manual split of lines with just '
    ' instead of '
    ' which is most probably the line separator used by the file, given the windows environment. Simply splitting with proper separator, or trimming/ignoring whitespace will solve it, but luckily it's not a bug of node 😜 I'm surprised axios handles it without complain..

    • @0sliter0
      @0sliter0 Před rokem +12

      yeah, that was pain to watch. average python developer actually trying to do code struggling so hard...

    • @pukaputyouon3233
      @pukaputyouon3233 Před rokem +2

      @IMJamby want to get paid to help me build a search engine with open ai codex? I need help 😅 I’ll pay

  • @ITGirlll
    @ITGirlll Před rokem +109

    The editing here was great! Enjoyed going on this kinda unnecessary journey with you 😂

  • @MyAmazingUsername
    @MyAmazingUsername Před rokem +13

    The destination was worth the journey. That ending. Amazing.

  • @ashishpandagre6805
    @ashishpandagre6805 Před rokem +6

    The editing was pretty great. Enjoyed every single bit of it.

  • @SuperMasterDesaster
    @SuperMasterDesaster Před rokem +33

    My guess is that the bug with the weird console output you encountered is caused by trailing carriage return characters in your domain names. The text file containing the top 1m domains probably uses
    as a line separator, but you only split on each
    . Thus every parsed line ends with a carriage return character (
    ). This will reset the console's cursor to the start of the line when you print the domain name to stdout. Everything printed after that will overwrite the existing text on the current line.

    • @dealloc
      @dealloc Před rokem +5

      Yep, definitely parsed the file with domains incorrectly.

  • @justingolden21
    @justingolden21 Před rokem +240

    You've got plurals but what about tense such as past or present tense? "ran" vs "run" and what about "running" or "runner?" How do you know that "river bank" and "city bank" are unrelated? Even from a lexical point of view, grouping words and phrases together is complicated.
    That being said, incredibly insightful (and fun) video!

    • @muradbashirov6435
      @muradbashirov6435 Před rokem +15

      There's actually a solution for this!! Look up stemming the words(very popular one is Snowball Stemmer and you can use nltk library for python). Even though it's not as accurate, you can just compare the stems of two words and if they match bam you've got your answer.

    • @DominikGuzowski
      @DominikGuzowski Před rokem +7

      I'll mention one more thing that was overlooked... All the languages of the world. 😅 Good luck with that.

    • @LuisFelipeZaguini
      @LuisFelipeZaguini Před rokem

      @@muradbashirov6435 Nice!

    • @ko-Daegu
      @ko-Daegu Před rokem

      @@muradbashirov6435 no shot there are a solution I guess google and other search forms work somehow thanks god

    • @nekocat34
      @nekocat34 Před rokem +4

      You could use a Semantic Similarity database, as plurals of words are semanticaly similar to their non-plural counterpart. Would also give you other results that are semanticaly close without any needed logic. One problem is that opposite words are very semantically close so you might get the opposite of what you're looking for in the first results.

  • @luisoosiscool
    @luisoosiscool Před rokem +5

    Hey conaticus,
    I just want to say that I love your videos and I really enjoy watching them; thanks for your content :)

  • @pritalbamnodkar2620
    @pritalbamnodkar2620 Před rokem +33

    After making my own search engine in Python, I learnt that making a search engine is mostly about doing string processing.

    • @Bluentomion
      @Bluentomion Před rokem +1

      how?,im looking everywere for tutorials but tere is nothing,where and wath should i search to do this and wath to learn

    • @CraftingTableMC
      @CraftingTableMC Před rokem

      @@Bluentomion it sounds doable, I bet you can do it yourself! Good luck!

    • @Bluentomion
      @Bluentomion Před rokem

      @@CraftingTableMCthank you,from wath i know i need to do a good classing system ,but i find anything like:idk with wath tools.. i do it except for vs code .

    • @raik1766
      @raik1766 Před rokem

      @@Bluentomion Well its mostly just converting stuff like ran to run, etc etc, and obviously for scraping domains / domain extensions

  • @Valyrie97
    @Valyrie97 Před rokem +5

    im confused. this "bug". just print it to a file, and you'll see the carriage return without a newline that is in your string. the log is doing the correct thing, by first printing a string, going to the beginning of the line, then printing another shorter string, before ending with a newline. none of these actions will erase the line after the second half of the string, after the carriage return... its not a bug.

  • @squidtito8501
    @squidtito8501 Před rokem +10

    Awesome video! Keep the awesome work 👍

  • @catriverr
    @catriverr Před rokem +2

    The 'bug' in nodejs is caused by a '
    ' operator most in the 'url' object btw

  • @YehonatanTavor
    @YehonatanTavor Před rokem +20

    This is severly underrated! keep up the great work!

  • @iamverybigsad
    @iamverybigsad Před rokem +9

    Thought this was going to be yet another clickbaity video where you either use search results from a bunch of actual search engines or something like that. Was pleasantly surprised. Great video!

    • @boiimcfacto2364
      @boiimcfacto2364 Před rokem +2

      This is even worse though, because he literally didn't make anything close to a search engine.

  • @yumeyuki1944
    @yumeyuki1944 Před rokem +20

    5:50 Since you are doing some multithreading to me, it seems like a race condition than bug tbh. Maybe something like this happens: console log in one thread, tries to write to console, sees that line 5 is empty, writes, at the same time, console log in another thread, tries to write to console. Line 5 is empty (previous thread didn't actually do anything yet), writes to line 5. The result is a mix of the two.

    • @begga9682
      @begga9682 Před rokem +7

      The only way to multi-thread in JS is with workers, which he didn't use. And consle.log is thread safe in that case

    • @boem231
      @boem231 Před rokem

      @@begga9682 you can use child_process

    • @dylhack
      @dylhack Před rokem +1

      We had attempted to directly use the process's STDIO to test and it would have the same effect. The string would empty out upon modification. (dylhack)

    • @smylemusicproductions8897
      @smylemusicproductions8897 Před rokem +2

      C'mon guys the url's just have a carriage return appended to them, caused by splitting the url list on every newline (
      ), while newlines on windows actually look like

    • @dylhack
      @dylhack Před rokem +1

      @@smylemusicproductions8897 you are right that the original file is using CRLF EOL, but we stripped them during processing.

  • @Namynnuz
    @Namynnuz Před rokem +4

    The way to aggregate different endings into one word is called tokenisation. You also would like to use asynchronous language with multithreading in mind. Web crawlers themselves are also quite big of a topic.

  • @ja100o
    @ja100o Před rokem +3

    For the plurals: use a fuzzy word matcher library
    Will make it waaaaaaay more resilient

  • @kaby3190
    @kaby3190 Před rokem +1

    One way you can get the singular from plural forms of words is to get the lemma, usually it's called a lemmatizer or stemmer, to get the root form of the word, with no plural and no verb conjugation.
    So "am, are, is" -> "be" and "cars" -> "car" and for irregular plurals "men" -> "man".

  • @thatanimeweirdo
    @thatanimeweirdo Před rokem +9

    Instead of manually adding plurals, you could've also used projects like meilisearch, which return non-exact results (fuzzy searching) with a probability score.

  • @FiReLScar
    @FiReLScar Před rokem +3

    I actually made a crawler in node, I used puppeteer so that way it can run the js on the page, because some pages use js to set the description and title (which I learned after making a whole crawler with just fetch. It's actually pretty fast it just wont ever catch up to a site like youtube where people post their own content.

  • @Crybyte
    @Crybyte Před rokem +2

    2:34 wasn’t expecting the spanish autotune rap in the background 😂

  • @tony2shoes982
    @tony2shoes982 Před rokem +1

    Great video. It was kinda hard to hear you with the music at some points. Would love to see more of this type of content. Very interesting.

  • @Omikronik
    @Omikronik Před rokem +3

    what is the song at 4:48?
    Edit: found it myself, the song is called
    OTE - Orange Marmalade

  • @wellingtonalmeida2662
    @wellingtonalmeida2662 Před rokem +15

    Loved the "I have no f*cking Idea of what I'm doing so I'll just say the technology is trash" vibes 💀

  • @Povilaz
    @Povilaz Před rokem

    Very interesting concept!

  • @mnesicles.
    @mnesicles. Před rokem

    Me trabó la cabeza el reguetón, no me lo esperaba en esta clase de videos, jajaja

  • @siddhantgupta1300
    @siddhantgupta1300 Před rokem

    Great video bro especially the intro❤️‍🔥 from where u got that song 🎵

  • @Marco-vn8tc
    @Marco-vn8tc Před rokem +2

    What a great lesson it is !!

  • @kracdev6223
    @kracdev6223 Před rokem +3

    which extention are you using for inline errors? It looks really useful!

  • @lasslos1490
    @lasslos1490 Před rokem +27

    Now this is a conclusion that shows development towards becoming a real man.

  • @WolvericCatkin
    @WolvericCatkin Před rokem +7

    _You managed to go _*_thirty minutes_*_ before questioning that decision???_ 😹

  • @RacksDay
    @RacksDay Před rokem

    earned a sub bro keep the work up👍👍

  • @jomy10-games
    @jomy10-games Před rokem +5

    If you’re looking for a new backend language, Go is nice.

  • @dylnn4363
    @dylnn4363 Před rokem +3

    Does anyone know the song that begins at 5:25 ?

  • @mezohx
    @mezohx Před rokem +13

    I applaud him, whenever I got a bug, I rather start from scratch rather than fixing it

  • @oskard3516
    @oskard3516 Před rokem +1

    parrot and programming is da best combo ig

  • @soulspirit8687
    @soulspirit8687 Před rokem +1

    spankdang added successfully LMAO

  • @ThrillDaWill
    @ThrillDaWill Před rokem

    Pain, misery, and JavaScript

  • @LasWegas
    @LasWegas Před rokem +1

    Great video!
    A few years ago, I tried creating a search engine as well - failed because of similar problems in much smaller scale 😂😂

  • @khanasfireza9515
    @khanasfireza9515 Před rokem +2

    Wow this is great, can you make something like elasticsearch for demonstration purpose? Would love to see that ❤

  • @gresse170
    @gresse170 Před rokem +12

    I think you could also solve the problem with the plurals with the levenshtein distance, which is kind of a measurement of the similarity of 2 strings. So you can set the maximum number of different chars in the strings to 2 or 3 to cover plurals and typos

  • @shakz2077
    @shakz2077 Před rokem

    great video mate, i had a great laugh!

  • @johanrong
    @johanrong Před rokem

    What microphone do you use and do you use any audio effect if so can you list?

  • @NagatoriUnlimitedDomain

    Why didnt you use the Heap View of the Debugger? It shows you whats allocating a shitton of memory making it crash?

  • @Curstantine
    @Curstantine Před rokem +2

    >npm being slow
    Well, yeah, use pnpm or yarn, they are much faster and uses much more sane package managing methods.

  • @rustwithoutrust
    @rustwithoutrust Před rokem +1

    Positive alien strangeness

  • @elstonko343
    @elstonko343 Před rokem +2

    Getting click baited was “inconvenient and useless”

  • @AByteofCode
    @AByteofCode Před rokem +5

    Next, make a hide engine

  • @pieterspruijt2
    @pieterspruijt2 Před rokem +2

    There's a way around cloudflare, for example tls clients

  • @xtremeblaze777
    @xtremeblaze777 Před rokem

    Great video overall! I think the background audio is a little too high. When I edit videos, I always try to keep background audio a couple decibels below the interviewee/speaker.

  • @j_r28
    @j_r28 Před rokem +1

    Why didn't you used NLP models to find relation between the query and the title heads of websites?

  • @ryan-heath
    @ryan-heath Před rokem +1

    “I’ll never touch javascript again”
    video: “let me end this nonsense now!’

  • @chadyways8750
    @chadyways8750 Před rokem +2

    "I'm never going to touch javascript again."
    Hmm yes, from experience, does not pan out.

  • @colonelmoustache
    @colonelmoustache Před rokem +2

    You could also mimic a stack and do your function in iterative. This way you can have an "unlimited stack"
    But it's a pain in the ass to do

  • @viclan7832
    @viclan7832 Před rokem

    And so he learned go and rust, he did the right choice

  • @hominusprogramming
    @hominusprogramming Před rokem

    Rapid question: did you used the levinstain algorithm? If not try to search it, with a little bit of modification you can include the plural and avoid using external file for them

  • @ryloriz
    @ryloriz Před rokem +1

    "I'm never gonna touch JavaScript again" yeah right

  • @jasper6788
    @jasper6788 Před rokem +1

    0:28 whats that last website there?

  • @JorgePicco
    @JorgePicco Před rokem

    Superb conclusion

  • @HDArtzy
    @HDArtzy Před rokem +1

    sweet bro! making my first web crawler now

  • @ThymeCypher
    @ThymeCypher Před rokem +1

    “And find exactly what they’re looking for” - not Google in 2022, the latest algorithm is absolutely horrendous.

  • @kipchickensout
    @kipchickensout Před rokem +1

    dude was amused by dutch

  • @LambdaCreates
    @LambdaCreates Před rokem

    Next video: I built another search engine in Fortran.

  • @hovac.
    @hovac. Před rokem +1

    wait till he finds out google has a "Feeling lucky" button

  • @kaithompson8115
    @kaithompson8115 Před rokem

    Most brute force search engine in history

  • @idkjuststop
    @idkjuststop Před rokem

    how did u get windows terminal on windows 10

  • @kiwischool
    @kiwischool Před rokem +1

    Hellooo I'm a new subscriber to your channel. What's the font you are using in vscode?

    • @tjgdddfcn
      @tjgdddfcn Před rokem +1

      looks like jetbrains mono

  • @foqsi_
    @foqsi_ Před rokem

    I need more conaticus...

  • @AntonioNoack
    @AntonioNoack Před rokem +3

    I already saw a comment about stemming. Please use the next best thing, that was developed after it: word embeddings :)
    and then you can calculate site importance using Hits or PageRank 😁.

  • @sky22_
    @sky22_ Před rokem

    After watching this video, I feel challenged to make a search engine, but using C#. Any tips for me? Thanks :D

  • @neofox2526
    @neofox2526 Před rokem

    that funny word was amusing

  • @omdxp
    @omdxp Před rokem +1

    I always wondered how algolia search works

  • @Carlos-do2vh
    @Carlos-do2vh Před rokem

    Wait, was that a bird on his shoulder? BROH!

  • @ananayarora
    @ananayarora Před rokem +1

    Use yarn / pnpm to speed up that npm install

  • @c00lkitty
    @c00lkitty Před rokem +1

    He created a search engine??? i created a programming language but that’s still AMAZING that he built a search engine!

  • @ugandanknuckles3429
    @ugandanknuckles3429 Před rokem +1

    "I'm never going to touch JavaScript again"
    (Every developer I know)

  • @PeterfoxUwU
    @PeterfoxUwU Před rokem +1

    I love the Message from this Video!

  • @Slashscreen
    @Slashscreen Před rokem +1

    aaand this is why we use Go for backend

  • @goonshield9858
    @goonshield9858 Před rokem +2

    Just wait until Javascript update that you can code C in it

  • @Drewno1
    @Drewno1 Před rokem +1

    Pls give link to pop filter

  • @fairphoneuser9009
    @fairphoneuser9009 Před rokem

    I love your conclusion! 😁

  • @FirstNameLastName-gh9iw
    @FirstNameLastName-gh9iw Před rokem +1

    Stackoverflow has been running on my computer constantly like 20 times over for the last month as i crunch for my coding classes’ exam in like a month…and because I wasn’t procrastinating I’m like done with a month to spare while I know a guy with an outline

  • @vexg2981
    @vexg2981 Před rokem +1

    I WAS NOT EXPECT THE ENDING 💀💀💀💀

  • @c.j.hatton
    @c.j.hatton Před rokem +20

    Just wait until you try to make a search engine in French. For every verb, there are 8 different ways to say it (with different spellings) for each tense.
    Edit: I forgot about irregular verbs. They don't follow any pattern / rule about how they are conjugated.

    • @ko-Daegu
      @ko-Daegu Před rokem +1

      Hence NLP is a thing

    • @placeholder4988
      @placeholder4988 Před rokem

      Or having german words that are fused together. For example „Search engine“ would be „Suchmaschine“ because it uses the word „Suche“ and the word „Maschine“. This leads to things like „Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz“ or in english „beef labeling surveillance duties transfer law“

    • @1aboPLZ
      @1aboPLZ Před rokem

      @@placeholder4988 then it needs to cut these words. It first has to cut the syllables and then search trough the Duden for words

    • @owzok7087
      @owzok7087 Před rokem

      dude, this guy doesn't even know what he is doing, first of all the search engines doesn't even work like that and second, documents need always to be preprocessed, there what is stored are the stems of every word and it removes the stopwords which also include verbs.

  • @kamal-hassan
    @kamal-hassan Před rokem

    Which theme are you using?

  • @avithedev
    @avithedev Před rokem +2

    Tears 😭😭😭😭😭
    10/10 video

  • @porroapp
    @porroapp Před rokem

    If my memory serves me correctly, there's are a few cloudflare bypass plugin for Puppeteer.

  • @LyrelGaming
    @LyrelGaming Před rokem +1

    Underrated

  • @theabbie3249
    @theabbie3249 Před rokem

    For cloudflare issue, you could use cached webpages from Google or internet archive, you would be using Google anyways but that's Better than quitting.

  • @sl554
    @sl554 Před rokem +1

    Better tools for scraping/crawling websites: crawlee and apify ! Simplifies request queues and proxies

  • @ipetrovbg
    @ipetrovbg Před rokem +1

    I hope you didn't kill your parrot hitting your desk 😅

  • @bills1967
    @bills1967 Před rokem

    I’m trying to build a Pokémon team builder but I’m not sure how to get the search bar to give you the Pokémon as you type. You just have to perfectly type the name of the Pokémon.

    • @bills1967
      @bills1967 Před rokem +1

      @@DFPercush Thank you so much! I appreciate it!

  • @Kaazmaz3447
    @Kaazmaz3447 Před rokem

    So were not gonna talk about the cockitel parakeet sitting on his shoulder?

  • @Daxanius
    @Daxanius Před rokem +3

    I think Rust is a good language to pick for a project like this.

  • @tanmaybarvi567
    @tanmaybarvi567 Před rokem

    Next video : YT recommendations to get more such videos