That NodeJS bug was a memory issue. When the string was modified it would have side effects and cause the string to empty out. I wonder if it were immutable then would've prevented it. (dylhack)
@@dylhack The url's just have a carriage return appended to them, caused by splitting the url list on every newline ( ), while newlines on windows actually look like
One thing to help with performance: have a separate server/program which just acts as the queue (i.e. not a part of the main program). That way as the scraper crawls through sites and finds links, it just sends it over to the server - and when it's ready to scrape more - it can just request the next link (rather then trying to keep millions of links in memory, the queue server can properly store it to disk/wherever). Then it's also much easier to run multiple instances of the scraper/indexer - which all pull from the queue - and then put the results into some other kind of keyword/score/indexed database of some kind. Aaand finally - you're search servers will then just send the query to that processed database (and I mean that's still incredibly basic for the mammoth task).
Waste of resources, everything can be a scraper and just get SQL database and you're good to go. No need for a main server to do requests and fetching nodes. It would make the whole thing more stable yes... but with the right code all of the nodes could fetch...
In my crawler I just limited the queue size to something around 10000 links. Not the best solution, but it works. If you were to store all links, you would eventually run out of space no matter what you do.
Someone else already pointed out, but the "bug" of the url is simply caused by the manual split of lines with just ' ' instead of ' ' which is most probably the line separator used by the file, given the windows environment. Simply splitting with proper separator, or trimming/ignoring whitespace will solve it, but luckily it's not a bug of node 😜 I'm surprised axios handles it without complain..
My guess is that the bug with the weird console output you encountered is caused by trailing carriage return characters in your domain names. The text file containing the top 1m domains probably uses as a line separator, but you only split on each . Thus every parsed line ends with a carriage return character ( ). This will reset the console's cursor to the start of the line when you print the domain name to stdout. Everything printed after that will overwrite the existing text on the current line.
You've got plurals but what about tense such as past or present tense? "ran" vs "run" and what about "running" or "runner?" How do you know that "river bank" and "city bank" are unrelated? Even from a lexical point of view, grouping words and phrases together is complicated. That being said, incredibly insightful (and fun) video!
There's actually a solution for this!! Look up stemming the words(very popular one is Snowball Stemmer and you can use nltk library for python). Even though it's not as accurate, you can just compare the stems of two words and if they match bam you've got your answer.
You could use a Semantic Similarity database, as plurals of words are semanticaly similar to their non-plural counterpart. Would also give you other results that are semanticaly close without any needed logic. One problem is that opposite words are very semantically close so you might get the opposite of what you're looking for in the first results.
@@CraftingTableMCthank you,from wath i know i need to do a good classing system ,but i find anything like:idk with wath tools.. i do it except for vs code .
im confused. this "bug". just print it to a file, and you'll see the carriage return without a newline that is in your string. the log is doing the correct thing, by first printing a string, going to the beginning of the line, then printing another shorter string, before ending with a newline. none of these actions will erase the line after the second half of the string, after the carriage return... its not a bug.
Thought this was going to be yet another clickbaity video where you either use search results from a bunch of actual search engines or something like that. Was pleasantly surprised. Great video!
5:50 Since you are doing some multithreading to me, it seems like a race condition than bug tbh. Maybe something like this happens: console log in one thread, tries to write to console, sees that line 5 is empty, writes, at the same time, console log in another thread, tries to write to console. Line 5 is empty (previous thread didn't actually do anything yet), writes to line 5. The result is a mix of the two.
We had attempted to directly use the process's STDIO to test and it would have the same effect. The string would empty out upon modification. (dylhack)
C'mon guys the url's just have a carriage return appended to them, caused by splitting the url list on every newline ( ), while newlines on windows actually look like
The way to aggregate different endings into one word is called tokenisation. You also would like to use asynchronous language with multithreading in mind. Web crawlers themselves are also quite big of a topic.
One way you can get the singular from plural forms of words is to get the lemma, usually it's called a lemmatizer or stemmer, to get the root form of the word, with no plural and no verb conjugation. So "am, are, is" -> "be" and "cars" -> "car" and for irregular plurals "men" -> "man".
Instead of manually adding plurals, you could've also used projects like meilisearch, which return non-exact results (fuzzy searching) with a probability score.
I actually made a crawler in node, I used puppeteer so that way it can run the js on the page, because some pages use js to set the description and title (which I learned after making a whole crawler with just fetch. It's actually pretty fast it just wont ever catch up to a site like youtube where people post their own content.
I think you could also solve the problem with the plurals with the levenshtein distance, which is kind of a measurement of the similarity of 2 strings. So you can set the maximum number of different chars in the strings to 2 or 3 to cover plurals and typos
Great video overall! I think the background audio is a little too high. When I edit videos, I always try to keep background audio a couple decibels below the interviewee/speaker.
Rapid question: did you used the levinstain algorithm? If not try to search it, with a little bit of modification you can include the plural and avoid using external file for them
I already saw a comment about stemming. Please use the next best thing, that was developed after it: word embeddings :) and then you can calculate site importance using Hits or PageRank 😁.
Stackoverflow has been running on my computer constantly like 20 times over for the last month as i crunch for my coding classes’ exam in like a month…and because I wasn’t procrastinating I’m like done with a month to spare while I know a guy with an outline
Just wait until you try to make a search engine in French. For every verb, there are 8 different ways to say it (with different spellings) for each tense. Edit: I forgot about irregular verbs. They don't follow any pattern / rule about how they are conjugated.
Or having german words that are fused together. For example „Search engine“ would be „Suchmaschine“ because it uses the word „Suche“ and the word „Maschine“. This leads to things like „Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz“ or in english „beef labeling surveillance duties transfer law“
dude, this guy doesn't even know what he is doing, first of all the search engines doesn't even work like that and second, documents need always to be preprocessed, there what is stored are the stems of every word and it removes the stopwords which also include verbs.
For cloudflare issue, you could use cached webpages from Google or internet archive, you would be using Google anyways but that's Better than quitting.
I’m trying to build a Pokémon team builder but I’m not sure how to get the search bar to give you the Pokémon as you type. You just have to perfectly type the name of the Pokémon.
Always bet on JavaScript
That NodeJS bug was a memory issue. When the string was modified it would have side effects and cause the string to empty out. I wonder if it were immutable then would've prevented it. (dylhack)
@@dylhack The url's just have a carriage return appended to them, caused by splitting the url list on every newline (
), while newlines on windows actually look like
@@smylemusicproductions8897 no we stripped them using regex. But it was an issue to begin with though.
too bad conaticus wrote bug :(
Javascript for the win
The most unrealistic part of this was that someone answered your question on StackOverflow, without just downvoting you and calling you a idiot.
lmao
XD
an*
xD
Title: "I made a Search Engine"
Video: "Javascript is bad, I didn't finish my Search engine"
He didn't even manage to parse a list of urls 😄 Babysteps people!
defeated
bro just asking cauz i need it do you know some place i could actually get to learn how to create a real time web search engine?????urgently needed
@@JelleDeLoeckerhe is bad
Next video: I made a search engine in assembly
@x5up0s seconds*
İs this a search engine? 😳 This slow garbage written in blotaed javascript
@@qaraciyer why did you reply to me tho?
You could just comment under the video itself
"I made a search engine in morse code"
@@ExediceWhyNot shat up
One thing to help with performance: have a separate server/program which just acts as the queue (i.e. not a part of the main program). That way as the scraper crawls through sites and finds links, it just sends it over to the server - and when it's ready to scrape more - it can just request the next link (rather then trying to keep millions of links in memory, the queue server can properly store it to disk/wherever). Then it's also much easier to run multiple instances of the scraper/indexer - which all pull from the queue - and then put the results into some other kind of keyword/score/indexed database of some kind. Aaand finally - you're search servers will then just send the query to that processed database (and I mean that's still incredibly basic for the mammoth task).
Waste of resources, everything can be a scraper and just get SQL database and you're good to go. No need for a main server to do requests and fetching nodes. It would make the whole thing more stable yes... but with the right code all of the nodes could fetch...
In my crawler I just limited the queue size to something around 10000 links. Not the best solution, but it works. If you were to store all links, you would eventually run out of space no matter what you do.
@@randomdamianhey I will pay you to help me build my search engine with open ai codex? U in? 🙌✅
@@pukaputyouon3233 I can help if you want.
@@pukaputyouon3233 I can help you out if you're still interested
"I can't believe I just spent 30 minutes trying to fix something when I had given the wrong input" is the description of my life
Someone else already pointed out, but the "bug" of the url is simply caused by the manual split of lines with just '
' instead of '
' which is most probably the line separator used by the file, given the windows environment. Simply splitting with proper separator, or trimming/ignoring whitespace will solve it, but luckily it's not a bug of node 😜 I'm surprised axios handles it without complain..
yeah, that was pain to watch. average python developer actually trying to do code struggling so hard...
@IMJamby want to get paid to help me build a search engine with open ai codex? I need help 😅 I’ll pay
The editing here was great! Enjoyed going on this kinda unnecessary journey with you 😂
The destination was worth the journey. That ending. Amazing.
The editing was pretty great. Enjoyed every single bit of it.
My guess is that the bug with the weird console output you encountered is caused by trailing carriage return characters in your domain names. The text file containing the top 1m domains probably uses
as a line separator, but you only split on each
. Thus every parsed line ends with a carriage return character (
). This will reset the console's cursor to the start of the line when you print the domain name to stdout. Everything printed after that will overwrite the existing text on the current line.
Yep, definitely parsed the file with domains incorrectly.
You've got plurals but what about tense such as past or present tense? "ran" vs "run" and what about "running" or "runner?" How do you know that "river bank" and "city bank" are unrelated? Even from a lexical point of view, grouping words and phrases together is complicated.
That being said, incredibly insightful (and fun) video!
There's actually a solution for this!! Look up stemming the words(very popular one is Snowball Stemmer and you can use nltk library for python). Even though it's not as accurate, you can just compare the stems of two words and if they match bam you've got your answer.
I'll mention one more thing that was overlooked... All the languages of the world. 😅 Good luck with that.
@@muradbashirov6435 Nice!
@@muradbashirov6435 no shot there are a solution I guess google and other search forms work somehow thanks god
You could use a Semantic Similarity database, as plurals of words are semanticaly similar to their non-plural counterpart. Would also give you other results that are semanticaly close without any needed logic. One problem is that opposite words are very semantically close so you might get the opposite of what you're looking for in the first results.
Hey conaticus,
I just want to say that I love your videos and I really enjoy watching them; thanks for your content :)
thanks for consuming it!
After making my own search engine in Python, I learnt that making a search engine is mostly about doing string processing.
how?,im looking everywere for tutorials but tere is nothing,where and wath should i search to do this and wath to learn
@@Bluentomion it sounds doable, I bet you can do it yourself! Good luck!
@@CraftingTableMCthank you,from wath i know i need to do a good classing system ,but i find anything like:idk with wath tools.. i do it except for vs code .
@@Bluentomion Well its mostly just converting stuff like ran to run, etc etc, and obviously for scraping domains / domain extensions
im confused. this "bug". just print it to a file, and you'll see the carriage return without a newline that is in your string. the log is doing the correct thing, by first printing a string, going to the beginning of the line, then printing another shorter string, before ending with a newline. none of these actions will erase the line after the second half of the string, after the carriage return... its not a bug.
Awesome video! Keep the awesome work 👍
The 'bug' in nodejs is caused by a '
' operator most in the 'url' object btw
This is severly underrated! keep up the great work!
Thought this was going to be yet another clickbaity video where you either use search results from a bunch of actual search engines or something like that. Was pleasantly surprised. Great video!
This is even worse though, because he literally didn't make anything close to a search engine.
5:50 Since you are doing some multithreading to me, it seems like a race condition than bug tbh. Maybe something like this happens: console log in one thread, tries to write to console, sees that line 5 is empty, writes, at the same time, console log in another thread, tries to write to console. Line 5 is empty (previous thread didn't actually do anything yet), writes to line 5. The result is a mix of the two.
The only way to multi-thread in JS is with workers, which he didn't use. And consle.log is thread safe in that case
@@begga9682 you can use child_process
We had attempted to directly use the process's STDIO to test and it would have the same effect. The string would empty out upon modification. (dylhack)
C'mon guys the url's just have a carriage return appended to them, caused by splitting the url list on every newline (
), while newlines on windows actually look like
@@smylemusicproductions8897 you are right that the original file is using CRLF EOL, but we stripped them during processing.
The way to aggregate different endings into one word is called tokenisation. You also would like to use asynchronous language with multithreading in mind. Web crawlers themselves are also quite big of a topic.
For the plurals: use a fuzzy word matcher library
Will make it waaaaaaay more resilient
One way you can get the singular from plural forms of words is to get the lemma, usually it's called a lemmatizer or stemmer, to get the root form of the word, with no plural and no verb conjugation.
So "am, are, is" -> "be" and "cars" -> "car" and for irregular plurals "men" -> "man".
Instead of manually adding plurals, you could've also used projects like meilisearch, which return non-exact results (fuzzy searching) with a probability score.
I actually made a crawler in node, I used puppeteer so that way it can run the js on the page, because some pages use js to set the description and title (which I learned after making a whole crawler with just fetch. It's actually pretty fast it just wont ever catch up to a site like youtube where people post their own content.
2:34 wasn’t expecting the spanish autotune rap in the background 😂
Great video. It was kinda hard to hear you with the music at some points. Would love to see more of this type of content. Very interesting.
what is the song at 4:48?
Edit: found it myself, the song is called
OTE - Orange Marmalade
Loved the "I have no f*cking Idea of what I'm doing so I'll just say the technology is trash" vibes 💀
Very interesting concept!
Me trabó la cabeza el reguetón, no me lo esperaba en esta clase de videos, jajaja
Great video bro especially the intro❤️🔥 from where u got that song 🎵
What a great lesson it is !!
which extention are you using for inline errors? It looks really useful!
looks to be "Error Lens"
Now this is a conclusion that shows development towards becoming a real man.
_You managed to go _*_thirty minutes_*_ before questioning that decision???_ 😹
earned a sub bro keep the work up👍👍
If you’re looking for a new backend language, Go is nice.
Does anyone know the song that begins at 5:25 ?
I applaud him, whenever I got a bug, I rather start from scratch rather than fixing it
parrot and programming is da best combo ig
spankdang added successfully LMAO
Pain, misery, and JavaScript
Great video!
A few years ago, I tried creating a search engine as well - failed because of similar problems in much smaller scale 😂😂
Wow this is great, can you make something like elasticsearch for demonstration purpose? Would love to see that ❤
I think you could also solve the problem with the plurals with the levenshtein distance, which is kind of a measurement of the similarity of 2 strings. So you can set the maximum number of different chars in the strings to 2 or 3 to cover plurals and typos
great video mate, i had a great laugh!
What microphone do you use and do you use any audio effect if so can you list?
Why didnt you use the Heap View of the Debugger? It shows you whats allocating a shitton of memory making it crash?
>npm being slow
Well, yeah, use pnpm or yarn, they are much faster and uses much more sane package managing methods.
Positive alien strangeness
Getting click baited was “inconvenient and useless”
Next, make a hide engine
There's a way around cloudflare, for example tls clients
Great video overall! I think the background audio is a little too high. When I edit videos, I always try to keep background audio a couple decibels below the interviewee/speaker.
Why didn't you used NLP models to find relation between the query and the title heads of websites?
“I’ll never touch javascript again”
video: “let me end this nonsense now!’
"I'm never going to touch javascript again."
Hmm yes, from experience, does not pan out.
i keep relapsing
You could also mimic a stack and do your function in iterative. This way you can have an "unlimited stack"
But it's a pain in the ass to do
And so he learned go and rust, he did the right choice
Rapid question: did you used the levinstain algorithm? If not try to search it, with a little bit of modification you can include the plural and avoid using external file for them
"I'm never gonna touch JavaScript again" yeah right
0:28 whats that last website there?
Superb conclusion
sweet bro! making my first web crawler now
and hella jeff.
“And find exactly what they’re looking for” - not Google in 2022, the latest algorithm is absolutely horrendous.
dude was amused by dutch
and afrikaans
Next video: I built another search engine in Fortran.
wait till he finds out google has a "Feeling lucky" button
Most brute force search engine in history
how did u get windows terminal on windows 10
Hellooo I'm a new subscriber to your channel. What's the font you are using in vscode?
looks like jetbrains mono
I need more conaticus...
I already saw a comment about stemming. Please use the next best thing, that was developed after it: word embeddings :)
and then you can calculate site importance using Hits or PageRank 😁.
After watching this video, I feel challenged to make a search engine, but using C#. Any tips for me? Thanks :D
that funny word was amusing
I always wondered how algolia search works
Wait, was that a bird on his shoulder? BROH!
Use yarn / pnpm to speed up that npm install
He created a search engine??? i created a programming language but that’s still AMAZING that he built a search engine!
"I'm never going to touch JavaScript again"
(Every developer I know)
I love the Message from this Video!
aaand this is why we use Go for backend
Just wait until Javascript update that you can code C in it
Pls give link to pop filter
I love your conclusion! 😁
Stackoverflow has been running on my computer constantly like 20 times over for the last month as i crunch for my coding classes’ exam in like a month…and because I wasn’t procrastinating I’m like done with a month to spare while I know a guy with an outline
I WAS NOT EXPECT THE ENDING 💀💀💀💀
Just wait until you try to make a search engine in French. For every verb, there are 8 different ways to say it (with different spellings) for each tense.
Edit: I forgot about irregular verbs. They don't follow any pattern / rule about how they are conjugated.
Hence NLP is a thing
Or having german words that are fused together. For example „Search engine“ would be „Suchmaschine“ because it uses the word „Suche“ and the word „Maschine“. This leads to things like „Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz“ or in english „beef labeling surveillance duties transfer law“
@@placeholder4988 then it needs to cut these words. It first has to cut the syllables and then search trough the Duden for words
dude, this guy doesn't even know what he is doing, first of all the search engines doesn't even work like that and second, documents need always to be preprocessed, there what is stored are the stems of every word and it removes the stopwords which also include verbs.
Which theme are you using?
Tears 😭😭😭😭😭
10/10 video
If my memory serves me correctly, there's are a few cloudflare bypass plugin for Puppeteer.
Underrated
For cloudflare issue, you could use cached webpages from Google or internet archive, you would be using Google anyways but that's Better than quitting.
Better tools for scraping/crawling websites: crawlee and apify ! Simplifies request queues and proxies
I hope you didn't kill your parrot hitting your desk 😅
I’m trying to build a Pokémon team builder but I’m not sure how to get the search bar to give you the Pokémon as you type. You just have to perfectly type the name of the Pokémon.
@@DFPercush Thank you so much! I appreciate it!
So were not gonna talk about the cockitel parakeet sitting on his shoulder?
I think Rust is a good language to pick for a project like this.
Next video : YT recommendations to get more such videos