Scraping multiples websites with one Python script

John Watson Rooney

zhlédnutí 24 858

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 5. 09. 2024
Writing a simple web scraping script to do some basic price comparison
github.com/jhn...
Scraper API www.scrapingbe...
Patreon: / johnwatsonrooney
Donations: www.paypal.com...
Proxies: iproyal.club/J...
Hosting: Digital Ocean: m.do.co/c/c7c9...
Gear I use: www.amazon.co....
Disclaimer: These are affiliate links and as an Amazon Associate I earn from qualifying purchases
Věda a technologie

Komentáře • 55

@sheikh4awais Před rokem ⁺¹⁰
Could you also make one tutorial for your code editor setup and the terminal? It looks really cool.
@JohnWatsonRooney Před rokem ⁺⁹
Yes, working on a setup video and neovim video!
@sheikh4awais Před rokem ⁺²
@@JohnWatsonRooney thanks so much.
@bakerssebandeke6764 Před rokem ⁺²
@@JohnWatsonRooney how's that video coming along?? 😊
@JohnWatsonRooney Před rokem ⁺⁴
@@bakerssebandeke6764 haha yeah... soon :)
@silkogelman Před rokem ⁺³
Thank you John! 🙏
Informative and it got me a couple of new ideas I want to try now! 💡😀
@JohnWatsonRooney Před rokem ⁺²
thanks, that's great!
@zhengdiao3494 Před rokem ⁺²
Learned a lot in your video, hope to come out with a neovim editor tutorial, thank you sir!
@JohnWatsonRooney Před rokem ⁺¹
I am working on a neovim video and thanks for watching
@zhengdiao3494 Před rokem
@@JohnWatsonRooney Thanks, have a nice life!
@dennistanui7085 Před rokem ⁺³
Thanks a lot, always informative. How would you then run the two scrapers concurrently? and how would you pattern match when scraping a lot of products (i.e scrape all products on both sites, and then create a product_dataframe for example with price comparison)
@bainsk8 Před 4 měsíci
Great video John, thank you. Very informative.
@yawarvoice Před rokem ⁺²
Hi,
I"ve asked this question in other video of yours as well, but asking here again, in-case you have missed the other one:
@John I've been following you for a long time and watching all your scraping videos with Python. I have started to create scraper but the website is not allowing me to access as it is considering my script as a bot, though I have changed the user-agent to latest chrome but still, that website is recognizing me as a bot. My question is that which combo I should use for scraping little complex JS/AJAX/bot-aware websites? People say that selenium is good for that purpose, but you say that selenium is not a good option now a days as it is slow, then what do you suggest, which combo should I use, that can fit in many scenarios, if not all.
Looking forward!
Thanks.
@JohnWatsonRooney Před rokem ⁺²
Hi - it depends on the site but generally i suggest trying; a) adding more headers as well as the useragent b) trying playwright/selenium with the undetectable driver c) using proxies d) combination of all three. Beating some anti bot protection can be tricky it takes time to figure out what it is you need to do to comply
@yawarvoice Před rokem ⁺¹
@@JohnWatsonRooney Normally its cloudflare the only hinderence. Where can I find detailed documentation for selectolax, I'm write now writing a scraper using cloudscraper (found it a comment, answered by you) and it has bypassed cloudflare. But I'm having trouble with selectolax right now, unable to find proper documentation. Is there any other fast alternative to selectolax? That has bigger community?
@JohnWatsonRooney Před rokem ⁺²
@@yawarvoice selectolax is just an HTML parser - the main on in the python community is Beautifulsoup you could give that go
@yawarvoice Před rokem
@@JohnWatsonRooney Got it. One last thing: Which one you'll prefer: 1) SE+BS or 2) Playwright + BS or 3) Cloudscraper + BS?
@mmemahmoud7274 Před rokem ⁺¹
nice work as always , can you please make a video about how to scrape email addresses from a domain ?
@pypypy4228 Před rokem ⁺²
Great video as always! But how do you happen not to be banned by Amazon? I tried scraping a couple of years ago - it always detected my script as robot and didn't give data.
@JohnWatsonRooney Před rokem ⁺²
Thanks! I’ve never had an issue with Amazon - I found that I usually just need a user agent and occasionally the language header and I’m good
@pypypy4228 Před rokem ⁺¹
@@JohnWatsonRooney thank you! I gotta give it a try!
@samoylov1973 Před rokem ⁺¹
Thank you for this video! Works wonderful with a particular item. But what if I want to get multiple items. Say, news stories from a website. html.css_first(selector).text().strip() - css_first gets only latest one. css_all - doesn't work, and just html.css(selector) won't work either. Please help.
@JohnWatsonRooney Před rokem ⁺¹
Thanks. html.css(selector) will return a list of all matching elements for the given selector so we can loop through this and call .text() on each iteration to get the data
@samoylov1973 Před rokem
@@JohnWatsonRooney Thank you! Waiting for more videos! Take care!
@ericxls93 Před rokem ⁺¹
Very good video as usual! Thank you! When is chatgpt video coming 🤔?
@JohnWatsonRooney Před rokem ⁺¹
thanks! hmm not a fan of chatgpt, not sure i'll cover it
@PankajThakur-jq1td Před rokem ⁺¹
Hey John, How can we scrape a page which requires zipcode to open the actual data to scrape and various navigations to go the data.
@JohnWatsonRooney Před rokem
Yes, it you will need to see how the website works. Sometimes it’s an Ajax request when you enter the zip code which you can copy, other times it might need browser automation
@LHCB6 Před rokem ⁺¹
Thanks for the zz shortcut
@JohnWatsonRooney Před rokem
It’s a good one I didn’t even know about until recently
@shehbanpatel Před rokem
Hello, I tried this but keep getting the Attribute error 'NoneType' object has no attribute 'text'. I outputted the text this resp receives and it doesnt have the tag which shows up while inspecting the page
@gh-sb1dy Před rokem
vids they are great
When getting info from a site using python is the ip same or when using python? or do they have their own different ip address? and also same with scrapy; if i use scrapy does that ip address is same as this computers?
because some sites have blocks set up to prevent types of things like this and i dont want to get banned forever by my ip
any way to bypass this so you dont get banned?
@Wassilvideos Před rokem
Hi John I have a question, can you guide me for how to scroll down a scrollable ul list in a section of the html with playwright
@herrpez Před 11 měsíci
Oops. Misspelled Thomann; better remake the video! 😉
@JohnWatsonRooney Před 11 měsíci
Haha yeah- I have actually done that before!
@lasangagamers Před rokem
i have written the code but it will not print any results
@garyjo3229 Před rokem ⁺¹
One question, what is your ide?
@JohnWatsonRooney Před rokem
Neovim - it’s a slightly modified version of chrisatmachine’s basic ide if you google it
@gh-sb1dy Před rokem ⁺¹
Can you please post your codes in your videos to a link below or in github or etc. it would be so helpful
@JohnWatsonRooney Před rokem
github.com/jhnwr/youtube - I am reorganizing my github but here it is
@bakasenpaidesu Před rokem ⁺²
The comment section be like
Video : "How I survived from dying"
Comments: the shirt looks good.
What I mean is everyone is asking for ide 😂
@JohnWatsonRooney Před rokem ⁺¹
Haha yeah, I didn’t think people would that interested in it
@void-qy4ov Před rokem
Hey man, 10x for your tuts.
I'm doing a lot of scrapping. Lately I need to get logos of 20k e-commerce stores.
Imho, it was an interesting task. Unfortunately only about 1/3 could be automated - I went with finding divs, classnames, and image sources having a 'logo' in it.
May be you did something like that before and have interesting strategy ?
@JohnWatsonRooney Před rokem
hey, thanks. interesting task as you say. I would probably save the html for each into a document database like mongo, and then test different patterns against each - save having to make loads of requests over and over. this way you could try different ways and see which works, updating the database with the logo as you go. Theoretical approach it would probably need revising as you go though
@void-qy4ov Před rokem
@@JohnWatsonRooney yep, i skipped db part, used just saved pages (played with filenames to get a correlation to store identifier). picking a strategy is the tricky part every site chooses it's own way to keep the logo, even on platforms like shopify or wp :)
@bensikes1640 Před rokem
I’m trying to scrape addresses: zip code, city, state, etc. from thousands of websites. How would you recommend I do this. I’m trying regular expression stuff, but even then it pulls in other info.
@ChristopherBrown-bj4zl Před rokem ⁺¹
7:05 Yeah but, show me the ugly as sin CSS selectors/HTML. Those are the ones that give me the hardest time. Great vids! Thanks!
@JohnWatsonRooney Před rokem
haha, yeah i understand. I'll include some more wonky stuff going forward
@jigneshprajapati6974 Před rokem
how to automate the captcha in python
Před rokem ⁺¹
so coooool
@JohnWatsonRooney Před rokem
Thanks !
@c__0ne Před rokem
Nice! Is this neovim? Can you write to me how to get this editor with syntax highlighting tabs etc? Thank you!
@JohnWatsonRooney Před rokem ⁺²
Yes it is! I am going to do a video on it but if you google "chrisatmachine basic IDE neovim" its basically that
@c__0ne Před rokem ⁺¹
@@JohnWatsonRooney thx!

Další v pořadí

Automatické přehrávání