Scraping multiples websites with one Python script

Sdílet
Vložit
  • čas přidán 5. 09. 2024
  • Writing a simple web scraping script to do some basic price comparison
    github.com/jhn...
    Scraper API www.scrapingbe...
    Patreon: / johnwatsonrooney
    Donations: www.paypal.com...
    Proxies: iproyal.club/J...
    Hosting: Digital Ocean: m.do.co/c/c7c9...
    Gear I use: www.amazon.co....
    Disclaimer: These are affiliate links and as an Amazon Associate I earn from qualifying purchases
  • Věda a technologie

Komentáře • 55

  • @sheikh4awais
    @sheikh4awais Před rokem +10

    Could you also make one tutorial for your code editor setup and the terminal? It looks really cool.

  • @silkogelman
    @silkogelman Před rokem +3

    Thank you John! 🙏
    Informative and it got me a couple of new ideas I want to try now! 💡😀

  • @zhengdiao3494
    @zhengdiao3494 Před rokem +2

    Learned a lot in your video, hope to come out with a neovim editor tutorial, thank you sir!

    • @JohnWatsonRooney
      @JohnWatsonRooney  Před rokem +1

      I am working on a neovim video and thanks for watching

    • @zhengdiao3494
      @zhengdiao3494 Před rokem

      @@JohnWatsonRooney Thanks, have a nice life!

  • @dennistanui7085
    @dennistanui7085 Před rokem +3

    Thanks a lot, always informative. How would you then run the two scrapers concurrently? and how would you pattern match when scraping a lot of products (i.e scrape all products on both sites, and then create a product_dataframe for example with price comparison)

  • @bainsk8
    @bainsk8 Před 4 měsíci

    Great video John, thank you. Very informative.

  • @yawarvoice
    @yawarvoice Před rokem +2

    Hi,
    I"ve asked this question in other video of yours as well, but asking here again, in-case you have missed the other one:
    @John I've been following you for a long time and watching all your scraping videos with Python. I have started to create scraper but the website is not allowing me to access as it is considering my script as a bot, though I have changed the user-agent to latest chrome but still, that website is recognizing me as a bot. My question is that which combo I should use for scraping little complex JS/AJAX/bot-aware websites? People say that selenium is good for that purpose, but you say that selenium is not a good option now a days as it is slow, then what do you suggest, which combo should I use, that can fit in many scenarios, if not all.
    Looking forward!
    Thanks.

    • @JohnWatsonRooney
      @JohnWatsonRooney  Před rokem +2

      Hi - it depends on the site but generally i suggest trying; a) adding more headers as well as the useragent b) trying playwright/selenium with the undetectable driver c) using proxies d) combination of all three. Beating some anti bot protection can be tricky it takes time to figure out what it is you need to do to comply

    • @yawarvoice
      @yawarvoice Před rokem +1

      @@JohnWatsonRooney Normally its cloudflare the only hinderence. Where can I find detailed documentation for selectolax, I'm write now writing a scraper using cloudscraper (found it a comment, answered by you) and it has bypassed cloudflare. But I'm having trouble with selectolax right now, unable to find proper documentation. Is there any other fast alternative to selectolax? That has bigger community?

    • @JohnWatsonRooney
      @JohnWatsonRooney  Před rokem +2

      @@yawarvoice selectolax is just an HTML parser - the main on in the python community is Beautifulsoup you could give that go

    • @yawarvoice
      @yawarvoice Před rokem

      @@JohnWatsonRooney Got it. One last thing: Which one you'll prefer: 1) SE+BS or 2) Playwright + BS or 3) Cloudscraper + BS?

  • @mmemahmoud7274
    @mmemahmoud7274 Před rokem +1

    nice work as always , can you please make a video about how to scrape email addresses from a domain ?

  • @pypypy4228
    @pypypy4228 Před rokem +2

    Great video as always! But how do you happen not to be banned by Amazon? I tried scraping a couple of years ago - it always detected my script as robot and didn't give data.

    • @JohnWatsonRooney
      @JohnWatsonRooney  Před rokem +2

      Thanks! I’ve never had an issue with Amazon - I found that I usually just need a user agent and occasionally the language header and I’m good

    • @pypypy4228
      @pypypy4228 Před rokem +1

      @@JohnWatsonRooney thank you! I gotta give it a try!

  • @samoylov1973
    @samoylov1973 Před rokem +1

    Thank you for this video! Works wonderful with a particular item. But what if I want to get multiple items. Say, news stories from a website. html.css_first(selector).text().strip() - css_first gets only latest one. css_all - doesn't work, and just html.css(selector) won't work either. Please help.

    • @JohnWatsonRooney
      @JohnWatsonRooney  Před rokem +1

      Thanks. html.css(selector) will return a list of all matching elements for the given selector so we can loop through this and call .text() on each iteration to get the data

    • @samoylov1973
      @samoylov1973 Před rokem

      @@JohnWatsonRooney Thank you! Waiting for more videos! Take care!

  • @ericxls93
    @ericxls93 Před rokem +1

    Very good video as usual! Thank you! When is chatgpt video coming 🤔?

    • @JohnWatsonRooney
      @JohnWatsonRooney  Před rokem +1

      thanks! hmm not a fan of chatgpt, not sure i'll cover it

  • @PankajThakur-jq1td
    @PankajThakur-jq1td Před rokem +1

    Hey John, How can we scrape a page which requires zipcode to open the actual data to scrape and various navigations to go the data.

    • @JohnWatsonRooney
      @JohnWatsonRooney  Před rokem

      Yes, it you will need to see how the website works. Sometimes it’s an Ajax request when you enter the zip code which you can copy, other times it might need browser automation

  • @LHCB6
    @LHCB6 Před rokem +1

    Thanks for the zz shortcut

    • @JohnWatsonRooney
      @JohnWatsonRooney  Před rokem

      It’s a good one I didn’t even know about until recently

  • @shehbanpatel
    @shehbanpatel Před rokem

    Hello, I tried this but keep getting the Attribute error 'NoneType' object has no attribute 'text'. I outputted the text this resp receives and it doesnt have the tag which shows up while inspecting the page

  • @gh-sb1dy
    @gh-sb1dy Před rokem

    vids they are great
    When getting info from a site using python is the ip same or when using python? or do they have their own different ip address? and also same with scrapy; if i use scrapy does that ip address is same as this computers?
    because some sites have blocks set up to prevent types of things like this and i dont want to get banned forever by my ip
    any way to bypass this so you dont get banned?

  • @Wassilvideos
    @Wassilvideos Před rokem

    Hi John I have a question, can you guide me for how to scroll down a scrollable ul list in a section of the html with playwright

  • @herrpez
    @herrpez Před 11 měsíci

    Oops. Misspelled Thomann; better remake the video! 😉

  • @lasangagamers
    @lasangagamers Před rokem

    i have written the code but it will not print any results

  • @garyjo3229
    @garyjo3229 Před rokem +1

    One question, what is your ide?

    • @JohnWatsonRooney
      @JohnWatsonRooney  Před rokem

      Neovim - it’s a slightly modified version of chrisatmachine’s basic ide if you google it

  • @gh-sb1dy
    @gh-sb1dy Před rokem +1

    Can you please post your codes in your videos to a link below or in github or etc. it would be so helpful

    • @JohnWatsonRooney
      @JohnWatsonRooney  Před rokem

      github.com/jhnwr/youtube - I am reorganizing my github but here it is

  • @bakasenpaidesu
    @bakasenpaidesu Před rokem +2

    The comment section be like
    Video : "How I survived from dying"
    Comments: the shirt looks good.
    What I mean is everyone is asking for ide 😂

    • @JohnWatsonRooney
      @JohnWatsonRooney  Před rokem +1

      Haha yeah, I didn’t think people would that interested in it

  • @void-qy4ov
    @void-qy4ov Před rokem

    Hey man, 10x for your tuts.
    I'm doing a lot of scrapping. Lately I need to get logos of 20k e-commerce stores.
    Imho, it was an interesting task. Unfortunately only about 1/3 could be automated - I went with finding divs, classnames, and image sources having a 'logo' in it.
    May be you did something like that before and have interesting strategy ?

    • @JohnWatsonRooney
      @JohnWatsonRooney  Před rokem

      hey, thanks. interesting task as you say. I would probably save the html for each into a document database like mongo, and then test different patterns against each - save having to make loads of requests over and over. this way you could try different ways and see which works, updating the database with the logo as you go. Theoretical approach it would probably need revising as you go though

    • @void-qy4ov
      @void-qy4ov Před rokem

      @@JohnWatsonRooney yep, i skipped db part, used just saved pages (played with filenames to get a correlation to store identifier). picking a strategy is the tricky part every site chooses it's own way to keep the logo, even on platforms like shopify or wp :)

    • @bensikes1640
      @bensikes1640 Před rokem

      I’m trying to scrape addresses: zip code, city, state, etc. from thousands of websites. How would you recommend I do this. I’m trying regular expression stuff, but even then it pulls in other info.

  • @ChristopherBrown-bj4zl
    @ChristopherBrown-bj4zl Před rokem +1

    7:05 Yeah but, show me the ugly as sin CSS selectors/HTML. Those are the ones that give me the hardest time. Great vids! Thanks!

    • @JohnWatsonRooney
      @JohnWatsonRooney  Před rokem

      haha, yeah i understand. I'll include some more wonky stuff going forward

  • @jigneshprajapati6974
    @jigneshprajapati6974 Před rokem

    how to automate the captcha in python

  •  Před rokem +1

    so coooool

  • @c__0ne
    @c__0ne Před rokem

    Nice! Is this neovim? Can you write to me how to get this editor with syntax highlighting tabs etc? Thank you!

    • @JohnWatsonRooney
      @JohnWatsonRooney  Před rokem +2

      Yes it is! I am going to do a video on it but if you google "chrisatmachine basic IDE neovim" its basically that

    • @c__0ne
      @c__0ne Před rokem +1

      @@JohnWatsonRooney thx!