Python and Scrapy - Scraping Dynamic Site (Populated with JavaScript)

Code [RE] Code

zhlédnutí 78 941

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 13. 09. 2024

Komentáře • 201

@codeRECODE Před 3 lety ⁺⁸
Hi everyone, I need your support to get this channel running. *Please SUBSCRIBE and Like!*
Leave a comment with your questions, suggestions, or a word of appreciation :-)
I would love your suggestions for new videos.
@harshnambiar Před 4 lety ⁺²⁴
You did this without even using docker or splash. That is pretty cool. 🌸
@codeRECODE Před 4 lety ⁺²
Thank you! 😊
@julian.borisov Před 3 lety ⁺¹⁹
"Without Selenium" caught my attention!
@klarnorbert Před 3 lety ⁺¹
I mean, selenium is not for web scraping(it's mostly used for automating web app testing). If you can reverse engineer the API, like in this video, Scrapy is more than enough.
@k.m.jiaulislamjibon1443 Před 3 lety
@@klarnorbert but sometimes you have no other way other tan use selenium. Some webapp developer is so much clever to encapsulate ta funcion calls that page don't show xhr request. i had to use selenium for parsing data in a. webapp
@osmarribeiro Před 4 lety ⁺⁶
OMG! Amazing video. I'm learning scrapy now, this video help me a lot.
@codeRECODE Před 4 lety
Glad it helped!
@igorwarzee Před 3 lety ⁺²
It really helped me a lot. Thank you and congrats. Cheers from Brazil!
@codeRECODE Před 3 lety
Cheers!
@kenrosenberg8835 Před 3 lety ⁺²
wow! You are a very smart programmer, I never thought of making REST API calls directly and then parsing the response, very nice, there is a lot to learn in your videos, more than just scraping.
@codeRECODE Před 3 lety
Glad it was helpful!
@lambissol7423 Před 3 lety ⁺³
excellent!! i feel like you doubled my knowledge on web scraping!
@codeRECODE Před 3 lety ⁺¹
That's awesome!
@gamelin1234 Před 3 lety ⁺³
Just used this technique to scrape a huge dataset after struggling for a couple of hours with requests+BS. Thank you so much for the great content!
@codeRECODE Před 3 lety
Glad it helped :-)
@sebleaf8433 Před 3 lety ⁺⁴
Wow!! This is awesome! Thank you so much for teaching us new things with scrapy :)
@codeRECODE Před 3 lety
Thank you :-)
@mohamedbhasith90 Před 9 měsíci
@@codeRECODE Hi sir, I'm trying to scrape a website with hidden apis like you did in this video. but, the data is in POST request not in GET request like you have in the video.. I'm really stuck here.. can you make a video on scraping with hidden api using POST request? i hope you find this comment
@RonZuidema Před 4 lety ⁺³
Great video, thanks for the simple but precise instruction!
@codeRECODE Před 4 lety
Glad it was helpful!
@cueva_mc Před 3 lety ⁺²
This is amazing, thank you!
@codeRECODE Před 3 lety ⁺¹
Glad you like it!
@Chris-vx6eb Před 4 lety ⁺⁶
This took me 2 days to figure out. If you're having trouble with json.loads(), I found out that the json data i scraped was actually a byte string type, and so i had to decode it BEFORE using json.loads. So where he had (9:47)
*raw_data = response.body*
replace with: *raw_data = response.body.decode("utf-8")*
then continue on with: *data = json.loads(raw_data)*
TO CHECK IF YOU NEED TO DO THIS, RUN THIS TEST:
*raw_data = repr(response.body)* #repr() is a built in function that (1) turns python objects into printable objects, so you can see what you're dealing with and (2) in my case, if it prints out your object, you can find out if you have a byte string because you will get a 'b' infront of your string.
*print(raw_data)*
output>>> b'{ {data:...}, otherdata: [{...},{...}] }'
if you have this b, use the method I described above. Hope I saved someone time, stackoverflow doesn't have a question for this yet (:
@codeRECODE Před 4 lety ⁺²
@chris - Good Catch!
Short answer: replace response.body.decode("utf-8") with response.text
Detailed answer:
Let's understand text and body
response.body contains the raw response without any decoding
response.text contains the decoded response text as string
In this video, response.body worked because there are no special decoding required
Your method is correct. Even better approach would be use response.text as it actually is TextResponse which is encoding aware object.
Bonus tip: install ipython and you will have a much better python console
Good luck!
@Chris-vx6eb Před 4 lety ⁺¹
@@codeRECODE awesome, thanks!
@tokoindependen7458 Před 3 lety ⁺¹
Bro paste this article on website, so many people can easily to find out
@pythonically Před 2 lety
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not tuple
this same error?
@navdeeprana8477 Před 3 měsíci
the video is too good as I am trying to learn scrapy but i thought it was far difficult for me to understand
but you made it simple
@carryminatifan9928 Před 3 lety ⁺²
Beautiful shop and selenium is not for large data scraping.
Scrapy is best 👍
@codeRECODE Před 3 lety
Yes indeed
@helloworld-sk1hr Před 4 lety ⁺²
Before watching this video I was doing this using selenium when I am watching your video then I am laughing myself what I was doing.
This video has saved my day
Your videos are amazing 🔥
@stealthseeker18 Před 3 lety ⁺³
Can you do web scraping if the website behind cloudflare version 2?
@daddyofalltrades Před 3 lety ⁺²
Sir thanks a lot !! This series will definitely help me ❤️
@codeRECODE Před 3 lety
Glad to hear that
@EnglishRain Před rokem
FANTASTIC explanation!!
@lorderiksson3377 Před 5 měsíci ⁺¹
This technique is fantastic. And thanks a lot for great content on your youtube page. Keep up the great job.
But how to implement pagination? Bit of a shame it wasn't shown here.
Let's say the schools are in a list of 25 items per page. 10 pages in total. How to do then?
@codeRECODE Před 3 měsíci
Shame is a strong word, no?
I try to cover one single topic in one video. Pagination itself is a topic by itself. I have a video on that too.
If you rather learn in a structured manner, you can try my course for a week.
@ruksharalam173 Před rokem
Wow, learning something new about scrapy everyday
@codeRECODE Před rokem
Oh yes! It is really vast!
@yusufrifqi5006 Před 2 lety
all of your tutorial is very helpful, big thanks to you, and i will wait for another scrapy content
@codeRECODE Před 2 lety ⁺¹
Coming soon!
@yusufrifqi5006 Před 2 lety
@@codeRECODE nice! I will waiting for scrapy asyncronus program
@AmitKumar-qv2or Před 3 lety ⁺¹
thank you so much sir....
@joaocarlosariedifilho4934 Před 4 lety ⁺⁴
Excellent, sometimes there is no reason to use Splash we need only to understand what and how js is making the requests, Thank you!
@codeRECODE Před 4 lety ⁺²
Exactly! It's must faster and the web server doesn't have to send all those CSS, JS, Images etc. Everyone is happier :-)
@shashikiranneelakantaiah6237 Před 4 lety
@@codeRECODE Hi there, I am facing an issue with a website, I can hit the first page but from then on if I make any request it redirects back to first page itself. It would be of great help if you could summarise as why this behaviour occurs on some sites. Thanks. And if I make the request to the same url with scrapy-splash I am getting lot of time out errors.
@codeRECODE Před 4 lety ⁺¹
@@shashikiranneelakantaiah6237 - double check that you are passing all the request headers, except cookies and content-length
cookies will be handled by scrapy.
content-length will vary and will break things instead of fixing it
@shashikiranneelakantaiah6237 Před 4 lety ⁺¹
@@codeRECODE Thank you for replying, will give it a try, please do more videos on scrapy, your way of explaining the topics are excellent. Once again Thank you.
@BreakItGaming Před 4 lety ⁺²
Sir Please Complete this Series To Complete Advanced level.I have looked at many youtube channels but i did'nt find any series which is complete one.
So it is my kind request.
Anyway thanks for stating such initiative.
@codeRECODE Před 4 lety
Glad that you liked it. I will add more videos in the future for sure :-)
@RahulT-oy1br Před 4 lety ⁺³
You just earned ₹7000 in 30 mins. Wowza
@codeRECODE Před 4 lety ⁺⁵
Thank you, but let's be honest. This is NOT a get rich quick scheme. There is work involved in learning, analyzing the site, and finally, finding someone who will pay YOU for this task. Involves hard work :-)
That being said, this is one of the fastest paths to actually earn money as a freelancer.
@RahulT-oy1br Před 4 lety ⁺¹
@@codeRECODE Any particular freelancing or online short-term internship sites you'd recommend?
@codeRECODE Před 4 lety ⁺³
@@RahulT-oy1br Any of the freelancing sites is fine. Practice with jobs already closed. Once you are confident, start applying for new jobs
@fabiof.deaquino4731 Před 4 lety
@@codeRECODE great recommendations. Really appreciate all the work that you have been doing! Thanks a lot.
@zangruver132 Před 4 lety
@@codeRECODE Well I have never done freelancing nor do I have any idea. Can you still suggest me atleast one or two sites for me to start web scraping freelancing in India? Also do I need any prior experience?
@andycruz3893 Před měsícem
Thanks man
@nadyamoscow2461 Před 3 lety
Many thanks! I`ve learned a lot and it all works fine.
@codeRECODE Před 3 lety ⁺¹
Glad it helped
@tunoajohnson256 Před 4 lety ⁺¹
This is a great tutorial. You taught me a lot and my app runs way faster than using Selenium now. Many Thanks, I hope to encourage you to keep teaching!
@codeRECODE Před 4 lety
Thank you Tunoa!
@jagdish1o1 Před 3 lety ⁺¹
It's an awesome tutorial. I've learned alot thanks. I have a question, I want to set a default value if there's no value.
I've tried with pipelines but item.setdefault('field', 'value') in process_item but it's not working.
@codeRECODE Před 3 lety
def process_item(self, item, spider):
for field in item.fields:
if item.get(field) is None: # Any other checks you need
item[field]="-1"
return item
@gracyfg Před 4 měsíci
Can you extend this and show us how to scrap all pages next and all product details and make it a production quallity product. or some points to make this a productions quality code with exceptions etc...
@codeRECODE Před 3 měsíci
All these topics need a lot of details. most of these topics are covered across many videos.
You can also try my course and ask for a refund within a week if you don’t like it.
Happy learning!
@emmanuelowino4291 Před 2 lety ⁺¹
Thanks for this, It really helped , but what if instead of a json file it returns a xhr response
@codeRECODE Před 2 lety
Nothing changes. JSON and XHR is just browser's way of logically grouping information in this case.
@dashkandhar Před 4 lety ⁺¹
very knowledgeable and clear content, Kudos! ,
and what if an API is taking time to return response data than how to handle that?
@codeRECODE Před 4 lety
Thanks!
If it taking time, change the DOWNLOAD_TIMEOUT in settings. Add this line to your spider class
custom_settings={
'DOWNLOAD_TIMEOUT' : 360 # in seconds. Default is 180 seconds
}
@charisthawhite2793 Před 3 lety
your video is very helpful, deserve to subscribe
@codeRECODE Před 3 lety
Glad it helped
@UmmairRadi Před rokem
Thank you this is awesome. what about a website that gets data using Graphql?
@azwan1992 Před 2 lety
Nice!
@hayathbasha4519 Před 3 lety
Hi,
Please advice me on how to improve / speed up the scrapy process
@codeRECODE Před 3 lety
You can increase the CONCURRENT_REQUESTS from default 16 to a higher number.
In most cases, you will need proxies if you want to scrape faster.
@gsudhanshu Před 4 lety ⁺¹
I am trying to copy what you did in the video but with the same code I am getting error on fetching first api i.e. getAllSchools . 2020-08-23 18:57:38 [scrapy.core.scraper] ERROR: Spider error processing (referer: directory.ntschools.net/)
Traceback (most recent call last):
File "/home/sudhanshu/.local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
yield next(it)
File "/home/sudhanshu/.local/lib/python3.6/site-packages/scrapy/utils/python.py", line 347, in __next__
return next(self.data)
@Ankush_1991 Před 3 lety
Hi Sir the video is great because of its simplicity and clarity. I am a beginner in webscraping and I am stuck at a point for very long time now can u help me. How do We contact you for our doubts please mention something in ur video descriptions.
@codeRECODE Před 3 lety ⁺¹
you can post your doubts here or comments section of my website. It is not always possible to reply to every question due to sheer volume though. I am planning to start a facebook group where everyone can help everyone else. Let me know how it sounds.
@AndresPerez-qd8pn Před 4 lety ⁺¹
Hey i love your videos,
I'm a little stuck with some code, could you help me? That would be very nice (some tutoring).
@felinetech9215 Před 4 lety ⁺¹
I followed along all your videos to be able to scrape a javascript generated webpage, but the data I want to scrape isn't in the XHR tab. Any suggestions sir ?
@codeRECODE Před 4 lety
Check the source of the main document
@felinetech9215 Před 4 lety
@@codeRECODE any info on how to do that sir ?
@HoustonKhanyile Před 3 lety
Çould you please make video scrapping a music streaming service like Soundcloud.
@sowson4347 Před 4 lety ⁺¹
Thank you for the easy to follow videos done in an calm unhurried manner. I notice you used VSCode for part of the work and CMD for running Scrapy. I found it extremely difficult to load Scrapy into VSCode even with a virtual environment. I could not run it in the VSCode terminal. How did you do it?
@codeRECODE Před 4 lety ⁺²
I work on scrapy a lot so I have it installed at the system level ("pip install scrapy" at cmd with admin rights). Just saves me a few steps. When I have to distribute the code, I always create a virtual environment and use scrapy inside it
If I want to use VS Code terminal, I just use the bottom left area where the python environment in use is listed, click it, and change it to set to the current virtual environment.
@sowson4347 Před 4 lety ⁺¹
@@codeRECODE Thank you for responding so quickly. I was under the impression that Scrapy could run in VSCode just like BS. I solved the issue after watching your video many times over and reading up numerous other sites. What I had failed to comprehend was Scrapy has to be run in the Anaconda cmd environment not within a VSCode notebook. VSCode is just an editor being used to create the spider file. Your use of ntschools.py file in C:\Users\Work also confused me. I have now created my first Scrapy spider and can follow your videos better. Thanks keep up the good work.
Scrapy refused to install at the system level. I had to use Anaconda.
@codeRECODE Před 4 lety
Good that the issue is resolved. Never had a problem installing scrapy with elevated cmd (run as administrator) or sudo pip3 install
Don't know why you faced a problem
BTW, "work" was just my user id.
@sowson4347 Před 4 lety
@@codeRECODE User Error 101 - RTFM
@l0remipsum991 Před 3 lety
Thank you so much. 1437! You literally saved my a$$. Subbed!
@codeRECODE Před 3 lety ⁺¹
Thanks for the sub!
@Pablo-wh4vl Před 4 lety ⁺¹
How will yo go if instead of in XHR, the content is loaded with the following one, with calls in the JS tab? Is still possible with requests?
@codeRECODE Před 4 lety ⁺¹
Tabs are only for logical grouping. You can extract info from any request, just that the code will change based on how data is organized.
@cueva_mc Před 3 lety
is it possible to parse the "base_url" instead of copying it?
@cueva_mc Před 3 lety ⁺¹
Or is it possible to parse the XHR urls from python?
@codeRECODE Před 3 lety
I am not sure what you want to ask, can you expand your question?
@kamaralam914 Před rokem
Sir, in my case i am using it for india mart and not getting any data on the response tab!
@orlandespiritu2961 Před 2 lety
Hi can you help me write a code that grabs hotel data from Agoda using this? I’ve been stuck.Running out of time for an exercise. Just started learning Python 3 weeks ago.
@codingfun915 Před 3 lety
How can I get the information if i have all the links of the schools and want to extract data from these links? Where should I keep all the links?? In the starting_urls or where please help me asap
@TheCherryCM Před 4 lety
Hi,
Could you help me to solve a similar kind of problem? I tried this header but still not getting any data.
@157sk8er Před 3 lety
I am trying to scrape information from a weather site but my code is not showing up in the XHR but it is showing up in the JS tab. How do I scrape data from this tab?
@codeRECODE Před 3 lety ⁺¹
Nothing changes! All, JS, XHR is Chrome's way of organizing URLs. You will find every under the All tab as well. Just use the same technique.
@bibashacharya2637 Před 2 lety
hello sir my question is that can we do exactly same things with docker and spalsh?? please reply
@codeRECODE Před 2 lety
Yes -- See this czcams.com/video/RgdaP54RvUM/video.html
@shamblini_6170 Před 2 lety
What happens when you encounter a 400 Code with the API link address? Can't seem to get past the API as the response.text shows "No API key found in request."
@codeRECODE Před 2 lety
Find the API key and add it to headers
@himanshuranjan7456 Před 4 lety
Just one question, does scrapy has support of async. I mean when taking a look at libraries like request or request-html they have async support, so the time consumed during scrapping is very less.
@codeRECODE Před 4 lety
Yes and better!
It is based on twisted. The whole framework is built around the idea of async. You would have to use to appreciate how fast it is.
@niteeshmishra2790 Před rokem
hi i am wondering if i want to scrape multiple field then how to do it,suppose i searched mobile on amazon now i get mobile brand name description link and complete details along with next page.
@codeRECODE Před rokem
See this czcams.com/video/LfSsbJtby-M/video.html
@FBR2169 Před 2 lety
Hello Sir. A quick question. What if the Request Method of the website is POST instead of GET? Will this still work? If not what should I do?
@codeRECODE Před 2 lety
Yes it will.
See my many videos on POST requests - czcams.com/users/CodeRECODEsearch?query=post
@ThangHuynhTu Před 2 lety
(7:00) : How can you copy paste the headers like that ? I try to copy as you but I have to put quote by myself? Is there anyway to copy fast as yours?
@codeRECODE Před 2 lety ⁺¹
Oh, I understand the confusion. I removed that part to keep the video short. Anyways, you can make it quick and easy by following these steps.
pip install scraper-helper
This library contains some useful functions that I created for my personal use and later made it open source.
Once you have this installed, you can use the headers that you copied directly without formatting. Simply use the function get_dict() and send the headers in a triple-quoted string:
headers = scraper_helper.get_dict('''
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-
accept-encoding: gzip, deflate, br
accept-language: en-GB,en;q=0.9
''')
It will also take care of cleaning up unwanted headers like cookies, content-length etc. Good luck
@ThangHuynhTu Před 2 lety
@@codeRECODE Really nice. Thanks for your clarifying!
@muhammedjaabir2609 Před 4 lety
why iam getting thi error ???
"raise JSONDecodeError("Expecting value", s, err.value) from None
"
@WDMatt02 Před 3 lety ⁺¹
i love u indian buddy, thanks to ur rook sacrifice
@codeRECODE Před 3 lety
Glad that my videos are helpful :-)
@amarchinta4463 Před 3 lety
Hi sir, I have one question about not this tutorial. I want to fetch multiple different domains having the same page structure with a single spyder. How I can achieve this ? Please help
@codeRECODE Před 3 lety
If same structure means same selectors for all those domains, just add them to start_urls or create a crawl spider.
@beebeeoii5461 Před 3 lety
hi, great video but sadly this will not work if the site does some hashing/encrypting of their API. for eg, a token has to be attached as the header and the token can only be achieved through some kind of computation done by the webpage
@codeRECODE Před 3 lety ⁺²
If your browser can handle encryption, hashing, you can do that with Scrapy too. Most of the time, they will just send some unique key which you have to send in the next request.
If you don't have time to examine how it is working, you can use splash/selenium or something similar and save time. It will be faster to code but slower in execution.
If you do figure out APIs, the scrapes are going to be very fast, especially when you want to get millions of items every day.
Finally, just think of it as another tool in your arsenal. Use the one that suits the problem at hand :-)
Good luck!
@arunk6435 Před 2 lety
Hello, Mr Upendra. Every time I start to scrape, my data usage reaches its limit too fast. What is your data plan? I mean, How many GBs are you allowed to use per day?
@codeRECODE Před 2 lety
It's really hard to calculate how many GBs your project is going to consume. If you can probably run your project on any of the cloud services.
For any serious work, I would suggest to get a broadband connection with no data cap.
@arunk6435 Před 2 lety
@@codeRECODE Thank You, Mr Upendra. I would like to know what data plan you use. What is your daily Data Limit?
@chapidi99 Před 3 lety
Hello, is there an example how to scrape if there is paging?
@codeRECODE Před 3 lety
I have covered pagination in many videos. I am planning to create one video to cover all kind of pagination in one video.
@yashnenwani9261 Před 3 lety
Sir i want to use search bar to search for a particular thing and then extract related data
Pls. Help!
@codeRECODE Před 3 lety
Open dev tools and check the network tab. See what happens when you click search.
If you can't figure it out, use selenium
@the_akpathi Před 2 lety
Is it legally ok to send headers from a script like this? Specially headers like user-agent?
@codeRECODE Před 2 lety
This is an educational video aiming to teach how things work. For legal issues, you would need to talk to your lawyer.
@abukaium2106 Před 4 lety
Hello sir, I have made a spider same to your coding but it show twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost, what can I do to solve it. Please reply. Thanks
@codeRECODE Před 4 lety
some connectivity issue. See if you can connect using scrapy shell
@adityapandit7344 Před 3 lety
Hi Sir,
How can we scrape json data from a website using scrapy.
@codeRECODE Před 3 lety
Create a regular scrapy request for the url that contains the json data. In the Call back method (for example, parse) you can access the json directly using response.json() in the newer versions
@adityapandit7344 Před 3 lety
@@codeRECODE hi sir have you posted any video on it?
@shubhamsaxena3220 Před 2 lety
Can we scrape any dynamic website using this method?
@codeRECODE Před 2 lety ⁺¹
Short answer - No. There are multiple techniques to scrape dynamic websites. Every site is different and would need a different technique.
@harshgupta-ds2cw Před 4 lety
I have been trying to find a webscraper which will work on OTT platforms. Your method didn't gave me any results. I need help.
@codeRECODE Před 4 lety
Scraping OTT is almost impossible due to technical reasons -as they have multiple layers of defenses to stop privacy, AND legal reasons. I am not going to attempt it for sure :-)
@maysgumir3972 Před 4 lety
HI,
I need your help. I am trying to scrape details from the e-commerce site www.banggood.com, the price is ajax loaded and I cannot retrieve it with scrappy then I tried to get the ajax request manually as you teach in the video but I cannot find the exact path for the request. Could you please make a video on this particular website (to find ajax request manually). Your help will be more appreciable. you can choose any category for scraping details.
@Code / RECODE
@stalluri11 Před 3 lety
is there a way to scrape webpages in python when url doesnot change with page numbers
@codeRECODE Před 3 lety ⁺¹
Yes, I have covered this in many videos. I am planning to do a dedicated video on pagination.
@stalluri11 Před 3 lety
@@codeRECODE look forward to it. I can't find a video on this
@naijalaff6946 Před 4 lety ⁺¹
great video.Thanks you so much.
@codeRECODE Před 4 lety
Glad you liked it!
@chakrabmonoj Před 3 lety
In fact I followed your steps into the XHR and 1. It does not show accept.json (but the site is run by JS which I checked by the hack shown by you here) 2. It also says 'eval' not allowed on the site (not sure what that means) - it shows no file being generated as you have shown for this site.
what could be happening here?
I am trying to sort all my connections by the total number of reactions their posts have got.
Can u help with a suggestion for coding this?
thanks
@codeRECODE Před 3 lety ⁺¹
I am attaching the link to the code. I just tried it and it works. Make sure that you run this with *scrapy runspider **ntschools.py* , not like a python script.
Source: gist.github.com/eupendra/7900849c56872925635d0c6c6b8f78f5
@chakrabmonoj Před 3 lety
@@codeRECODE Thanks for the quick revert. What I forgot to mention is I was trying to use your code on LinkedIn. Does it have excessive privacy policies because of which it is not showing any Json file being generated? Any help appreciated.
@taimoor722 Před 4 lety
i need help regarding how to approach client for webscrapping project
@codeRECODE Před 4 lety
I would be including some tips in my upcoming courses and videos
@oktayozkan2256 Před 2 lety
this is API scraping. some websites use csrftoken and sessions in their API, this makes the website nearly impossible to scrape from API.
@codeRECODE Před 2 lety
While CSRFtoken and sessions can be handled, I do agree that this technique does not work everywhere.
However, this should be the first thing that we should try. Rendering using Selenium/Playwright should be the last resort.
Even after that, many websites will not work, and there will be no workaround. 🙂
@harshnambiar Před 4 lety
Also, can you scrape bseindia this way?
@codeRECODE Před 4 lety
Haven't tried bse. Have a look at my blog to see how I did it for NSE.
coderecode.com/scrapy-json-simple-spider/
@adityapandit7344 Před 3 lety
Hi sir
@codeRECODE Před 3 lety
Please watch the XPath video I posted. That will help you. It will be something like this:
//script[@type="application/ld+json]"
@adityapandit7344 Před 3 lety
@@codeRECODE yes it's but it's the second script tag in this page how can we mention the second one
@codeRECODE Před 3 lety
just add [2]
@adityapandit7344 Před 3 lety
@@codeRECODE where can I add 2 can you tell me
@adityapandit7344 Před 3 lety
Hii sir when I loads the json data then I M facing json decode error expecting value line 1 . What is the solution of it
@codeRECODE Před 3 lety
It means that the string that you are trying to load with json is not in the form of valid json format. It may need some clean up
@adityapandit7344 Před 3 lety
@@codeRECODE yes sir the error has been resolved. Now can you give me an idea how can I link scrapy with django. It will be very greatful.sorry I am asking too many questions. But I M doing it practically that's why I M facing these problems.
@engineerbaaniya4846 Před 4 lety
Where I can get detailed tutorial
@codeRECODE Před 4 lety
courses.coderecode.com/p/mastering-web-scraping-with-python
@monika6800 Před 4 lety
Hi
Could me please help me in scraping one of the dynamic site?
@codeRECODE Před 4 lety ⁺¹
Which site is that? What is the problem you are facing?
@zaferbagdu5001 Před 4 lety
hi , i tried to write a code but in query response return 'Failed to load response data' , as a result there are jquery links , am i use them
@codeRECODE Před 4 lety ⁺¹
share your code in pastebin or something similar. I will try to find the problem
@zaferbagdu5001 Před 4 lety
@@codeRECODE code here=pastebin.pl/view/ee0b7d3d
Shortly the real page is www.tjk.org/TR/YarisSever/Info/Page/GunlukYarisSonuclari
i want to scrap tables in this page
thanks for everything
@udayposia5069 Před 3 lety
I want to send null value for one of the formdata using FormREquest.form_response. How should I pass null value. Its not accepting ' ' or None.
@codeRECODE Před 3 lety
Share your code. Usually blank strings work.
@sunilghimire6990 Před 4 lety
scrapy crawl generates error like
DEBUG: Rule at line 1702 without any user agent to enforce it on.
Help me
@codeRECODE Před 4 lety
What exactly are you trying to achive? Are you going through the same exercise as I showed in the video?
@sunilghimire6990 Před 4 lety
I am following your tutorials and i tried to scrape website
Title = response.css('title::text'). extract ()
Yield Title
I got the title but also got unusual error as mentioned.
@codeRECODE Před 4 lety ⁺¹
@@sunilghimire6990
It looks like you are either not passing the Headers in the request OR something is wrong with the user-agent part of the header dictionary OR the header dictionary itself is not correctly formatted.
Here are a few other things I can suggest:
1. You are using extract(), which is the same as getall() This is confusing and that's why it is outdated now.
2. Probably you are using "scrapy CRAWL" to run the spider. What I have created here is a standalone spider which needs to be run using "scrapy runspider"
3. Take up my free course to get the basics clear. I am sure it will help you. Here is the coderecode.com/scrapy-crash-course
4. Once you register for the free course, you will find the complete source code that you can run. If you face any problem, you can attach screenprint and code in the comments in my course and I will surely help in detail
@sunilghimire6990 Před 4 lety
@@codeRECODE thank you sir
@ashish23555 Před 3 lety ⁺¹
Really scrapy is the best but it needs time to be pro.
@codeRECODE Před 3 lety
Oh yes, Scrapy is best!
@nimoDiary Před 4 lety
Can you please teach how to scrape badminton players data from pbl site
@codeRECODE Před 4 lety
Whats the site URL? What have you tried and what problem are you facing?
@nimoDiary Před 4 lety
www.pbl-india.com/
I am trying to extract the data of squads of all teams with all their details including names, country, world rank, etc
@codeRECODE Před 4 lety
@@naijalaff6946 Thank you for the mention in readme. Feels good :-)
@ashish23555 Před 3 lety
Why need of scrappy or selenium as these r not helpful on AJAX
@codeRECODE Před 3 lety
I am not sure I understand your question. Can you elaborate?
@ashish23555 Před 3 lety
@@codeRECODE how to scrapp pages from a website protected with reCAPTCHA.
@codeRECODE Před 3 lety ⁺¹
@@ashish23555 Use a service like 2captcha.com
@Ahmad-sn9kh Před měsícem
i want scrape data from tiktok can you help me
@zangruver132 Před 4 lety ⁺¹
Hey. I wanted to scrape number of comments of each game in the following link (fitgirl-repacks.site/all-my-repacks-a-z/). But I can't find it anywhere in the network tab. Yes the html without JS provides a number of comments with it but it is outdated one.
@codeRECODE Před 4 lety ⁺¹
it's there! Here is how to find it. Open the site, press F12, go to the Network tab and open any listing. On the top, you will see something like 238 comments. Now, make sure that your focus is on the Network tab and press CTRL F. Now search for this number 238. You will quite a few results and one of them will be a .js file that will have this data.
You will note that this comes from a third-party commenting system.
Reminder - Getting this data using web scraping may not be legal. I do not give advice on what is legal and what is not. What I explained is only for learning how websites work. Good luck!
@kaifscarbrow Před rokem
Cool price. I've been doing ~500k records for $100 🥲
@codeRECODE Před rokem
Increase your price!
@KartikSir_ Před 2 lety
Getting Error :
[scrapy.core.engine] DEBUG: Crawled (403)
@shannoncole6425 Před 3 lety
Nice!
@codeRECODE Před 3 lety
Thank you

Další v pořadí

Automatické přehrávání

Selenium - Real World Web Scraping - Challenges and Solutions