Komentáře •

  • @dqnnny
    @dqnnny Před 2 lety +1

    Super useful, subbed!

  • @jessicabrock3220
    @jessicabrock3220 Před 2 lety +1

    Very helpful

  • @suwenhao9864
    @suwenhao9864 Před 7 měsíci +1

    cool video!

  • @kevindonovan3911
    @kevindonovan3911 Před rokem +1

    Help...love some guidance.
    I'm using 3ish lines to scrape data (tables) from web pages. But stating this month I now get Nan Nan and not the numbers.
    from selenium import webdriver
    browser = start_firefox(URL, headless=True)
    html = browser.page_source
    arrays=pd.read_html(html)
    for i in arrays:
    print(i)
    I love the simplicity of this and use it on several web pages to get stock data. But now I'm only getting the column headings.. no data ? ?
    Any advice would be greatly appreciated.
    Kevin

    • @ReuvenLerner
      @ReuvenLerner Před rokem

      Sorry, I don't really know much about Selenium.

  • @carlosfranchy878
    @carlosfranchy878 Před rokem +1

    Hello! First at all nice video! Im working in a project where its very usefull to use pd.read_htlm. The problem i have is there are some data that are png, as the flag in your example. Is there anyways to convert this png into arrays? Ty!

    • @ReuvenLerner
      @ReuvenLerner Před rokem

      Not in Pandas, so far as I know. Sounds like you would need some sort of OCR system to turn the graphic into text, but I don't know much about such things, I'm afraid.

    • @carlosfranchy878
      @carlosfranchy878 Před rokem

      @@ReuvenLerner Okay, ty for the answer anyways!

  • @1994siddhu
    @1994siddhu Před 2 lety +2

    Hello Mr. Lerner, This is Siddharth. I am using read_html for my python coding to read html files. It seems quite powerful in its working. Although, for html files of size less than 2 mb of data, it takes few seconds to run this command. But for large files of size such as 5 mb or more, it takes about half an hour for me for running the read_html command. Could you please suggest how to do about to read large html files in a quicker way?

    • @ReuvenLerner
      @ReuvenLerner Před 2 lety +1

      You'll probably want to retrieve the HTML in the background using "requests", and then have pandas read the data without going through the network. My gut feeling is that this will cut down on the time -- but it could be that the HTML is complex, and that it'll take time and memory no matter what.

    • @1994siddhu
      @1994siddhu Před 2 lety +1

      @@ReuvenLerner Okay. Thanks for the suggestion. I shall try that and let you know what i get.

    • @1994siddhu
      @1994siddhu Před 2 lety +1

      @@ReuvenLerner Hello Mr.Lerner, I tied using requests but it didn't work. I think that's because I am not really "web" scrapping a html file. I am reading html file that is also stored in my system. In the read_html command, I tried the match command that will get me only the tables that I want filtering the rest, but that took similar time as well, may be because it is still reading through the whole html file to find tables that identify with my given match option. Is there a way to give multiple match options in one read_html command? Yeah, I agree. My html file does not have the same number of kind of columns throughout which is why it is taking so long. As a last resource, do you think running my code through multi processors where each processor takes care of one match input and then I join all the data at the end from all the procs? Could that work? Thank you, Siddharth

    • @ReuvenLerner
      @ReuvenLerner Před 2 lety

      ​@@1994siddhu read_html isn't mean to do all of the complex things you're asking of it. If you have that much data, or that unknown of an HTML layout, then you might have to use something like "beautiful soup" to download and parse the HTML, and then hand Pandas a more traditional data structure. read_html is really meant for relatively simple pages with clear and obvious table layouts. The moment that you have different number of columns, you're kind of sunk.
      As for multiprocessing, I am not aware of a way to use it here, except if you parse multiple sites or files in parallel.
      Sorry I can't be of more help!

  • @pramishprakash
    @pramishprakash Před rokem +1

    cool

  • @ye-ym5jo
    @ye-ym5jo Před 2 lety

    Thanks a lot sir. I tried before with predetermined link on my online course, it always said key error, but when i tried with another url, it works. How could this happen?

  • @mandarraut9565
    @mandarraut9565 Před 2 lety +1

    Hi Reuvin, This was helpful, Thankyou
    But i need some more help ,For eg.i have a set of links of same website amd i am trying to get html tables(Specification table ) but the issue here is that i am able to save html tables for each product that means if i have 20 links than i am saving 20 different excel files
    What i want is that if we can save all html tables into 1 excel and as we are saving specifications tables most of the time it may have same headers and different value. So whenever we scrape tables its values should append below each other as per specific header and if we find a new header it should append into headers and add its value under it.
    Please help me with this. I am unable to do so

    • @ReuvenLerner
      @ReuvenLerner Před 2 lety +1

      According to the documentation for the to_excel method (pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html), you can save to multiple sheets in an Excel document. I've never done it myself, but there's an example toward the end of the documentation that shows you how to do that. I hope this helps!

    • @mandarraut9565
      @mandarraut9565 Před 2 lety +1

      @@ReuvenLerner sure . I will try .thanks for the help