Turning multiple CSV files into a single pandas data frame

Sdílet
Vložit
  • čas přidán 4. 09. 2024
  • Data sets often come as multiple CSV files. How can you read those into a single pandas data frame? In this video, I show you how to use a Python "for" loop, or even a list comprehension, along with the "glob" module and the pd.concat function, to accomplish this.
    The Jupyter notebook for this video is at github.com/reu..., and my free, weekly newsletter about Python and software engineering is at BetterDevelope....

Komentáře • 58

  • @sloughpacman
    @sloughpacman Před 2 lety +3

    Your material has the habit of provoking thought, which often leads me off on exploratory tangents for a couple of hours. Thanks for wasting my day, Reuven 😀

  • @ManjeeriG
    @ManjeeriG Před 4 měsíci +1

    Thank you so much for this tutorial. I have gone through multiple tutorials but this one is the easiest and (ofcourse) smartest way to combine multiple csv into the one. 😊😊 You got a new subscriber.

  • @learner7273
    @learner7273 Před rokem

    I never knew you could have a list of dataframes. Thank you!

  • @mikej62
    @mikej62 Před rokem +1

    Nice!!! - I've seen several ways to solve this problem and this is the most efficient I've personally came across!
    Question for you, In [13], there will be times I want to use this approach because I want to verify the number of rows in each file....how would you do it? TIA!

    • @ReuvenLerner
      @ReuvenLerner  Před rokem +1

      Glad to hear it helped! If you want to verify the number of rows in a data frame, the fastest/easiest way is to ask for len(df.index). You can have an "if" statement in your "for" loop checking how many rows there are in the data frame, and thus know how many you retrieved.
      I don't believe that there's a way to check the number of rows in a file without reading it into a data frame first, at least with read_csv.

    • @mikej62
      @mikej62 Před rokem +1

      ​@@ReuvenLerner Gotcha!
      What I did was added another print statement before "all_dfs.append(new_df)" print(len(new_df.index)). Wanted to see the way you would approach it.
      All for the purpose of documenting what I started with.
      Cheers!!

  • @guocity
    @guocity Před 5 měsíci +1

    thats really helpful, what about reading multiple csv file, read new csv file ignore repeated rows

    • @ReuvenLerner
      @ReuvenLerner  Před 5 měsíci

      To read multiple CSV files, you need to run read_csv multiple times. And I think that if you want to ignore repeated rows, that's something you have to do after creating the data frame, not during the loading process.

  • @Merqos
    @Merqos Před 2 lety +2

    Thank you so much for this example! This made me save so much time!

    • @ReuvenLerner
      @ReuvenLerner  Před 2 lety +1

      I'm so happy to hear it!

    • @Merqos
      @Merqos Před 2 lety +1

      @@ReuvenLerner yeah, really! I was told to extract data from a vessel where all data was given in 10 minutes interval excel documents. This means within a day i had to open 144 files and collect them in one sheet. I love you and the internett man, thanks again!

  • @varunsharma3485
    @varunsharma3485 Před 2 lety +1

    Worked for me. Earlier I was using excel power query to join multiple csv's and then importing them to pandas, but it had a limitation of max 10million rows in excel. This tutorial is very helpful. Thanks and subscribed. One issue is that I am facing in this code is that I want to add my individual csv file names as a column in concatenate dataframe that is missing in this code.Any tips

    • @ReuvenLerner
      @ReuvenLerner  Před 2 lety

      I'm delighted to know that the video helped!
      BTW, the maximum number of rows in Excel is (in)famous. It even led to the loss of a bunch of covid data in the UK, by silently removing a whole bunch of inputs.
      Thanks for joining me here, and I'll see you around...

  • @awaisraza2285
    @awaisraza2285 Před rokem +1

    I have a main folder and in that main folder there are 35 sub folders. and their names are like S01, S02... S35. And each folder has a dataset of same structure. How i can concatenate that data into one dataframe?

    • @ReuvenLerner
      @ReuvenLerner  Před rokem

      You can! Look upthe "recursive" parameter in glob.glob, and you'll see that you can get all of the files that match patterns across multiple subdirectories.

  • @matthewhsu7299
    @matthewhsu7299 Před rokem +1

    Learn something new today. Thank you so much!

  • @regal7548
    @regal7548 Před 4 měsíci +1

    What if the datasets doesnt haw anything in common , like one is geological data, one is survey data, one is market analysis and each of them has a massive number of null values . Also the unique ids are different for example , one table has SLT20284903 and some others just numbers . What do we do then ?

    • @ReuvenLerner
      @ReuvenLerner  Před 4 měsíci

      Then you shouldn't be combining them, in this way or any other way! My assumption for this video was that you have a data set broken up across a number of CSV files, each with the same column names and dtypes. You want to take these multiple files and turn them into a single data frame. Pandas provides us with the "pd.concat" method, which is good for such things, but the problem is how you read them into Pandas quickly and easily.
      If you have geological data, survey data, and market analysis, then *perhaps* they have some factors in common. But you don't want them in the same data frame. Rather, read each into its own data frame, and use "join" or "merge" to combine them.

    • @regal7548
      @regal7548 Před 4 měsíci +1

      @@ReuvenLerner ok.. thank you

  • @nvduk3
    @nvduk3 Před rokem +1

    If one of the csv files is blank inbetween then it's breaking for loop. How to avoid that?

    • @ReuvenLerner
      @ReuvenLerner  Před rokem +1

      My best suggestion is to skip a data frame with zero rows from the output list. There might well be better solutions, though!

    • @nvduk3
      @nvduk3 Před rokem +1

      @@ReuvenLerner ​ @Reuven Lerner yes the video was of immense help. I got the concept but was facing an issue as I am trying to run it for over 2300 .csv files and many of them inbetween are blank and so it's not able to get those mentioned column names defined in the loop and is stopping there.
      Manually deleting them is time consuming & kinda stupid lol. I'm sorry as my Python is of intermediate level, but I will try to skip the ones based on if it can't find a matching column name as you mentioned. Thanks a lot again!

    • @ReuvenLerner
      @ReuvenLerner  Před rokem +1

      @@nvduk3 Oh, right - if the file is completely blank, then you can't select columns. That is a problem! Unfortunately, that's a tough one to solve. Maybe you could write a function that loads the first 5 lines, and checks to see if there are any columns. If such a function returns True/False, then you can use it in the "if" of your comprehension, and only truly process those that have the columns you want. Yes, you'll end up reading (at least part of) each file twice, but that might still be best. I'm not sure.

    • @nvduk3
      @nvduk3 Před rokem +1

      @@ReuvenLerner skipping the column names didn't work but skipping the blank sheets entirely worked somehow after few hit and trials. Thanks a lot, I really appreciate it 👍🏽

  • @silvajonatan
    @silvajonatan Před 7 měsíci +1

    Hi, fantastic infomation. Very didatic. Thanks a lot.

  • @33samogo
    @33samogo Před rokem +1

    Excellent, worked fine for me also, thank you!

  • @koosschutter1675
    @koosschutter1675 Před 10 měsíci +1

    Where can I download some of these large files for testing? I want to split or combine some CSV files but with my basic laptop can't load everything in memory and I need all the columns. Me and chatgtp came up with appending files which is slow. I'm not a data scientist, I just want to split and combine csv files and test my and my pc's capabilities.

    • @ReuvenLerner
      @ReuvenLerner  Před 9 měsíci +1

      You can download the (large!) files from Pandas Workout from here: files.lerner.co.il/pandas-workout-data.zip

    • @koosschutter1675
      @koosschutter1675 Před 9 měsíci +1

      @@ReuvenLerner Thank you very much.

  • @tomtaetz20
    @tomtaetz20 Před 2 lety +2

    How would you go about this when the csv files don’t all contain the same column names(some different some the same)?

    • @ReuvenLerner
      @ReuvenLerner  Před 2 lety +2

      Ooh, that makes it more interesting!
      If all of the input data frames are guaranteed to contain the columns you want, then you can just select columns with double square brackets + a list of column names in the top line of the comprehension. That is, you can say pd.read_csv(one_filename)[['email', 'phone']].
      But if they have different columns, and/or you have to do transformations, then this technique gets much messier. Perhaps you could call a function in the top line of the comprehension. And the function could find the columns it wants and output them. But it won't be quite as elegant.

    • @tomtaetz20
      @tomtaetz20 Před 2 lety +1

      Awesome! Thank you, I will give it a try

    • @anushkajoshi8427
      @anushkajoshi8427 Před 5 měsíci

      I was literally looking for the exact question in the comment box, before commenting mine.

  • @gabbyf2906
    @gabbyf2906 Před rokem +1

    Great video helped a lot!

  • @israaaljowder9751
    @israaaljowder9751 Před rokem +1

    very informative video, Thanks

  • @traal
    @traal Před 2 lety +1

    Would this also work with a generator expression passed into pd.concat? That would look nicer and probably save some memory. 😊

    • @ReuvenLerner
      @ReuvenLerner  Před 2 lety +1

      It would definitely work. But I'm not sure how much memory it would save, because at the end of the day, we still get all of the data from the three smaller data frames. And the way that behind-the-scenes memory is managed, I'm going to guess (and might be wrong about this) that the usage would be the same. But it would probably be wise to experiment with this; I'll try to find some time to do that, and if I have interesting results, I'll report them on this channel.

  • @gam3rman85
    @gam3rman85 Před 8 měsíci +1

    helpful. thanks!

  • @FletchersHeating
    @FletchersHeating Před rokem

    Thanks for this video!
    Is there a tutorial or information on how to do the same but for multiple data frames? ie. one csv = one dataframe
    Many thanks

    • @ReuvenLerner
      @ReuvenLerner  Před rokem

      Glad it helped! The same code would basically work for what you want, if you don't then concatenate the data frames together at the end. You'll end up with a list of data frames, each from a separate CSV file, with which you can do whatever you want.

  • @tedfitzgerald4202
    @tedfitzgerald4202 Před rokem +1

    Really super video.

  • @Tshediso1
    @Tshediso1 Před 2 lety +1

    Thank very much

  • @pankajgoikar4158
    @pankajgoikar4158 Před 2 lety +1

    Just awesome.... :)

  • @KingofDiamonds117
    @KingofDiamonds117 Před 2 lety +1

    I can't get this to waork for me. I need help.

    • @ReuvenLerner
      @ReuvenLerner  Před 2 lety

      What are you doing, and what error do you get?

    • @KingofDiamonds117
      @KingofDiamonds117 Před 2 lety

      @@ReuvenLerner I tried doing this:
      import glob
      data = glob.glob['channel1/*.csv']
      and got:
      TypeError Traceback (most recent call last)
      ~\AppData\Local\Temp/ipykernel_13700/502398532.py in
      1 import glob
      ----> 2 data = glob.glob['channel1/*.csv']
      TypeError: 'function' object is not subscriptable

    • @ReuvenLerner
      @ReuvenLerner  Před 2 lety

      @@KingofDiamonds117 "function object is not subscriptable" means: you're using [] on a function, and you really should be using (). Try that!

    • @KingofDiamonds117
      @KingofDiamonds117 Před 2 lety +1

      @@ReuvenLerner I'm not getting any errors this time, thanks. I have poor eyesight so it's difficult for me to see properly.