I'm not exaggerating when I say this video changed my life. I went from a guy who did everything upstream in SQL and grudgingly used Pandas to a guy who uses Pandas for everything. The approach Matt demonstrates also translates generally to PySpark. I'm now considered the go-to guy for Pandas and PySpark code in my department. There's so much bad code around, often written by people with advanced degrees and MATLAB experience it seems. I could make a full time job out of cleaning up bad code. Dot chain FTW!
This was a ridiculously useful video. I feel like I've watched a lot of python videos, but I think this might be the most practically useful for people who are not brand new to pandas--who use it all the time.
Great presentation. As others said pure gold. If there is button called pure gold I would have clicked it. A simple like is not enough. It also changed my view of code organization. Thanks for sharing.
Excellent Pandas best practices video. I was already a big user of chaining but for some reason hadn't used append much. This is a cleaner way to do things and I will be using it. My next notebook is going to be much easier to maintain and much easier to build. Thanks Matt!
I was looking how to speed up my pandas operations since I read Python itself is faster than R and pandas should be faster than python, i am happy i came here. Excellent tips that I am going to experiment and hopefully achieve a quicker output time. Excellent session nevertheless.
1:18:00 For the specific question being asked (find duplicates in a primary key) there is a much simpler solution than what Matt Harrison suggested: df.duplicated("primary_key", keep=False). It will select all rows with non-unique values in the "primary_key" column, i.e., all the rows that are duplicated. Matt solves the more general problem of "find all rows for which the element in primary_key occurs at least N times". A more concise (though perhaps less readable) solution to this would be something like (df [df.primary_key.value_counts()[df.primary_key].reset_index().primary_key > N] )
An alternative to your approach is to use .transform() with .groupby(), to act effectively like a SQL window function that counts the primary keys, but whose result is the same length as the original data (rather than being collapsed due to aggregation). Something like: num_dups = df.groupby('key')['key'].transform('size') # has same index as df df.loc[num_dups > N]
just a tip: at 48:30 when commenting line by line upwards you could point with mouse at desired line, then press (i think) ALT and keep pressed, pointer might switch to a thin lined cross, then drag with mouse pointer up or down the lines and then insert # its like doing block comment... still looking for a way to do that without mouse, but not sure to use sth like vim extension, if there is one...
I have a problem with aggregations, sometimes if you aggregate two columns and one column has a cell with a NaN .groupby will ignore it, I know you can keep those NaNs, but I would like to see a use case when is good idea to keep NaNs while using a .groupby and when is not a good idea.
@@JimmieChoi93 haha I actually did try with some screenshots. It recognizes that is a notebook and a monospace font but then suggest it might be the default JupyterLab font or Consolas, Menlo, etc. Also tried WhatTheFont and FontSquirrel with no luck.
I'm not exaggerating when I say this video changed my life.
I went from a guy who did everything upstream in SQL and grudgingly used Pandas to a guy who uses Pandas for everything.
The approach Matt demonstrates also translates generally to PySpark.
I'm now considered the go-to guy for Pandas and PySpark code in my department. There's so much bad code around, often written by people with advanced degrees and MATLAB experience it seems. I could make a full time job out of cleaning up bad code.
Dot chain FTW!
Thanks! Glad to help.
Heh, MATLAB and bad coding practices - the two are never far from one another it seems.
90 minutes of pure gold.
Thanks Matt!
Thanks David. 👍🙏 Make sure you check out my book, Effective Pandas, if you appreciated this.
AGREE COMPLETELY ! FANTASTIC PRESENTATION ! Learned more here than in past two years
This is easily the best pandas guide I have ever watched so far.
Thank you!
This is gold! Matt did an amazing job showing best practices when using pandas and a lot of intuition about how pandas function run under the hood.
This was a ridiculously useful video. I feel like I've watched a lot of python videos, but I think this might be the most practically useful for people who are not brand new to pandas--who use it all the time.
This man is a living data legend.
Mass respect.
Really interesting talk, was doubtful about chaining at first but you have converted me :) . A very very informative talk, thanks
Thanks for coming around Nick. 😉 Hope you find these techniques useful to you.
Great presentation. As others said pure gold. If there is button called pure gold I would have clicked it. A simple like is not enough. It also changed my view of code organization. Thanks for sharing.
Matt, big thank you for chaining idea!
By far the best pandas video I have ever seen
This tutorial had so many gems! Thanks Matt
This is mind blowing... Thank you very much!
I can’t wait for you to give another talk on polars!
Excellent Pandas best practices video. I was already a big user of chaining but for some reason hadn't used append much. This is a cleaner way to do things and I will be using it. My next notebook is going to be much easier to maintain and much easier to build. Thanks Matt!
Awesome. Thanks
Thanks for sharing ❤
Thanks Matt, this was an incredible presentation. Came here from the Real Python podcast, just bought the book too!
Thanks for your support
Really interesting, many thanks to Matt and Pydata :)
I was looking how to speed up my pandas operations since I read Python itself is faster than R and pandas should be faster than python, i am happy i came here.
Excellent tips that I am going to experiment and hopefully achieve a quicker output time.
Excellent session nevertheless.
I really love this session and it’s completely changed the way I process data going forward.
Thanks a lot !
Here from your HN comment. Super informative.
Thanks for the wonderful pandas insights matt and pydata!
1:18:00 For the specific question being asked (find duplicates in a primary key) there is a much simpler solution than what Matt Harrison suggested: df.duplicated("primary_key", keep=False). It will select all rows with non-unique values in the "primary_key" column, i.e., all the rows that are duplicated.
Matt solves the more general problem of "find all rows for which the element in primary_key occurs at least N times". A more concise (though perhaps less readable) solution to this would be something like
(df
[df.primary_key.value_counts()[df.primary_key].reset_index().primary_key > N]
)
An alternative to your approach is to use .transform() with .groupby(), to act effectively like a SQL window function that counts the primary keys, but whose result is the same length as the original data (rather than being collapsed due to aggregation).
Something like:
num_dups = df.groupby('key')['key'].transform('size') # has same index as df
df.loc[num_dups > N]
Really interesting and informative talk.
Thanks
Thank you for this! This is super helpful. I learned so much!
Thanks for you 'rant' Matt - have your recent books and still realised something that I should be doing with my data. 👌
Thanks Tyrone! Good luck with your Pandas. 😉🐼
Thank you
just a tip:
at 48:30 when commenting line by line upwards you could point with mouse at desired line, then press (i think) ALT and keep pressed, pointer might switch to a thin lined cross, then drag with mouse pointer up or down the lines and then insert #
its like doing block comment...
still looking for a way to do that without mouse, but not sure to use sth like vim extension, if there is one...
Awesome!
I have a problem with aggregations, sometimes if you aggregate two columns and one column has a cell with a NaN .groupby will ignore it, I know you can keep those NaNs, but I would like to see a use case when is good idea to keep NaNs while using a .groupby and when is not a good idea.
Sir the apply method gave me error such as unhashable series.
How to fix that?
Great video, I need this data set.
Where can I find it?
Can someone identify the font he uses in Jupyterlab ? :D
'Lato' I guess
@@JimmieChoi93 I just tried and I don't think it is Lato.
@@pmiron damn. Here's an idea, screenshot it to ChatGPT and ask
@@JimmieChoi93 haha I actually did try with some screenshots. It recognizes that is a notebook and a monospace font but then suggest it might be the default JupyterLab font or Consolas, Menlo, etc. Also tried WhatTheFont and FontSquirrel with no luck.
ty for the video matt this is awesome
can you explain how u got those numbers @ 57:30 --
6_220 / 125
Thank you!
235.215 is a ratio between mpg and l/100km. It's a constant the presenter looked up on a search engine ahead of time