Exploratory Data Analysis in Pandas | Python Pandas Tutorials

Alex The Analyst

zhlédnutí 107 759

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 5. 06. 2024
Take my Full Python Course Here: www.analystbuilder.com/course...
In this series we will be walking through everything you need to know to get started in Pandas! In this video, we learn about Exploratory Data Analysis in Pandas.
Dataset in GitHub:
github.com/AlexTheAnalyst/Pan...
Code in GitHub: github.com/AlexTheAnalyst/Pan...
Favorite Pandas Course:
Data Analysis with Pandas and Python - bit.ly/3KHMLlu
____________________________________________
SUBSCRIBE!
Do you want to become a Data Analyst? That's what this channel is all about! My goal is to help you learn everything you need in order to start your career or even switch your career into Data Analytics. Be sure to subscribe to not miss out on any content!
____________________________________________
RESOURCES:
Coursera Courses:
📖Google Data Analyst Certification: coursera.pxf.io/5bBd62
📖Data Analysis with Python - coursera.pxf.io/BXY3Wy
📖IBM Data Analysis Specialization - coursera.pxf.io/AoYOdR
📖Tableau Data Visualization - coursera.pxf.io/MXYqaN
Udemy Courses:
📖Python for Data Analysis and Visualization- bit.ly/3hhX4LX
📖Statistics for Data Science - bit.ly/37jqDbq
📖SQL for Data Analysts (SSMS) - bit.ly/3fkqEij
📖Tableau A-Z - bit.ly/385lYvN
Please note I may earn a small commission for any purchase through these links - Thanks for supporting the channel!
____________________________________________
BECOME A MEMBER -
Want to support the channel? Consider becoming a member! I do Monthly Livestreams and you get some awesome Emoji's to use in chat and comments!
/ @alextheanalyst
____________________________________________
Websites:
💻Website: AlexTheAnalyst.com
💾GitHub: github.com/AlexTheAnalyst
📱Instagram: @Alex_The_Analyst
____________________________________________
0:00 Intro
1:51 First Look at Data
3:45 Info()
4:40 Describe()
5:47 Counting all Null Values
7:09 Count of Unique Values
8:15 Sorting on Values
10:40 Correlation between Columns
11:53 Heatmap using Seaborn
14:43 Grouping Data
25:02 Visualizing Grouped Data
26:17 Boxplots for Outliers
29:07 Data Types of Columns
30:41 Outro
All opinions or statements in this video are my own and do not reflect the opinion of the company I work for or have ever worked for

Komentáře • 160

@santiagofajardo4949 Před rokem ⁺⁸⁵
Hello,
at minute 24:24, I managed to reverse the range of column names using [5:13][::-1]. The expression [::-1] is used to reverse ranges and it is very useful:
df2 = df.groupby('Continent')[df.columns[5:13][::-1]].mean(numeric_only=True).sort_values(by='2022 Population', ascending=False)
df2
Thank you very much, Mr. Alex, for these tutorials.
@user-zq6cp7lh3s Před 5 měsíci ⁺¹
Thank You!
@renanz21 Před 5 měsíci ⁺³
Alternatively, start counting columns backwards,
df2 = df.groupby("Continent")[df.columns[-5:-13:-1]].mean().sort_values(by='2022 Population', ascending=False)
df2
@pbp7 Před 9 měsíci ⁺¹⁷
Man, “Oceania” was so funny 😂, tks for the class!
@JW-pu1uk Před rokem ⁺²⁸
This is absolutely top tier content. I can't stress this enough to people new, or going into the DA/DS field: you WILL be exploring and cleaning data sets much more than you will be visualizing and building models.
Thanks for this, Alex!
@satrapech6107 Před 9 měsíci ⁺²³
the correction of df.corr() is:
numeric_columns = df.select_dtypes(include=[np.number])
correlation_matrix = numeric_columns.corr
correlation_matrix()
@pradiptanugraha6841 Před 7 měsíci
Thanks it works. Why df.corr() not working on me ?
@rajkumarjadi7061 Před 7 měsíci
thanks man.
@francescab1413 Před 6 měsíci ⁺³⁰
df.corr(numeric_only = True)
worked for me
@arrofifahmi7708 Před 5 měsíci ⁺²
@@francescab1413 me too mate! Thanks a lot!
@SDMNKhan Před 4 měsíci
name 'np' not defined?
@rafaelmarques5623 Před 5 měsíci ⁺⁷
Oceania is one of the 7 Continents (North America, South America, Europe, Asia, Africa, Oceania, Antartica). It's basically Australia and the countries (islands) around it.
Hope that helps!
@DuckingDuck-th2lt Před 4 měsíci ⁺¹⁰
Hello, Alex!
Once again, thanks a lot for all your hard work!
At 13:10 I got an error ValueError: 'box_aspect' and 'fig_aspect' must be positive"
Solved it by putting the plt.rcParams BEFORE the sns.heatmap
The other problem was that some functions didn't work until I added the parameter numeric_only = True, e.g., df.corr (numeric_only=True) or .mean(numeric_only = True)
Hope, it can help someone!
@yanpaucon1043 Před 24 dny
Thank you, You are the Best!
@kartikgupta370 Před 6 měsíci ⁺⁵
We can also write this to save time writing all the column names in the list "df2 = df.groupby('Continent')[df.columns[12:4:-1]].mean(numeric_only=True).sort_values(by='2022 Population', ascending=False)
"
@DEDE-ix9lg Před 10 měsíci ⁺¹
I always enjoy a video from Alex. Making one of the best videos , while some other channels just can be a real headache
@pradiptisimkhada292 Před rokem ⁺⁴
I just finished all the videos in you bootcamp playlist few hours ago and I'm excited to do this again..
@sj1795 Před 5 měsíci ⁺¹
EXCELLENT SUPERB video!! I can't believe it--I'm 6/7 videos away from the end of your FANTASTIC bootcamp series! Wahoo! I learned a lot in this video. :) As for "ending on a low note", hardly Alex lol All your content is uplifting and rewarding! As always, THANK YOU!
@Inc0gnit030 Před 10 měsíci ⁺¹
I really enjoyed this introduction to Pandas! Keep up the good work!
@AlastorGarcia Před rokem ⁺⁸
Thanks Alex! Right now i'm applying to my first DA Job and you have no idea how useful your videos have been for me!!
@ermano5586 Před 10 měsíci ⁺¹
Hey? How is it going? Did you succed in applying for the job you want?
@quotesdiary310 Před rokem ⁺²
Hi Alex
Thank you so much for your support for freshers in the field of data analytics.
@MaximKazartsev Před 10 měsíci ⁺⁴
Alex, thank you for this great video and everything you do!
In order to avoid manual ordering of the population years, there is a way to use df.columns method, by adding reversed. The whole construction looks like
df2 = df.groupby('Continent')[list(reversed(df.columns[5:13]))].mean().sort_values(by='2022 Population', ascending=False)
And it works )
@languagewanderlust Před 2 měsíci
thank you!
@frenamakenson9844 Před 3 měsíci ⁺⁴
Hello,
100000000 thanks for sharing
For the Corealtion part at 11mn
df.corr(numeric_only=True) # pass numeric only param to not having error
@nadarioferguson6276 Před měsícem
Thank you so much for this. I really enjoyed it and learned a lot of what I had forgotten a few years ago.
@Charlay_Charlay Před 4 měsíci
Thank you for the Pandas class!
@toygar8699 Před 6 měsíci ⁺¹⁹
For those get error in heatmap:
import matplotlib.pyplot as plt
numeric_columns = df.select_dtypes(include=['float'])
sns.heatmap(numeric_columns.corr(), annot=True)
plt.rcParams['figure.figsize'] = (20, 7)
plt.show()
@asmitaupadhyay4656 Před 2 měsíci
thank you
@nointernetnarwhal7615 Před 2 měsíci
THANK YOU!!!!!! I almost quit for good.
@nassrmohamed278 Před měsícem
i had that error in corr : " could not convert string to float: 'AFG'"
do you know how to solve this
@user-vy8kk9ob3s Před měsícem
thanks a lot toygar
@yanpaucon1043 Před 23 dny
@@nassrmohamed278 df.corr(numeric_only=True)
@staquatica1607 Před 8 měsíci ⁺³⁷
I got some error's (using pycharm) that I solved by using "mumeric_only=True". For instance: df.corr(numeric_only=True) and df.groupby("Continent").mean(numeric_only=True)
@mohammedshadaabkhan3228 Před 7 měsíci ⁺⁵
Hey use this code instead
numeric_df = df.select_dtypes(include='number') # Select only numeric columns
plt.figure(figsize=(20, 7)) # Set the figure size
sns.heatmap(numeric_df.corr(), annot=True) # Create the heatmap with annotations
plt.show()
@DevanshAsawa Před 5 měsíci ⁺¹
helped a ton thanks
@haley2486 Před 4 měsíci ⁺¹
Thanks for posting! I had to do SHIFT+TAB on the corr() function to find out how to get only numeric values.
@nassrmohamed278 Před měsícem ⁺¹
thaaaaaaaaaaaaaaank youuuuuuuuuuuuuuuuu
@jjsansano Před 27 dny
This is great! Thank you!
@neildelacruz6059 Před 8 měsíci ⁺¹
Thank you Alex this is very helpful.
@ngwamalfred8151 Před 8 měsíci
Where would l have been without this video .
@user-dx2hx2rd4g Před rokem ⁺¹
Thank you for the useful information!
@abhishekchaudhary7913 Před 4 měsíci ⁺²
df4=df3.sort_index(ascending=True)
df4 at 26:11 as alex is sorting manually you sort the year directly by this command
@moniquebrasilbaptista1989 Před 7 měsíci
I am sure I am going to use some of these tips. Thank you!😍❤
@kogureyoeh Před 11 měsíci ⁺⁶
at 24:00
you can just simply add ".sort_index()" on the "df3 = df2.transpose()", so that we don't have to manually rearrange the columns.
df3 = df2.transpose().sort_index() worked on my end, hope on your end too.
@abisolalumous5505 Před 4 měsíci
thank you
@vitorribeirosa Před rokem ⁺¹
Neat...
Thanks for sharing this content.
Cheers
@aayushitrivedi3481 Před rokem ⁺²
love your videos alexx ;)
@aishwaryapattnaik3082 Před rokem ⁺²
Thanks a lot for this clear cut explanation. Can you make something similar for NLP projects end to end ?
@tranguyen4462 Před 2 měsíci
omg I laughed out loud at the "Oceania" part ;)))) Alex is so funny and brutally honest about things he didn't know ;)))
@user-yp1ej5ou6b Před 6 měsíci ⁺¹
Hey, just a quick note here, when we're plotting the populations, it's only related to the numeric values compared to the highest populations, in fact (for example) Oceania's population increased in around 2.5 times
Anyway, thanks for the content, it's amazing
@elfridhasman4181 Před rokem ⁺¹
Thank you Alex💯🔥
@quotesdiary310 Před rokem ⁺¹
Thank you so much alex
@user-fx9eq7zm2v Před 10 měsíci
Again, thank you were much!
@TheRobinCreations Před 8 měsíci
Thank you so much it was very informative.
@youssefbekk4453 Před rokem
high level , thanks
@abdulsami6117 Před 10 měsíci
Love from Pakistan Alex, Really Helpful and Enjoyable.
I also like the OOPS sound you make 😂😂
@haithammontaser7769 Před 11 měsíci
Hello Alex. Thanks for the video and content. Is there any video for data per-processing?
@diegomartins7214 Před 6 měsíci
Thank you!
@lukekulak7165 Před rokem ⁺³
Lets goo!
@keluargaindo-timordiuk Před 10 měsíci ⁺⁴
For the grouping data I do df2=df.drop(columns=['CCA3','Country','Capital'])
df3=df2.groupby('Continent').mean(numeric_only=True).sort_values(by="2022 Population",ascending=False)
df3
to get to the same output as seen in the video
@danielmariobuchberger Před 8 měsíci
Me too, this should be explained, because Strings can not get easy a mean...to long is most the problem!
@bolajiawofuwa8116 Před 5 měsíci
THANK YOU!!!!!!
@LaMeeFitness Před rokem ⁺⁴
Thanks for all you do. I’m loving the bootcamp. Just finished excel project. However, please can you make a video on story telling?
@ayoubchouket Před 3 měsíci
thank you
@sarayusemesta6132 Před 4 dny
26:00
you can just add this to inverted columns
df2 = df.groupby('Continent')[df.columns[5:13]].mean(numeric_only=True).sort_values('2022 Population', ascending=False)
df2_inverted = df2.iloc[:, ::-1]
df2_inverted
@l7932 Před 3 dny
thanks sir
@akademy_performance_digital Před 5 měsíci
great
@HarshKumar-ws3wv Před 3 měsíci
Sir, in your opinion : Jupyter vs Pycharm? Which is better for Exploratory Data Analysis ?
@enix492 Před rokem ⁺²
Hello Alex. I read a few reviews on your recommended course on Udemy. People are saying that it is a bit outdated especially the last section. Do you think I should still go for it and the non updated part doesn't matter? Love your content and thanks for everything you do here.
@AlexTheAnalyst Před rokem ⁺²
I haven't taken it in a while - worth listening to more recent comments. Could be outdated?
@innocentnduaguba Před 5 měsíci ⁺²
Thank you so much Alex, truly great content you put out there. I have a question please; when I run df.groupby('Continent').mean() and df.corr() I get errors, please what could be the cause and what can I do to remedy it.
@sabithsaqlain1367 Před 5 měsíci ⁺¹
use df.corr(numeric_only = True)
@sj1795 Před 5 měsíci ⁺¹
@@sabithsaqlain1367 THANK YOU for this!! This was driving me a little nutty. Really appreciate you sharing this. :)
@SDMNKhan Před 4 měsíci
I could not fix the mean() issue.
@chriscurtis95 Před měsícem
df.groupby('Continent').mean(numeric_only=True)
@Zenitsu-mq7fq Před 2 měsíci
24:50
df2 = df.groupby('Continent').mean(numeric_only=True).iloc[:, -5:-13:-1].sort_values(by = '1970 Population', ascending = False)
df2 = df2.transpose()
df2.plot()
This way we don't use the copypasting and changing columns, just use reversed indexes)
@minasghazaryan9344 Před rokem ⁺⁶
Hi, Alex. First of all thanks for a great video and explanations in it.
If you could help out with the issue I get running your exact code I would be more than grateful.
Running the df.corr() line gives me the following error: ValueError: could not convert string to float: 'AFG' .
Same comes for the heatmap,etc. What could it be here?
Thanks a lot in advance.
@ReneePieschke Před rokem
Getting the same errors.
@11zaad Před 11 měsíci ⁺²
try this ==> df.corr(numeric_only=True)
@dustin3320 Před 11 měsíci ⁺¹³
Best to use df.corr(numeric_only=True) to get around this
@Batira583 Před 6 měsíci
you saved my life thanks so much @@dustin3320
@fede77 Před 5 měsíci ⁺²
df.corr(numeric_only = True)
@Marcusram Před 11 měsíci
we can do df3=df3.iloc[::-1] to solve the problem with the date order
@adminravi Před 8 měsíci ⁺¹
Is it ok if I use:
pd.set_option('display.float_format', '{:.2f}'.format) instead of
pd.set_option('display.float_format', lambda x: '%.2f' % x)
@rohallav Před 8 měsíci
or even better you can do lambda x: f"{x:.2f}"
@donvious Před 2 měsíci
hi, where is the link for the csv format document?
@iqraasif3783 Před 7 měsíci ⁺¹
Hi, can someone help. When I plot figures that have been grouped, it doesn't show the figure, just says .
@user-tm7uw4os1n Před 3 měsíci
21:09 I just figured it out. Simply add another line after the plot, like:
df2.plot()
plt.show()
@gauravpunera3256 Před rokem ⁺¹
Alex please make video on how to get international remote data analyst job
@user-re4ip5ms9w Před 13 dny
my heatmap is broken its not showing all the values even if I wrote the annot = True anyone have a fix? i tried almost everything when I hit shift+tab
@philiprhome3824 Před rokem ⁺¹
as R user, the syntax of pandas is just weird in compare to tidyverse (dplyr and tidyr)
@meredithleonor5035 Před rokem
why use anaconda instead of google collab, just curious looking forward in visual tutorial at python and statistics thanks i really need this type of tutorial i am studying cohort analysis and RFM analysis
@peaceandlove8862 Před 7 měsíci
Oceania is the continent that includes Australian and New Zealand.
@arpitmaheshwari122 Před 5 měsíci ⁺¹
hey, can anyone tell if the correlation command is working in vs code?
I'm getting a value error in this part.
please share the solution if you have one
thanks :)
@Shashankkundena Před 2 měsíci
Hey, just use numeric_only = True
@OkallTheAnalyst Před 3 měsíci ⁺¹
Incase you are running into an error at minute 11:12, add numeric_only = True to the corr. i.e df.corr(numeric_only = True).
@mananagrawal4114 Před 2 měsíci
thanks man !
@karanvaghela4668 Před 9 měsíci
Hey alex why we should use python instead of SQl Because SQl is easy
@truthgaming2296 Před 5 měsíci
its spells 'O-Ce-A-Nia' btw
btw thank for this guidance SIr Alex :)
@rjk537 Před 11 měsíci ⁺¹
I'm a law graduate without any experience or qualifications in data analysis whatsoever but i want to get into data analysis. Will i be able to get a job in this field? and if yes then what possible skills and certifications will help me to achieve the same? please give me some tips and insights it would be really helpful!
@ermano5586 Před 10 měsíci
Yes, you can, from skills I would prefer mostly analytical thinking, learn probability and statistics, other high math stuff.
From certification mr Alex said that Amazon and Tableau certifications, and others will help, but anyways if it's long-term learning certificate, I think it is ok to have it on CV. But the thing that highlites you it is the projects that you have done mostly for your job and I mean not only portfolio projects but another ones to show your uniqueness.
@ermano5586 Před 10 měsíci
I have one problem, which is that the table does not display columns starting from "area (km^2)" when we call "df" to view the table, I mean there is no scrollbar for horizontal data, can anyone help for this, please?
@ruchirmittal9207 Před 5 měsíci ⁺¹
Try another browser. Some browsers doesn't support that feature.
@osiomogieasekome8799 Před 11 měsíci
I couldn't get seaborn to import... I tried online solutions about installation but it didn't work
@rnjesus9950 Před 4 měsíci
This worked for me where df.corr() did not:
# Select numeric columns (excluding any non-numeric columns)
numeric_columns = df.select_dtypes(include=['float64', 'int64'])
# Calculate the correlation matrix
correlation_matrix = numeric_columns.corr()
correlation_matrix
@OazadOMER Před 8 měsíci
Thank you very much Alex I'm shifting from Ph to Data Analyst with your bootcamp I had an issue with plt.show() AttributeError: module 'matplotlib' has no attribute 'show' i's deprecated and I counldn't find something sameller and also my chart not showing numbers 14:10
Best regards
@dishanbhandari Před měsícem
Hi there, did u find the solution to your problem of not showing numbers? I ran into the same problem too.
@octaverius762 Před 11 měsíci ⁺³
Alex which continent do you think Australia is in 😮
@AlexTheAnalyst Před 11 měsíci
:D
@chefernandez563 Před 11 měsíci
Australia is also a continent tho😂 sometimes ppl will also refere to NZ ans Aus as the "Australias" but Oceania includes the other surrounding islands
@octaverius762 Před 11 měsíci ⁺¹
@@chefernandez563 Oceania is a continent, Australia is a country. How people often speak is not relevant
@dragoneer121 Před 11 měsíci
@@octaverius762 Actually it is relevant. Though different countries do have different models and its entirely up to convention. Australia the continent is usually considered the 3 islands of mainland Australia, Tasmania and Papua New Guinea
@dishanbhandari Před měsícem
My heatmap doesn’t contain the data values inside them as in 14:18 instead it just shows a heatmap with column values as in the top most band. I have written the code just as shown alos df.corr(numeric_only=True) as well as that ‘annot’ but still now data values. Pls Anyone help
@NyeinHtutSwe Před 22 dny
i am also run into same problem :). I still cant find the solution
@jDub997D Před 7 dny
upgrade your seaborn package
pip install seaborn --upgrade
restart your kernel and rerun all the boxes
@srijanrawat4014 Před 10 měsíci
i am having problem in downloading the file , can anyone help me out
@orlumbuseuw5646 Před rokem ⁺¹⁹
Was there here an adult ignorant of what Oceania is or is this some inner joke in the channel?
@octaverius762 Před 11 měsíci ⁺²
I can't believe this
@litoavila. Před 11 měsíci ⁺¹
Also FYI America is just one continent, in case you doubt it
@MatthewBreithaupt Před 7 měsíci ⁺²
OceanEeeA
@MatthewBreithaupt Před 7 měsíci ⁺¹
FYI Australia is not a *small* island. Oceania doesn't "mean" anything, it's the name of a continent containing the countries listed right in front of you since you already filtered the data 😂😂
@chefernandez563 Před 11 měsíci ⁺¹
Am I the only one who knew Oceania was Australia, New Zealand, Samoa and those places😂😂
@roshandhumal1193 Před rokem
Sir Alex.
I am Roshan Dattaram Dhumal
I live in India from Mumbai.
I want to start my career in data analysis but I don't know how to start and I want to know what steps you have to take to become Data analytics.
I would like to request you to please explain to us and give us some steps. Please sir I will definitely do hard work.
@hammadahmed7192 Před 9 měsíci
try passing numeric only argument. In recent version, default value of this argument has changed to false so it tries to correlate string values as well.
df.corr(numeric_only = True)
@dragoneer121 Před 11 měsíci ⁺¹
Continents are mostly a social convention. The english spekaing countries tend to use 7, while spanish speaking countries have a 6 continent model where it uses Oceania and combines North and south America.
Australia is the continent but Oceania is a geopolitical convenience. If it was not included most of the pacific isalnd countries would not be associated with a continent. North and South America are another convenience and Central america is only a region by American standards.
As an example of how ridiculous it is as a continent, Hawaii would be included if it was independant.
@naagarhive6581 Před 4 měsíci
OOPs
@Ben-qe8ju Před 11 měsíci
O-she-ana
@taroge5464 Před 9 měsíci ⁺¹
no explanation.................pd.set_option('display.float_format',lambda x : '%.2f' % x)
@FailingProject185 Před 3 měsíci ⁺²
It's funny american don't know the continent of australia.
@csaracho2009 Před 9 měsíci
(Minute 9:30)... So, in the Continent America there are 'two" Continents, "NorthAmerica"and "SouthAmerica"/ Ha Ha Ha, Americans...
@gogor8017 Před 5 měsíci
You said 'Oceania' so many times, now it sounds like meaningless word.
@aayushitrivedi3481 Před rokem ⁺²
first
pin me
@ermano5586 Před 10 měsíci
pin
@alikoohi8265 Před rokem
informative video thanks.Just found an easier way to reverse order of rows:
df3 = df2.transpose().loc[::-1] 😉
@RaihanRisad Před 3 měsíci
i couldnt able to do df.corr() because it was saying some columns are not numeric so that case i had to use numeric_df; numeric_df = df[['2022 Population', '2020 Population', '2015 Population', '2010 Population', '2000 Population', '1990 Population', '1980 Population', 'Area (km²)', 'Density (per km²)', 'Growth Rate', 'World Population Percentage' ]]
numeric_df.corr()
@dwbrow3 Před 2 měsíci ⁺¹
Try df.corr(numeric_only=True)
@marypazcuessy3004 Před 16 dny
Can anyone help me? My heatmap wont load all the numbers, just the Rank row starting at 1
I used
df.corr(numeric_only = True)
sns.heatmap(df.corr(numeric_only = True), annot = True)
plt.show()

Další v pořadí

Automatické přehrávání

Amazon Web Scraping Using Python | Data Analyst Portfolio Project