Real-World Dataset Cleaning with Python Pandas! (Olympic Athletes Dataset)

Sdílet
Vložit
  • čas přidán 20. 07. 2024
  • I'm prepping a dataset for an upcoming tutorial and I figured walking through the process of cleaning it would work well for a livestream! We use various Python Pandas functions to accomplish our data cleaning goals.
    We'll be working off of this repo:
    github.com/KeithGalli/Olympic...
    Some topics that we cover:
    - How you can use web scraping to collect data like this (Python beautifulsoup).
    - Splitting strings into separate columns
    - Using regular expressions (regexes) to extract specific details from columns
    - Converting columns to datetime & numeric types
    - Grabbing only a subset of our columns
    Sorry that this was a bit last minute scheduling-wise, will try to give more advance notice in the future!
    Video timeline!
    0:00 - Livestream Overview
    4:00 - About the Olympics dataset (source website and how it was scraped)
    9:50 - Cleaning the dataset (getting started with code & data)
    19:26 - What aspects of our data should be cleaned?
    29:08 - Get rid of bullet points in Used name column
    34:08 - How to split Measurements into two separate height/weight numeric columns.
    1:05:00 - Parse out dates from Born & Died columns
    1:25:43 - Parse out city, region, and country from Born column (working with regular expressions)
    1:41:15 - Get rid of the extra columns
    1:46:08 - Next steps (how would we clean the results.csv)
    1:49:41 - Questions & Answers
    -------------------------
    Follow me on social media!
    Instagram | / keithgalli
    Twitter | / keithgalli
    TikTok | / keithgalli
    -------------------------
    Practice your Python Pandas data science skills with problems on StrataScratch!
    stratascratch.com/?via=keith
    Join the Python Army to get access to perks!
    CZcams - / @keithgalli
    Patreon - / keithgalli
    *I use affiliate links on the products that I recommend. I may earn a purchase commission or a referral bonus from the usage of these links.

Komentáře • 36

  • @KeithGalli
    @KeithGalli  Před 3 měsíci +16

    Thank you everyone who tuned in today!!

  • @rrrprogram8667
    @rrrprogram8667 Před 7 dny +2

    I really thank god that I found your channel thanks for sharing knowledge and keep uploading

  • @beauforda.stenberg1280
    @beauforda.stenberg1280 Před 3 měsíci +2

    I missed the live stream, but I am watching this video atm. This is the second upload of yours I have watched. I am a subscriber and wish to thank you very much for your uploads. Please, keep them coming. I am very new to Python. I am learning Python: firstly, to realise a knowledge graph 'index' for computational shells and shell scripting in the widest possible purview, for a Web app/website version of a dedicated work on computational shells and shell scripting, I have spent the last six months writing. I need to extract all the data from an archive of Markfown files, the book I have written, which involves cleaning, preserving the relationships of the data to inform the generation of an ontology of the computational shells and shell scripting domain, through natural language processing. Establish a dataset. Export dataset into a directed graph. Visualise with NetworkX. I don't yet know how to do any of this. If you could cover some of the processes involved to realise a knowledge graph from a Markdown file, that would be brilliant! Thanks again for your uploads.

  • @danprovost8232
    @danprovost8232 Před 3 měsíci +1

    Great stream this was very helpful! Keep up the good work!

  • @AndyJagroom-ur7xh
    @AndyJagroom-ur7xh Před 10 dny +1

    Can you do an update on the numpy video, thank you so much for these videos it helped me a lot ❤

  • @aishwaryapattnaik3082
    @aishwaryapattnaik3082 Před 2 měsíci +1

    Such a great tutorial Keith. Please keep uploading such high quality videos on Pandas and many more

  • @marcinjagusz2481
    @marcinjagusz2481 Před 3 měsíci +1

    Thanks Keith! I know it takes some time to prepare and record such staff, but please upload more of Python coding!

    • @KeithGalli
      @KeithGalli  Před 2 měsíci +3

      will try to keep them coming!

  • @Hamsters_Rage
    @Hamsters_Rage Před 2 měsíci +2

    29:26 - he starts writing some code

  • @chenjackson6001
    @chenjackson6001 Před 2 měsíci +1

    感谢你的辛苦付出

  • @AndyJagroom-ur7xh
    @AndyJagroom-ur7xh Před 10 dny +1

    What's your laptop? Cool videos BTW

  • @SangNguyen-bu8xd
    @SangNguyen-bu8xd Před měsícem

    Amazing thank u sir

  • @067-ashish7
    @067-ashish7 Před 3 měsíci +2

    Please Upload more videos related to data cleaning

  • @Kidpambi
    @Kidpambi Před 3 měsíci

    Thanks a lot man

  • @AnasM24
    @AnasM24 Před 2 měsíci

    Thank you man

  • @Kira-vs4np
    @Kira-vs4np Před 2 měsíci

    just a note, at 1:19:21 the format = "mixed" isn't really working for me, and it fills the date_born column with NaT values. So, I tried format = "%d %B %Y" and it works

  • @rrcr4769
    @rrcr4769 Před 4 dny

    Hi Keith,
    This code handles the issue will:

    # Split column 'Measurements'to height_cms and weight_kgs

    dfCpy['height_cm'] = None # add a blank column to store height
    dfCpy['weight_kgs'] = None # add a blank column to store weight

    # Extract height and weight information
    dfCpy['height_cm'] = dfCpy['Measurements'].str.extract(r'(\d+) cm', expand=False).astype(float)
    dfCpy['weight_kgs'] = dfCpy['Measurements'].str.extract(r'(\d+) kg', expand=False).astype(float)
    dfCpy

  • @chillydoog
    @chillydoog Před 3 měsíci +1

    Hawaiian shirt and Twisted Tea! My man

    • @KeithGalli
      @KeithGalli  Před 2 měsíci +1

      hawaiian shirt yes, but sorry to disappoint just a standard sparkling water I'm drinking haha

    • @chillydoog
      @chillydoog Před 2 měsíci +1

      @@KeithGalli 😉

  • @vg5675
    @vg5675 Před 2 měsíci

    Should i always drop the rows containing null values and then perform the further analysis???

    • @rohitsinha1092
      @rohitsinha1092 Před 2 měsíci +1

      not necessarily it depends you see in case of doing the same kind of cleaning for machine learning dropping an entire col can cause loss of data that might have helped in pattern recognition of the ml algorithm so you can use other methods to handle missing values for that case but i think its better to just handle them seperately rather than just drop an entire coln even tho that is a possible approach for smaller datasets so its case by case basis but as i am analysing this dataset now i see a few colns with excessively large amounts of null values so i think its okay to drop them. Cheers

  • @ramarisonandry8571
    @ramarisonandry8571 Před 3 měsíci

    From Madagascar

  • @alphonsinebyukusenge3071

    Where can we find the dataset?

  • @sebastianalvarez1537
    @sebastianalvarez1537 Před 3 měsíci

    holy fuq

  • @SAGAR-ox6ks
    @SAGAR-ox6ks Před 2 měsíci

    i did chatgpt for the questions that you framed and it is showing same solution , i could have easily done chatgpt rather than seing this video just download the dataset and put some rows of the dataset in chatgpt and put all the frames question they will be same as in this video for 2 hrs, it took 5 min for chatgpt to do..

    • @mohammadsamir2713
      @mohammadsamir2713 Před 2 měsíci +4

      If you're not going to support people efforts, at least don't disappoint them

    • @Opoliades
      @Opoliades Před 2 měsíci

      Yeah, but what are you going to do when ChatGPT can’t save you? You didn’t “easily” do the task at hand… you made someone/something else do it. Maybe data analyzing isn’t your thing. Perhaps consider being a LLM-expert instead 😊