K-Means Clustering Algorithm with Python Tutorial

Sdílet
Vložit
  • čas přidán 25. 06. 2024
  • K-Means clustering is a popular unsupervised machine learning algorithm that is commonly used in the exploratory data analysis phase of a project. It groups data together into clusters based on similarities within the data. In this tutorial, we will go through the basics of running a k-means algorithm on well log data.
    My Medium article this video is based on. Check it out as it contains more examples and extra plots.
    towardsdatascience.com/how-to...
    Timestamps:
    0:00 Introduction
    0:53 K-Means Clustering Theory
    2:56 Jupyter Notebook Loading Data & Importing Libraries
    5:53 Applying a Standard Scaler
    8:27 Identifying Optimum Number of Clusters - Elbow Plot
    11:20 Appling K-Means Clustering Algorithm
    12:55 Plotting K-Means Clustering Results on a Scatter Plot
    14:25 Comparing Results from Multiple K Values
    18:40 Other Clustering Methods & Outro
    DOWNLOAD NOTEBOOK & DATA
    Data and notebooks for my entire CZcams series can now be found here:
    github.com/andymcdgeo/Andys_Y...
    REFERENCES & LIBRARIES
    Force 2020 Competition Github: Bormann P., Aursand P., Dilib F., Dischington P., Manral S. 2020. FORCE Machine Learning Competition. github.com/bolgebrygg/Force-2...
    Competition Results: www.npd.no/en/force/Previous-...
    Books I Recommend:
    As an Amazon Associate I earn from qualifying purchases. By buying through any of the links below I will earn commission at no extra cost to you.
    PYTHON FOR DATA ANALYSIS: Data Wrangling with Pandas, NumPy, and IPython
    UK: amzn.to/3HNycJ9
    US: amzn.to/3DL7qPv
    FUNDAMENTALS OF PETROPHYSICS
    UK: amzn.to/3l1PgSf
    PETROPHYSICS: Theory and Practice of Measuring Reservoir Rock and Fluid Transport Properties
    UK: amzn.to/30UNWZS
    US: amzn.to/3DNqBbd
    WELL LOGGING FOR EARTH SCIENTISTS
    UK: amzn.to/3FHsbfn
    US: amzn.to/3CILAuE
    GEOLOGICAL INTERPRETATION OF WELL LOGS
    UK: amzn.to/3l2v2HV
    US: amzn.to/30UOTkU
    If you haven't already, make sure you subscribe to the channel: / @andymcdonald42
    -----
    Thanks for watching, if you want to connect you can find me at the links below:
    / andymcdonaldgeo
    / geoandymcd
    / andymcdonaldgeo
    www.andymcdonald.scot/
    Be sure to sign up for my newsletter to be kept updated when I post and share new content on CZcams and Medium.
    fabulous-founder-2965.ck.page...
    #petrophysics #python #MachineLearning #unsupervised-learning
  • Věda a technologie

Komentáře • 91

  • @JeanLouisKali
    @JeanLouisKali Před 3 dny

    Great presentation. The clearest I've seen on CZcams, to date. 👍

  • @moaiedbetamour6078
    @moaiedbetamour6078 Před rokem +1

    Very nice, simple, clear and to the point. Thank you for sharing.

  • @SouthwestStet
    @SouthwestStet Před 10 měsíci

    This was such a fantastic tutorial, thank you for putting quality content out there.

  • @beyzamutlu7379
    @beyzamutlu7379 Před 8 měsíci +1

    That was the best explanation what i watch for KClustering thank you 😊

  • @allansalles8895
    @allansalles8895 Před rokem

    Thanks again for the content, Andy! You're a great teacher!

    • @AndyMcDonald42
      @AndyMcDonald42  Před rokem

      Thanks Allan. Glad to hear you are enjoying the content.

  • @user-en5mi5zc1s
    @user-en5mi5zc1s Před rokem +1

    Thank you! The example script is a huge help

  • @bb3132
    @bb3132 Před 2 lety +1

    Andy - Your videos are very helpful and informative! Thank you!

  • @AaromGuillaume-er8pe
    @AaromGuillaume-er8pe Před rokem

    Explained this better than my professor. Big W

  • @MultiDrag90
    @MultiDrag90 Před 7 měsíci +1

    Excellent tutorial! Thank you very much for your time

  • @abdoulazizmahamadouhamidou2244

    Thanks ! I am geoscientist just starting my data sciences journey and I find your videos very helpful

    • @user-bv7dy1pn7w
      @user-bv7dy1pn7w Před měsícem

      Please can you help me I want to know more about data sciences applying in geosciences

  • @letsjoinhands
    @letsjoinhands Před rokem +1

    Your fluency and skill, simply superb! Keep it up!

  • @olaal-najjar7391
    @olaal-najjar7391 Před 2 lety

    Absolutely useful. Thank you Andy

  • @tylerpargiter642
    @tylerpargiter642 Před rokem

    very useful thank you! I'm midway through a data analysis apprenticeship and this helped me alot!

    • @AndyMcDonald42
      @AndyMcDonald42  Před rokem

      You're very welcome! I am glad to hear it has been helpful.

  • @calfredie0170
    @calfredie0170 Před rokem +2

    Amazing video you have put together here. I enjoyed how clear you were as well as the pace you took to go through the steps and explain everything. I am new to this kind of thing so does anyone have resources on where I can learn how to interpret cluster graphs

  • @youkendoit123
    @youkendoit123 Před rokem

    Amazing video, thank you Sir

  • @robikurniawan8507
    @robikurniawan8507 Před 2 lety

    thank you andy for your sharing 🙏🙏

  • @mafaldanunes774
    @mafaldanunes774 Před 5 měsíci

    THANKS YOUUUUU AHHHHH SO HAPPY I DID IT

  • @shahzaibkhan7215
    @shahzaibkhan7215 Před 11 měsíci

    Precise and clear👍👍plz explain naive based, Support vector machine & decision tree as well

  • @thirteen174
    @thirteen174 Před 2 lety

    Thank you so much !!

  • @katieweir4166
    @katieweir4166 Před rokem

    Yasss! A fellow Scot!!!

  • @mohammadkeshtkar9655
    @mohammadkeshtkar9655 Před 2 lety

    Hi Andy I think you start machine learning topic and it's my favorite topic thank you 🙏🙏

    • @AndyMcDonald42
      @AndyMcDonald42  Před 2 lety

      I will be jumping between some Python topics and machine learning topics over the future episodes. Is there any particular algorithms you would like to see covered?

  • @mominabdlhamed2098
    @mominabdlhamed2098 Před rokem

    What a great tutorial, thanks a lot🥰🥰

  • @josedavidbastoaguirre2099

    pretty cool. I have used K-means and DBSCAN to identify electrofacies, but I am still working on a way to optimize this task.
    It would be grade to see the Well Plots (depth Vs logs) with each point identified by its own cluster.

    • @AndyMcDonald42
      @AndyMcDonald42  Před 2 lety

      Thanks Jose. I did have a section of code for displaying the facies data on a log plot but I did not include it in the video. The full plotting code can be found here: towardsdatascience.com/how-to-use-unsupervised-learning-to-cluster-well-log-data-using-python-a552713748b5

  • @FLEXTRAILERSandTEASERS-lw3ds

    i liked it, had to hit that belllll

  • @Kittys_life0
    @Kittys_life0 Před 2 lety

    Thanks alot for your helpful videos..

  • @eyo3303
    @eyo3303 Před rokem

    great content

  • @alopix5468
    @alopix5468 Před rokem

    Hey! great video, only one question. What if I want to set my own centroids?

  • @kkamalpha
    @kkamalpha Před 2 lety +1

    Thanks! I have been doing this on resistivity and seismic values on different profiles in a catchment. However, everytime I get same trend but clusters change in their places. Would like to know about this issue...

  • @stephenmackenzie9016
    @stephenmackenzie9016 Před 2 lety

    Excellent thanks

  • @craigsmith941
    @craigsmith941 Před měsícem

    Hi Andy, this was a great tutorial as it's something I would like to try on a csv file with various metrics in the design of a pharmaceutical. I have one question though: I will be wanting to use 5-7 columns on the csv file for clustering - how do you go about visually representing this? I can't think of a good way to do it. Thanks!

  • @dayansaynes6691
    @dayansaynes6691 Před rokem

    Thanks a lot!

  • @caothuydung
    @caothuydung Před 8 měsíci

    thanks a lot

  • @pixelkeckleon1171
    @pixelkeckleon1171 Před rokem

    Too good

  • @pattylu8568
    @pattylu8568 Před rokem

    Thank you so much, Andy! I really find your video helpful. I am just wondering whether it would be possible for us to draw the scatter plot in multi-dimensions? Cuz I followed all of your steps but could not continue the step after the elbow plot when using my 500 columns dataframe.

    • @AndyMcDonald42
      @AndyMcDonald42  Před rokem +1

      Thanks Patty.
      You would only be able to draw the scatter plot up to 3 dimensions (X, Y and Z). However, you could look at using Seaborn's Pairplot to view 2d scatter plots of each of the variables versus the others: czcams.com/video/D5DPZyge31g/video.html
      I would be wary though of using 500 features with this plot as it will become unwieldy.
      I would be asking myself the following in your situation:
      - Do I require all 500 columns?
      - Are all of the columns relevant?
      - Can I reduce them manually or look at algorithms such as PCA to reduce the dimensionality of the dataset.

  • @laveshagrawal4241
    @laveshagrawal4241 Před 4 měsíci

    Excellent presentation and explanation
    is there a place from where I see the code you have written for this as that would help me in learning. Thanks

  • @guanyilu5498
    @guanyilu5498 Před rokem

    hi , thanks for the video, but could you please direct me that which file in your github is the jupyter notebook for this video? I could not find it. thanks

  • @vitorcastro42
    @vitorcastro42 Před 7 měsíci

    Solid video :)
    Btw, where is your accent from?

  • @syifasyuhaidahazman2384

    very helpful . If you could use example that can be easily understandable for non-science community would be extra helpful!!!

  • @jialicai6096
    @jialicai6096 Před rokem

    Thank you Andy, great video! What if I want to cluster more than 2 variables?

    • @AndyMcDonald42
      @AndyMcDonald42  Před rokem

      In the .fit() call at 12:00 you would pass in more variables. I have just used 2 for this example to illustrate what the output is like.
      Hope that helps :)

  • @timothysham6409
    @timothysham6409 Před rokem +1

    Andy, thanks for sharing. I can’t find the notebook for this specific exercise. I am trying to follow along with a different dataset but I am getting an error “name ‘means’ is not defined” when trying to determine the number of clusters.

    • @AndyMcDonald42
      @AndyMcDonald42  Před rokem

      Hi Timothy, did you manage to resolve this?
      If not, I would go back and check you have ran all of the cells before trying to determine the number of clusters.

  • @user-sd2cd2vj1f
    @user-sd2cd2vj1f Před 3 měsíci

    Could you please share the link to get the dataset?

  • @yesicamagnoli651
    @yesicamagnoli651 Před rokem

    Thank you Andy! I just want to ask you where can I find this notebook to download and work with it? Thanks again!

    • @AndyMcDonald42
      @AndyMcDonald42  Před rokem +1

      Sorry for the late reply. I realised I hadn't uploaded the file to the repo. You can find it here: github.com/andymcdgeo/Petrophysics-Python-Series
      It is Notebook 18.

    • @yesicamagnoli651
      @yesicamagnoli651 Před rokem

      @@AndyMcDonald42 thank you!! Please, keep on doing videos like this, I've been learning a lot!

  • @ajithkhan7314
    @ajithkhan7314 Před rokem

    Okay. So how to draw conclusion from these clusters ? I mean, what are your insights from this model ?

  • @mostafakhalid8332
    @mostafakhalid8332 Před rokem

    An error is raised after writing (kmeans_3) while plotting (NPHI vs. RHOB)

  • @cocoshih2948
    @cocoshih2948 Před rokem

    I have trouble using kemans.labels_ at the end it keeps showing this error: 'numpy.ndarray' object has no attribute 'labels_' can someone help me with this? Thank you!

  • @luisnazareth9193
    @luisnazareth9193 Před 2 lety

    Andy, i get some NaN value on the datasets.. and then when i try to run the "df.dropna(inplace = True)", all of the datasets become empty (zero). How to handle this? Thankyou

    • @AndyMcDonald42
      @AndyMcDonald42  Před 2 lety

      I would check if one or more columns are entirely nan.

  • @tsarm___
    @tsarm___ Před rokem

    I have problem when trying calculate using excel, the result is different with code, what can i do to fix it?

  • @hieunguyenminh1558
    @hieunguyenminh1558 Před 7 měsíci

    how to create input and output lines? pls help

  • @abdolkarimmehrparvar6583
    @abdolkarimmehrparvar6583 Před 5 měsíci

    I cannot find notebook file of this video in your git

  • @chottomtaki
    @chottomtaki Před 2 lety

    hello Andy, thanks for well-explained session,but on the final part can you assist to explain as to which features or measures differentiate one cluster from other,Thanks again

    • @AndyMcDonald42
      @AndyMcDonald42  Před 2 lety +1

      Thanks Dominic.
      One way would be to use a facet grid plot from seaborn and split by the clusters. You could then view the data by histograms, scatter plots and other plot types. That way you can see how the data features vary per cluster

    • @chottomtaki
      @chottomtaki Před 2 lety

      @@AndyMcDonald42 thank Andy,this is useful,I real appriciate

  • @TeeFat
    @TeeFat Před 9 měsíci

    Thank you so much for this video. I downloaded the data you used and found a negative relationship between RHOB and NPHI. Can tell me how your scatterplot shows a positive relationship between them? Thank you.

    • @AndyMcDonald42
      @AndyMcDonald42  Před 9 měsíci

      No problem. You are correct that NPHI and RHOB are usually anti-correlated. In petrophysics, we normally display RHOB on an inverted scale, often on the Y-axis. As RHOB values get lower, we likely have a higher porosity, and the values will plot higher up on the y-axis. For higher NPHI (neutron porosity) values, the points will plot further to the right. If we have a case where both NPHI and RHOB are high, they will then plot in the top right. It's a nice and easy way to visualise and identify potential reservoir intervals.

    • @TeeFat
      @TeeFat Před 9 měsíci

      @@AndyMcDonald42 Thank you so much. I am using it to cluster customer data, but I wanted to make sure I could replicate yours before trying. Thank you again for the explanation and such an awesome tutorial.

  • @aboodfal4780
    @aboodfal4780 Před rokem +1

    I’ve searched for this file in the github repository and I didn’t find this tutorial’s code file

  • @ahmetatasever8315
    @ahmetatasever8315 Před rokem +1

    Hi, I have one question about scettering in 13:21. Why were 'NHPI' and 'RHOB' written in 'plt.scatter()' when all calculations were done according to scaled data (I mean 'NHPI_T' and 'RHOB_T')? I am just trying to learn it. Could you please help me?

    • @AndyMcDonald42
      @AndyMcDonald42  Před rokem +1

      Using the scaled data within certain algorithms can reduce the effect of different data ranges (e.g feature1 ranges from 0 to 1, and feature2 ranges from 0.1 to 10,000), and scaling can also help speed things up. Some algorithms such as decision trees/random forests don't really need scaling whereas Neural Networks and even clustering can benefit from this process.
      Plotting the data using the original curves allows us to see how the calculated clusters align with the original data. If we were using scaled data, then the numbers on the axes wouldn't make too much sense for petrophysical interpretation.
      Hope that helps :)

    • @ahmetatasever8315
      @ahmetatasever8315 Před rokem +1

      @@AndyMcDonald42 Yes. It helps. :) Thank you very much. Also I have other question. Is there any way to get information about point in the graph by click using mouse to see which point belongs to which data?

    • @AndyMcDonald42
      @AndyMcDonald42  Před rokem +1

      @@ahmetatasever8315 Yes, there certainly us, The plot shown in this video was done with matplotlib, which is used to create a basic and static figure. You could easily swap that out for Plotly, which will have the extra interactivity and give extra info on hover.

    • @ahmetatasever8315
      @ahmetatasever8315 Před rokem

      @@AndyMcDonald42 Thank you again :)

  • @lorenzos785
    @lorenzos785 Před 2 lety

    I'm working on clustering energy consumption profiles of a group of households, how should the starting dataset be structured?
    For each apartment I'm given the annual energy consumption profile (15 minutes frequency for 1 year), the number of appliances and the number of rooms

    • @AndyMcDonald42
      @AndyMcDonald42  Před 2 lety

      Sounds like an interesting task 🙂
      If I understand correctly, you have a continuous variable for the energy consumption and then fixed variables for the rest?
      Have you considered clustering based on the profiles alone and grouping them into something like high energy users and low energy users or early birds and night owls?
      After that you could then try to use the other properties to gain more insights

    • @AndyMcDonald42
      @AndyMcDonald42  Před 2 lety

      Maybe have a look at time series clustering techniques for grouping the profiles

  • @dragster100
    @dragster100 Před 10 měsíci

    Can I say that at the end of the day, the way of interpreting the clusters is kind of subjective especially when the dataset gets more complex? Since the results could vary quite a lot as you apply different clustering algorithms or tuning some of their parameters. So it could be quite subjective, no?

    • @AndyMcDonald42
      @AndyMcDonald42  Před 10 měsíci +1

      Yes. That is very true. It is down to you or the person doing the interpretation to understand what the cluster may represent. If another person does there own interpretation they may have their own understanding of what the clusters represent

  • @abdullah.montasheri
    @abdullah.montasheri Před 7 měsíci

    Thank you, Andy, I could not find the notebook in your github.

    • @AndyMcDonald42
      @AndyMcDonald42  Před 7 měsíci

      I believe this may have been my original notebook. It contains much more detail than what I covered in the video. I hope this helps.
      github.com/andymcdgeo/Petrophysics-Python-Series/blob/master/18%20-%20Unsupervised%20Clustering%20for%20Lithofacies.ipynb

  • @fiqihnurhadi1266
    @fiqihnurhadi1266 Před 5 dny

    sir, how to clustering data 2d with size(512,512), please help me sir tq

  • @katieweir4166
    @katieweir4166 Před rokem

    It keeps saying name means not defined :(

  • @luckyramadhan346
    @luckyramadhan346 Před 2 lety +1

    finally, a non-indian accent speaker