Applied K-Means Clustering in R

Sdílet
Vložit
  • čas přidán 25. 07. 2024
  • ===== Likes: 888 👍: Dislikes: 5 👎: 99.44% : Updated on 01-21-2023 11:57:17 EST =====
    An easy to follow guide on K-Means Clustering in R! This easy guide has theory and applications explained for this classic unsupervised learning technique.
    Thanks for watching! Let me know what you think. Are there any issues? Please let me know in the comments below.
    Please Like and Subscribe! :)
    Github link to Rscript➡ github.com/SpencerPao/Data_Sc...
    0:00 - What is Unsupervised Learning ?
    0:41 - Where to use Unsupervised Learning?
    1:23 - Brief Theory of K-Means Clustering
    4:30 - Beginning R Walkthrough on K-Means
    6:25 - Steps for K-Means Clustering
    8:57 - Finding Optimal number of K-Means Clusters
    11:10 - Running K-Means Cluster
    12:32 - Visualizing Clusters
    15:00 - Results
    15:58 - Next Steps?
  • Věda a technologie

Komentáře • 177

  • @mdtowhidurrahman8406
    @mdtowhidurrahman8406 Před 3 lety

    Thanks, Spencer for the awesome videos. Your videos were like a one-shot go accurate for my problem statements.

  • @johneagle4384
    @johneagle4384 Před 2 lety +2

    I love videos like your. Clear, succinct and to the point. Thank you.

  • @sevdasattari7425
    @sevdasattari7425 Před 2 lety +2

    Thank you so much for your great teaching and video!

  • @miayayayaya
    @miayayayaya Před rokem +2

    I really like how you explain each piece of every line of the code, and how the whole logic works! More videos like this please!

  • @misallen
    @misallen Před rokem

    Thanks for the tutorial love the presentation!

  • @mentionitfootball
    @mentionitfootball Před 7 měsíci +1

    Awesome videos Spencer. Your videos on Hierarchical Clustering and K-Means are going to help me create some analytics based sports content. They really helped!

  • @Za3DoRzX
    @Za3DoRzX Před 11 měsíci +1

    Excellent video Spencer, able to explain the concept like a pro!

  • @noufalzed8513
    @noufalzed8513 Před 2 lety

    Amazing explanation, you helped me with my K-mean clustering in my master thesis. Thank You so much Spencer.

  • @lorenzomarino1884
    @lorenzomarino1884 Před 2 lety

    Well done! Your video was crucial to prepare the Data Analysis exam. Maaany thanks indeed!!

  • @ttr50nicola
    @ttr50nicola Před 3 lety

    Best video, thank you! I'll be using this to implement k-means on my project soon :)

  • @siavashbahramian
    @siavashbahramian Před měsícem

    Great video!

  • @the_sniperderek9163
    @the_sniperderek9163 Před 10 měsíci

    Thank you for the video , it help me a lot !!!!

  • @user-ni3mf1ou9r
    @user-ni3mf1ou9r Před 2 lety +1

    Thank you so very much!!! I was struggling with my dataset, and you explained it so well. I was under so much stress because I had errors but I'm actually not far off from the code you wrote down, and I'm just so glad I actually knew what I was doing but was hitting some obstacles on the way. I feel so much better now!

  • @haniehsartipi9700
    @haniehsartipi9700 Před 2 lety

    Great as always 🤍

  • @dianabaigarina80
    @dianabaigarina80 Před 3 lety

    Thanks heaps for such a great video!!!

  • @rishabhraghav5280
    @rishabhraghav5280 Před rokem

    Thank you so much sir. This tutorial really helped me with my data.

  • @pakwidi531
    @pakwidi531 Před rokem

    Thank you so much.

  • @albertannimae5587
    @albertannimae5587 Před rokem

    Thanks for this amazing tutorial, helped me to finish my school project

  • @Florian-lu2zo
    @Florian-lu2zo Před 2 lety

    Very nice video. I already knew how k-means work but its a different thing to use it in R

  • @mmanolakis1
    @mmanolakis1 Před rokem

    Thanks!

  • @divyasharma4812
    @divyasharma4812 Před 7 měsíci

    amazing

  • @conmeonaoca
    @conmeonaoca Před 2 lety

    This is a really nice video, you explained everything very clear. Thank you so so much!
    In the future, can you do a video for hierarchical clustering, pleaseeeee?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      Well, you're in luck!
      I did a video on that topic! Check it out here.
      czcams.com/video/MAUs4484TG8/video.html

  • @QuantCake247
    @QuantCake247 Před 3 lety

    loved it

  • @pradeepwijayawickrama507

    Thank you

  • @elifceyhan78
    @elifceyhan78 Před rokem

    Hello, thanks for this great video! It is helpful. What do you suggest for stability test in r? Which function I can use?

  • @jeanhwang18
    @jeanhwang18 Před rokem

    Clear to understand and thank you. Will you show how to use silhouette width to compare different K? I'd be much appreciated:)

    • @SpencerPaoHere
      @SpencerPaoHere  Před rokem

      At a high level, similar to the elbow plot method, you can visualize the silhoutte score vs the number of clusters of a xy graph. The highest point of silhoutte score vs cluster # tells you the recommended number of clusters to use.
      This article details this wayy better than my CZcams comment will do
      towardsdatascience.com/silhouette-method-better-than-elbow-method-to-find-optimal-clusters-378d62ff6891

  • @nikeshdubey4129
    @nikeshdubey4129 Před 3 lety

    thank you

  • @a.alheraky4018
    @a.alheraky4018 Před 9 měsíci

    Thanks alot for this very insightful instruction. Is there a way to remove the labels and fill/color the points according to species instead?

  • @nshah94
    @nshah94 Před 2 lety

    Smashed it

  • @tathagatochakraborty7264

    Thanks, Spencer for this great video and explanation. I have replicated this with my own dataset, and it worked! However, I am intrigued by the axis X, Y and what Dim1 and Dim2 represent. Could you share some thoughts about it?

    • @SpencerPaoHere
      @SpencerPaoHere  Před rokem +1

      In the backend, I believe that fviz_cluster uses PCA since the Iris data set has > 2 features. So, the features used (X and Y) are the features that explain the most variability hence Dim1 and Dim2. This is used to help visualize the clusters better. (3 dimensional plots may be better.)

  • @bforbeat
    @bforbeat Před 2 lety

    Could you make a complementary video where you use the identifiers in a master dataset and analyze further?
    great job btw! subscribed 100%

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety +1

      Thanks for the support!
      I maybe could do that. Though, it'd be a classification case-study, which may be very niche to this specific dataset (flowers)

    • @bforbeat
      @bforbeat Před 2 lety

      @@SpencerPaoHere Yes that’s exactly what my thesis is about. If you also have any further resources, which can help me with that topic I’d be very grateful!

  • @edwardlianto3547
    @edwardlianto3547 Před 2 lety

    Hi, Spencer, thank you for the video! You've explained a lot of details in such a short time!
    Regarding the Cluster Plot, may I ask what are the Dim1 and Dim2?
    Thank you!

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      Hi! The fviz_cluster (behind the scenes) is using PCA to shrink the 4 dimensions (as shown in the video) to 2 dimensions for plotting purposes. In this case, PCA uses its first 2 dimensions. So, Dim1 and Dim2 explain 95% of the variation.

    • @edwardlianto3547
      @edwardlianto3547 Před 2 lety

      @@SpencerPaoHere Wow, thank you! Got it, really appreciate your quick response, Spencer. Thanks a lot 😄

  • @qrisp504
    @qrisp504 Před 2 lety +1

    hi, at 11:57 you said we can assign these groups to our dataset as additional features. Can you tell me how do we do that? like what's the command for that, I am having hard time doing that.
    Edit: nvm I got it, I was doing some silly mistake. Great tutorial, thanks.

  • @ManaRogers-zf3nn
    @ManaRogers-zf3nn Před 2 měsíci

    Hello! Is there a way to change the color such that each point for your species has a specific color and the clysters are colored separately?

  • @spikeydude114
    @spikeydude114 Před rokem

    Great video! What about data you don't have labels for? Unsupervised

    • @SpencerPaoHere
      @SpencerPaoHere  Před rokem +1

      That is unsupervised learning; You could try K Means, Hierarchical clustering or PCA. You just need your features to group similar terms together.I am certain that there are other ML algos for unsupervised.

  • @andreaomdahl2871
    @andreaomdahl2871 Před 2 lety

    Thank you so much!
    while using this method on my own datset i get the error message "non-numeric argument to binary operator" while running the fviz_cluster(list(dataset_scale, cluster - km.clusters)). The dataset consists only of numeric numbers (except for the label/first column), so I cannot figure out whats wrong. May you know a solution?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      Hmm. My guess is this: When plotting, try to only use the variables that are of numeric type. You might be passing in the entire dataset. Make sure the features you are passing into the function are only of numeric type.

  • @4evrinanime
    @4evrinanime Před 2 lety +1

    Hi, python user here and we don't have the fviz_cluster function but it is amazing! From Google search it seems that the function is performing PCA on the cluster features in order to plot the points in 2D, and then apply the cluster label to color the points. I tried to do the same in python but seeing different shapes for the pca plots. Do you know if there is more pre-processing happening fviz_cluster?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      I love to use python myself :)
      I don't know the specifics of the backend use of fviz_cluster, but I suspect that the PCA is utilized to create the shapes to outline the clusters. I might be wrong though. And, I am not too familiar with shape creations unfortunatley.

  • @user-cr6mc3ww1l
    @user-cr6mc3ww1l Před rokem

    Thank you, I also watched Hierarchical Clustering video, amazing!
    I only have one questions, in the final plot, what are the Dim1 and Dim2?

    • @SpencerPaoHere
      @SpencerPaoHere  Před rokem +1

      The dim1 and dim2 are an amalgamation of all the features (I.e transformed features) of your dataset. Fviz uses pca in the backend.

  • @ancamihaelasuteu3184
    @ancamihaelasuteu3184 Před 2 lety

    Thank you for this video! It helped me a lot with my dataset!
    I was wondering: as I work on a species presence and absence matrix, do you think there would be a different function to be used instead of "dist"? I am asking, because my data being qualitative, I know that Jaccard or Dice indices should be used (instead of Euclidean)...

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      Are you referring to functons that refer to different distances? There is in fact a library that just does that. You can check out a notebook here:
      cran.r-project.org/web/packages/philentropy/vignettes/Distances.html

    • @ancamihaelasuteu3184
      @ancamihaelasuteu3184 Před 2 lety

      @@SpencerPaoHere Thank you, it was helpful. But my question/problem still remains and until now I have not found an answer to it. My data is binary, meaning value 1 for presence of a species and 0 for absence. Almost all functions (fviz_nbclust, fviz_dist, get_clust_tendency etc) use Euclidean distance, but I am not sure if it is compatible with binary data. I will continue my research. Thank you again for your video and time.

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety +1

      @@ancamihaelasuteu3184 Are your independent variables binary values or some class value? Makes sense. Yeah. When it comes to finding the distance of categorical variables, try looking into the categorical similarity measures such as Eskin, Overlap, IOF, OF, Lin etc.

  • @tmpcox
    @tmpcox Před 2 lety

    Thanks! How to do this with 3 dimensions or more (I know it can make the graph somewhat messy) ?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      If you are referring to XYZ coordinates, there are 3d spatial packages you can use in R. Though, I'd argue that Matlab is quite useful in this regard.
      This guide should be helpful to you:
      www.sthda.com/english/wiki/impressive-package-for-3d-and-4d-graph-r-software-and-data-visualization

  • @brucefox4954
    @brucefox4954 Před rokem

    Super video so thanks, Spencer. Here’s a question. You used the scaled data in kmeans, but not the data after running the dist function. What, then, was the purpose of running the dist function? Thanks in advance.

    • @SpencerPaoHere
      @SpencerPaoHere  Před rokem +1

      Hey! Thanks! Running the scaling and distance function for kmeans is important for the centroid calibration of your dataset. Both attempt to reduce bias from high/low values and you want your data to be on the same "level playing field"

  • @deenaadil6153
    @deenaadil6153 Před 2 lety

    If I don’t want to use a label for my clustering, and I will change the k for many times, how can I know what is the attribute that R uses to make the clustering ?
    Thank you

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      If you have unlabeled data to being with, you can use kmeans clustering to do the labeling for you. Although, if you are still undecided on the number of clusters to use, then the labeling will be off. I'd recommend sticking with one K cluster(s) and work from there.

  • @sachikogaming1137
    @sachikogaming1137 Před 2 lety

    Is it necessary to correlate first the variables before proceeding to clustering. Is it important to select only variables that are correlated, for analysis.

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      You don't necessarily need to have "correlated" features for K Means. However, It is important to distinguish among the colinear features for more appropriate results.

  • @gianluigiseccia5909
    @gianluigiseccia5909 Před 3 lety

    After i implementd this clustering i would like to have the neighbors for a new observation, giving its characteristics. How can i do?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 3 lety

      Hmm. If I understand correctly, you have a new observation and want to find the observations that are closes to that observation?
      I'd probably get all the observations in a hyperspace; get that specific observation; and run a distance method (up to you to decide which distance metric to use) to pinpoint which points are closest to that point. (using the scaling of the clusters to get correct axis)
      What's the use case? Typically, you would just want the find which cluster that new observation belongs to.

  • @msds2930
    @msds2930 Před 2 lety +1

    Thank u Spencer!
    Can we perform k-clustering for 3 parameters (features) in R? Bc all of the demos available online are only showing 2d. As much as possible I want to use R in performing this and ultimately plotting the results over python, 😀

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      Hmm. I don't really follow. If you have more than 3 features, that should be fine. You pass that to the parameter Kmeans(x...), where x is a numeric matrix. (m x n), where the length of n can be as many features as you'd like. Are you referring to plotting 3d?

    • @msds2930
      @msds2930 Před 2 lety

      I see, I'll rewatch the video, thanks. Yes that's the resulting plot I was expecting for 3 features..bc that's what I saw from posts that used python/matlab. Can we also plot that format in R?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      @@msds2930 Yes indeed.
      Here is a great guide on how to plot in 3d in R
      www.r-graph-gallery.com/3d.html

  • @koacenk
    @koacenk Před 2 lety

    Hi Spencer. Thanks for the video. I would like to know why do you use the same name in line 7 and 12? Will it changes the result if the name are different?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      Hi!
      The line 7 variable was overwritten by the line 12 variable because they are named the same variable. In this case, the line 12 variable is the most recent one with different logic. If you run line 7 once more, the values will be different. Think object protected programming.
      Hope that helps

  • @tansutazegul8297
    @tansutazegul8297 Před rokem

    Great one. How can we predict the classification of a completely new dataset?

    • @SpencerPaoHere
      @SpencerPaoHere  Před rokem +1

      Same as before --> Plug in the completley new dataset into your trained algorithm to obtain the predicted classifications. (Your new dataset must have the same features and feature types)

    • @tansutazegul8297
      @tansutazegul8297 Před rokem

      @@SpencerPaoHere very much appreciated!

  • @truongphu7407
    @truongphu7407 Před 2 lety

    I meant how can we extract the clusters and add to our initial data to do some descriptive statistics at the cluster level? Thanks

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      You can try the following:
      Predict cluster assignment with the incoming data. Label test set predictions with the cluster assignment (cbind). Write to csv.

  • @ankarata
    @ankarata Před rokem

    Thanks a lot for the clear explanation! I am trying to replicate your code with a dataset with missing values. Could you guide me on the best way to deal with that?

    • @SpencerPaoHere
      @SpencerPaoHere  Před rokem

      Sure! How can I help? It seems that maybe you need to clean your dataset a little bit to address the NA's?

    • @ankarata
      @ankarata Před rokem

      @@SpencerPaoHere I got around my NA's by removing all rows that contained one or more NA values (with the complete.cases command). However, this significantly reduced the size of my dataset, and I was hoping there is a way to perform the K-means with only excluding the NA's!

    • @SpencerPaoHere
      @SpencerPaoHere  Před rokem

      @@ankarata Have you tried imputation? I have a few videos on that concept!

    • @ankarata
      @ankarata Před rokem

      @@SpencerPaoHere I would prefer not estimate missing values. I think what i did is named listwise deletion, and i was hoping to find a method to perform pairwise deletion.

  • @truongphu7407
    @truongphu7407 Před 2 lety

    Dear Pao, after 3 lusters were identified, can we export new data table which each variety name can be group by cluster (1, 2 and 3)? Thanks for your useful video!

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      Yes you absolutely can. Try to save your dataset as a dataframe (for example)
      And, then write the dataframe to a csv on your local machine (write.csv(dataframe_object, filename))

  • @neemya
    @neemya Před 2 lety

    I was stucked at 14:30 , in the rstudio it keep saying that object 'km.clusters' not found. Idk where did i did wrong

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      Have you attempted to run line 24? I assigned that variable. So, its "Customized".

  • @lourdesarrueta5930
    @lourdesarrueta5930 Před 2 lety

    great video!! I learned a lot! Thanks but I have one question. What happens if the two first PCs only explain 60% of the variation, do you need to include a third PC?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety +1

      It really is up to you on how much of the variation you want explained. Typically, folks go for ~80% of variance explained, thus adding additional features is necessary to reach that threshold.

    • @lourdesarrueta5930
      @lourdesarrueta5930 Před 2 lety

      @@SpencerPaoHere Thank you for your quick reply! Do you also need to include the third PC in the cluster plot? like a 3D graph or it is fine to use a 2D graph?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety +1

      @@lourdesarrueta5930 Imho 3d visualizations aren't a great way to visualize data. (XYZ axis) -- you could however attempt to plot it on a 3d graph. (I think matlab has some great functionality on that -- Rstudio definitley does as well)
      I personally try to stick with 2d graphs - -you can also look into the matrix graphs. where you plot all the combinations of plots that share 1 axis.

  • @indexcards9414
    @indexcards9414 Před rokem

    I'm working with a different dataset, and I apply na.omit, but still get the error "NA/NaN/Inf in foreign function call (arg 1)" when I try to use fviz_nbclust(). Does anyone know why this might be happening?

    • @SpencerPaoHere
      @SpencerPaoHere  Před rokem

      KMeans can't handle NA values in the dataset. You're going to have to remove the Null / non numeric values.

  • @seanwestley1818
    @seanwestley1818 Před 2 lety

    @Spencer Pao This is a great video however I have a few questions using my datasets. I have an elemental analysis of trees I collected across the state and gave them numerical and/or a categorical ID tag relating to their GPS position. Is there a way to test the compositions? I have 30 elements tested and several soil regime markers that I want to build a better understanding of what is clustering and what seems to be off..... Please advise if you do not mind.

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety +1

      What do you mean by composition?
      If you want to judge how well your k means clustering approach does, you can always cluster based on labeled data (just remember to remove the target variable when clustering); do 80/20 train test split. Append target variable to train test. Use some metric to evaluate results

    • @seanwestley1818
      @seanwestley1818 Před 2 lety

      Ok, thank you. I have one last question. With your labeling method (shared below, written with my data label)
      paste(Field_foliar$Samples, 5:dim(Selected_Field_Foliar1)[2], sep="_")
      I used this to view the data with the individual sample name, it worked great but I also collected these foliar clippings from trees associated by a soil type. The soil type is a repeated label type, is there a way to label within the cluster by repeated data?? So is there a way to label each dot with the soil type that might be M1006 and M1005 M1006 etc etc. I would have roughly 9 soil types.
      Thank you so much for your help and videos.

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety +1

      @@seanwestley1818 Within the K Means cluster, you can label the corresponding records with whichever label the records are related to. You can include an additional column with the label(s) you mentioned and provide an overlay as I've done in the video.
      I hope that answered your question?

    • @seanwestley1818
      @seanwestley1818 Před 2 lety

      @@SpencerPaoHere Yes, however, when I tried using a different column it told me I could have a repeating label type. Is there a way to give it a label that repeats? The first one I created I followed your code (to the t ) and used my column with the sample title which are all unique however I want to know based on the soil type which some are listed as the same parent soil type but I want to see if these still cluster up together. How would I give it a column with repeating points? Which part of the code is how we were labeling it? If that makes more sense? Thank you so much.

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      @@seanwestley1818 Hmm. I don't think I follow 100%.
      Are you asking for a sampling approach? i.e reproduce observations in a dataset?
      Or adding an additional column to your dataset with your inquired categorical information?
      Or relabeling observations in a cluster?

  • @gezalt4260
    @gezalt4260 Před 2 lety

    I have Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1) when doing the fviz_nbclust

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      Your dataset probably has Empty/NaN values in it. Try and remove those observations/impute.

  • @dwiamririzqiakbar6415
    @dwiamririzqiakbar6415 Před 2 lety

    thanks spencer. but please make a video that use manhattan distance or another distance. i really need to finish my paper

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety +1

      😁
      If you are looking to use Kmeans with a different distance metrics (default is Euclidean), you can change the method in the parameter.
      kmeans(x, centers, iter.max = ..., nstart = 1, method ="manhattan")
      The kmeans default function in r supports a variety of different distance measurements such as "euclidean", "maximum", "manhattan", "canberra", "binary", "pearson" , "abspearson" , "abscorrelation", "correlation", "spearman" or "kendall".
      I hope that helped!

  • @arq.raquelruizportepetit2741

    Hi Spencer i have a problem using purrr library, i have already loaded the version 0.3.5 but is required 1.0.1, and in the last version of R, it doesnt works... what do you recomend... i can´t see the clustering. what version of R are u working on ? thks !!

    • @SpencerPaoHere
      @SpencerPaoHere  Před rokem

      weird.
      I am now using the most up to date R version of 4.3.0 as of 4/23/2023. Runs fine on my machine.

  • @AhmedAli-hk6zg
    @AhmedAli-hk6zg Před 2 lety

    Good, what would you suggest for Call Detail Record data ? where it has geolocation variable ?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety +1

      Hmm. If you have lat. and long. data, you probably wouldn't really need K means clustering applications. But, if your use case does require the algorithm, I'd use the haversine formula as the distance function.

    • @AhmedAli-hk6zg
      @AhmedAli-hk6zg Před 2 lety

      @@SpencerPaoHere Yes! my data has 14 variables including lat,long , datetime, caller, called , imeI , imsi, duration, . basically i've created a Shiny app to browse , explore and visualize the data.. but not sure which alogorithm should i use to extract significant location, and relations among callers ?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      ​@@AhmedAli-hk6zg Have you thought about splitting your data into different datasets? You can use individual features to attain your goals. (Lat vs Long)
      Other features dataset for something else. You don't necessarily need to use all the features to plug into a Clustering algorithm. Though, you can try and see what happens

  • @antonkohler7191
    @antonkohler7191 Před 2 lety +1

    First of all: great Video and it helped a lot!
    I got a question about row 12 (dist-function): why do you have to do it? You never use it again in the program.
    And what does the output mean exactly? (the 6 rows)

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      Good point! I was using the dist function for demonstration purposes. When you actually run K Means, some specified distance metric can be ran in the backend.
      I am not sure what you mean by output. Can you give me a timestamp?

    • @antonkohler7191
      @antonkohler7191 Před 2 lety

      @@SpencerPaoHere Ok good^^
      You show it at 8:03 😅

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      @@antonkohler7191 Oh! That is the distance matrix. (distance between pairs of objects) - -diagonal is always 1. (in this case it is not shown i.e (1,1) (2,2) etc)

    • @antonkohler7191
      @antonkohler7191 Před 2 lety

      ​@@SpencerPaoHere Ah I see and what are the pairs of objects? Centroids with datapoints, or the Number of Centroids with the average distance between centroids and datapoints, ...?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      @@antonkohler7191 Distance between other observations!

  • @ilhembenhenda3416
    @ilhembenhenda3416 Před 2 lety

    Can you help me i need code knn with Mapreduce in r

  • @edoardomarchi9195
    @edoardomarchi9195 Před rokem

    Hey spencer, I can't do the fviz_nbclust. It doesn't let me for some reason. Do I have to download a specific package for that ?

    • @SpencerPaoHere
      @SpencerPaoHere  Před rokem

      does install.packages("fviz_nbclust") not work?

    • @genesisandolivera3919
      @genesisandolivera3919 Před 4 měsíci

      @@SpencerPaoHere it did not work for me. I'm running on R 4.3.3 - does this matter?

  • @callmelabli3076
    @callmelabli3076 Před 2 lety

    Hi! Have you tried using the NbClust() function to determine the best optimal cluster number?

    • @callmelabli3076
      @callmelabli3076 Před 2 lety

      I've been trying to do it but I keep getting the message "The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated." . That is when my index in set to "all". But when I use single index like "hartigan" alone, everything is fine. But I want to use index=all to choose the best number of cluster. How do I fix this?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      @@callmelabli3076 I am not too familiar with the issue unfortunately. But, this link might help?
      stackoverflow.com/questions/46067602/the-tss-matrix-is-indefinite-there-must-be-too-many-missing-values-the-index

    • @callmelabli3076
      @callmelabli3076 Před 2 lety

      @@SpencerPaoHere I see. Thank you so much!

  • @dataanalyst1012
    @dataanalyst1012 Před 2 lety

    Hi Spencer, can I have variables larger than observations, in k-means cluster analysis?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      You most certainly can! Though, the results may vary...

    • @dataanalyst1012
      @dataanalyst1012 Před 2 lety

      @@SpencerPaoHere Thank you! Anyway, this video is great. I learned a lot.

  • @thanachasrudomlert8575

    A nice video. Can we use k means clustering data?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety +1

      Yes indeed. :)

    • @thanachasrudomlert8575
      @thanachasrudomlert8575 Před 2 lety

      @@SpencerPaoHere Appreciate your fast response. How to use k-means clustering on (x,y,z) coordinate data ? For example surface roughness of semiconductor wafer similar with below image,
      images.app.goo.gl/VJZfiQaw7i98K2Ea9

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      @@thanachasrudomlert8575 You would use K means the same way you do any other multi-dimensional data. The main purpose behind K means is to find a relationship of unlabeled observations to find potential groupings within the data. So, maybe in this particular use case, I would pass in the 3 features (x,y,z) as a dataframe in the algorithm and see what relationships exist.

  • @muhammadgumilangangkasa171

    is there any maximum amount of data to use fviz_nbclust? when i runs it, it says "Error: cannot allocate vector of size 78.8 Gb"

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      There is technically no limit. It's just that your computer does not have enough compute to create the model. You might want to to sample from your dataset.

    • @muhammadgumilangangkasa171
      @muhammadgumilangangkasa171 Před 2 lety

      @@SpencerPaoHere is there any technique to pick or cut the size of dataset without change the result?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      @@muhammadgumilangangkasa171 You can check out my sampling video here:
      czcams.com/video/8_4Ls7k1wyw/video.html

  • @GijsArkink
    @GijsArkink Před 2 lety

    Great video! im only left with 2 questions and i hope you chould help me with these:
    1. is it possible to improve the % within cluster sum of squares by clusters? (row 20/21)
    2. I applied this on another data set with 2 clusters but my results whare not amazing. now when i plot with 4 clusters i get good results. the only thing i need to do is add cluster 1,2 and 4 together. is this possible?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety +1

      Glad you liked it!
      1) Probably. You'd need to most likely transform your dataset and play around with the distances between points to perhaps improve the statistic. But, in the end of the day, you would probably hit the "ceiling" of the model effectiveness.
      2) You could "add" together the clusters as in obtain the labled dependent variables and query them to a new dataset. (and perhaps relabel them as one?)

    • @GijsArkink
      @GijsArkink Před 2 lety

      @@SpencerPaoHere Okay this seems logical! thanks for the fast reaction

  • @zqadri08
    @zqadri08 Před 2 lety

    Hello, im getting an error:
    Error in fviz_nbclust(iris_data_scale, kmeans, method = "wss") :
    could not find function "fviz_nbclust".
    i tried looking up the right package, but no luck. any suggestions?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety +1

      Have you tried install.packages("factoextra")

    • @zqadri08
      @zqadri08 Před 2 lety +1

      @@SpencerPaoHere sorry, yes that fixed it. Thank you !

  • @username2537
    @username2537 Před 2 lety

    is the scaling reversible?
    i've got a dataset with 4 numerical variables and i'm interested in the values of my cluster-centroids.

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      Yes! You can unscale. Assuming that you are using the default scale function in R, DMwR has an unscale function you can use.
      Else, you can refer to this post on other examples on how to unscale.
      stackoverflow.com/questions/10287545/backtransform-scale-for-plotting

    • @username2537
      @username2537 Před 2 lety

      Thanks for the quick answer.
      Unfortunately, when i try to install DMwR using 'install.packages("DMwR")' it says:
      'Warning in install.packages :
      package ‘DMwR’ is not available for this version of R'.
      Do you have an idea what to do?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      @@username2537 Oh wow. They removed the packaged from Cran. Interesting. You might have to get it via a mirror.
      You can also try the updated version. i.e DMwR2

    • @username2537
      @username2537 Před 2 lety

      @@SpencerPaoHere hmmm...
      Tried the updated DMwR2 but here is no function called unscale...
      What exactly do you mean with getting it via a mirror?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      ​@@username2537 As in download straight from github with a version requested.
      You can try this:
      library(devtools)
      install_github("cran/DMwR")

  • @HarpreetKaur-bx1ej
    @HarpreetKaur-bx1ej Před 2 lety

    Hi am struggling
    Kmeans(iris, center=20) is this code is the correct way to Perform a cluster analysis for 20 randomly selected iris datapoints
    Or nstart should be 20?
    Thanks in advance

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      The center=20 is the number of clusters. So, in this case you are looking to categorize 20 clusters in your data. Due to the number of observations in your dataset, k means will most likely consider each observation as its own cluster. You might want to decrease that number to 2-3.

    • @HarpreetKaur-bx1ej
      @HarpreetKaur-bx1ej Před 2 lety

      @@SpencerPaoHere actually question is
      Perform a cluster analysis for 20 randomly selected Swiss bank notes
      So what is 20?
      nstart Or centers

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      @@HarpreetKaur-bx1ej 'center = 20' where 20 is the number of centers kmeans is optimizing for.
      Perhaps this documentation may help further: www.rdocumentation.org/packages/stats/versions/3.6.2/topics/kmeans

  • @amiemeche5223
    @amiemeche5223 Před 9 měsíci

    Hi! I am learning R and I have a large data set of 100k observations with 23 variables. The dist() gives me an error: cannot allocate vector of size 37.3 Gb. I have not had much luck with google trying to resolve this issue. Would you happen to be able to help me?

    • @SpencerPaoHere
      @SpencerPaoHere  Před 5 měsíci

      Ahh yes. This is because your computer storage/RAM is too small. You either have to source it to an EC2 instance or you can randomly draw from the dataset.

    • @amiemeche5223
      @amiemeche5223 Před 5 měsíci

      @@SpencerPaoHere thank you for your reply. I ended up doing a random sample. It did the trick for the outcome needed.

  • @limzijian98
    @limzijian98 Před 6 měsíci

    Sry, what does Nstart actually do ?

  • @scathach7639
    @scathach7639 Před rokem

    dude, i wanna ask about result of cluster plot. are cluster 1 is good or bad in R for reccomendation something?

    • @SpencerPaoHere
      @SpencerPaoHere  Před rokem

      Is there a timestamp in the video where I can reference to? I am assuming the green cluster? In terms of accuracy for this group, it seems to be quite high for this dataset.

    • @scathach7639
      @scathach7639 Před rokem

      @@SpencerPaoHere ah i'm sorry, i mean on blue cluster is highest/good cluster?

    • @SpencerPaoHere
      @SpencerPaoHere  Před rokem

      ​@@scathach7639 It seems that the blue cluster does a good job at grouping virginica. (I don't have the literal statistics, but I did post the code on github-- link in the description!) You can run through it and do additional analysis. (i.e run table on the predictions vs true values); You can do the same analysis for the other clusters.
      Also, from the iris dataset that I was working with, all the clusters seem to be doing a great job of identifying the response variables.

    • @scathach7639
      @scathach7639 Před rokem

      @@SpencerPaoHere ah i see, ok thank u so much for your explaining🙏

  • @aaishaf7
    @aaishaf7 Před rokem

    Hello, thanks for this video. is there a way to remove the numbers next to the group names on the plot (e.g. setosa, virginica etc instead of setosa _ 33 virginica_118 etc)? The plot becomes very squished together if the words are long.

    • @SpencerPaoHere
      @SpencerPaoHere  Před rokem

      There are a variety of ways. A few solutions that come to mind could be: You could increase the plot resolution. (maximize the image using the Rstudio GUI) If you are referring to the points on the plot, you could rename to something like S_1, S_2.... V_118 if that helps. (edit the dataframe and do some data cleaning)

  • @rumblerumble2276
    @rumblerumble2276 Před 6 měsíci

    I’m missing something: Why was the data scaled? I thought scaling was done when the measurements use a different scale? Aren’t all measurements here in centimeters?

    • @dantshisungu395
      @dantshisungu395 Před 6 měsíci +1

      Features have different distributions as well. Assume a feature named "men" where the max values is 195cm and a feature "female" where the max value is 176cm. The ranges are different and male values will overpower the female ones if they are not scaled

  • @taz6783
    @taz6783 Před 2 lety

    czcams.com/video/NKQpVU1LTm8/video.html line 26 doesn't run for me... it says 'cant handle an object of class list'

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      Hmm. Not sure. Did you run all the previous lines to 26? It runs fine on my end.

    • @ositaonyejekwe
      @ositaonyejekwe Před rokem

      @@SpencerPaoHere i have the same error :(

  • @qasemalwadiah9343
    @qasemalwadiah9343 Před 2 lety

    Thanks a lot for this Video it's very helpful, but I'm facing issue with installing packages like "factoextra" it keeps giving me error as below msg, I hope to advise me to solve this problem
    "" Error in install.packages : cannot open file 'C:/Users/DELL/Documents/R/win-library/4.1/file24c47a606609/rstudioapi/help/figures/logo.png': Permission denied ""

    • @SpencerPaoHere
      @SpencerPaoHere  Před 2 lety

      That seems like its a computer issue. When installing your packages, you might want to see if you have admin privileges.

    • @qasemalwadiah9343
      @qasemalwadiah9343 Před 2 lety

      @@SpencerPaoHere I appreciate your reply I'll check the installation

  • @prof.luanmonteiro
    @prof.luanmonteiro Před rokem

    Great video!