ML with Python | Text Clustering | K-Means (Movies)

Sdílet
Vložit
  • čas přidán 5. 08. 2020
  • In this tutorial, I will show you how to perform Unsupervised Machine learning with Python using Text Clustering. We will look at how to turn text into numbers with using TF-IDF Vectorizer from sklearn. What we will also do is to check the centroid of each cluster. Once we know the centroid, we will know the movies that are closed to the centroids and that helps us to understand the similarities between these movies.
    I will show you step by step of:
    1. How to load the data into Google Colab notebook
    2. How to explore the data
    3. How to pre-process the data with TF-IDF Vectorizer from sklearn
    4. How to perform K-Means clustering with using Scikit-Learn library
    5. How to evaluate the results of the clustering
    For full codes and dataset:
    github.com/MarcusChong123/Tex...
    For full article:
    learndatascienceskill.com/ind...
    To learn more about Scikit-Learn Feature Extraction:
    scikit-learn.org/stable/modul...
    My website about Data Science:
    learndatascienceskill.com/
  • Věda a technologie

Komentáře • 39

  • @bhavyagoradia4203
    @bhavyagoradia4203 Před 2 lety +2

    Marcus, amazing explanation!! Thank you!

  • @jannatulfardous5802
    @jannatulfardous5802 Před 2 lety +2

    Very useful video for me.....Thank you for sharing Marcus.

  • @ranjinimukhejee2786
    @ranjinimukhejee2786 Před 3 lety +2

    Thank you Marcus! This was really helpful!

  • @pengchaocai2848
    @pengchaocai2848 Před 2 lety +2

    You Rock Marcus! Video is really helping

  • @TheSassy023
    @TheSassy023 Před 2 lety +1

    The first one video which really helps me, Thank you!

  • @m4rrow8
    @m4rrow8 Před 3 lety +4

    Hey Marcus! Thank you for this!!! I learned a lot with this video, this will be very useful for my capstone project

    • @codewithmarcus2151
      @codewithmarcus2151  Před 3 lety +1

      Hi Miyel, very happy to hear that! Feel free to explore other videos :)

  • @shahedmahbub9013
    @shahedmahbub9013 Před 3 lety +5

    Excellent tutorial explaining all the steps. Found this very helpful. Thank you!

  • @Qweasdzxc912
    @Qweasdzxc912 Před rokem

    Thank you for the video! Amazing content , I did learn a lot !

  • @lifetube1117
    @lifetube1117 Před 2 lety +1

    Thanks ! I have got a new idea for my project

  • @silentscream2808
    @silentscream2808 Před 2 lety +1

    Thanks man, saved my day

  • @mujammalahmed1524
    @mujammalahmed1524 Před 2 lety +1

    Thanks a lot brother, take love

  • @bentraje
    @bentraje Před 3 lety +1

    Thanks sharing the source materials!

    • @codewithmarcus2151
      @codewithmarcus2151  Před 3 lety

      you are welcome! Feel free follow my github account github.com/MarcusChong123 for the source code of other tutorials :)

  • @sandyAshraf
    @sandyAshraf Před 2 lety +1

    Thank you!

  • @Bhaveshwari21
    @Bhaveshwari21 Před 3 lety

    Hi Marcus, very well explained. thanks for the video. Can u make something on Analysis of categorical data without response variable.

  • @josuahutagalung6961
    @josuahutagalung6961 Před 2 lety

    Thank you sir. Sir , how to visual with scatter plot in this case?🙏🙏🙏

  • @Jxxxxxxxxxxxxxxxxxxx
    @Jxxxxxxxxxxxxxxxxxxx Před rokem +1

    bro,how to calculate the accuracy score of the model say(silhouette score ) for example

  • @ayeshaakhtar1482
    @ayeshaakhtar1482 Před 3 lety

    hi, is there any way to then represent those clusters in the form of a dbscan diagram?

  • @saimanohar3363
    @saimanohar3363 Před rokem

    Nice video and great explanation. What is the method used to arrive at a number of clusters. If it is an elbow method, how do we arrive at the number of clusters on text data. Thank you.

  • @chandandacchufan3242
    @chandandacchufan3242 Před rokem

    you should plot elbow curve for optimal number of clusters

  • @Saiju.
    @Saiju. Před rokem +1

    Hi there,i need to classify customer reviews into categories..suggest a method

  • @zahrasiraj106
    @zahrasiraj106 Před 3 lety

    hi can you please cover the topic of heirarchical clustering for text documents ?using python ?.since i need it
    to use

  • @shaikhkashif9973
    @shaikhkashif9973 Před 9 měsíci

    TOp G thanks 😊

  • @bjmaudioservices6134
    @bjmaudioservices6134 Před rokem

    Hi sir, how can I write a code that remove duplicate images using clustering in google colab after the unzipping the dataset zip file?

  • @mirroring_2035
    @mirroring_2035 Před 2 lety +1

    didnt you have to clean the "overview" column a bit more before vectorizing it? Like making them all lower case etc etc

  • @richv7170
    @richv7170 Před 3 lety

    Great video, thanks. I have a quick question though as I am getting an error that I cannot work out why. In the second code block with the line df = pd.read_csv("Movies_Dataset.csv") I get that show as ParserError and further down the result in another line which says ParserError: Error tokenizing data. C error: Expected 1 fields in line 13, saw 3
    Have you got any advice on what is going wrong here. Thanks

    • @codewithmarcus2151
      @codewithmarcus2151  Před 3 lety +1

      Hi, are you using Google Colab for this exercise? Did you use the same dataset as I am using (Movies_Dataset.csv)? If you are using Google Colab, make sure you have the file uploaded successfully. If that's true, try df = pd.read_csv(filename,header=None,error_bad_lines=False)

    • @richv7170
      @richv7170 Před 3 lety

      @@codewithmarcus2151 Thanks for the repyly Marcus, really appreciated. Yes I was using Colab, but the issue turned out that I had managed to corrupt the CSV 😬 i did a fresh download of the data and got it to work. Quick question, could you direct me to where I could possibly learn about using K means in a semi supervised model. For instance, if I have a whole batch of phrases that I wanted to sort into pre-defined clusters. Or using the movies dataset as an example, sort them by say family, adult or child friendly clusters if you see what I mean. Thanks again for your amazing tutorials

  • @mahiraj8522
    @mahiraj8522 Před rokem

    How to make decision on how many clusters will be the good fit for data?...Can you do elbow plot or silhouette score for this same dataset and explain....

  • @azeemsiddiqui3853
    @azeemsiddiqui3853 Před rokem

    How can I know which id belongs to which cluster?

  • @mohamadjumaa2042
    @mohamadjumaa2042 Před 2 lety

    I have a question that I hope someone can answer.
    How can I clustring on two fields, let's say the first is like "overview" and the second is "title"

  • @hugoalbert4695
    @hugoalbert4695 Před 2 lety

    Hi Marcus! Could you explain me the following line? print(' %s' % terms[j])

  • @zakiyahzainon5958
    @zakiyahzainon5958 Před 3 lety

    hi, thanks for the good tutorial. I follow your step using jupyter, but get stuck at this line:
    f.write(data.to_csv(index_label='id')) # set index to id
    f.close()
    after run this line, the error like below:
    UnicodeEncodeError Traceback (most recent call last)
    in
    ----> 1 f.write(data.to_csv(index_label='id')) # set index to id
    2 f.close()
    ~\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
    17 class IncrementalEncoder(codecs.IncrementalEncoder):
    18 def encode(self, input, final=False):
    ---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
    20
    21 class IncrementalDecoder(codecs.IncrementalDecoder):
    UnicodeEncodeError: 'charmap' codec can't encode characters in position 13647-13649: character maps to
    i skip that line, but there is no dataset for each cluster created... :(

  • @kent4239
    @kent4239 Před rokem

    Analysis is not complete without doing a manual review of the clusters at the end. From what you showed, it didn't look too promising.

  • @geraldasamoah98
    @geraldasamoah98 Před 3 lety +1

    Hi Marcus, great tutorial, thanks! I have one question: when I try to use the elbow method here to determine the optimal k like you did in the video before, I'm getting: ValueError: could not convert string to float: 'Toy Story'. The problem seems to be in this line: kmeanModel.fit(df)
    I would be glad if you could tell me what the code for the elbow method would look like in this specific case :)