ML with Python | Text Clustering | K-Means (Movies)
Vložit
- čas přidán 5. 08. 2020
- In this tutorial, I will show you how to perform Unsupervised Machine learning with Python using Text Clustering. We will look at how to turn text into numbers with using TF-IDF Vectorizer from sklearn. What we will also do is to check the centroid of each cluster. Once we know the centroid, we will know the movies that are closed to the centroids and that helps us to understand the similarities between these movies.
I will show you step by step of:
1. How to load the data into Google Colab notebook
2. How to explore the data
3. How to pre-process the data with TF-IDF Vectorizer from sklearn
4. How to perform K-Means clustering with using Scikit-Learn library
5. How to evaluate the results of the clustering
For full codes and dataset:
github.com/MarcusChong123/Tex...
For full article:
learndatascienceskill.com/ind...
To learn more about Scikit-Learn Feature Extraction:
scikit-learn.org/stable/modul...
My website about Data Science:
learndatascienceskill.com/ - Věda a technologie
Marcus, amazing explanation!! Thank you!
Very useful video for me.....Thank you for sharing Marcus.
Thank you Marcus! This was really helpful!
You Rock Marcus! Video is really helping
The first one video which really helps me, Thank you!
Hey Marcus! Thank you for this!!! I learned a lot with this video, this will be very useful for my capstone project
Hi Miyel, very happy to hear that! Feel free to explore other videos :)
Excellent tutorial explaining all the steps. Found this very helpful. Thank you!
Thank you!
@@codewithmarcus2151 Is there a command to find the Silhouette score or intertia from this? Thank you!
Thank you for the video! Amazing content , I did learn a lot !
Thanks ! I have got a new idea for my project
Thanks man, saved my day
Thanks a lot brother, take love
Thanks sharing the source materials!
you are welcome! Feel free follow my github account github.com/MarcusChong123 for the source code of other tutorials :)
Thank you!
Hi Marcus, very well explained. thanks for the video. Can u make something on Analysis of categorical data without response variable.
Thank you sir. Sir , how to visual with scatter plot in this case?🙏🙏🙏
bro,how to calculate the accuracy score of the model say(silhouette score ) for example
hi, is there any way to then represent those clusters in the form of a dbscan diagram?
Nice video and great explanation. What is the method used to arrive at a number of clusters. If it is an elbow method, how do we arrive at the number of clusters on text data. Thank you.
you should plot elbow curve for optimal number of clusters
Hi there,i need to classify customer reviews into categories..suggest a method
hi can you please cover the topic of heirarchical clustering for text documents ?using python ?.since i need it
to use
TOp G thanks 😊
Hi sir, how can I write a code that remove duplicate images using clustering in google colab after the unzipping the dataset zip file?
didnt you have to clean the "overview" column a bit more before vectorizing it? Like making them all lower case etc etc
Great video, thanks. I have a quick question though as I am getting an error that I cannot work out why. In the second code block with the line df = pd.read_csv("Movies_Dataset.csv") I get that show as ParserError and further down the result in another line which says ParserError: Error tokenizing data. C error: Expected 1 fields in line 13, saw 3
Have you got any advice on what is going wrong here. Thanks
Hi, are you using Google Colab for this exercise? Did you use the same dataset as I am using (Movies_Dataset.csv)? If you are using Google Colab, make sure you have the file uploaded successfully. If that's true, try df = pd.read_csv(filename,header=None,error_bad_lines=False)
@@codewithmarcus2151 Thanks for the repyly Marcus, really appreciated. Yes I was using Colab, but the issue turned out that I had managed to corrupt the CSV 😬 i did a fresh download of the data and got it to work. Quick question, could you direct me to where I could possibly learn about using K means in a semi supervised model. For instance, if I have a whole batch of phrases that I wanted to sort into pre-defined clusters. Or using the movies dataset as an example, sort them by say family, adult or child friendly clusters if you see what I mean. Thanks again for your amazing tutorials
How to make decision on how many clusters will be the good fit for data?...Can you do elbow plot or silhouette score for this same dataset and explain....
How can I know which id belongs to which cluster?
I have a question that I hope someone can answer.
How can I clustring on two fields, let's say the first is like "overview" and the second is "title"
Hi Marcus! Could you explain me the following line? print(' %s' % terms[j])
hi, thanks for the good tutorial. I follow your step using jupyter, but get stuck at this line:
f.write(data.to_csv(index_label='id')) # set index to id
f.close()
after run this line, the error like below:
UnicodeEncodeError Traceback (most recent call last)
in
----> 1 f.write(data.to_csv(index_label='id')) # set index to id
2 f.close()
~\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
UnicodeEncodeError: 'charmap' codec can't encode characters in position 13647-13649: character maps to
i skip that line, but there is no dataset for each cluster created... :(
Analysis is not complete without doing a manual review of the clusters at the end. From what you showed, it didn't look too promising.
Hi Marcus, great tutorial, thanks! I have one question: when I try to use the elbow method here to determine the optimal k like you did in the video before, I'm getting: ValueError: could not convert string to float: 'Toy Story'. The problem seems to be in this line: kmeanModel.fit(df)
I would be glad if you could tell me what the code for the elbow method would look like in this specific case :)