NLP Demystified 9: Automatically Finding Topics in Documents with Latent Dirichlet Allocation
Vložit
- čas přidán 2. 08. 2024
- Course playlist: • Natural Language Proce...
What do you do when you need to make sense of a pile of documents and have no other information? In this video, we'll learn one approach to this problem using Latent Dirichlet Allocation.
We'll cover how it works, then build a model with spaCy and Gensim to automatically discover topics present in a document and to search for similar documents.
Colab notebook: colab.research.google.com/git...
Timestamps
00:00:00 Topic modelling with LDA
00:00:21 The two assumptions an LDA topic model makes
00:03:15 Building an LDA Machine to generate documents
00:10:16 The Dirichlet distribution
00:14:43 Further enhancements to the LDA machine
00:17:01 LDA as generative model
00:20:15 Training an LDA model using Collapsed Gibbs Sampling
00:28:44 DEMO: Discovering topics in a news corpus and searching for similar documents
00:45:24 Topic model use cases and other models
This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.
Visit www.nlpdemystified.org/ to learn more.
Timestamps
00:00:00 Topic modelling with LDA
00:00:21 The two assumptions an LDA topic model makes
00:03:15 Building an LDA Machine to generate documents
00:10:16 The Dirichlet distribution
00:14:43 Further enhancements to the LDA machine
00:17:01 LDA as generative model
00:20:15 Training an LDA model using Collapsed Gibbs Sampling
00:28:44 DEMO: Discovering topics in a news corpus and searching for similar documents
00:45:24 Topic model use cases and other models
This course is pure gold. Your explanations are very clear and the aid of the notebooks is very helpful. Thanks a lot for creating this content and making it available for free!!!
First of all, Thank you so much for this course. I have understood each and everything till this point. I have a couple of questions though.
I am still confused about the relationship between Latent Dirchlet Allocation and Collapsed Gibbs Sampling. You talked about Latent Dirchlet Allocation for a long time but then kind of shifted to CGS. I am kind of having trouble understanding the similarities. (i get the multiplying thing and finding out the probabiities and shooting the dart thing). I'd love an explanation.
Thanks
keep up the good work
Nicely done, thanks!
It's really amazing work of NLP
thx so much
At around 27:45 " The algorithm will converge to some distribution which hopefully will make sense"
What is the main reason, logic in the algorithm that ensures words grouped in a topic like will make sense ( and that topic could be labeled as "food" 4 example?
Because we start from a random distribution, what is the main reason a wrong grouping (a grouping that makes no sense) of words won't be reinforced?
Great NLP lectures. Subscribe your channel.
Btw, when running the colab in the cloud (I preferred to do it that way) I had two issues that could be easily fixed. Do you accept PRs to correct them?
Yep, I do! I wrote and ran them in the cloud as well so perhaps something (e.g. a dependency) has changed. I would appreciate a PR.
i was going through the notebook file and i cant download the cnn corpus file from gdrive. its says access denied?
Not getting that issue:
imgur.com/a/yovuuqF
This might be a naive question, but could we not simply apply a clustering technique like k-Means to the bag-of-words or tf-idf matrix for topic discovery? You could then look at the most frequent words in each cluster to assign a topic.
Yep, you can absolutely do that. The difference between something like k-means and LDA is that the former has hard boundaries. So a document belongs to one cluster and that's it. With LDA, each document is assigned a topic *mixture* (e.g. 60% topic A, 25% topic B, 15% topic C). This often leads to (subjectively) better topic distributions and similarity search results.
At minute 25:53 we see that the prevelance for each topic has x/7 with alpha set to one. Shouldn't it be 6 for each topic? I count five words for each document. Where does the sixth word come from? Thanks
Refer to 25:42. The alpha value is added to the denominator for every topic. Since there are five words and two topics, the denominator is 5 + 1 + 1 = 7. The notation could be clearer on this.
Ahhh, so alpha depends on the number of documents too. Alright, thanks!!
So, if T3 wouldn’t be zero we would end up with 8 in the denominator, right?
@@eboi9081 Yep
@@futuremojo I guess we can calculate the denominator as -> (Words in T1 + a) + (Words in T2 + a) + (Words in T3 + a) = (2 + 1) + (2 + 1) + (0 +1) = 7
A few more questions from me:
1. At around 27 mins, is there a reason why this form of additive smoothing is different to the one in Naive Bayes? i..e how come there isn't a factor of K or V in the denominators.
2. When selecting a new topic for a word (step 5 in Collapsed Gibbs Sampling), do we choose the topic that maximises the criteria (i.e. the product of the fractions) or do we sample a topic from this distribution? In your example, we choose topic 1 but is there a chance topic 2 or 3 could have also been selected? Again, is there a reason for this.
Thanks, Sam.
1. That's unclear notation from me. The summation in the denominator includes the addition of the hyperparameter.
2. We sample from this distribution. So in our example, topics 2 and 3 had lower probability but still could've been chosen. If instead of sampling, we picked the most probable, we would be optimizing instead and very likely underestimating the uncertainty in the other parameters. With CGS, you're making a probable guess and as you go through more iterations, the guesses get stronger and stronger. That being said, what's the practical outcome of always picking the most probable instead of sampling? I don't know.
This is the best course for understanding NLP! The author did a great job explaining concepts in detail, but there is a minor terminology issue I want to highlight to avoid confusion. When the course refers to "vocabulary", it does not always mean just the individual words. More accurately, in NLP the "vocabulary" is the set of unique tokens that occur in the text corpus.
To elaborate:
1. By default, the vocabulary contains the unique unigrams (single words) in the corpus.
2. However, it can also include n-grams (sequences of multiple words) if we tokenize the text into n-grams rather than just unigrams.
3. Therefore, it is more precise to refer to the "tokens" rather than just the "vocabulary." The term "tokens" makes it clear we could be talking about either unigrams or n-grams, depending on the tokenization approach used.
4. This vocabulary or token set is what forms the basis for representations like word embeddings, indexing documents, etc.
That being said, using "tokens" is a more accurate general term than "vocabulary" when we want to refer to both potential unigrams and n-grams. Just wanted to point out this terminology distinction to prevent any confusion down the line.