NLP Demystified 9: Automatically Finding Topics in Documents with Latent Dirichlet Allocation

Sdílet
Vložit
  • čas přidán 2. 08. 2024
  • Course playlist: • Natural Language Proce...
    What do you do when you need to make sense of a pile of documents and have no other information? In this video, we'll learn one approach to this problem using Latent Dirichlet Allocation.
    We'll cover how it works, then build a model with spaCy and Gensim to automatically discover topics present in a document and to search for similar documents.
    Colab notebook: colab.research.google.com/git...
    Timestamps
    00:00:00 Topic modelling with LDA
    00:00:21 The two assumptions an LDA topic model makes
    00:03:15 Building an LDA Machine to generate documents
    00:10:16 The Dirichlet distribution
    00:14:43 Further enhancements to the LDA machine
    00:17:01 LDA as generative model
    00:20:15 Training an LDA model using Collapsed Gibbs Sampling
    00:28:44 DEMO: Discovering topics in a news corpus and searching for similar documents
    00:45:24 Topic model use cases and other models
    This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.
    Visit www.nlpdemystified.org/ to learn more.

Komentáře • 24

  • @futuremojo
    @futuremojo  Před 2 lety +2

    Timestamps
    00:00:00 Topic modelling with LDA
    00:00:21 The two assumptions an LDA topic model makes
    00:03:15 Building an LDA Machine to generate documents
    00:10:16 The Dirichlet distribution
    00:14:43 Further enhancements to the LDA machine
    00:17:01 LDA as generative model
    00:20:15 Training an LDA model using Collapsed Gibbs Sampling
    00:28:44 DEMO: Discovering topics in a news corpus and searching for similar documents
    00:45:24 Topic model use cases and other models

  • @youfripper
    @youfripper Před rokem +5

    This course is pure gold. Your explanations are very clear and the aid of the notebooks is very helpful. Thanks a lot for creating this content and making it available for free!!!

  • @thecubeguy2087
    @thecubeguy2087 Před 23 dny +1

    First of all, Thank you so much for this course. I have understood each and everything till this point. I have a couple of questions though.
    I am still confused about the relationship between Latent Dirchlet Allocation and Collapsed Gibbs Sampling. You talked about Latent Dirchlet Allocation for a long time but then kind of shifted to CGS. I am kind of having trouble understanding the similarities. (i get the multiplying thing and finding out the probabiities and shooting the dart thing). I'd love an explanation.
    Thanks

  • @mahmoudreda1083
    @mahmoudreda1083 Před rokem +3

    keep up the good work

  • @somerset006
    @somerset006 Před 11 měsíci

    Nicely done, thanks!

  • @varshapandey_daily_lifestyle

    It's really amazing work of NLP

  • @kevinoudelet
    @kevinoudelet Před 5 měsíci

    thx so much

  • @marius152
    @marius152 Před 10 měsíci +1

    At around 27:45 " The algorithm will converge to some distribution which hopefully will make sense"
    What is the main reason, logic in the algorithm that ensures words grouped in a topic like will make sense ( and that topic could be labeled as "food" 4 example?
    Because we start from a random distribution, what is the main reason a wrong grouping (a grouping that makes no sense) of words won't be reinforced?

  • @caiyu538
    @caiyu538 Před rokem

    Great NLP lectures. Subscribe your channel.

  • @youfripper
    @youfripper Před rokem +1

    Btw, when running the colab in the cloud (I preferred to do it that way) I had two issues that could be easily fixed. Do you accept PRs to correct them?

    • @futuremojo
      @futuremojo  Před rokem

      Yep, I do! I wrote and ran them in the cloud as well so perhaps something (e.g. a dependency) has changed. I would appreciate a PR.

  • @prithimeenan4214
    @prithimeenan4214 Před rokem +1

    i was going through the notebook file and i cant download the cnn corpus file from gdrive. its says access denied?

    • @futuremojo
      @futuremojo  Před rokem +1

      Not getting that issue:
      imgur.com/a/yovuuqF

  • @samuelcortinhas4877
    @samuelcortinhas4877 Před rokem +1

    This might be a naive question, but could we not simply apply a clustering technique like k-Means to the bag-of-words or tf-idf matrix for topic discovery? You could then look at the most frequent words in each cluster to assign a topic.

    • @futuremojo
      @futuremojo  Před rokem +5

      Yep, you can absolutely do that. The difference between something like k-means and LDA is that the former has hard boundaries. So a document belongs to one cluster and that's it. With LDA, each document is assigned a topic *mixture* (e.g. 60% topic A, 25% topic B, 15% topic C). This often leads to (subjectively) better topic distributions and similarity search results.

  • @eboi9081
    @eboi9081 Před rokem +3

    At minute 25:53 we see that the prevelance for each topic has x/7 with alpha set to one. Shouldn't it be 6 for each topic? I count five words for each document. Where does the sixth word come from? Thanks

    • @futuremojo
      @futuremojo  Před rokem

      Refer to 25:42. The alpha value is added to the denominator for every topic. Since there are five words and two topics, the denominator is 5 + 1 + 1 = 7. The notation could be clearer on this.

    • @eboi9081
      @eboi9081 Před rokem +1

      Ahhh, so alpha depends on the number of documents too. Alright, thanks!!

    • @eboi9081
      @eboi9081 Před rokem +1

      So, if T3 wouldn’t be zero we would end up with 8 in the denominator, right?

    • @futuremojo
      @futuremojo  Před rokem

      @@eboi9081 Yep

    • @prabhdeepsingh8726
      @prabhdeepsingh8726 Před 5 měsíci

      @@futuremojo I guess we can calculate the denominator as -> (Words in T1 + a) + (Words in T2 + a) + (Words in T3 + a) = (2 + 1) + (2 + 1) + (0 +1) = 7

  • @samuelcortinhas4877
    @samuelcortinhas4877 Před rokem

    A few more questions from me:
    1. At around 27 mins, is there a reason why this form of additive smoothing is different to the one in Naive Bayes? i..e how come there isn't a factor of K or V in the denominators.
    2. When selecting a new topic for a word (step 5 in Collapsed Gibbs Sampling), do we choose the topic that maximises the criteria (i.e. the product of the fractions) or do we sample a topic from this distribution? In your example, we choose topic 1 but is there a chance topic 2 or 3 could have also been selected? Again, is there a reason for this.
    Thanks, Sam.

    • @futuremojo
      @futuremojo  Před rokem +1

      1. That's unclear notation from me. The summation in the denominator includes the addition of the hyperparameter.
      2. We sample from this distribution. So in our example, topics 2 and 3 had lower probability but still could've been chosen. If instead of sampling, we picked the most probable, we would be optimizing instead and very likely underestimating the uncertainty in the other parameters. With CGS, you're making a probable guess and as you go through more iterations, the guesses get stronger and stronger. That being said, what's the practical outcome of always picking the most probable instead of sampling? I don't know.

  • @vipulmaheshwari2321
    @vipulmaheshwari2321 Před 11 měsíci

    This is the best course for understanding NLP! The author did a great job explaining concepts in detail, but there is a minor terminology issue I want to highlight to avoid confusion. When the course refers to "vocabulary", it does not always mean just the individual words. More accurately, in NLP the "vocabulary" is the set of unique tokens that occur in the text corpus.
    To elaborate:
    1. By default, the vocabulary contains the unique unigrams (single words) in the corpus.
    2. However, it can also include n-grams (sequences of multiple words) if we tokenize the text into n-grams rather than just unigrams.
    3. Therefore, it is more precise to refer to the "tokens" rather than just the "vocabulary." The term "tokens" makes it clear we could be talking about either unigrams or n-grams, depending on the tokenization approach used.
    4. This vocabulary or token set is what forms the basis for representations like word embeddings, indexing documents, etc.
    That being said, using "tokens" is a more accurate general term than "vocabulary" when we want to refer to both potential unigrams and n-grams. Just wanted to point out this terminology distinction to prevent any confusion down the line.