NLP Demystified 5: Basic Bag-of-Words and Measuring Document Similarity

Sdílet
Vložit
  • čas přidán 2. 08. 2024
  • Course playlist: • Natural Language Proce...
    After preprocessing our text, we take our first step in turning text into numbers so our machines can start working with them. We'll explore:
    - a simple "bag-of-words" (BoW) approach.
    - learn how to use cosine similarity to measure document similarity.
    - the shortcomings of this BoW approach.
    In the demo, we'll use a combination of spaCy and scikit-learn to build BoW representations and perform simple document similarity search.
    Colab notebook: colab.research.google.com/git...
    Timestamps:
    00:00:00 Basic bag-of-words (BoW)
    00:00:22 The need for vectors
    00:00:53 Selecting and extracting features from our data
    00:04:04 Idea: similar documents share similar vocabulary
    00:04:46 Turning a corpus into a BoW matrix
    00:07:10 What vectorization helps us accomplish
    00:08:20 Measuring document similarity
    00:11:09 Shortcomings of basic BoW
    00:12:37 Capturing a bit of context with n-grams
    00:14:10 DEMO: creating basic BoW with scikit-learn and spaCy
    00:17:47 DEMO: measuring document similarity
    00:18:40 DEMO: creating n-grams with scikit-learn
    00:19:35 Basic BoW recap
    This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.
    Visit www.nlpdemystified.org/ to learn more.

Komentáře • 12

  • @futuremojo
    @futuremojo  Před 2 lety +4

    Timestamps:
    00:00:00 Basic bag-of-words (BoW)
    00:00:22 The need for vectors
    00:00:53 Selecting and extracting features from our data
    00:04:04 Idea: similar documents share similar vocabulary
    00:04:46 Turning a corpus into a BoW matrix
    00:07:10 What vectorization helps us accomplish
    00:08:20 Measuring document similarity
    00:11:09 Shortcomings of basic BoW
    00:12:37 Capturing a bit of context with n-grams
    00:14:10 DEMO: creating basic BoW with scikit-learn and spaCy
    00:17:47 DEMO: measuring document similarity
    00:18:40 DEMO: creating n-grams with scikit-learn
    00:19:35 Basic BoW recap

  • @vipulmaheshwari2321
    @vipulmaheshwari2321 Před rokem +2

    I am truly amazed by the excellence of this course. It is undoubtedly the finest NLP course I have come across, and the teaching and explanations provided are unparalleled. I have the utmost respect and admiration for it. Kudos to you, and thank you for such a remarkable learning experience! BOWING DOWN IN RESPECT!

  • @NAEXTRO
    @NAEXTRO Před rokem +3

    Thanks for this awesome course. :)

  • @user-nm5jl8gy1u
    @user-nm5jl8gy1u Před rokem

    great lectures, I learned a lot of NLP concepts.

  • @aneshsrivastav8092
    @aneshsrivastav8092 Před rokem

    You are the best!! This course is soo soo helpful man!!

  • @frankrobert9199
    @frankrobert9199 Před rokem

    great lectures.

  • @techaztech2335
    @techaztech2335 Před 6 měsíci

    I am bit confused about the cosine similarity metric. I thought the cosine similarity range is from -1 to 1, instead of 0 to 1. I've seen 0 to 1 threshold being used elsewhere as well but I do notice more popular embedding models generate -ve vector elements and naturally the normalized versions produce ranges from -1 to 1. Can you please clarify this? Cuz I've been struggling to wrap my head around this.

  • @metavore7790
    @metavore7790 Před 10 měsíci

    If anybody is getting "ValueError: Input vector should be 1-D", in the Cosine Similarity section, the fix is simple. Change where the indices are on toarray(). For example:
    bow[0].toarray() is replaced by
    bow.toarray()[0]

  • @zhuchenwang4747
    @zhuchenwang4747 Před 11 měsíci

    Hi sir, maybe the calculation of dot product at 9:05 in the video is wrong. It should be (6x4)+(6x2)=36. By the way, your videos are very helpful for a beginner. Thank you very much for your effort. Looking forward to seeing more good videos in your channel.

    • @futuremojo
      @futuremojo  Před 11 měsíci

      Thank you for the correction!