NLP Demystified 5: Basic Bag-of-Words and Measuring Document Similarity
Vložit
- čas přidán 2. 08. 2024
- Course playlist: • Natural Language Proce...
After preprocessing our text, we take our first step in turning text into numbers so our machines can start working with them. We'll explore:
- a simple "bag-of-words" (BoW) approach.
- learn how to use cosine similarity to measure document similarity.
- the shortcomings of this BoW approach.
In the demo, we'll use a combination of spaCy and scikit-learn to build BoW representations and perform simple document similarity search.
Colab notebook: colab.research.google.com/git...
Timestamps:
00:00:00 Basic bag-of-words (BoW)
00:00:22 The need for vectors
00:00:53 Selecting and extracting features from our data
00:04:04 Idea: similar documents share similar vocabulary
00:04:46 Turning a corpus into a BoW matrix
00:07:10 What vectorization helps us accomplish
00:08:20 Measuring document similarity
00:11:09 Shortcomings of basic BoW
00:12:37 Capturing a bit of context with n-grams
00:14:10 DEMO: creating basic BoW with scikit-learn and spaCy
00:17:47 DEMO: measuring document similarity
00:18:40 DEMO: creating n-grams with scikit-learn
00:19:35 Basic BoW recap
This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.
Visit www.nlpdemystified.org/ to learn more.
Timestamps:
00:00:00 Basic bag-of-words (BoW)
00:00:22 The need for vectors
00:00:53 Selecting and extracting features from our data
00:04:04 Idea: similar documents share similar vocabulary
00:04:46 Turning a corpus into a BoW matrix
00:07:10 What vectorization helps us accomplish
00:08:20 Measuring document similarity
00:11:09 Shortcomings of basic BoW
00:12:37 Capturing a bit of context with n-grams
00:14:10 DEMO: creating basic BoW with scikit-learn and spaCy
00:17:47 DEMO: measuring document similarity
00:18:40 DEMO: creating n-grams with scikit-learn
00:19:35 Basic BoW recap
I am truly amazed by the excellence of this course. It is undoubtedly the finest NLP course I have come across, and the teaching and explanations provided are unparalleled. I have the utmost respect and admiration for it. Kudos to you, and thank you for such a remarkable learning experience! BOWING DOWN IN RESPECT!
Thank you so much!
Thanks for this awesome course. :)
great lectures, I learned a lot of NLP concepts.
You are the best!! This course is soo soo helpful man!!
great lectures.
I am bit confused about the cosine similarity metric. I thought the cosine similarity range is from -1 to 1, instead of 0 to 1. I've seen 0 to 1 threshold being used elsewhere as well but I do notice more popular embedding models generate -ve vector elements and naturally the normalized versions produce ranges from -1 to 1. Can you please clarify this? Cuz I've been struggling to wrap my head around this.
If anybody is getting "ValueError: Input vector should be 1-D", in the Cosine Similarity section, the fix is simple. Change where the indices are on toarray(). For example:
bow[0].toarray() is replaced by
bow.toarray()[0]
Thank you! Code updated.
Hi sir, maybe the calculation of dot product at 9:05 in the video is wrong. It should be (6x4)+(6x2)=36. By the way, your videos are very helpful for a beginner. Thank you very much for your effort. Looking forward to seeing more good videos in your channel.
Thank you for the correction!