NLP Demystified 6: TF-IDF and Simple Document Search

Word2Vec : Natural Language Processing

NLP Demystified 3: Basic Preprocessing (case-folding, stop words, stemming, lemmatization)

#JasonDeruloTV // Photos #GotPermissionToPost From @lianayel #SpicyMargarita

They got a Golden Buzzer 🤣✨

NEJHEZČÍ OBRÁZEK = BRAWL PASS +

NLP Demystified 5: Basic Bag-of-Words and Measuring Document Similarity

Future Mojo

zhlédnutí 9 897

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 2. 08. 2024
Course playlist: • Natural Language Proce...
After preprocessing our text, we take our first step in turning text into numbers so our machines can start working with them. We'll explore:
- a simple "bag-of-words" (BoW) approach.
- learn how to use cosine similarity to measure document similarity.
- the shortcomings of this BoW approach.
In the demo, we'll use a combination of spaCy and scikit-learn to build BoW representations and perform simple document similarity search.
Colab notebook: colab.research.google.com/git...
Timestamps:
00:00:00 Basic bag-of-words (BoW)
00:00:22 The need for vectors
00:00:53 Selecting and extracting features from our data
00:04:04 Idea: similar documents share similar vocabulary
00:04:46 Turning a corpus into a BoW matrix
00:07:10 What vectorization helps us accomplish
00:08:20 Measuring document similarity
00:11:09 Shortcomings of basic BoW
00:12:37 Capturing a bit of context with n-grams
00:14:10 DEMO: creating basic BoW with scikit-learn and spaCy
00:17:47 DEMO: measuring document similarity
00:18:40 DEMO: creating n-grams with scikit-learn
00:19:35 Basic BoW recap
This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.
Visit www.nlpdemystified.org/ to learn more.

Komentáře • 12

@futuremojo Před 2 lety ⁺⁴
Timestamps:
00:00:00 Basic bag-of-words (BoW)
00:00:22 The need for vectors
00:00:53 Selecting and extracting features from our data
00:04:04 Idea: similar documents share similar vocabulary
00:04:46 Turning a corpus into a BoW matrix
00:07:10 What vectorization helps us accomplish
00:08:20 Measuring document similarity
00:11:09 Shortcomings of basic BoW
00:12:37 Capturing a bit of context with n-grams
00:14:10 DEMO: creating basic BoW with scikit-learn and spaCy
00:17:47 DEMO: measuring document similarity
00:18:40 DEMO: creating n-grams with scikit-learn
00:19:35 Basic BoW recap
@vipulmaheshwari2321 Před rokem ⁺²
I am truly amazed by the excellence of this course. It is undoubtedly the finest NLP course I have come across, and the teaching and explanations provided are unparalleled. I have the utmost respect and admiration for it. Kudos to you, and thank you for such a remarkable learning experience! BOWING DOWN IN RESPECT!
@futuremojo Před rokem
Thank you so much!
@NAEXTRO Před rokem ⁺³
Thanks for this awesome course. :)
@user-nm5jl8gy1u Před rokem
great lectures, I learned a lot of NLP concepts.
@aneshsrivastav8092 Před rokem
You are the best!! This course is soo soo helpful man!!
@frankrobert9199 Před rokem
great lectures.
@techaztech2335 Před 6 měsíci
I am bit confused about the cosine similarity metric. I thought the cosine similarity range is from -1 to 1, instead of 0 to 1. I've seen 0 to 1 threshold being used elsewhere as well but I do notice more popular embedding models generate -ve vector elements and naturally the normalized versions produce ranges from -1 to 1. Can you please clarify this? Cuz I've been struggling to wrap my head around this.
@metavore7790 Před 10 měsíci
If anybody is getting "ValueError: Input vector should be 1-D", in the Cosine Similarity section, the fix is simple. Change where the indices are on toarray(). For example:
bow[0].toarray() is replaced by
bow.toarray()[0]
@futuremojo Před 10 měsíci
Thank you! Code updated.
@zhuchenwang4747 Před 11 měsíci
Hi sir, maybe the calculation of dot product at 9:05 in the video is wrong. It should be (6x4)+(6x2)=36. By the way, your videos are very helpful for a beginner. Thank you very much for your effort. Looking forward to seeing more good videos in your channel.
@futuremojo Před 11 měsíci
Thank you for the correction!

Další v pořadí

Automatické přehrávání

NLP Demystified 6: TF-IDF and Simple Document Search

NLP Demystified 6: TF-IDF and Simple Document Search

Word2Vec : Natural Language Processing

Word2Vec : Natural Language Processing

NLP Demystified 3: Basic Preprocessing (case-folding, stop words, stemming, lemmatization)

NLP Demystified 3: Basic Preprocessing (case-folding, stop words, stemming, lemmatization)

#JasonDeruloTV // Photos #GotPermissionToPost From @lianayel #SpicyMargarita

#JasonDeruloTV // Photos #GotPermissionToPost From @lianayel #SpicyMargarita

They got a Golden Buzzer 🤣✨

They got a Golden Buzzer 🤣✨

NEJHEZČÍ OBRÁZEK = BRAWL PASS +

NEJHEZČÍ OBRÁZEK = BRAWL PASS +

Finger Heart - Fancy Refill (Inside Out Animation)

Finger Heart - Fancy Refill (Inside Out Animation)

Stanford's FREE data science book and course are the best yet

Stanford's FREE data science book and course are the best yet

Cosine Similarity, Clearly Explained!!!

Cosine Similarity, Clearly Explained!!!

Word2Vec - Skipgram and CBOW

Word2Vec - Skipgram and CBOW

NLP Demystified 8: Text Classification With Naive Bayes (+ precision and recall)

NLP Demystified 8: Text Classification With Naive Bayes (+ precision and recall)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Natural Language Processing|Bag Of Words Intuition

Natural Language Processing|Bag Of Words Intuition

NLP Demystified 12: Capturing Word Meaning with Embeddings

NLP Demystified 12: Capturing Word Meaning with Embeddings

How I'd Learn AI (If I Had to Start Over)

How I'd Learn AI (If I Had to Start Over)

Ariška miluje Tary Camp! 🥰🤩

Ariška miluje Tary Camp! 🥰🤩

PROČ JSOU KURÝRNÍ SLUŽBY UTRPENÍ 📯 💀

PROČ JSOU KURÝRNÍ SLUŽBY UTRPENÍ 📯 💀

【鬥羅大陸】小舞真的錯怪唐舞桐了! #斗羅大陸 #唐三 #小舞 #唐舞桐 #唐舞麟

【鬥羅大陸】小舞真的錯怪唐舞桐了! #斗羅大陸 #唐三 #小舞 #唐舞桐 #唐舞麟

She blended SPAGHETTI @anasofiafehn

She blended SPAGHETTI @anasofiafehn

Barbie Style Gear Knob Makeover: Glamour for Your Drive! 💅🏻🚗

Barbie Style Gear Knob Makeover: Glamour for Your Drive! 💅🏻🚗

Your bathroom needs this

Your bathroom needs this

Když chceš najít písničku, ale nevíš název.. #fyp #foryou #foryoupage #shazam #reels

Když chceš najít písničku, ale nevíš název.. #fyp #foryou #foryoupage #shazam #reels

Smart Sigma Kid #funny #sigma #comedy

Smart Sigma Kid #funny #sigma #comedy