NLP Demystified 2: Text Tokenization

Sdílet
Vložit
  • čas přidán 2. 08. 2024
  • Course playlist: • Natural Language Proce...
    The usual first step in NLP is to chop our documents into smaller pieces in a process called Tokenization. We'll look at the challenges involved and how to get it done.
    Colab notebook: colab.research.google.com/git...
    Timestamps:
    00:00 Tokenization
    00:12 Text as unstructured data
    00:39 What is tokenization?
    01:09 The challenges of tokenization
    03:09 DEMO: tokenizing text with spaCy
    07:55 Preprocessing as a pipeline
    This video is part of Natural Language Processing Demystified --a free, accessible course on NLP.
    Visit www.nlpdemystified.org/ to learn more.

Komentáře • 27

  • @futuremojo
    @futuremojo  Před 2 lety +1

    Timestamps:
    00:00 Tokenization
    00:12 Text as unstructured data
    00:39 What is tokenization?
    01:09 The challenges of tokenization
    03:09 DEMO: tokenizing text with spaCy
    07:55 Preprocessing as a pipeline

  • @pictzone
    @pictzone Před rokem +10

    This guy posted a mind-blowing series and then left. Thank you, you're a legend!

  • @anissahli-gl9ud
    @anissahli-gl9ud Před 8 měsíci +2

    Je viens de France et je viens juste de tomber sur cette superbe playlist qui est pour moi la plus complète sur youtube ! Merci, un grand merci à vous ! C'est difficile de trouver des formations d'une telle qualité.

  • @MrNeelthehulk
    @MrNeelthehulk Před rokem +1

    Thanks for posting this series buddy!!

  • @caiyu538
    @caiyu538 Před rokem +2

    great lectures, thumb up.

  • @somerset006
    @somerset006 Před 11 měsíci +1

    Nicely done, thanks!

  • @user-nm5jl8gy1u
    @user-nm5jl8gy1u Před rokem +1

    Great to know more about NLP concepts. In Hugging face tutorials, there are some concepts are not mentioned. I guess these concepts may be a little outdated in the era of transformer.

  • @alp1234alp1234
    @alp1234alp1234 Před rokem +1

    Thank you so much for offering such high quality content 🎉

  • @BadEnoughDudeRescues
    @BadEnoughDudeRescues Před rokem +1

    Hi, fantastic course! Wondering if by any chance there are solutions available to the exercises in the notebooks? I checked the github and collab but was unable to find solutions for the exercises.

    • @futuremojo
      @futuremojo  Před rokem +1

      Thank you!
      I didn't publish solutions for the exercises but if you're stuck, email me and I'll help you out.

  • @SatyaRao-fh4ny
    @SatyaRao-fh4ny Před 7 měsíci

    These are very helpful videos, thank you! There are still a few concepts that are unclear. You have mentioned that documents are segmented to a list of sentences, and each sentence segmented into a list of tokens. This implies that the list of tokens is empty to begin with, and after tokenization, we end up with a list of tokens(token vocabulary?) specific to the corpus we provide. But later, when you start the tokenization using spaCy, you are loading some db??? What is this doing? Shouldn't spaCy just be a program/tool that has some "advanced rules" to tokenize a document that we provide, and create a new token vocabulary from scratch, and not use it's own db/list created from some unknown corpus as some starting point? And finally, why tokenize a sentence at a time- because a document size can be large? Could it have read in a fixed number of words at a time, say 100 words, and then tokenized them? A "sentence" should have no meaning for the tokenizer, is this right? Actually, how does a tokenizer even "know" when a sentence starts/ends?!? Thanks for any clarifications!

    • @nebvoice
      @nebvoice Před 2 měsíci

      the db you are referring is the statistical model that was trained on some annotated data(forgot the name here). That is the thing that tokenizes the given document or sentences. Spacy is just a module that helps us tokenize our data according to those statistical model. ... All this, I think so. Just a beginner....

  • @CC-nz2oc
    @CC-nz2oc Před 9 měsíci

    Hello. Thank you for such a detailed course. I have a question about using pre-trained language models. My language (Azerbaijani) is not yet available in the library. Are you covering this topic further or is it not worth wasting time on learning without this model?

  • @user-nm5jl8gy1u
    @user-nm5jl8gy1u Před rokem

    Since hugging face and openAPI provide APIs for use, could we skip spaCy, NLTK, these relatively old library?

    • @futuremojo
      @futuremojo  Před rokem

      spaCy uses transformers under the hood.
      That being said, I would use HF libraries if you're looking to do more fine-grained work other than calling out to an LLM.

    • @user-nm5jl8gy1u
      @user-nm5jl8gy1u Před rokem

      @@futuremojo Thank you so much. I learned a lot of NLP concepts from your great lectures.

  • @oluOnline
    @oluOnline Před rokem

    Is it still possible to connect to a local runtime? I can't see an obvious connect button. May delete this if I solve it, thanks for any help!

    • @futuremojo
      @futuremojo  Před rokem +1

      Hey Olu: yes, it's possible. These instructions worked for me:
      research.google.com/colaboratory/local-runtimes.html

    • @oluOnline
      @oluOnline Před rokem

      @@futuremojo Thanks so much; very fast reply also!
      Do you happen to know if colab has any quirks with zsh shell? Googling turns up nothing but first pip install in the notebook returns: `zsh:1: no matches found: spacy==3.* `
      edit: seems to work without the ==3! now i'm trying to work out why it doesn't recognise it as a module...

    • @futuremojo
      @futuremojo  Před rokem +1

      @@oluOnline I just tried on zsh and got the same error.
      Googled it and found this:
      stackoverflow.com/questions/30539798/zsh-no-matches-found-requestssecurity
      When I use quotes like this:
      pip install -U 'spacy==3.*'
      It works!

    • @oluOnline
      @oluOnline Před rokem

      @@futuremojo Final question; import says module not found? Sorry for all these setup questions I'm unsure what's zsh, what's colab and what's python! (
      It would be nice to have an extra page before 0 with an intro to tools used e.g. jupyter notebooks, colab, and i assume a load of other stuff)

    • @futuremojo
      @futuremojo  Před rokem

      @@oluOnline Is the problem happening when you import spaCy? If so, I'm not getting that.
      Here's a video I shot of me starting in an empty pipenv shell and installing spacy.
      www.loom.com/share/252f86aab1394b3580840ea2f55cba54
      My guess is that there's an environment issue where pip is installing it in one Python environment, but you're trying to import it in *another* Python environment. Are you using a tool like virtualenv to isolate environments?

  • @gnorts_mr_alien
    @gnorts_mr_alien Před rokem +2

    you have a radio voice.

  • @michaelcharlesthearchangel

    Interesting to see AI developers reword phraseology concepts and language morphemes into "token" corporate key words.
    English majors and Language doctorates are laughing, 😆:;🤣. And asking why 🤔?