Machine Learning with Text in scikit-learn (PyCon 2016)

Sdílet
Vložit
  • čas přidán 6. 09. 2024
  • Although numeric data is easy to work with in Python, most knowledge created by humans is actually raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from. In this tutorial, we'll build and evaluate predictive models from real-world text using scikit-learn. (Presented at PyCon on May 28, 2016.)
    GitHub repository: github.com/jus...
    Enroll in my online course: courses.datasc...
    == OTHER RESOURCES ==
    My scikit-learn video series: • Machine learning in Py...
    My pandas video series: • Data analysis in Pytho...
    == LET'S CONNECT! ==
    Newsletter: www.dataschool...
    Twitter: / justmarkham
    Facebook: / datascienceschool
    LinkedIn: / justmarkham
    CZcams: www.youtube.co...
    JOIN the "Data School Insiders" community and receive exclusive rewards:
    / dataschool

Komentáře • 244

  • @jayanthkumar7964
    @jayanthkumar7964 Před 6 lety +17

    Your videos just feel so friendly and inclusive, while being really educational. Your way of teaching is great. I thank you sincerely!

    • @dataschool
      @dataschool  Před 6 lety

      Thanks very much for your kind words! You are very welcome!

  • @RaynerGS
    @RaynerGS Před 4 lety +1

    The method which he uses to explain all concepts that are said is totally didact. Some teachers say terms to explain terms and at the final, you do not understand anything, however, Kevin Markham explains each term precisely without utilizing other terms. I admire the way which he teaches. Way to Go, and greetings from Brazil.

    • @dataschool
      @dataschool  Před 3 lety +1

      Thanks very much for your kind words! 🙏

  • @akshitdayal2689
    @akshitdayal2689 Před 3 lety +3

    I've followed these series and these have really given me a great insight about machine learning as I've just started learning about it. Thank you so much

  • @mmpcse
    @mmpcse Před 4 lety +1

    I m SAP ABAP Engineer, trying to integrate python + ABAP. Have seen few videos on Python ML, but listening to Kavin Video reminds of Steve Jobs Marketing Speech : Clear Concise Calm and Rich Knowledge Embedded in this video. I will be watching this video multiple times because it has rich practical content and more importantly Kavin art of Speech brutally attract one's attention 🙂. Keep Guiding Us 🙏.

  • @jgajul2
    @jgajul2 Před 8 lety +32

    The best tutorial i have ever watched! Kevin you have mastered both the art of machine learning and teaching :)

    • @dataschool
      @dataschool  Před 8 lety +2

      Wow! What a kind compliment... thanks so much!

    • @nureyna629
      @nureyna629 Před 5 lety

      This guy is gifted.

  • @thebanjoranger
    @thebanjoranger Před 4 lety +3

    I could listen to this voice all day.

  • @syedasad3047
    @syedasad3047 Před rokem +1

    Your way of teaching is absolutely the best. Thanks a lot for your time and effort. May God Bless you.

    • @dataschool
      @dataschool  Před rokem

      Thanks very much for your kind words!

  • @lalithdupathi5174
    @lalithdupathi5174 Před 7 lety

    I am an electronic student, but your vigor and teaching skills on ML has got me inclined towards it very much.
    Thank you for the great head start you've given

    • @dataschool
      @dataschool  Před 7 lety

      You're very welcome! Good luck in your machine learning education.

  • @keepfeatherinitbrothaaaa
    @keepfeatherinitbrothaaaa Před 7 lety +1

    Holy crap, he can talk at a normal speed! Anyway, this series was great. I can find my way around with Python but I'm a complete beginner to data science and machine learning and I've learned a ton. I will definitely be re-watching this entire series to really grasp the material. Thanks again, keep up the good work.

    • @dataschool
      @dataschool  Před 7 lety

      HA! Yes, that's my normal talking speed :)
      Glad you liked the series - I appreciate your comment!

  • @tseringpaljor8679
    @tseringpaljor8679 Před 8 lety +4

    Hands down the best machine learning presentation I've seen thus far. Definitely looking forward to enrolling in your course once I'm done with your other free intro material. I think what sold me is how you've focused ~3 hours on a specific ML approach (supervised learning) to a common domain (text analysis). Other ML intros try to fit classification/regression/clustering all into 3 hours, which becomes too superficial a treatment. Anyway, bravo and keep up the great work!

    • @dataschool
      @dataschool  Před 8 lety

      Wow, thank you so much! What you're describing was exactly my goal with the tutorial, so I'm glad it met your needs!
      For others who are interested, here's a link to my online course: www.dataschool.io/learn/

  • @debanitadasgupta790
    @debanitadasgupta790 Před 5 lety +3

    The BEST ML tutorials , I have come across... Thanks a lot ... God bless you ...

    • @dataschool
      @dataschool  Před 5 lety

      Thanks so much for your kind words!

  • @payalbhatia5244
    @payalbhatia5244 Před 5 lety +1

    @Data School, Again and Again you are the best Kevin. I was scared of the text analytics and web scraping. You can teach in such an intuitive and lucid way. Thanks a ton

    • @dataschool
      @dataschool  Před 5 lety

      Thanks very much for your kind words!

  • @okao08
    @okao08 Před 7 lety +1

    i coudnt find any relevant video on youtube to do text analysis using machine learning... wow that was a great video and an eye opener for machine learning.. thank you so much kevin

    • @dataschool
      @dataschool  Před 7 lety +1

      You're very welcome! Glad it was helpful to you!

    • @okao08
      @okao08 Před 7 lety

      Hi Kevin....I have several tokenized text files... I want to compare each of these text file with another text file and check the similarities or differences
      how i am i able to do that using scikit or nltk

  • @tissues2441
    @tissues2441 Před 6 lety +1

    I cant wait till I have watched enough of your content to start on your courses.

  • @taotaotan5671
    @taotaotan5671 Před 5 lety

    Boy you made the best tutorial. Talking slow is magical!

  • @nehagupta7904
    @nehagupta7904 Před 8 lety

    You are indeed a "GURU" who can train and share knowledge in true sense.
    I'm a non technical person but learning python and scikit-learn for my research and this video has taken my understanding to higher level, just in 3 hours....THANK YOU VERY MUCH Kevin!!! Can you please recommend some links where I can learn more on short text sentiment analysis using machine learning in python, especially to learn feature engineering aspect, like using POS, word embedding as features...Thanks again ...

    • @dataschool
      @dataschool  Před 8 lety +1

      You are very welcome! Regarding recommended links, I think this notebook might be helpful to you: nbviewer.jupyter.org/github/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb
      Good luck!

  • @zankbennett8340
    @zankbennett8340 Před 8 lety +19

    Great video. The problem with the audio is that the channels are the inverse of each other, so on mono devices where the L and R channels are summed together, they completely nullify the output signal. I don't know of a work-around except to listen using a 2-channel system

    • @dataschool
      @dataschool  Před 8 lety +1

      Wow! Thanks for the explanation. How did you figure that out? I spent probably an hour with the A/V people at the conference as they tried to figure out the problem, and they never came up with any clear explanation.

    • @tompara3
      @tompara3 Před 5 lety +1

      If you don't need / care for stereo effect (which is obvious b/c there's only monologue in this video), "jack normalling" via an audio mixer is the solution.
      Input: Plug either the L or R channel (say L for example) into the "jack normalling" port of an audio mixer. Then the Output of the mixer will be L x L because the L signal (note, it's the "signal") is copied to the R channel on the fly. Vice versa if you use the R channel for Input, which will become R x R for Output. Thus, on playback on either mono or stereo device, L and R channels will have the same phase, and always sound the same.
      PS: it's strange that the L and R channels are the inverse of each other. Only explanation is the A/V people somehow reversed the polarities of their L and R jacks (assuming professional XLR jacks in this case).

  • @ibtsamgujjar8697
    @ibtsamgujjar8697 Před 7 lety

    Just wanna thank you for the awesome series. I am new to machine learning and you are one of my first and favorite teacher in this journey :)

    • @dataschool
      @dataschool  Před 7 lety

      You are very welcome! Good luck on your journey! :)

  • @lingobol
    @lingobol Před 8 lety

    Wonderful set of videos. I have started my ML journey with these videos. Now gonna go deeper and practise more and more.
    Thanks Kevin for the best possible head start.
    Your Fan,
    A beginner Data Scientist.

    • @dataschool
      @dataschool  Před 7 lety

      You're very welcome! That's excellent to hear... good luck!

  • @gcm4312
    @gcm4312 Před 8 lety +13

    This is a great resource. Thank you for sharing

    • @dataschool
      @dataschool  Před 8 lety +1

      You're very welcome!

    • @7justfun
      @7justfun Před 7 lety

      Data School , Can you help point me to a demo /material for hierarchical clustering(aglometric pref)... would counter vectorization work for such a scenario befroe we apply knn or shiftmeans

    • @anakwesleyan
      @anakwesleyan Před 7 lety

      A great resource indeed. What I find extremely helpful is that it explains the small but critical aspects of the library, e.g. CountVectorizer only takes 1D, what sparse data in scipy looks like, etc.

  • @KurzedMetal
    @KurzedMetal Před 6 lety +2

    Using the x1.5 speed YT feature is perfect for this video :)
    I'm half of the video so far, and I'm enjoying it a lot, kudos to the presenter.

    • @dataschool
      @dataschool  Před 6 lety

      Glad you are enjoying it! :)

    • @nureyna629
      @nureyna629 Před 5 lety

      I did the same from video 1, I have just use 3 days to practice every thing, and I really enjoy the show :)

  • @AnkitSharma-hk8yq
    @AnkitSharma-hk8yq Před 7 lety

    I am doing a college project on machine learning. It was very helpful. Thank you

  • @gtalpc59
    @gtalpc59 Před 6 lety

    I have gone through a hell of videos and materials in machine learning. But this is the best which is properly paced and make it easy to follow and learn and takes you inside machine learning. I am keen to know whether you would start on Deep learning and tensorflow soon? It would be really helpful for those who are confused on overwhelming amount of materials. Thanks a lot!!

    • @dataschool
      @dataschool  Před 6 lety

      So glad to hear that my videos have been helpful to you! As far as deep learning, I don't have any upcoming videos or courses planned, but it is certainly under consideration.

  • @Torakashi
    @Torakashi Před 5 lety

    I really enjoy your structured approach to teaching these classes :)

    • @dataschool
      @dataschool  Před 5 lety

      Thanks! You should check out my online course: www.dataschool.io/learn/

  • @socialist_king
    @socialist_king Před 7 lety

    THIS is some great stuff. . .really helpful I am working on my final year project. I am working on the classification of cattle and wanted to use machine learning (for the facial recognition of both pet and livestock)

    • @dataschool
      @dataschool  Před 7 lety

      Very cool project! So glad to hear that the video was helpful to you!

  • @sibinh
    @sibinh Před 7 lety +1

    Thanks Kelvin for your great presentation as always. I think it could be great if the presentation included feature selection i.e. chi-squared test...

    • @dataschool
      @dataschool  Před 7 lety

      Thanks for the suggestion! I'll consider that for future videos.

  • @jasonxoc
    @jasonxoc Před 7 lety +4

    Anyone having audio issues, the right channel is completely out of phase with the left channel. So if you use something like audio hi-jack pro and insert an audio unit between (safari|chrome|firefox) and the output speakers to either duplicate the left channel or flip the right channel. Or use headphones as your brain will sum it just fine, it just sounds like it's left heavy because of the haas effect. Using speakers is a sure way to make yourself feel uncomfortable and lastly if you don't hear anything it's because your device is mono and summing the signals renders very little wave. (To the venue engineer: Don't record in stereo unless you know how to record in phase)

    • @dataschool
      @dataschool  Před 7 lety +1

      Thanks for the suggestions and the technical explanation! I talked with the audio engineers at the conference numerous times, and they were never able to explain the source of the problem!

    • @jasonxoc
      @jasonxoc Před 7 lety

      Right on, hopefully it helps someone else. Took me a while to figure out how to flip the channel. By the way, you videos are great man. Thanks so much for them!

    • @dataschool
      @dataschool  Před 7 lety

      You're very welcome! Thanks for your kind comments, and I'm glad you have enjoyed the videos!

    • @FULLCOUNSEL
      @FULLCOUNSEL Před 6 lety

      sad, the audio could not work..am stranded too

  • @omparghale
    @omparghale Před rokem +1

    Hey Kevin,firstly thanks for all the pandas stuff that you've put on your channel,that helped greatly!!
    I wanted to know whether this sklearn pycon tutorial is still applicable in 2023 or is the syntax today is wildly different than what it was back in 2016?

    • @dataschool
      @dataschool  Před rokem

      Glad to hear the pandas videos have been helpful! Yes, this tutorial is absolutely still relevant, actually very little scikit-learn syntax used in the video has changed.

  • @bennineo6372
    @bennineo6372 Před 5 lety

    This is a great, great tutorial and in depth explanation on many related topics! Thanks so much!

  • @lprevost69
    @lprevost69 Před 7 lety

    Very nice work Kevin. I suspect I did what a lot do -- jump into ML without a lot of fundamentals. My experience was after doing one of the "hello world" tutorials on ML (IRIS dataset), I immediately "wired up" my features, which were of course full of text, and crashed my model with string errors. After that crash, your video was my "back to the drawing board" trek to get some fundamentals in place and I'm now refreshed and ready to go try it again!
    Question: My real world problem are trouble tickets (documents) with a variety of "features" including some long text fields (ie. problem description or action taken which has sentiment in it) and some category fields which can be resolved to maybe 8 categories. I'm ultimately trying to categorize these "tickets" in the trouble work into about 5-6 categories (multi-class classification problem). so, using your ham/spam email example, I have 2-3 long text fields that will need to be vectorized to DTF (probably each with separate vocabularies), and some category feature inputs to the model. And rather than ham/spam, the model needs to predict to multiple classes (ie. 5-6 categories of tickets). I'm running into problems where the pandas frame has all this but has some of it in Object columns which don't directly product np arrays.
    Can you make any suggestions on how to approach the work? I think after spending my Saturday and Sunday with your exercise, this is how I should approach it:
    1) Read data into Pandas dataframe
    2) Count Vectorize the two long text columns into separate DTFs. Do I need to the join the arrays?
    3) You mentioned that scikit is not clear on whether category features have to be binarized or not. I'll figure that out. Same with prediction classes.
    4) train the model on that.
    Also, I recall in your course, you mentioned some concepts called "feature unions" and "transformers" in response to a question I could not hear. You gave some recommendations on using ensemble methods and "transformer features next to one another." This sounds like a clue to my problem. Any recommendations on how to go deeper into that area?
    Of course, one of my very next steps is to sign up for your course!!

    • @dataschool
      @dataschool  Před 7 lety

      Thanks for the detailed question! I think that for step 2, my default approach would be to combine the text fields together for each ticket before vectorizing, which would result in a single document-term matrix (DTM). In other words, you avoid combining multiple DTMs, which may not provide any additional value over a single DTM.
      Regarding feature unions, here are some public resources that might be helpful to you:
      zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
      scikit-learn.org/stable/auto_examples/hetero_feature_union.html
      Regarding my course, I think you'd get a lot of value out of it given your goals. More information is here: www.dataschool.io/learn/
      Hope that helps, and good luck!

    • @lprevost69
      @lprevost69 Před 7 lety

      Wow! that is a good point Kevin. One DTM makes a lot of sense. Would you agree even for the categorical features?
      IN other words, would you just mix the two text fields -- one with the messy free form text request and the other with a category field into the same DTM and allow the vectorization just to do it thing on two columns rather than just one? I could see how that would "look" the same to the estimator as a category is just an extension on the DTM
      yes, I also have since found Zac Stewart's good work on feature unions and pipelines and have even talked to him a bit about the approach. it seems like he has moved his methods onto using things like the Sklearn-pandas library (github.com/paulgb/sklearn-pandas/tree/feature_union_pipegithub.com/paulgb/sklearn-pandas/tree/feature_union_pipethe -- the PR that uses feature unions and pipelines in the code) which better supports pandas and data frames.
      In contemplating your elegant simple approach of combining, I'm now thinking I have over engineered this. But I did end up making this work by building parallel pipes of features from pandas columns with multiple transformers (countvectorizer, tfidftransformer, and labelbinarizer) and then feature joined these before inputting to the estimator. this method does simplify the learning and transforming process. But, the tradeoff is that it does also complicate the process of being able to discern what features drove the decision logic (ie. hard to get features from the complex pipeline of steps).
      Your approach of combining to 1 DTM may give me best of both worlds. thanks for your help and would appreciate confirmation on putting categorical features into the single DTM

    • @dataschool
      @dataschool  Před 7 lety

      Thanks for the follow-up! Yes, I would agree that adding the categorical features to the DTM makes sense. However, you may want to append some text to the category names before adding them to the column of free-form text. For example, if the category is color, and possible values are "red" and "blue", you may want to add them to the free-form text as "colorred" and "colorblue". Why do this? Well, it's possible that seeing the word "red" in the text is a good predictor of ticket type A, and seeing the category "red" is a good predictor of ticket type B, and you want the model to be able to learn that information separately. Does that make sense?

  • @royxss
    @royxss Před 7 lety

    This channel to so helpful. Actually helped me a lot during my semesters. Thank you so much (y)

  • @rahulsripuram8174
    @rahulsripuram8174 Před 7 lety

    Awesome. I really liked it. Will do a POC. Please few suggest some datasets otherthan spam-ham

    • @dataschool
      @dataschool  Před 7 lety

      There are lots of great datasets here:
      archive.ics.uci.edu/ml/
      www.kaggle.com/datasets
      Hope that helps!

  • @juiguram7177
    @juiguram7177 Před 7 lety

    i just love your videos .They are great help specially for a non programmer like me trying to learn data science.It has helped me a lot in understanding all the concepts clearly in a short time rather than reading stuff.Your videos are my goto stuff for my college work.
    I want to see some content on grid search and pipeline.Also could you please share your email,i have some more doubts

    • @dataschool
      @dataschool  Před 7 lety

      Thanks for your kind words! I'm glad they have been helpful to you!
      Regarding grid search, I cover it in video 8 of my scikit-learn series: czcams.com/video/Gol_qOgRqfA/video.html
      Regarding pipeline, I cover it in modules 4 and 5 of my online course: www.dataschool.io/learn/ (You can also find my email address on this page.)
      Hope that helps!

    • @juiguram7177
      @juiguram7177 Před 7 lety

      The contact information part doesn't load on my system,can you please post your email here

    • @dataschool
      @dataschool  Před 7 lety

      kevin@dataschool.io

  • @ujwalsah2304
    @ujwalsah2304 Před 4 lety

    You are awesome Kevin

  • @anjangurung2538
    @anjangurung2538 Před 6 lety

    thankyou so much for this video. cleared all the doubts i had. thankyou again

  • @rahulbhatia5657
    @rahulbhatia5657 Před 5 lety +8

    Is it still relevant in 2019? Thanks for letting me know

    • @dataschool
      @dataschool  Před 5 lety +2

      Absolutely still relevant! However, there are some changes to the scikit-learn API that are useful to know about: www.dataschool.io/how-to-update-your-scikit-learn-code-for-2018/

  • @benben341
    @benben341 Před 7 lety

    Thank you, very much, just viewed all your online course. Im not really that super duper with machine learning, but your courses certainly got me thinking and able to get scikit to work at least.
    One thing i will have to research, is if your initial dataset uses classes good/bad etc instead of numbers such 1,0 how to actually get that into the i think its “label.map” from this video.
    This video shows me how to do it kinda of briefly but your “Machine learning in Python with scikit-learn” series does not cover it at all - (unless i missed it somewhere).
    Also near the end of your “Machine learning in Python with scikit-learn” the course length become longer which means i have to stop it more often. So maybe more breaks could help.
    As i said, its amazing what you have provided, and im just trying to offer some feed back - instead of just being all take.

    • @dataschool
      @dataschool  Před 7 lety

      Thanks for your feedback! Regarding your question about numeric labels, I think this video might be helpful to you: czcams.com/video/P_q0tkYqvSk/video.html

  • @stepheniezzi34
    @stepheniezzi34 Před 7 lety

    To fix the audio issue on iPhone use headphones and in settings turn off mono audio (general>>>accessibility, then scroll down to hearing)

    • @dataschool
      @dataschool  Před 7 lety

      Thanks for sharing that solution!

  • @donbasti
    @donbasti Před 7 lety

    great video and the information was very clearly presented. Good work!

  • @donovankeating8577
    @donovankeating8577 Před 6 lety

    Really good talk. Very easy to follow. Thank you for sharing! :)

  • @sudhiirreddy7868
    @sudhiirreddy7868 Před 7 lety

    Thanks a Lot for this resource...Hoping to see more videos like this

    • @dataschool
      @dataschool  Před 7 lety

      You're welcome! Glad it was helpful to you.

  • @moutaincold2218
    @moutaincold2218 Před 7 lety

    I like you and your videos very much. Hope you could develop a more detailed course on scikit-learn and Deep Learning (tensorflow)

    • @dataschool
      @dataschool  Před 7 lety

      Thanks for the suggestion! I'll definitely consider it for the future!
      Subscribing to my newsletter is a great way to hear when I release new courses: www.dataschool.io/subscribe/

  • @AshokPatel-qc1hz
    @AshokPatel-qc1hz Před 4 lety

    To scale down the feature what should we prefer Standardization or Normalization and why? and when to use it?

    • @dataschool
      @dataschool  Před 4 lety

      It depends on what you mean by those terms, because they are often used interchangeably.

  • @galymzhankenesbekov2924

    you make wonderful videos and courses, however it is very expensive for international students like me.

  • @prakharsahu9498
    @prakharsahu9498 Před 5 lety

    Great video. I would like to know if you would be doing videos on tokenizing ,stemming and lemmatizing and other core NLP techniques.

    • @dataschool
      @dataschool  Před 5 lety

      You might be interested in my course, Machine Learning with Text in Python: www.dataschool.io/learn/

  • @saurabhsingh826
    @saurabhsingh826 Před 6 lety

    Excellent video. Thankyou so much Kevin Sir,it really helped me a lot.

    • @dataschool
      @dataschool  Před 6 lety

      You're welcome!

    • @saurabhsingh826
      @saurabhsingh826 Před 6 lety

      Data School Sir, I dropped an email few days back by the I'd saurabhs9913@gmail.com . Could you please go through it and let me know ?

  • @charlinhos0824
    @charlinhos0824 Před 8 lety

    Thanks for sharing Kevin, apart of the obvious I also curious about how you use evernote in your daily lectures task, maybe that's could be another great video to follow on ..

    • @dataschool
      @dataschool  Před 8 lety

      My Evernote usage is pretty simple... just storing and organizing task lists and links! :)

  • @deepanshnagaria4579
    @deepanshnagaria4579 Před 6 lety +1

    sir, the video series was a great learning experience.
    Sir can you suggest me the algos in descending order of their accuracies for a model to find emotions from text data?

    • @dataschool
      @dataschool  Před 6 lety

      It is impossible to know what algorithm will work best in advance of trying it out!

    • @dataschool
      @dataschool  Před 6 lety

      I don't have any resources to recommend, I'm sorry!

  • @cartoonjerk
    @cartoonjerk Před 7 lety

    Nevermind my previous comment, problem solved. But now I have a new one and would be very happy if you can help me answer it! When I calculate my ham and spam frequencies, my ham count is completely different than yours. It reads: 1.373624e-09 for very, 4.226535e-11 for nasty, 2.113267e-11 for villa, 4.226535e-11 for beloved, and 2.113267e-11 for textoperator. Any way to fix this or has the data changed since then?

    • @dataschool
      @dataschool  Před 7 lety

      The dataset hasn't changed. Are you sure all the code you wrote was identical to my code? You can check your code here: github.com/justmarkham/pycon-2016-tutorial/blob/master/tutorial_with_output.ipynb

  • @mrfarhadahmadzadegan
    @mrfarhadahmadzadegan Před 3 lety

    Your video was great! I learned a lot. I just have one question, How our model counts the number of column and row of the sparse matrix?

  • @karthikudupa5475
    @karthikudupa5475 Před 5 lety

    Thanks a lot Kevin

  • @generalzeedot
    @generalzeedot Před 7 lety

    Kev, has anyone ever told you that you remind them of Sheldon Cooper?
    Keep up the great work btw

    • @dataschool
      @dataschool  Před 7 lety +1

      Ha! I have heard that a few times recently :)
      Glad you like the videos!

  • @christopherteoh3094
    @christopherteoh3094 Před 4 lety

    Hi Kevin, great video content! I just have a question. At 33:23 where you mentioned about the 5 interesting things that were observed, stop words are dropped and not included in the tokens list.
    However, during vect.fit(simple_train), the stop_words argument is set to None.
    Can I presume that there is a set of standardized stop words and CountVectorizer drops it and the stop_words argument takes in user-specified stop words?

    • @christopherteoh3094
      @christopherteoh3094 Před 4 lety

      I got the answer towards the end of the video i.e. the word was removed because of the string pattern which contains < 2 characters. Thanks!

  • @eddbiddle6604
    @eddbiddle6604 Před 4 lety

    Another fantastic video - thanks Kevin

  • @mohammadali8800
    @mohammadali8800 Před rokem

    good job , wish you more success

  • @gauravmitra3683
    @gauravmitra3683 Před 8 lety

    Another of your fantastic videos.

  • @puneetja
    @puneetja Před 7 lety

    Hi Kevin,
    Thanks for the wonderful tutorial. I just have a very basic question - We did image classification in past and used neural network. There we used few convolutional layers and activation function. However I see here that you did not use any convolutional layers and activation function. Is this because you are using naive bayes classifier not neural network classifier algorithms?
    Thanks in advance.

    • @dataschool
      @dataschool  Před 7 lety +2

      That's correct! Naive Bayes does not involve any layers or an activation function.

  • @_rsk_
    @_rsk_ Před 6 lety

    Hello Kevin,
    I have progressively watched your video's from Pandas to Scikit Learn to this video on ML with Text. All have been brilliant videos and very nicely paced.
    Kudo's on that and hope you continue with more videos (shout out for Jupyter Notebooks ;-) ).
    I have one question specific to the topic on this video.
    For text analytic, the recommendation is to create a vocabulary and document-term matrix of the train data using a Vectorizer (i.e. instantiate a CountVectorizer and use fit_transform).
    Then use the fitted vocabulary to build a document-term matrix from the testing data (i.e. from the vector used during training perform a transform).
    If I use TfidfVectorizer and then TruncatedSVD as shown below, is the commented step-3 the right way ?
    # Step 1: perfrom train, test split.
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
    # Step 2: create a Tf-Idf Matrix and perform SVD on it.
    tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words='english')
    tfidf_train = tfidf_vectorizer.fit_transform(X_train)
    svd = TruncatedSVD(n_components=200, random_state=42)
    X_train_svd = svd.fit_transform(tfidf_train)
    # Step 3: transforming Testing Data ??
    # Is this the right way:
    # tfidf_test = tfidf_vectorizer.transform(X_test)
    # X_test_svd = svd.transform(tfidf_test)
    Thanks in advance.

    • @dataschool
      @dataschool  Před 6 lety

      Thanks for your very kind comments, I appreciate it!
      Regarding your question, I'm not really familiar with TruncatedSVD, so I'm not able to say. Good luck!

  • @SlesaAdhikari
    @SlesaAdhikari Před 6 lety

    So very helpful. Thanks Kev!

  • @anujasilampur9211
    @anujasilampur9211 Před 5 lety

    in my case..,shape of x_train and x_train_dtm is different..and getting ValueError: Found input variables with inconsistent numbers of samples: [25, 153]
    at fit.....please help

    • @dataschool
      @dataschool  Před 5 lety

      It's hard for me to say what is going wrong... good luck!

  • @yuanxiang1369
    @yuanxiang1369 Před 7 lety

    That's a great tutorial. Just a quick question, if I were to apply svm, random forest, latent dirichlet allocation, instead of naive bayes, does the input data still be document-term matrix form?

    • @dataschool
      @dataschool  Před 7 lety

      I'm not sure for LDA, but for SVM and Random Forests, yes, the input format would be the same.

  • @ShriSuperman
    @ShriSuperman Před 5 lety

    This is amazing video .. u really a great teacher.. can i get whole course videos ..pls .....

    • @dataschool
      @dataschool  Před 5 lety

      Thanks! The course is available here: www.dataschool.io/learn/

  • @im18already
    @im18already Před 6 lety

    Hi. It was mentioned on 1:06 that the X should be 1 dimensional. What if I have 2 set/column of text? The 2 column has certain relationship, so merging them into a single column is probably not the best way.

    • @dataschool
      @dataschool  Před 6 lety

      Great question! Sometimes, merging the text columns into the same column is the best solution. Other times, you should build separate feature matrices and merge them, either using FeatureUnion or SciPy.

  • @NBAchampionshouston
    @NBAchampionshouston Před 8 lety

    Hi, thanks for the video! Do you know if it's possible to supply each article to CountVectorizer as a list of features already created (for example noun phrases or verb-noun combinations) rather than the raw article which CountVectorizer would usually then extract n-grams from? Thanks!

    • @dataschool
      @dataschool  Před 8 lety +1

      From the CountVectorizer documentation, it looks like you can define the vocabulary used by overriding the 'vocabulary' argument: scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
      However, it's not clear to me if that will work when using a vocabulary containing phrases rather than single words.
      Try it out, and let me know if you are able to get it to work!

  • @wowwwwwwwwwwwwwwwize
    @wowwwwwwwwwwwwwwwize Před 7 lety

    Hi Kevin, that is a great video. I have one question. When i am dealing with a dataframe have large number of rows, each row having large texts which text vectorizer will be better tfidfvectotizer or countvectorizer or hashingvectorizer. I applied tfidf but it generates many feauture vectors which later becomes difficult to append it to the origial dataframe becoz
    of large array size

    • @dataschool
      @dataschool  Před 7 lety

      It's impossible to know in advance which vectorizer will work best, sometimes you just have to experiment!
      Once you have generated a document-term matrix, you should not put it back in pandas. It should remain a sparse array.
      Hope that helps!

  • @didierleprince6106
    @didierleprince6106 Před 4 lety +1

    Merci 😊

  • @amosmunezero9958
    @amosmunezero9958 Před 7 lety

    Hi does anyone know how we can extract and store the words that are thrown out during the transformation? Like is there an easier way (built-in function) other than writing python text regular expression or manipulation to compare the words and feature names?
    Thanks.

    • @dataschool
      @dataschool  Před 7 lety

      Great question! I don't know of a simple way to do this, but perhaps someone else here knows...

  • @rainerwahnsinn3262
    @rainerwahnsinn3262 Před 7 lety

    I'd like to jump in into the questions around 55:00 and ask:
    Why don't we keep track of the order of the words in a dataset? The meaning of two datasets containing the same words could be really different, for example "Call me "Tom". " and "Tom, call me!". Right now those two datasets look exactly the same to us when vectorized like in the lecture. I thought maybe we could create a higher dimensional matrix and represent those word combinations as vectors in space and then fit a model on this. Would this work?

    • @dataschool
      @dataschool  Před 7 lety

      Great question! We don't keep track of word order in order to simplify the problem, and because we don't believe that word order is useful enough to justify including it. (That would add more "noise" than "signal" to our model, reducing predictive accuracy.) That being said, you can include n-grams in the model, which preserves some amount of word order and can sometimes be helpful.

  • @vishwasgarg9186
    @vishwasgarg9186 Před 8 lety

    great videos man...I have become your fan

  • @md2704
    @md2704 Před 4 lety

    Thank you for all your helpful videos. I have a question related to vectorization:
    At 1:07:36, if we use the words from the test set to fit our model, we could obtain a document-term matrix where some terms would have only zero entries. Would that have negative effects on our classifier?

    • @dataschool
      @dataschool  Před 4 lety

      Glad you like the videos! As for your question, I don't completely follow, sorry! I would just say that there is a right way to do it (fit_transform on training set and transform on testing set), and that will give you the most reliable prediction of how your model will perform on out of sample data. Hope that helps!

  • @laurafernandezbecerra8978

    Is there any tutorial to analyse system logs with ML? thanks in advance!

  • @aykutcayir64
    @aykutcayir64 Před 8 lety

    This video is excellent. Thanks for the video, but there is a problem for the mobile version of the video. After opening talk of the video, I cannot hear the voice. Did you notice that before?

    • @dataschool
      @dataschool  Před 8 lety

      Glad you liked it! Yes, that audio problem affects some devices and browsers, especially mobile devices. It's caused by the audio encoding of the original recording. I tried to fix it, but didn't come up with any solutions. I'm sorry!

  • @itsbuzzz
    @itsbuzzz Před 7 lety

    Hi Kevin! Thanks for that valuable presentation!
    Just a question...
    Is the following the right way to apply K-fold cross validation on text data?
    X_train_dtm = vect.fit_transform(X_train)
    scores = cross_val_score(, X_train_dtm, y, cv=5)
    I am not totally sure if X_train_dtm and y are correct on the cross_val_score function above..
    Thanks again!

    • @itsbuzzz
      @itsbuzzz Před 7 lety

      I just saw Andrew's comment... bit.ly/2mXdwZ9

    • @dataschool
      @dataschool  Před 7 lety

      Glad you liked the tutorial! Regarding your question, I actually cover this in detail in my online course: www.dataschool.io/learn/

  • @deepikadavuluri8474
    @deepikadavuluri8474 Před 6 lety

    Hi Kevin,
    It is a great lecture. Eventhough I am new to this machine learning, I understood the basics of machine learning and logistic regression. I have a doubt. Can we classify into more than two groups(Ham, Spam and some_other) ?
    Thank you.

    • @dataschool
      @dataschool  Před 6 lety

      Great to hear! Regarding your question, you can classify into more than two categories - it's called multi-class classification. scikit-learn does support that. Hope that helps!

  • @vivekathilkar5873
    @vivekathilkar5873 Před 7 lety

    great learning experience

  • @sonalivv
    @sonalivv Před 7 lety

    Can we use Naive Bayes to classify text into more than just 2 or 3 categories (potentially 10+ categories)?

    • @dataschool
      @dataschool  Před 7 lety

      Great question! The scikit-learn documentation says that "All scikit-learn classifiers are capable of multiclass classification": scikit-learn.org/stable/modules/multiclass.html
      So yes, that should work!

  • @gianglt2008
    @gianglt2008 Před 8 lety

    Thank you for the resource.
    I have a question
    In real life, the initiation of class CountVectorizer can fail if the volume of input text is BIG ( e.g. I want to encode a big number of text files). Did it happen to you ?

    • @dataschool
      @dataschool  Před 8 lety +1

      I haven't had that happen, but if it did, it should happen during the 'fit' stage rather than during the instantiation of the class. In any case, HashingVectorizer is designed to deal with very large vocabularies: scikit-learn.org/stable/modules/feature_extraction.html#vectorizing-a-large-text-corpus-with-the-hashing-trick
      Hope that helps!

    • @gianglt2008
      @gianglt2008 Před 8 lety

      Thank you very much. You are correct. The problem happens during the fitting stage. I will try with HashingVectorize

  • @rock_feller
    @rock_feller Před 7 lety

    Hi Kevin, I didn’t catch up very well the why we should do the train-test split before vectorization ? Could you help? Rockefeller from Cameroon

    • @dataschool
      @dataschool  Před 7 lety +1

      It's a tricky concept! Basically, you want to simulate the real world, in which words will be seen during the testing phase that were not seen during the training phase. By splitting before vectorization, you accomplish this. Hope that helps!

  • @skinheadworkingclass
    @skinheadworkingclass Před 7 lety

    Hi Kevin, excellent presentation!
    I would like to ask you a question. How can "tokens_ratio" improve the accuracy score of Naive Bayes model?

    • @dataschool
      @dataschool  Před 7 lety +1

      Glad you liked it! tokens_ratio was just a way to understand the model - it won't actually help the model to become better.

  • @priyankap8627
    @priyankap8627 Před 5 lety

    Hey, You have used 2 classes for classification right? What if I need more than 2 class, eg: contempt, depression, anger, joy and many such emotions. Do I need to change any of the code in here, or providing a data set with multiple classes is enough?
    And I have one more doubt; Once the model is built and prepared, how can I actually know into which class, a new text document supplied as input will belong to? eg: If the new document is ham or spam?

    • @dataschool
      @dataschool  Před 5 lety

      1. Most of the time, you don't need to modify your scikit-learn code for multi-class classification.
      2. Using the predict method
      Hope that helps! You might be interested in my course: www.dataschool.io/learn/

    • @priyankap8627
      @priyankap8627 Před 5 lety

      @@dataschool Thanks a lot. This lecture was very helpful for me. I love the way you teach. Great teacher :)

  • @rayuduyarlagadda3473
    @rayuduyarlagadda3473 Před 6 lety

    Awesome video would u please make videos on performance metrics and featurization and feature engineering

    • @dataschool
      @dataschool  Před 6 lety

      Thanks for your suggestions!

    • @dataschool
      @dataschool  Před 5 lety

      I wrote a blog post about feature engineering: www.dataschool.io/introduction-to-feature-engineering/

  • @FedericaLuciaVinella
    @FedericaLuciaVinella Před 6 lety

    watching this on a Speed 1.5, and still understandable.

  • @chrisdemchalk3491
    @chrisdemchalk3491 Před 7 lety

    Any recommendation for a multi label classification example where there will be a high number >200 of potential classes.

    • @dataschool
      @dataschool  Před 7 lety

      I recommend reducing the complexity of the problem by reducing the number of classes.

  • @23232323rdurian
    @23232323rdurian Před 7 lety

    Thanks for the great tutorial. However several times I cant see the rightmost part of an
    instruction. So I cant type it, execute it, follow the Python action. Very frustrating!
    For example: 1:06:25 frome sklearn.cross_validation import train_test_split
    but then I cant see the rest of the instruction so I cant follow the next several minutes of your tutorial using Python
    Anyhow: I appreciate your tutorial...Thank you!

    • @dataschool
      @dataschool  Před 7 lety

      Sorry to hear! However, all of the code is available in the GitHub repository: github.com/justmarkham/pycon-2016-tutorial
      Hope that helps!

  • @cartoonjerk
    @cartoonjerk Před 7 lety

    Once again thanks a lot for the video, been learning a lot from this. Quick question though, can you give the full url for the one you provided around 1:00:00? I tried both methods and none worked! Thanks!

    • @dataschool
      @dataschool  Před 7 lety

      Here's the URL for the SMS dataset: raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv
      And you can find all of the code shown in the video here: github.com/justmarkham/pycon-2016-tutorial
      Hope that helps!

  • @musiclover21187
    @musiclover21187 Před 7 lety

    I wish you had a text mining course in python :(

    • @dataschool
      @dataschool  Před 7 lety +1

      I offer an online course called "Machine Learning with Text in Python" - check it out! www.dataschool.io/learn/

  • @VijayaragavanS
    @VijayaragavanS Před 6 lety

    Thanks for the detailed information, Is that possible to use Multidimensional?

    • @dataschool
      @dataschool  Před 6 lety

      I'm sorry, I don't understand your question. Could you clarify? Thanks!

  • @ghanemimehdi1063
    @ghanemimehdi1063 Před 8 lety

    Hi,
    Thanks for sharing, it's very usefull !
    I have a little question : for the labelization i use "preprocessing.LabelEncoder()" is it ok ?

    • @dataschool
      @dataschool  Před 8 lety +2

      Sure, LabelEncoder is useful as long as you are encoding labels (also known as "response values" or "target values") or binary categorical features. If you are using it to encode categorical features with more than 2 levels, you'll want to think carefully about whether it's an appropriate encoding strategy.

  • @naveenv3097
    @naveenv3097 Před 7 lety

    you said 3 documents as an explanation for 3*6 sparse matrix(around 35.10)...where did we give the 3 documents?

    • @dataschool
      @dataschool  Před 7 lety

      The 3 documents are the 3 elements of the 'simple_train' list, which we passed to the vectorizer during the 'fit' step.
      Hope that helps!

    • @naveenv3097
      @naveenv3097 Před 7 lety

      Thank you

  • @_overide
    @_overide Před 7 lety

    Great tutorial, really enjoyed and loved the way you explain things :)
    I have a little question, I'm working with product reviews, so by using CountVectorizer I have created binary dtf sparse matrix for each of my reviews and created a feature vector something like , so I have approx 200k+ reviews and have to store the same for each of them. I have read about "Feature Vector Hashing" technique, how to use that in Python, so that I can keep only hash of dtf-matrix but not actual dtf-matrix . I have no idea how to do that and how it actually works. It would be great if you help or suggest some good tutorial.
    Thank again for this wonderful tutorial !

    • @dataschool
      @dataschool  Před 7 lety +2

      Thanks for your kind words! This section from the scikit-learn documentation on feature hashing might be helpful to you: scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing
      Good luck!

  • @mohinik4473
    @mohinik4473 Před 5 lety

    I need to test Pega system build along with python for machine learning.I am automation tester but need to do AI testing,can you please guide how can i go about.

    • @dataschool
      @dataschool  Před 5 lety

      I won't be able to help, I'm sorry!

  • @andrewhintermeier9675
    @andrewhintermeier9675 Před 8 lety

    Is it possible to use KFolds cross validation instead of test train split with this method?

    • @dataschool
      @dataschool  Před 8 lety

      Yes, you could use cross-validation instead. However, to do cross-validation properly, you have to also use a pipeline so that the vectorization takes place during cross-validation, rather than before it. Hope that helps!

    • @andrewhintermeier9675
      @andrewhintermeier9675 Před 7 lety

      Thanks! I've never used pipelines before but I've seen pipelines used in some example code before, I'll have to look into it.

    • @dataschool
      @dataschool  Před 7 lety +1

      Here's a nice example that includes a pipeline: radimrehurek.com/data_science_python/

    • @andrewhintermeier9675
      @andrewhintermeier9675 Před 7 lety

      Thank you so much. Your series is honestly the best I've found for learning ML, it's been so helpful for me :D

    • @dataschool
      @dataschool  Před 7 lety

      You're very welcome, and thanks for your kind words! :)

  • @eugenydolgy1060
    @eugenydolgy1060 Před 5 lety

    Great video!

  • @macpc4612
    @macpc4612 Před 7 lety

    is it possible to calculate spamminess and haminess irrespective of the lcassifier used ?

    • @dataschool
      @dataschool  Před 7 lety +1

      Great question! You could use a similar approach with other classification models, though the code would be a bit more complicated because you wouldn't have access to the feature_count_ and class_count_ attributes of the Naive Bayes model.

  • @ash_engineering
    @ash_engineering Před 5 lety

    Hey kevin could please make a video on machine learning pipelining .

    • @dataschool
      @dataschool  Před 4 lety

      I cover pipeline in this video: czcams.com/video/irHhDMbw3xo/video.html

  • @pankajnayak8388
    @pankajnayak8388 Před 7 lety

    X_train_dtm = vect.fit_transform(X_train)
    X_train_dtm.shape
    AttributeError: 'numpy.int64' object has no attribute 'lower'

    • @dataschool
      @dataschool  Před 7 lety

      I'm not able to evaluate the cause of this error without knowing what steps took place before this line of code, and with what dataset. Good luck!

  • @navkirankaur671
    @navkirankaur671 Před 7 lety

    ValueError: multiclass format is not supported
    I am getting this error when i am running the auc score

    • @dataschool
      @dataschool  Před 7 lety

      Are you using the same dataset as me, or your own dataset?

  • @jjunior1283
    @jjunior1283 Před 6 lety

    Thanks a lot for the course. Very powerful indeed. Is there a way to create a dataframe with say the top 20 features? Thanks again

    • @dataschool
      @dataschool  Před 6 lety

      Glad you liked it! Regarding your question, is this what you are looking for?
      df = tokens.head(20).copy()

    • @jjunior1283
      @jjunior1283 Před 6 lety

      Thanks for the suggestion. I figured if I explain the problem better I'd get a better help. I'm trying to predict whether an item will fail or not. I have a data set with over 30 variables one of which I'm trying to vectorize. Doing this blows that one variable to over 7,000. Because of this I run out of memory when merging them to the data set containing the 30 other variables. Also due to the data set being unbalanced, the models don't train well using the two data sets independently (similar results both as good as random). I recently created an account on AWS and bought a powerful instance; I was able to merge the two and still it didn't train well. My goal is to use say the top 20 feature and merge with the 30 other variables to train. I used dtm=fit_transform() for that one variable. Is there a way to limit the number of features to an arbitrary number say 20; that is the ones with the highest tf idf scores? Or can I manually get them? Sorry for the length and thanks for the help

    • @dataschool
      @dataschool  Před 6 lety

      The vectorization is creating a sparse matrix, which is quite memory efficient. It sounds like the problem is that you are merging a sparse matrix with a dense matrix, which forces the sparse matrix to become dense, which would definitely create memory problems.
      One solution is to train models on the datasets separately and then ensemble them. It sounds like you might be doing this already, but aren't getting good results? If so, I don't think it's because of class imbalance.
      I think that using the max_features parameter of CountVectorizer will accomplish what you are trying to do, though I don't think it's necessarily a good strategy. You will lose too much valuable data.
      My recommended strategy is not super simple, so I can't describe it briefly, but it's covered in module 5 of my online course: www.dataschool.io/learn/
      Hope that helps!

    • @jjunior1283
      @jjunior1283 Před 6 lety

      Data School thanks a lot. I will definitely watch that recommended video and keep playing with it

  • @jundou7858
    @jundou7858 Před 6 lety

    two question about the Bag of Words which have obsessed me for a while.first question is my source file has 2 columns, one is email content, which is text format, the other is country name(3 different countries) from where the email is sent, and I want to label if the email is Spam or not, here the assumption is the email sent from different countries also matters if email is spam or not. so besides the bag of words, I want to add a feature which is country, the question is that is there is way to implement it in sklearn.The other question is besides Bag of Words, what if I also want to consider the position of the words, for instance if word appears in first sentence, I want to lower its weight, if word appears in last sentence, I want to increase its weight, is there a way to implement it in sklearn.Thanks.

    • @dataschool
      @dataschool  Před 6 lety

      Great questions! 1. Use FeatureUnion, or combine the two columns together and use CountVectorizer on the combined column. 2. You would write custom code to do this.

  • @ichtube
    @ichtube Před 7 lety

    This is just a minor point but how come y is 150 by nothing when it's a vector?

    • @dataschool
      @dataschool  Před 7 lety

      When I say that it's "150 by nothing", that really just means that it's a one-dimensional object in which the magnitude of the first dimension is 150, and there is no second dimension. That is distinct from a two-dimensional object of 150 by 1. Does that help?
      If I misunderstood your question, please let me know!