Machine Learning with Text in scikit-learn (PyCon 2016)

Although numeric data is easy to work with in Python, most knowledge created by humans is actually raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from. In this tutorial, we’ll build and evaluate predictive models from real-world text using scikit-learn. (Presented at PyCon on May 28, 2016.)

GitHub repository:
Enroll in my online course:
Subscribe to the Data School newsletter:

My scikit-learn video series:
My pandas video series:



  1. ElementHTTP says:

    How did you develop such amazing methodological/analytical skills ? (Great
    teacher, philosopher ,book or something else ?)
    I am apologizing if these questions are a bit personal.

  2. Carlos Arturo Pimentel Trujillo says:

    Thanks for sharing Kevin, apart of the obvious I also curious about how you
    use evernote in your daily lectures task, maybe that’s could be another
    great video to follow on ..

  3. Dmitrii Beliakov says:

    Great material! I’ve been working on my own machine learning model before I
    learned about sklearn and now I have discovered feature extraction from
    text, which I coded myself 🙂
    Happy and sed situation. Happy to find a useful tool. Sad that I spent so
    much time on coding my own feature extractor 🙂

    Thanks for the video! Great job!

  4. Ghanemi mehdi says:

    Thanks for sharing, it’s very usefull !

    I have a little question : for the labelization i use
    “preprocessing.LabelEncoder()” is it ok ?

  5. dualphase says:

    Video Request: Random Forests, Gradient Boosting etc: I see they are very
    popular in Kaggle. Also Introduction to Neural Networks/Deep Learning in
    Python. Thank you so much :)

  6. Aykut Çayır says:

    This video is excellent. Thanks for the video, but there is a problem for
    the mobile version of the video. After opening talk of the video, I cannot
    hear the voice. Did you notice that before?

  7. Tsering Paljor says:

    Hands down the best machine learning presentation I’ve seen thus far.
    Definitely looking forward to enrolling in your course once I’m done with
    your other free intro material. I think what sold me is how you’ve focused
    ~3 hours on a specific ML approach (supervised learning) to a common domain
    (text analysis). Other ML intros try to fit
    classification/regression/clustering all into 3 hours, which becomes too
    superficial a treatment. Anyway, bravo and keep up the great work!

  8. Jagadeesh Gajula says:

    The best tutorial i have ever watched! Kevin you have mastered both the art
    of machine learning and teaching :)

  9. Casey Lickfold says:

    Hi, thanks for the video! Do you know if it’s possible to supply each
    article to CountVectorizer as a list of features already created (for
    example noun phrases or verb-noun combinations) rather than the raw article
    which CountVectorizer would usually then extract n-grams from? Thanks!

  10. Zank Bennett says:

    Great video. The problem with the audio is that the channels are the
    inverse of each other, so on mono devices where the L and R channels are
    summed together, they completely nullify the output signal. I don’t know of
    a work-around except to listen using a 2-channel system

  11. Nathan French says:

    One small suggestion if you ever do one of these lectures again –
    because your audience is not miked, I can’t hear their
    questions/comments. As a result your responses are sometime difficult to
    follow without knowing what they said. If in the future you do a
    lecture that you know will be posted on YouTube, you might consider
    starting your answers with a brief summary of what they just asked. But
    otherwise this is a great lecture, and a great addition to your existing
    body of work on this channel. I come here weekly for professional
    development and frequently recommend this channel to colleagues.

  12. Giang Lam Tung says:

    Thank you for the resource.
    I have a question
    In real life, the initiation of class CountVectorizer can fail if the
    volume of input text is BIG ( e.g. I want to encode a big number of text
    files). Did it happen to you ?

  13. Neha Gupta says:

    You are indeed a “GURU” who can train and share knowledge in true sense.
    I’m a non technical person but learning python and scikit-learn for my
    research and this video has taken my understanding to higher level, just in
    3 hours….THANK YOU VERY MUCH Kevin!!! Can you please recommend some links
    where I can learn more on short text sentiment analysis using machine
    learning in python, especially to learn feature engineering aspect, like
    using POS, word embedding as features…Thanks again …

  14. Andrew Hintermeier says:

    Is it possible to use KFolds cross validation instead of test train split
    with this method?

  15. Lingobol :) says:

    Wonderful set of videos. I have started my ML journey with these videos.
    Now gonna go deeper and practise more and more.
    Thanks Kevin for the best possible head start.

    Your Fan,
    A beginner Data Scientist.

Comments are closed.