Mean Shift with Titanic Dataset – Practical Machine Learning Tutorial with Python p.40

We continue the topic of clustering and unsupervised machine learning with Mean Shift, this time applying it to our Titanic dataset.

There is some degree of randomness here, so your results may not be the same. You can probably re-run the program to get similar data if you don’t get something similar, however.

We’re going to take a look at the Titanic dataset via clustering with Mean Shift. What we’re interested to know is whether or not Mean Shift will automatically separate passengers into groups or not. If so, it will be interesting to inspect the groups that are created. The first obvious curiosity will be the survival rates of the groups found, but, then, we will also poke into the attributes of these groups to see if we can understand why the Mean Shift algorithm decided on the specific groups.

7 comments

  1. jonatan isakas says:

    My man, you´re a true legend. Been watching your videos none stop for the
    last couple of weeks, all your machine learning and Quantopian episodes.
    You got your self another subscriber on your website. You deserve all the
    credit you can get! Keep up the awesome job!

  2. Gian Carlo Martinelli says:

    instead of the pattern “for i in range(len(x))” you could use the
    enumerate() function. In this case:

    for i, label in enumerate(labels):
    original_df[‘cluster_group’].iloc[i] = label

  3. André Stephano says:

    If you could upload a simple GIST after you record the video would be
    really nice.

  4. Jonathan Fraine says:

    Hello Sentdex,

    Excellent work thus far. I look forward to the NN’s coming up!

    On this example, I am either awesome or broken; and I lean towards the
    latter.

    I re-ran the example a dozen times and consistently get 4 clusters (every
    time!). I started with your code from

    and added only the “original_df” and “suvival_rate(s)” sections — nothing
    else. One of my classes has a 100% survival rate — need to learn that
    trick for my next boat ride.

    My clusters – PClass Labels – Survival rates are as follows:

    0 [1 2 3] 36.9%
    1 [1] 100%
    2 [1] 73.9%
    3 [3] 10.0

    Have you or others experienced this? Note that the mean of my pclass labels
    is 2.29 for every cluster.

    I’m worried that I screwed something up in the pre-processing. I copied the
    titanic.xls file the link provided in the code, which I copied from the
    host website:

    Thank you!

    edit: I took the code straight off of

    (all the way to the end) and changed nothing, but got 4 clusters

  5. loderunnr says:

    Is there a way for K-Means or MeanShift to find what features matter the
    most in the classification? Could we find, using a clustering algorithm,
    without randomly digging as we did in this tutorial, if “class”, “sex” or
    “fare” matter more than another feature?

Comments are closed.