Lecture 16 | Machine Learning (Stanford)

Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng discusses the topic of reinforcement learning, focusing particularly on MDPs, value functions, and policy and value iteration.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:

CS 229 Course Website:

Stanford University:

Stanford University Channel on YouTube:

12 comments

  1. stupidGooglePlus says:

    In the computation you start at 57, to determine why moving west is better
    than moving north from the (3,1) state, it seems that you disregarded or
    forgot the discount factor, without mentioning it. I do think that in this
    case it suffices to look at undiscounted values to determine the optimal
    action, because there are no intermediate rewards. I find find this just a
    bit misleading, but I also wanted to share my thoughts. Great lecture (so
    far)!

  2. Sheldon Kalman says:

    I have a question on this.. and no where to ask. If I change the charge
    from -0.2 to -0.1 or -0.9, would that change my result? 

  3. GriefGrumble TheMauler says:

    I suppose, generally, yes, since this will change the choice from
    expectations. Sorry, if too “too late”.

  4. GriefGrumble TheMauler says:

    The avalanche of questions on MDP in the beginning of next lecture was
    relieving. I thought, that was just me. Recursive algorithms must not be
    ever hopped over, like in this lecture (and numerous others). Extensive
    search on “value iteration example” did eventually help me, migh help the
    others as well.

  5. cozos says:

    Isn’t there a prohibitive number of states in the helicopter learning
    algorithm? I imagine there are very large amount of states since positions
    and orientations are basically continuous – how did they reconcile this?

Comments are closed.