Lecture 2 | Machine Learning (Stanford)

Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on linear regression, gradient descent, and normal equations and discusses how they relate to machine learning.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:

CCS 229 Course Website:

Stanford University:

Stanford University Channel on YouTube:


  1. Tùng Nguyễn-Trọng says:

    I took a nap during the lecture, just like a little pupil in his first day
    at school
    gonna call my long lost friends, thanks!

  2. 田野 says:

    Question: at the 34:23, for a certain training sample, we have
    adjustment of the jth of Theta= – alpha * (estimation error )*Xj

    For example we only have one Theta and one x where Theta = unit price/sqr
    ft and X= the number of sqr ft

    I don’t understand why a larger Xj should lead to a larger Theta adjustment.

    For example, if we have 2 cases, in both the estimation error is 10000
    dollars. In the first case, the Xj = 500 sqr ft, in the second case Xj=5000
    sqr ft. Then the second case feeds back a 10x larger adjustment for unit
    price. But why?

    In the first case, you tell the machine, hey you missed by 10,000 dollars,
    given that the apt have 500 sqr ft, next time, next time reduce 20 dollars
    per sqr ft. This makes sense.

    Then in the second case, you tell the machine, hey you missed by 10.000
    dollars, given that the house have 5000 ft, next time reduce 200 dollars
    per sqr ft. That’s weird.

    Thanks folks

  3. Jordan Shackelford says:

    I can’t keep up. I want to learn this but I have no experience with the
    math he’s using. Calculus, right?

  4. Stephen Douglas says:

    To any data scientists, how important is it to be able to derive learning
    algorithms? Is it enough to be able to use Matplotlib without knowing the

  5. 박종빈 says:

    Would anybody explain me why ∇tr(ABA’C) = CAB + C’AB’?
    Unhelpful Note:
    ?= ∇tr(A * (BA’C)) + ∇(A’ * (CAB))
    ?= (BA’C)’ + (CAB)”
    = CAB + C’AB’

  6. blah deBlah says:

    The algorithm for identifying the speaker as a college-educated millennial
    involves maximizing the occurrence of:
    x3=”It turns out that…”

  7. Find 'N' Frag says:

    i don’t know if it’s me or the teacher, but i can’t stop hearing “Chinese
    example” instead of “training example”

  8. Samuel Ferrer says:

    If the rest of the lectures is based on these operators … then I will
    hang out till the very end … elegant!!

  9. Andrey Kholkin says:

    I hate that this guy is just speed running through all the math equations.
    Where are the real life examples, the industry experience, the fun in
    This is why everyone just sleeps in class, no vision, no inspiration, just
    pointless mathematical equations.
    I mean is this supose to be boring? I thought a better ranked university
    would have a higher teaching standard, maybe I was wrong about that.

  10. Unique and Hilarious Username says:

    I’m a highschool junior and I didn’t know what a partial derivative was so
    I walked into my AP Calc class today asked the teacher and was told to
    never speak of it again. Apparantly my teacher has repressed nightmares of
    it in college haha. I looked it up. seems pretty straight forward I think i
    get it now.

  11. Chris Walsh says:

    just wondering if you could encode the landscape using fourier transforms
    and then use that multi-level representation with a slightly modified
    algorithm to get a faster / more accurate result?

  12. Kng Schltz says:

    Am I wrong or right if I assume that the gradient is actually oriented in
    the direction of biggest ASCENT?
    wikipedia says so too.. so I assume we should use the gradients orientation
    multiplicated with -1 for the stated example contrary to what is mentioned
    in the video

  13. newbielives says:

    Am I the only one impressed by the chalk board that wipes itself clean when
    he lifts it up and pulls it back down

  14. curcicm says:

    The normal equations fall out immediately from perpendicularity criterion
    for shortest distance X^t (X * theta – y) = 0 and you don’t have to get
    into trace computations.

  15. Akshat says:

    NOTE: A^(T) represents transpose of matrix A.
    At 59:56 it should only be C^(T)AB^(T) and not C^(T)AB^(T)+ CAB as
    according to one of the above equations, gradient of AB wrt A is equal to
    B^(T), thus the gradient of ABA^(T)C should be equal to (BA^(T)C)^(T) and
    that is equal to C^(T)AB^(T). Please help me sort this out.

  16. James Khan says:

    At 1:01:52 the design matrix X is m by n. Then he multiplies by theta and
    it looks like we’re just left with a mx1 vector. Is each x in the resulting
    vector assumed to be an n dimensional or am I missing something?

  17. Gliesea says:

    No, he’s not answering the question at the end. Yes, if you use dependent
    feature vectors, you’ll have to use the pseudoinverse and yes this doesn’t
    matter because your feature vector is 1D anyway and you’ll also pick
    independent features. But the student (who does not know how to pronounce
    pseudo-) is merely pointing out that (X^T*X)^-1*X^T is also X^+ or the
    pseudoinverse of X. pinv(X) in MATLAB. Hope this helps anyone who was

Comments are closed.