_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Important machine learning equations
       
       
        TrackerFF wrote 1 hour 39 min ago:
        Kind of weird not to see β̂ = (XᵀX)⁻¹Xᵀy
       
        roadside_picnic wrote 6 hours 5 min ago:
        While this very much looks like AI slop, it does remind me of a
        wonderful little book (which has many more equations): Formulas Useful
        for Linear Regression Analysis and Related Matrix Theory - It's Only
        Formulas But We Like Them [0]
        
        That book is pretty much what it says on the cover, but can be useful
        as a reference given it's pretty thorough coverage. Though, in all
        honesty, I mostly purchased it due to the outrageous title.
        
        0.
        
   URI  [1]: https://link.springer.com/book/10.1007/978-3-642-32931-9
       
          nxobject wrote 3 hours 45 min ago:
          Finally, a handy reference to more matrix decompositions and
          normal/canonical forms than I ever realized I wanted to know!
       
        dkislyuk wrote 7 hours 43 min ago:
        Presenting information theory as a series of independent equations like
        this does a disservice to the learning process. Cross-entropy and
        KL-divergence are directly derived from information entropy, where
        InformationEntropy(P) represents the baseline number of bits needed to
        encode events from the true distribution P, CrossEntropy(P, Q)
        represents the (average) number of bits needed for encoding P with a
        suboptimal distribution Q, and KL-Divergence (better referred to as
        relative entropy) is the difference between these two values (how many
        more bits are needed to encode P with Q, i.e. quantifying the
        inefficiency):
        
        relative_entropy(p, q) = cross_entropy(p, q) - entropy(p)
        
        Information theory is some of the most accessible and approachable math
        for ML practitioners, and it shows up everywhere. In my experience,
        it's worthwhile to dig into the foundations as opposed to just
        memorizing the formulas.
        
        (bits assume base 2 here)
       
          golddust-gecko wrote 4 hours 53 min ago:
          Agree 100% with this. It gives the illusion of understanding, like
          when a precocious 6 year old learns the word "precocious" and feels
          smart because they have can say it. Or any movie with tech or science
          with .
       
          morleytj wrote 5 hours 48 min ago:
          I 100% agree.
          
          I think Shannon's Mathematical Theory of Communication is so
          incredibly well written and accessible that anyone interested in
          information theory should just start with the real foundational work
          rather than lists of equations, it really is worth the time to dig
          into it.
       
        0wis wrote 8 hours 9 min ago:
        Currently improving my foundation in data preparation for ML, this
        short and right article is a gem.
       
        dawnofdusk wrote 8 hours 58 min ago:
        I have some minor complaints but overall I think this is great! My
        background is in physics, and I remember finally understanding every
        equation on the formula sheet given to us for exams... that really felt
        like I finally understood a lot of physics. There's great value in
        being comprehensive so that a learner can choose themselves to dive
        deeper, and for those with more experience to check their own
        knowledge.
        
        Having said that, let me raise some objections:
        
        1. Omitting the multi-layer perceptron is a major oversight. We have
        backpropagation here, but not forward propagation, so to speak.
        
        2. Omitting kernel machines is a moderate oversight. I know they're not
        "hot" anymore but they are very mathematically important to the field.
        
        3. The equation for forward diffusion is really boring... it's not that
        important that you can take structured data and add noise incrementally
        until it's all noise. What's important is that in some sense you can
        (conditionally) reverse it. In other words, you should put the reverse
        diffusion equation which of course is considerably more sophisticated.
       
        cgadski wrote 9 hours 43 min ago:
        > This blog post has explored the most critical equations in machine
        learning, from foundational probability and linear algebra to advanced
        concepts like diffusion and attention. With theoretical explanations,
        practical implementations, and visualizations, you now have a
        comprehensive resource to understand and apply ML math. Point anyone
        asking about core ML math here—they’ll learn 95% of what they need
        in one place!
        
        It makes me sad to see LLM slop on the front page.
       
          maerch wrote 9 hours 36 min ago:
          Apart from the “—“, what else gives it away? Just asking from a
          non-native perspective.
       
            nxobject wrote 3 hours 53 min ago:
            Three things come to mind:
            
            - bold-face item headers  (eg “Practical Significance:”)
            
            - lists of complex descriptors non-technical parts of the writing
            (“ With theoretical explanations, practical implementations, and
            visualizations”)
            
            - the cheery, optimistic note that underlines a goal plausibly
            derived from a prompt. (eg “ Let’s dive into the equations that
            power this fascinating field!”)
       
            Romario77 wrote 9 hours 10 min ago:
            It's just too bombastic for what it is - listing some equations
            with brief explanation and implementation.
            
            If you don't know these things on some level already the post
            doesn't give you too much (far from 95%), it's a brief reference of
            some of the formulas used in machine learning/AI.
       
              random3 wrote 4 hours 32 min ago:
              Slop brings back memories of literature teachers red-marking my
              "bombastic" terms in primary school essays
       
            cgadski wrote 9 hours 16 min ago:
            It's not really about the language. If someone doesn't speak
            English well and wants to use a model to translate it, that's cool.
            What I'm picking up on is the dishonesty and vapidness. The article
            _doesn't_ explore linear algebra, it _doesn't_ have visualizations,
            it's _not_ a comprehensive resource, and reading this won't teach
            you anything beyond keywords and formulas.
            
            What makes me angry about LLM slop is imagining how this looks to a
            student learning this stuff. Putting a post like this on your
            personal blog is implicitly saying: as long as you know some some
            "equations" and remember the keywords, a language model can do the
            rest of the thinking for you! It's encouraging people to forgo
            learning.
       
            TFortunato wrote 9 hours 19 min ago:
            This is probably not going to be a very helpful answer, but I sort
            of think of it this way: you probably have favorite authors or
            artist (or maybe some really dislike!), where you could probably
            take a look at a piece of their work, even if its new to you, and
            immediately recognize their voice & style.
            
            A lot of LLM chat models have a very particular voice and style
            they use by default, especially in these longer form "Sure, I can
            help you write a blog article about X!" type responses. Some pieces
            of writing just scream "ChatGPT wrote this", even if they don't
            include em-dashes, hah!
       
              TFortunato wrote 9 hours 9 min ago:
              OK, on reflection, there are a few things,
              
              Kace's response is absolutely right that the summaries tend to be
              a place where there is a big giveaway.
              
              There is also something about the way they use "you" and the
              article itself... E.g. the "you now have a comprehensive resource
              to understand and apply ML math. Point anyone asking about core
              ML math here..." bit. This isn't something you would really
              expect to read in a human written article. It's a ChatBot
              presenting it's work to "you", the single user it's conversing
              with, not an author addressing their readers. Even if you ask the
              bot to write you an article for a blog, a lot of times it's
              response tends to mix in these chatty bits that address the user
              or directly references to the users questions / prompts in some
              way, which can be really jarring when transferred to a different
              medium w/o some editing
       
            kace91 wrote 9 hours 33 min ago:
            Not op, but it is very clearly the final summary telling the user
            that the post they asked the AI to write is now created.
       
        bob1029 wrote 10 hours 12 min ago:
        MSE remains my favorite distance measure by a long shot. Its quadratic
        nature still helps even in non-linear problem spaces where convexity is
        no longer guaranteed. When working with generic/raw binary data where
        hamming distance would be theoretically more ideal, I still prefer MSE
        over byte-level values because of this property.
        
        Other fitness measures take much longer to converge or are very
        unreliable in the way in which they bootstrap. MSE can start from a
        dead cold nothing on threading the needle through 20 hidden layers and
        still give you a workable gradient in a short period of time.
       
        bee_rider wrote 10 hours 13 min ago:
        Are eigenvalues or singular values used much in the popular recent
        stuff, like LLMs?
       
          calebkaiser wrote 9 hours 48 min ago:
          LoRa uses singular value decomposition to get the low rank matrices.
          In different optimizers, you'll also see eigendecomposition or some
          approximation used (I think Shampoo does something like this, but
          it's been a while).
       
        cl3misch wrote 10 hours 38 min ago:
        In the entropy implementation:
        
            return -np.sum(p * np.log(p, where=p > 0))
        
        Using `where` in ufuncs like log results in the output being
        uninitialized (undefined) at the locations where the condition is not
        met. Summing over that array will return incorrect results for sure.
        
        Better would be e.g.
        
            return -np.sum((p * np.log(p))[p > 0])
        
        Also, the cross entropy code doesn't match the equation. And, as
        explained in the comment below the post, Ax+b is not a linear operation
        but affine (because of the +b).
        
        Overall it seems like an imprecise post to me. Not bad, but not
        stringent enough to serve as a reference.
       
          jpcompartir wrote 10 hours 21 min ago:
          I would echo some caution if using as a reference, as in another blog
          the writer states:
          
          "Backpropagation, often referred to as “backward propagation of
          errors,” is the cornerstone of training deep neural networks. It is
          a supervised learning algorithm that optimizes the weights and biases
          of a neural network to minimize the error between predicted and
          actual outputs.." [1] backpropagation is a supervised machine
          learning algorithm, pardon?
          
   URI    [1]: https://chizkidd.github.io/2025/05/30/backpropagation/
       
            cl3misch wrote 10 hours 7 min ago:
            I actually see this a lot: confusing backpropagation with gradient
            descent (or any optimizer). Backprop is just a way to compute the
            gradients of the weights with respect to the cost function, not an
            algorithm to minimize the cost function wrt. the weights.
            
            I guess giving the (mathematically) simple principle of computing a
            gradient with the chain rule the fancy name "backpropagation" comes
            from the early days of AI where the computers were much less
            powerful and this seemed less obvious?
       
              imtringued wrote 9 hours 6 min ago:
              The German Wikipedia article makes the same mistake and it is
              quite infuriating.
       
              cubefox wrote 9 hours 7 min ago:
              What does this comment have to do with the previous comment,
              which talked about supervised learning?
       
                imtringued wrote 8 hours 45 min ago:
                Reread the comment
                
                "Backprop is just a way to compute the gradients of the weights
                with respect to the cost function, not an algorithm to minimize
                the cost function wrt. the weights."
                
                What does the word supervised mean? It's when you define a cost
                function to be the difference between the training data and the
                model output.
                
                Aka something like (f(x)-y)^2 which is simply the quadratic
                difference between the result of the model given an input x
                from the training data and the corresponding label y.
                
                A learning algorithm is an algorithm that produces a model
                given a cost function and in the case of supervised learning,
                the cost function is parameterized with the training data.
                
                The most common way to learn a model is to use an optimization
                algorithm.
                There are many optimization algorithms that can be used for
                this. One of the simplest algorithms for the optimization of
                unconstrained non-linear functions is stochastic gradient
                descent.
                
                It's popular because it is a first order method. First order
                methods only use the first partial derivative known as the
                gradient whose size is equal to the number of parameters.
                Second order methods converge faster, but they need the
                Hessian, whose size scales with the square of the to be
                optimized parameters.
                
                How do you calculate the gradient? Either you calculate each
                partial derivative individually, or you use the chain rule and
                work backwards to calculate the complete gradient.
                
                I hope this made it clear that your question is exactly
                backwards. The referenced blog is about back propagation and
                unnecessarily mentions supervised learning when it shouldn't
                have done that and you're the one now sticking with supervised
                learning even though the comment you're responding to told you
                exactly why it is inappropriate to call back propagation a
                supervised learning algorithm.
       
                cl3misch wrote 8 hours 54 min ago:
                The previous comment highlights an example where backprop is
                confused with "a supervised learning algorithm".
                
                My comment was about "confusing backpropagation with gradient
                descent (or any optimizer)."
                
                For me the connection is pretty clear? The core issue is
                confusing backprop with minimization. The cited article
                mentioning supervised learning specifically doesn't take away
                from that.
       
       
   DIR <- back to front page