_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Highly efficient matrix transpose in Mojo
       
       
        graycat wrote 5 hours 11 min ago:
        Fast matrix transpose?    Agree for a transposed matrix, just change the
        indexing arithmetic that converts row i and column j to an offset in
        the storage for the matrix and then remember that this is a transposed
        matrix. Some software object semantics could make this easy for other
        software to use.
       
          jjtheblunt wrote 1 hour 36 min ago:
          i think the problem with changing the indexing arithmetic is that you
          could end up with arithmetic incompatible with vector instructions in
          hardware that you're hoping to use for parallelism.
       
        totalperspectiv wrote 5 hours 23 min ago:
        In the coarse graining code, you use an @parameter-for. Doesn’t that
        lead to some pretty large code size unrolling that? Or is that less of
        an issue on GPU?
        
        Great write up! I learned a lot!
       
          simon_vtr wrote 4 hours 50 min ago:
          It doesn’t. The batch size is just 8. This is a very good trick and
          often needed to archive peak performance in memory bound kernels. You
          can checkout the equivalent code in cuda aswell :)
       
        thunkingdeep wrote 6 hours 22 min ago:
        Is the word archive used in place of achieve? I’m not sure if there
        is a terminology issue that I don’t understand in this post…
       
        daft_pink wrote 7 hours 7 min ago:
        I think Mojo’s lack of being a true open product and existing to
        drive profits at Modular has really held it back.
        
        It’s just really impractical to use a licensed programming language
        in 2025.
       
          totalperspectiv wrote 5 hours 24 min ago:
          My impression is that this is on purpose on their part. They’ve
          repeatedly stated that by 2026 they will open source the compiler,
          and I think they’ve wanted a slow adoption ramp in order to spend
          some more time getting it right first.
          
          Possibly rose-tinted glasses on my part, but I’m optimistic for
          2026. Chris Lattner has a pretty strong track record of getting these
          things right.
       
            veidr wrote 5 hours 2 min ago:
            Yeah, and he's clearly trying to avoid what happened to Swift[1].
            Although the danger of "corporate owner priorities dictate
            releasing half-baked/awful changes" risk is still there, Lattner
            himself has more influence within Modular (obviously, as co-founder
            and CEO) than he did at Apple, so it may work out better this time.
            
            [1] 
            
   URI      [1]: https://news.ycombinator.com/item?id=30416070
       
              daft_pink wrote 40 min ago:
              It was just very difficult primarily because of the way the
              license limitations and install steps made it difficult to drop
              it into the existing python tooling ecosystem.
              
              I haven’t tried it in a long time, but as it’s a Python
              superset, I tried to drop it into my jupyter notebook docker
              container and you had to agree to license terms and register your
              email and install a modular package that contained a bunch of
              extra things.
              
              If you want to get widespread adoption for a python superset, you
              would probably want to get it included in the official jupyter
              docker images as people who do this sort of programming like to
              use a jupyter repl, but they just made it so difficult.
              
              I’m no open source zealot and I’m happy to pay for software,
              but I think the underlying language needs to be a lot more open
              to be practical.
       
              GeekyBear wrote 4 hours 15 min ago:
              > he's clearly trying to avoid what happened to Swift
              
              Also to MLIR while Lattner was at Google:
              
              > MLIR was born—a modular, extensible compiler infrastructure
              designed to bring order to the chaos. It brought forth a
              foundation that could scale across hardware platforms, software
              frameworks, and the rapidly evolving needs of machine learning.
              It aimed to unify these systems, and provide a technology
              platform that could harmonize compute from many different
              hardware makers.
              
              But unification is hard. What started as a technical project
              quickly turned into a battleground: open-source governance,
              corporate rivalries, and competing visions all collided. What
              could have been a straightforward engineering win became
              something much more complicated.
              
   URI        [1]: https://www.modular.com/blog/democratizing-ai-compute-pa...
       
              melodyogonna wrote 4 hours 25 min ago:
              Yeah, Mojo's development has been pretty transparent. Chris
              publishes technical documents for most features and takes
              community feedback into account. A recent example is here: [1]
              Btw, Mojo's development is a masterclass in language development
              and community building, it's been fun watching Chris go back to
              fix technical debts in existing features rather than proceeding
              with adding new features.
              
   URI        [1]: https://forum.modular.com/t/variable-bindings-proposal-d...
       
        iandanforth wrote 9 hours 26 min ago:
        I'm probably just ignorant but shouldn't the graphic of the tiled
        transpose have the green vector column-oriented in the final matrix?
       
          somethingsome wrote 6 hours 54 min ago:
          The colors are reading writing operations ;)
          
          You have global memory and shared memory, the global is slower.
          
          You read in rows in the global memory (faster than reading columns)
          
          You write in columns in the shared memory (slower than in rows, but
          the shared memory is fast, this is the transpose operation)
          
          You read in rows in the shared memory (very fast)
          
          You write in rows in the global memory (faster than writing in
          columns)
          
          The idea behind that tiling is to hide the slow part in a memory that
          is faster.
       
        saagarjha wrote 10 hours 52 min ago:
        > This kernel archives a bandwidth of 1056.08 GB/s which is faster than
        the 875.46 GB/s we archived using CUDA. I believe that to be the reason
        because we use the PTX api for TMA transfers in Mojo.
        
        I can't say for sure because I couldn't find the CUDA kernel but I kind
        of doubt this is true. You can hit memory bandwidth on Hopper without
        using TMA at all, which is mostly designed for accelerating
        asynchronous copies and reducing memory pressure. If all you are doing
        is a transpose you don't need any of this to go fast (though it might
        simplify your indexing code…?)
       
          simon_vtr wrote 7 hours 12 min ago:
          The kernels I mention in CUDA use all the equivalent logic like the
          Mojo kernels. You can find them on my GitHub: [1] You may want to
          provide a faster kernel on H100 via PR and I will merge after
          checking it’s faster.
          
   URI    [1]: https://github.com/simveit/effective_transpose
       
        melodyogonna wrote 13 hours 48 min ago:
        I wonder if there is a reason for not using the high level abstractions
        provided by Modular
       
          totalperspectiv wrote 5 hours 21 min ago:
          I’d also add that Mojo is new, and people are still feeling it out
          by trying to 1:1 things with Cuda.
       
          saagarjha wrote 10 hours 58 min ago:
          Most interesting algorithms (e.g. with dynamic shapes, mixed
          computation) are typically better scheduled by hand.
       
            Q6T46nT668w6i3m wrote 5 hours 6 min ago:
            Sure, but Modular’s mission was to provide abstractions to
            minimize these types of optimizations.
       
        almostgotcaught wrote 20 hours 29 min ago:
        As someone said below - you'd never write just a transpose kernel -
        it'll be fused into something else.
       
          saagarjha wrote 11 hours 6 min ago:
          Look the frontier AI companies need something other than reversing
          binary trees to give interview candidates
       
            almostgotcaught wrote 5 hours 41 min ago:
            No one is going to ask this on an interview. Used to be matmul. 
            These days it's FA.
       
        londons_explore wrote 21 hours 26 min ago:
        Why do we ever need to transpose a matrix?
        
        Isn't it better to simply combine the transposition with whatever next
        operation one wishes to do with the matrix?
       
          fulafel wrote 14 hours 45 min ago:
          This could make Mojo look  even better as it would ld be more compute
          heavy and the last step thread reduction would be less relevant.
       
          hogepodge wrote 20 hours 34 min ago:
          You're right that a good graph compiler will do this for you. There
          still may be times, like if you're interfacing with another library,
          where you'll need to switch a matrix between row major or column
          major layouts.
       
            meindnoch wrote 8 hours 26 min ago:
            Serious linear algebra libraries expect a flag that tells if
            elements are column-major or row-major.
       
          throwawayabcdef wrote 20 hours 39 min ago:
          The next operation might need the data in column major order to read
          it fast. So you might have to transpose first. And these maybe be
          concurrent stages of a processing pipeline.
       
            viraptor wrote 19 hours 32 min ago:
            Now I'm curious, how many times do you have to fully read the
            matrix in GPU for the total impact of reading columns to be higher
            than one-off actual transpose and then sequential row reads? I know
            it depends on lots of things, I'm after a rough estimate.
       
              saagarjha wrote 11 hours 0 min ago:
              It's quite rare. Usually problems are tiled anyway and you can
              amortize the cost of having data in the "wrong" layout by loading
              coalesced in whatever is the best layout for your data and then
              transposing inside your tile, which gives you access to much
              faster memory.
       
                stephencanon wrote 5 hours 9 min ago:
                The one pure transpose case that does come up occasionally is
                an in-place non-square transpose, where there is a rich
                literature of very fussy algorithms. If someone managed to make
                any headway with compiler optimization there, I'd be
                interested.
       
        noracists wrote 1 day ago:
        slop
       
        jsnell wrote 1 day ago:
        The "Switching to Mojo gave a 14% improvement over CUDA" title is
        editorialized, the original is "Highly efficient matrix transpose in
        Mojo".
        
        Also, the improvement is 0.14%, not 14% making the editorialized
        linkbait particularly egregious.
       
          timmyd wrote 21 hours 39 min ago:
          [op here] To be clear: Yes, there are 3 kernels - you can see those
          in the linked github at the end of the article if you clicked that.
          These are:
          
          transpose_naive - Basic implementation with TMA transfers
          
          transpose_swizzle - Adds swizzling optimization for better memory
          access patterns
          
          transpose_swizzle_batched - Adds thread coarsening (batch processing)
          on top of swizzling
          
          Performance comparison with CUDA: The Mojo implementations achieve
          bandwidths of:
          
          transpose_naive: 1056.08 GB/s (32.0025% of max)
          
          transpose_swizzle: 1437.55 GB/s (43.5622% of max)
          
          transpose_swizzle_batched: 2775.49 GB/s (84.1056% of max)
          
          via the GitHub - simveit/efficient_transpose_mojo
          
          Comparing to the CUDA implementations mentioned in the article:
          
          Naive kernel: Mojo achieves 1056.08 GB/s vs CUDA's 875.46 GB/s
          
          Swizzle kernel: Mojo achieves 1437.55 GB/s vs CUDA's 1251.76 GB/s
          
          Batched swizzle kernel: Mojo achieves 2775.49 GB/s vs CUDA's 2771.35
          GB/s
          
          So there is highly efficient matrix transpose in Mojo
          
          All three Mojo kernels outperform their CUDA counterparts, with the
          naive and swizzle kernels showing significant improvements (20.6% and
          14.8% faster respectively), while the final optimized kernel achieves
          essentially identical performance (slightly better by 4.14 GB/s).
          
          The "flag" here seemed innapropriate given that its true this
          implementation is indeed faster, and certainly the final iteration
          could be improved on further. It wasn't wrong to say 14% or even 20%.
       
            jsnell wrote 20 hours 44 min ago:
            Users of the site only have one control available: the flag.
            There's no way to object only to the title but not to the post, and
            despite what you say that title hit the trifecta: not the original
            title, factually incorrect, and clickbait. So I'm not that
            surprised it got flagged (even if I did not flag it myself).
            
            Email the mods at hn@ycombinator.com. There's a chance they'll
            remove the flag and re-up the post.
       
              timmyd wrote 20 hours 36 min ago:
              thanks jsnell - i did they and they appreciated the comment
              above, and unflagged it. i appreciate it!
       
          atomicapple wrote 1 day ago:
          I think the OP based the title off of "This kernel archives 1437.55
          GB/s compared to the 1251.76 GB/s we get in CUDA" (14.8%) and not the
          final kernels for whatever reason
       
          jebarker wrote 1 day ago:
          Yeah, it seems like the blog post is just meant to be an example of
          how to do something in Mojo and not a dunk on CUDA.
       
            timmyd wrote 21 hours 14 min ago:
            FWIW I didnt take the blog as a dunk on CUDA, just as an impressive
            outcome from the blog writer in Mojo. It's awesome to see this on
            Hopper - if it makes it go faster thats awesome.
       
          baal80spam wrote 1 day ago:
          0.14% is within the limits of statistical error. So this is a
          nothing-"article".
       
            jsnell wrote 1 day ago:
            I don't think that's fair. The article promised a highly efficient
            kernel and seems to have delivered exactly that, which isn't
            "nothing". My beef is entirely with the submitted title.
       
        arjvik wrote 1 day ago:
        Where's the 14%? Looks like their final kernels show a 0.14%
        improvement of Mojo over the equivalent CUDA kernel?
       
          77pt77 wrote 1 day ago:
          It looks because it does.
          
          >(2771.35/2775.49 - 1) * 100 = -.14916285052369131300
          
          Flagged.
       
            timmyd wrote 1 day ago:
            Updated the title to the original. I did base the numbers on
            
            "This kernel archives 1437.55 GB/s compared to the 1251.76 GB/s we
            get in CUDA" (14.8%) which is still impressive
       
        voronar wrote 1 day ago:
        Mr. Mojo Risin'
       
        vlan121 wrote 1 day ago:
        Mojos compiler is closed source. Thats a big no-no
       
          dgurchenkov wrote 21 hours 37 min ago:
          I work on Mojo. The whole compiler, runtime etc. will get open
          sourced, most likely within a year. It is just a matter of time and
          us getting all the required work done.
          
   URI    [1]: https://docs.modular.com/mojo/faq/#open-source
       
            xiphias2 wrote 8 hours 11 min ago:
            ,,will get open sourced'' means closed source, parent wrote the
            same
       
              GeekyBear wrote 5 hours 3 min ago:
              Chris Lattner (the CEO of Modular) was previously the technical
              lead behind the creation of LLVM, Clang and Swift, all of which
              were open sourced.
              
              He has a bit of a track record already.
       
                xiphias2 wrote 3 hours 6 min ago:
                Sure, but at that time he was employed by Apple for example.
                
                Now he's making a for profit company and there's already MAX
                and MAX Enterprise stuff to not trust that the open source part
                would be competitive with already great inferencing frameworks
                for example.
       
            almostgotcaught wrote 20 hours 28 min ago:
            > runtime
            
            Are you talking about your libc equivalent or MAX?
       
              dgurchenkov wrote 4 hours 18 min ago:
              Both.
              
              Mojo standard library is already open source. Mojo at the moment
              does not need a runtime (but if it ever needs one it'd get open
              sourced). My point was, Mojo as a whole, as a programming
              language & a reference implementation, will definitely get open
              sourced.
              
              MAX itself is a bigger beast to work with, and I am out of my
              depth to talk about it. I think it'll get open sourced as well,
              just the timeline might be different (shorter or longer, IDK).
       
        htrp wrote 1 day ago:
        Left unsaid, the 14% improvement in performance came at the cost of
        increasing dev time by 35%
       
          bravetraveler wrote 1 day ago:
          Reminds me of this, lol:
          
          > "From the moment I understood the weakness of my flesh, it
          disgusted me. I craved the strength and certainty of steel."
          
          14% all the time vs 35% some of the time
          
          edit: Closing numbers are far less impressive than those buried in
          the middle of the post. Confusing; bye everyone
       
        colesantiago wrote 1 day ago:
        Does anyone use Mojo in production at all or are even hiring for Mojo?
       
          melodyogonna wrote 11 hours 29 min ago:
          Modular (the company behind Mojo) uses it in production. I imagine
          that if they have any clients then those also use Mojo in production
          - albeit indirectly - since all the GPU kernels used by Modular are
          written in Mojo.
       
        sestep wrote 1 day ago:
        I'm not an expert in this space, but is this meaningful? I'd assume
        that it's more common to fuse together transposition with an operation
        that precedes or follows it (e.g. matmul), which should be far more
        efficient than materializing the entire transposition in memory if it's
        just an intermediate value.
       
          musebox35 wrote 14 hours 35 min ago:
          Matrix transpose is a canonical example of a memory bound operation
          and often used to showcase optimization in a particular programming
          language or library. See for example the cutlass matrix transpose
          tutorial from Jay Shah of flash attention 3 paper:
          
   URI    [1]: https://research.colfax-intl.com/tutorial-matrix-transpose-i...
       
            saagarjha wrote 11 hours 1 min ago:
            Unfortunately the issue (alluded to in the blog post you linked) is
            that transposes do absolutely no work but memory loads. Sure, they
            test that you can swizzle your accesses, but modern accelerators
            are all about pipelining and feeding matrix multiply units, which
            is considerably harder than loading from memory as fast as
            possible. Actually, even the Mojo post barely beats CUDA for most
            of its kernels, because you can hit memory bandwidth for transpose
            on the latest hardware using techniques from 5-10 years ago. This
            is definitely not true for more interesting operations.
       
              musebox35 wrote 8 hours 20 min ago:
              I totally agree that the resulting kernel will be rarely useful.
              I just wanted to highlight that it is a commonly used educational
              exercise to showcase how to optimize for memory throughput. If
              the post showed how to fuse a transpose + rmsnorm epilogue to a
              gemm then the kernel would be more functional but the blog post
              would be much harder to follow for newcomers.
              
              Jay Shah’s later articles contain examples that involve
              epilogue fusion. IMHO, understanding how to write an efficient
              transpose helps with following the more involved ones.
       
                simon_vtr wrote 7 hours 8 min ago:
                That was exactly my reason to write this blogpost and optimise
                transpose. 
                It is a simple educational yet not trivial example to learn the
                basics.
       
       
   DIR <- back to front page