_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Tensor Product Attention Is All You Need
       
       
        thunkingdeep wrote 9 min ago:
        If you don’t pay to read papers, you don’t get to complain about
        the titles, imo.
        
        I hate ads, but I’m not paying for YouTube Premium either. That’s
        how it goes. I get ads.
       
        joshdavham wrote 1 hour 44 min ago:
        I'm sorry but can people please stop naming their papers "X is all you
        need"? It's super annoying.
       
          recursive wrote 1 hour 26 min ago:
          Are you saying... you consider it harmful?
       
        cute_boi wrote 2 hours 19 min ago:
        > a novel attention mechanism
        
        Why do every paper has to mention this word "novel" and these titles
        are getting crazier day by day.
       
          NitpickLawyer wrote 32 min ago:
          If your paper is scored / gated on "novel factor" by admission
          committees, then applicants will over-use that term.
       
          patrick451 wrote 1 hour 56 min ago:
          Because to publish in a real journal, you typically need both novelty
          and for your work to be "interesting". The job of the abstract and
          introduction of a paper (where the word "novel" normally lives) is to
          sell the reviewer that the paper should be published and to sell you
          that you should read and cite it.
       
          verdverm wrote 1 hour 58 min ago:
          There are a number of papers which aim to improve the attention
          aspect of models, all being some derivation of the original
          "Attention is All You Need" paper. A pattern of "'blank' Attention is
          All You Need" has emerged
       
        esafak wrote 2 hours 24 min ago:
        Tensor decomposition has traditionally suffered from high computational
        complexity. Is it an issue here?
       
          dartos wrote 1 hour 50 min ago:
          At a sniff test it would make sense.
          
          Trading computational complexity for space.
       
          absolutelastone wrote 1 hour 58 min ago:
          Looks like it's just a matrix decomposition in the paper. I'm
          guessing anyway. These attention papers are always a painful mix of
          mathematical, quasi-mathematical, and information retrieval jargon.
          
          There is something in the github repo about higher-order
          decompositions. Don't find where the method for factoring is given.
       
            verdverm wrote 1 hour 55 min ago:
            I chuckled when I read, in S-3.1
            
            > Specifically, for each token t, with a small abuse of notation,
            we define:
       
          verdverm wrote 2 hours 0 min ago:
          My math is rusty, but it looks to have a higher complexity than the
          original attention. I cannot say if it is an issue. Generally it
          seems we are willing to spend more computation at training time if it
          produces better results at inference time. In this case they are
          reducing the resources needed at inference time (an order of
          magnitude for the KV cache) or enabling longer sequences given the
          same resources.
          
          There's another paper I saw yesterday, "Element-wise Attention is All
          You Need" which looks like an early preprint, written by a solo
          author with a solo A800, and tested on some smaller problems. If the
          results hold up for language benchmarks, it could reduce resource
          requirements during training as well. It looks to have a lower
          complexity when scaling
          
   URI    [1]: https://arxiv.org/abs/2501.05730
       
        carbocation wrote 3 hours 17 min ago:
        My kingdom for renaming this paper to something like "Tensor Product
        Attention is a Memory-Efficient Approach for Long-Sequence Language
        Modeling"
       
          Zacharias030 wrote 1 hour 42 min ago:
          If you don’t like the title, wait till you see this acronym:
          „… we introduce the
          Tensor ProducT ATTenTion Transformer (T6), a new model
          architecture…“
       
            imjonse wrote 1 hour 5 min ago:
            There is a famous transformer model named T5 from Google, and also
            S4, S4 and S6 (Mamba) in the LLM space, so it is not unusual
            naming.
       
        whymauri wrote 3 hours 18 min ago:
        I really can't with these paper titles anymore, man.
       
          byyoung3 wrote 50 min ago:
          haha same
       
          anigbrowl wrote 2 hours 23 min ago:
          By 2038 all scientific papers will be titled 'Bruh.' While this might
          at first seem a recipe for confusion, the fundamental
          interconnectedness of all things as demonstrated by Ollama(Googol 13)
          highlight the fact that pretty much any insight is as good as any
          other and are all descriptions of the same underlying phenomenon.
          Freed from constraint like survival or the necessity to engage in
          economic activity, humanity in the 203s will mainly devote itself to
          contemplating amusing but fundamentally interchangeable perspectives
          within increasingly comfy pleasure cubes.
       
            01HNNWZ0MV43FF wrote 57 min ago:
            As foretold by Joseph Campbell
       
            smlacy wrote 59 min ago:
            Bruh is all you need
       
          wisty wrote 2 hours 37 min ago:
          Clickbait paper titles considered harmful?
       
            gbnwl wrote 2 hours 22 min ago:
            OK I'll admit I chuckled
       
          magicalhippo wrote 2 hours 57 min ago:
          There's an Ask HN thread going[1] asking about what people have done
          with small LLMs. This seems like a possible application. I asked
          Granite 3.1 MOE 3B to generate a title based on the abstract and it
          came up with:
          
          Tensor Product Attention: A Memory-Efficient Solution for Longer
          Input Sequences in Language Models
          
          Maybe a Greasemonkey script to pass arXiv abstracts to a local Ollama
          could be something...
          
          [1] 
          
   URI    [1]: https://news.ycombinator.com/item?id=42784365
       
          ilove196884 wrote 2 hours 59 min ago:
          I hate how paper titles are worded like seo techniques.
       
            verdverm wrote 2 hours 22 min ago:
            This is a riff on the original "attention is all you need" paper,
            there has been a few of these lately
       
            spiritplumber wrote 2 hours 45 min ago:
            Turn something into a metric and it will be misused. Ever always
            was
       
       
   DIR <- back to front page