_______ __ _______ | | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----. | || _ || __|| < | -__|| _| | || -__|| | | ||__ --| |___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____| on Gopher (inofficial) URI Visit Hacker News on the Web COMMENT PAGE FOR: URI Tensor Product Attention Is All You Need thunkingdeep wrote 9 min ago: If you donât pay to read papers, you donât get to complain about the titles, imo. I hate ads, but Iâm not paying for YouTube Premium either. Thatâs how it goes. I get ads. joshdavham wrote 1 hour 44 min ago: I'm sorry but can people please stop naming their papers "X is all you need"? It's super annoying. recursive wrote 1 hour 26 min ago: Are you saying... you consider it harmful? cute_boi wrote 2 hours 19 min ago: > a novel attention mechanism Why do every paper has to mention this word "novel" and these titles are getting crazier day by day. NitpickLawyer wrote 32 min ago: If your paper is scored / gated on "novel factor" by admission committees, then applicants will over-use that term. patrick451 wrote 1 hour 56 min ago: Because to publish in a real journal, you typically need both novelty and for your work to be "interesting". The job of the abstract and introduction of a paper (where the word "novel" normally lives) is to sell the reviewer that the paper should be published and to sell you that you should read and cite it. verdverm wrote 1 hour 58 min ago: There are a number of papers which aim to improve the attention aspect of models, all being some derivation of the original "Attention is All You Need" paper. A pattern of "'blank' Attention is All You Need" has emerged esafak wrote 2 hours 24 min ago: Tensor decomposition has traditionally suffered from high computational complexity. Is it an issue here? dartos wrote 1 hour 50 min ago: At a sniff test it would make sense. Trading computational complexity for space. absolutelastone wrote 1 hour 58 min ago: Looks like it's just a matrix decomposition in the paper. I'm guessing anyway. These attention papers are always a painful mix of mathematical, quasi-mathematical, and information retrieval jargon. There is something in the github repo about higher-order decompositions. Don't find where the method for factoring is given. verdverm wrote 1 hour 55 min ago: I chuckled when I read, in S-3.1 > Specifically, for each token t, with a small abuse of notation, we define: verdverm wrote 2 hours 0 min ago: My math is rusty, but it looks to have a higher complexity than the original attention. I cannot say if it is an issue. Generally it seems we are willing to spend more computation at training time if it produces better results at inference time. In this case they are reducing the resources needed at inference time (an order of magnitude for the KV cache) or enabling longer sequences given the same resources. There's another paper I saw yesterday, "Element-wise Attention is All You Need" which looks like an early preprint, written by a solo author with a solo A800, and tested on some smaller problems. If the results hold up for language benchmarks, it could reduce resource requirements during training as well. It looks to have a lower complexity when scaling URI [1]: https://arxiv.org/abs/2501.05730 carbocation wrote 3 hours 17 min ago: My kingdom for renaming this paper to something like "Tensor Product Attention is a Memory-Efficient Approach for Long-Sequence Language Modeling" Zacharias030 wrote 1 hour 42 min ago: If you donât like the title, wait till you see this acronym: â⦠we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architectureâ¦â imjonse wrote 1 hour 5 min ago: There is a famous transformer model named T5 from Google, and also S4, S4 and S6 (Mamba) in the LLM space, so it is not unusual naming. whymauri wrote 3 hours 18 min ago: I really can't with these paper titles anymore, man. byyoung3 wrote 50 min ago: haha same anigbrowl wrote 2 hours 23 min ago: By 2038 all scientific papers will be titled 'Bruh.' While this might at first seem a recipe for confusion, the fundamental interconnectedness of all things as demonstrated by Ollama(Googol 13) highlight the fact that pretty much any insight is as good as any other and are all descriptions of the same underlying phenomenon. Freed from constraint like survival or the necessity to engage in economic activity, humanity in the 203s will mainly devote itself to contemplating amusing but fundamentally interchangeable perspectives within increasingly comfy pleasure cubes. 01HNNWZ0MV43FF wrote 57 min ago: As foretold by Joseph Campbell smlacy wrote 59 min ago: Bruh is all you need wisty wrote 2 hours 37 min ago: Clickbait paper titles considered harmful? gbnwl wrote 2 hours 22 min ago: OK I'll admit I chuckled magicalhippo wrote 2 hours 57 min ago: There's an Ask HN thread going[1] asking about what people have done with small LLMs. This seems like a possible application. I asked Granite 3.1 MOE 3B to generate a title based on the abstract and it came up with: Tensor Product Attention: A Memory-Efficient Solution for Longer Input Sequences in Language Models Maybe a Greasemonkey script to pass arXiv abstracts to a local Ollama could be something... [1] URI [1]: https://news.ycombinator.com/item?id=42784365 ilove196884 wrote 2 hours 59 min ago: I hate how paper titles are worded like seo techniques. verdverm wrote 2 hours 22 min ago: This is a riff on the original "attention is all you need" paper, there has been a few of these lately spiritplumber wrote 2 hours 45 min ago: Turn something into a metric and it will be misused. Ever always was DIR <- back to front page