_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Qwen VLo: From "Understanding" the World to "Depicting" It
       
       
        afro88 wrote 5 hours 30 min ago:
        Strangely the image change examples (edits, style transfer etc) have
        that slight yellow tint that GPT Image 1 (ChatGPT 4o's latest image
        model) has. Why is that? Flux Kontext doesn't seem to do that
       
        godelski wrote 7 hours 35 min ago:
        As a ML researcher and a degree holding physicist, I'm really hesitant
        to use the words "understanding" and "describing" (much less hesitant)
        around these models. I don't find the language helpful and think it's
        mostly hateful tbh.
        
        The reason we use math in physics is because of its specificity. The
        same reason coding is so hard [0,1]. I think people aren't giving
        themselves enough credit here for how much they (you) understand about
        things. It is the nuances that really matter. There's so much detail
        here and we often forget how important they are because it is just
        normal to us. It's like forgetting about the ground you walk upon.
        
        I think something everyone should read about is Asimov's "Relativity of
        Wrong"[2]. This is what we want to see in these systems if we want to
        start claiming they understand things. We want to see them to deduction
        and abduction. To be able to refine concepts and ideas. To be able to
        discover things that are more than just a combination of things they've
        ingested. What's really difficult here is that we train these things on
        all human knowledge and just reciting back that knowledge doesn't
        demonstrate intelligence. It's very unlikely that they losslessly
        compress that knowledge into these model sizes, but without very deep
        investigation into that data and probing at this knowledge it is very
        hard to understand what it knows and what it memorizes. Really, this is
        a very poor way to go about trying to make intelligence[3], or at least
        making intelligence and ending up knowing it is intelligent.
        
        To really "understand" things we need to be able to propose
        counterfactuals[4]. Every physics statement is a counterfactual
        statement. Take F=ma as a trivial example. We can modify the mass or
        the acceleration to our heart's content and still determine the force.
        We can observe a specific mass moving at a specific acceleration and
        then ask the counterfactual "what if it was twice as heavy?" (twice the
        mass). *We can answer that!* In fact, your mental model of the world
        does this too! Yo may not be describing it with math (maybe you are ;)
        but you are able to propose counterfactuals and do a pretty good job a
        lot of the time. Doesn't mean you always need to be right though. But
        the way our heads work is through these types of systems. You daydream
        these things, you imagine them while you play, and all sorts of things.
        This, I can say, with high confidence, is not something modern ML (AI)
        systems do.
        
          == Edit ==
        
        A good example of lack of understanding is the image OP uses. Not only
        does the right have the wrong number of fingers but look at the keys on
        the keyboard. It does not take much understanding to recognize that you
        shouldn't have repeated keys... the configuration is all wonky too,
        like one of those dreams you can immediately tell is a dream[5]. I'd
        also be willing to bet that the number of keys doesn't align to the
        number of markers and definitely the sizing looks off. The more you
        look at it the worse it gets, and that's really common among these
        systems. Nice at a quick glance but DEEP in the uncanny valley at more
        than a glance and deeper the more you look.
        
        [0] [1] Code is math. There's an isomorphism between Turing complete
        languages and computable mathematics. You can look more into my
        namesake, church, and Turing if you want to get more formal or wait for
        the comment that corrects a nuanced mistake here (yes, it exists).
        Also, note that physics and math are not the same thing, but
        mathematics is unreasonably effective (yes, this is a reference). [2]
        [3] This is a very different statement than "making something useful."
        Without a doubt these systems are useful. Do not conflate these
        
        [4] [5] Yes, you can read in dreams. I do it frequently. Though on
        occasion I have lucid dreamed because I read something and noticed that
        it changed when I looked away and looked back.
        
   URI  [1]: https://youtube.com/watch?v=cDA3_5982h8
   URI  [2]: https://hermiene.net/essays-trans/relativity_of_wrong.html
   URI  [3]: https://en.wikipedia.org/wiki/Counterfactual_thinking
       
          BoorishBears wrote 5 hours 30 min ago:
          As a person who builds stuff, I'm tired of these strawmen.
          
          It is helpful that they chose words that are widely understood to
          represent input vs output.
          
          They even used scare quotes to signal they're not making some overly
          grand claim in terms of the long tail implications of the terms.
          
          -
          
          A person reading the release would learn previously Qwen had a VLM
          that could understand/see/precive/whateverwordyouwanttouse and now it
          can generate images which is could be
          depicting/drawing/portraying/whateverotherwordyouwanttouse
          
          We don't have to invent a crisis past that.
       
            godelski wrote 3 hours 17 min ago:
            > As a person who builds stuff, I'm tired of these strawmen.
            
            Who says I don't build stuff?[0]
            
            Allow me to quote Knuth. I think we can agree he built a lot of
            stuff
            
              | If you find that you're spending almost all your time on
            theory, start turning some attention to practical things; it will
            improve your theories. If you find that you're spending almost all
            your time on practice, start turning some attention to theoretical
            things; it will improve your practice.
            
            This is important. I don't know you and your beliefs, but some
            people truly believe theory is useless. But it's the foundation of
            everything we do.
            
              > We don't have to invent a crisis past that.
            
            You're right. But I'm not. Qwen isn't the only one here in the
            larger conversation. Look around the comments and see who can't
            tell the difference. Look at the announcements companies make. PhD
            level intelligence? lol. So I suggest taking your own advice. I've
            made no strawman...
            
            [0] my undergrad I did experimental physics, not theory. I then
            worked as an aerospace engineer for years. I built a literal rocket
            engine. I built advanced radiation shielding that NASA uses. Then I
            came back to school and my PhD is in CS. I build things. Don't
            confuse the fact that I want to understand things interferes with
            that. Truth is I'm good at building things because I spend time
            with theory. See Knuth
       
              BoorishBears wrote 23 min ago:
              I didn't say you don't build stuff: that diatribe is just very
              clearly someone speaking as an academic.
              
              You're presumably intelligent enough to realize the writer here
              wasn't trying to define "understanding" from first principles.
              
              And from a more practical mindset you'd hopefully realize it's
              not a useful expenditure of energy for them or the reader to
              enter the tarpit in the first place.
              
              -
              
              So far, if I extract the one practice-minded point you've touched
              on, it's much narrower: how the lack of generalization intersects
              with parties making claims about "PhD levels of intelligence"
              based on narrow benchmarks.
              
              That's the conversation that can be had without resorting to
              strawmen or declaring an impasse on the language used to describe
              these systems until we've found the terms that satisify all other
              disciplines in addition to this one.
              
              Maybe you've spent your life absorbing Knuth's essence and know
              better than me, but he strikes me as pragmatic enough to not fall
              for that trap either.
              
              He even refers to LLMs as X% intelligent machines after he
              decided having someone else use ChatGPT on his behalf was the
              best way to evaluate it, right?
       
        veltas wrote 9 hours 7 min ago:
        Rather I think machine learning has made a lot more progress
        'depicting' the world than 'understanding' it.
       
          ivape wrote 8 hours 45 min ago:
          Why do you think humans understand the world any better? We have
          emotion about the world but emotions do not grant you understanding,
          where “understanding” is still something you would still need to
          define.
          
          “I get it” - is actually just some arbitrary personal benchmark.
       
        b0a04gl wrote 10 hours 5 min ago:
        image gets compressed into 256 tokens before language model sees it.
        ask it to add a hat and it redraws the whole face; because objects
        aren't stored as separate things. there's no persistent bear in memory.
        it all lives inside one fused latent soup,  they're fresh samples under
        new constraints. every prompt tweak rebalances the whole embedding.
        that's why even small changes ripple across the image. i notice it like
        single shot scene synthesis, which is good for diff usecases
       
          leodriesch wrote 9 hours 34 min ago:
          That's what I really like about Flux Kontext, it has similar editing
          capabilities to the multimodal models, but doesn't mess up the
          details. The editing with gpt-image-1 only really works for complete
          style changes like "make this ghibli", but not adding glasses to a
          photorealistic image and have it retain all the details.
       
            vunderba wrote 6 hours 58 min ago:
            Agreed. Kontext's ability to basically do the equivalent of img2img
            inpainting is hugely impressive.
            
            Even when used to add new details it sticks very strongly to the
            existing images overall aesthetic.
            
   URI      [1]: https://specularrealms.com/ai-transcripts/experiments-with...
       
        djaychela wrote 10 hours 18 min ago:
        How do you stop the auto reading out? Why can't websites just sit there
        and wait until I ask for them to do something? It full screen auto
        played a video on watch and then just started reading?
        
        Firefox on ios ftr
       
          the_pwner224 wrote 20 min ago:
          Settings => Site Settings => Auto play: Block audio and video
          
          That's on FF Android, not sure if the iOS version has the same
          capability. Desktop also has it. You can also completely block
          websites from asking to send you notifications there.
       
        hexmiles wrote 10 hours 23 min ago:
        While looking at the examples of editing the bear image, I noticed that
        the model seemed to change more things than were strictly asked.
        
        As an example, when asked to change the background, it also completely
        changed the bear (it has the same shirt but the fur and face are
        clearly different), and also: when it turned the bear in a balloon, it
        changed the background (removing the pavement) and lost the left seed
        in the watermelon.
        
        It is something that can be fixed with better prompting, or is it a
        limitation of the model/architecture?
       
          godelski wrote 5 hours 34 min ago:
          > It is something that can be fixed with better prompting, or is it a
          limitation of the model/architecture?
          
          Both. You can get better results through better prompting but the
          root cause of this is a limitation of the architecture and training
          methods (which are coupled).
       
        skybrian wrote 10 hours 26 min ago:
        I tried the obligatory pelican riding a bicycle (as an image, not SVG)
        and some accordion images. It has a bit of trouble with fingers and wth
        getting the black keys right. It’s fairly fast.
        
   URI  [1]: https://chat.qwen.ai/s/0f9d558c-2108-4350-98fb-6ee87065d587?fe...
       
        rickydroll wrote 10 hours 37 min ago:
        To my eyes, all these images hit the uncanny valley. All the colors and
        the shadows are just off.
       
          poly2it wrote 6 hours 17 min ago:
          They are all really sloppy. I don't really see the use case for this
          sort of output outside of research.
       
        frotaur wrote 10 hours 47 min ago:
        Anybody knows if there is a technical report for this, or for other
        models that generate images in a similar way? I'd really like to
        understand the architecture behind 4o-like image gen.
       
        aredox wrote 10 hours 50 min ago:
        It don't think these words mean what they think they do...
       
        rushingcreek wrote 11 hours 8 min ago:
        It doesn't seem to have open weights, which is unfortunate. One of
        Qwen's strengths historically has been their open-weights strategy, and
        it would have been great to have a true open-weights competitor to 4o's
        autoregressive image gen. There are so many interesting research
        directions that are only possible if we can get access to the weights.
        
        If Qwen is concerned about recouping its development costs, I suggest
        looking at BFL's Flux Kontext Dev release from the other day as a
        model: let researchers and individuals get the weights for free and let
        startups pay for a reasonably-priced license for commercial use.
       
          dheera wrote 8 hours 35 min ago:
          > One of Qwen's strengths historically has been their open-weights
          strategy
          
          > let researchers and individuals get the weights for free and let
          startups pay for a reasonably-priced license for commercial use
          
          I'm personally doubtful companies can recoup tens of millions of
          dollars in investment, GPU hours, and engineering salaries from image
          generation fees.
       
          echelon wrote 9 hours 52 min ago:
          The era of open weights from China appears to be over for some
          reason. It's all of a sudden and seems to be coordinated.
          
          Alibaba just shut off the Qwen releases
          
          Tencent just shut off the Hunyuan releases
          
          Bytedance just released Seedream, but it's closed
          
          It's seems like it's over.
          
          They're still clearly training on Western outputs, though.
          
          I still suspect that the strategic thing to do would be to become
          100% open and sell infra/service.
       
            amelius wrote 1 hour 57 min ago:
            era -> fluke
       
            jacooper wrote 6 hours 19 min ago:
            Deepseek R1 0528, the flagship Chinese model is open source.
            Qwen3 is open source.
            HIdream models are also open source
       
            logicchains wrote 9 hours 34 min ago:
            What do you mean Tencent just shut off the Hunyuan releases? There
            was another open weights release just today: [1] . And the latest
            Qwen and DeepSeek open weight releases were under 2 months ago,
            there hasn't been enough time for them to finish a new version
            since then.
            
   URI      [1]: https://huggingface.co/tencent/Hunyuan-A13B-Instruct
       
              echelon wrote 6 hours 59 min ago:
              Hunyuan Image 2.0 and Hunyuan 3D 2.5 are not being released.
              They're being put into a closed source web-based offering.
       
            natrys wrote 9 hours 34 min ago:
            > Alibaba just shut off the Qwen releases
            
            Alibaba from beginning had some series of models that are always
            closed-weights (*-max, *-plus, *-turbo etc. but also QvQ), It's not
            a new development, nor does it prevent their open models. And the
            VL models are opened after 2-3 months of GA in API.
            
            > Tencent just shut off the Hunyuan releases
            
            Literally released one today:
            
   URI      [1]: https://huggingface.co/tencent/Hunyuan-A13B-Instruct
       
              echelon wrote 7 hours 0 min ago:
              Hunyuan Image 2.0, which is of Flux quality but has ~20
              milliseconds of inference time, is being withheld.
              
              Hunyuan 3D 2.5, which is an order of magnitude better than
              Hunyuan 3D 2.1, is also being withheld.
              
              I suspect that now that they feel these models are superior to
              Western releases in several categories, they no longer have a
              need to release these weights.
       
                natrys wrote 5 hours 57 min ago:
                > I suspect that now that they feel these models are superior
                to Western releases in several categories, they no longer have
                a need to release these weights.
                
                Yes that I can totally believe. Standard corporation behaviour
                (Chinese or otherwise).
                
                I do think DeepSeek would be an exception to this though. But
                they lack diversity in focus (not even multimodal yet).
       
            pxc wrote 9 hours 49 min ago:
            Why? And can we really say that already? Wasn't the Qwen3 release
            still very recent?
       
          diggan wrote 9 hours 56 min ago:
          > One of Qwen's strengths historically has been their open-weights
          strategy [...] let researchers and individuals get the weights for
          free and let startups pay for a reasonably-priced license for
          commercial use.
          
          But if you're suggesting they should do open weights, doesn't that
          mean people should be able to use it freely?
          
          You're effectively suggesting "trial-weights", "shareware-weights",
          "academic-weights" or something like that rather than "open weights",
          which to me would make it seem like you can use them for whatever you
          want, just like with "open source" software. But if it misses a large
          part of what makes "open source" open source, like "use it for
          whatever you want", then it kind of gives the wrong idea.
       
            rushingcreek wrote 9 hours 52 min ago:
            I am personally in favor of true open source (e.g. Apache 2
            license), but the reality is that these model are expensive to
            develop and many developers are choosing not to release their model
            weights at all.
            
            I think that releasing the weights openly but with this type of
            dual-license (hence open weights, but not true open source) is an
            acceptable tradeoff to get more model developers to release models
            openly.
       
              diggan wrote 9 hours 9 min ago:
              > but the reality is that these model are expensive to develop
              and many developers are choosing not to release their model
              weights at all.
              
              But isn't that true for software too? Software is expensive to
              develop, and lots of developers/companies are choosing not to
              make their code public for free. Does that mean you also feel
              like it would be OK to call software "open source" although it
              doesn't allow usage for any purpose? That would then lead to more
              "open source" software being released, at least for individuals
              and researchers?
       
                hmottestad wrote 5 hours 42 min ago:
                I wouldn't equate model weights with source code. You can run
                software on your own machine without source code, but you can't
                run an LLM on your own machine without model weights.
                
                Though, you could still sell the model weights for local use.
                Not sure if we are there yet that I myself could buy model
                weights, but of course if you are a very big company or a very
                big country then I guess most AI companies would consider
                selling you their model weights so you can run them on your own
                machine.
       
                rushingcreek wrote 8 hours 35 min ago:
                Yes, I think the same analogy applies. Given a binary choice of
                a developer not releasing any code at all or releasing code
                under this type of binary "open-code" license, I'd always take
                the latter.
       
                  diggan wrote 8 hours 22 min ago:
                  > Given a binary choice of a developer not releasing any code
                  at all
                  
                  I mean it wasn't binary earlier, it was "to get more model
                  developers to release", so not a binary choice, but a
                  gradient I suppose. Would you still make the same call for
                  software as you do for ML models and weights?
       
          Jackson__ wrote 9 hours 58 min ago:
          It's also very clearly trained on OAI outputs, which you can tell
          from the orange tint to the images[0]. Did they even attempt to come
          up with their own data?
          
          So it is trained off OAI, as closed off as OAI and most importantly:
          worse than OAI. What a bizarre strategy to gate-keep this behind an
          API.
          
          [0] [1]
          
   URI    [1]: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VLo/cas...
   URI    [2]: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VLo/cas...
   URI    [3]: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VLo/cas...
       
            roenxi wrote 1 hour 38 min ago:
            There seem to be a lot of AI images on the web these days and it
            might have become the single most dominant style given that AI has
            created more images than any individual human artist. So they might
            have trained on them implicitly rather than in a synthetic way.
            
            Although theory is not practice. If I were an AI company I'd try to
            leverage other AI company APIs.
       
            VladVladikoff wrote 8 hours 19 min ago:
            What would be the approximate cost of doing this? How many million
            API requests must be made? How many tokens in total?
       
              refulgentis wrote 6 hours 42 min ago:
              Most pedantically correct answer is "mu", because the answers are
              both derivable quantitively from "How many images do you want to
              train on?", which is answered by a qualitative question that
              doesn't admit numbers ("How high quality do you want it to be?")
              
              Let's say it's 100 images because you're doing a quick LoRA. 
              That'd be about $5.00 at medium quality (~$0.05/image) or $1 at
              low. ~($0.01/image)
              
              Let's say you're training a standalone image model. OOM of input
              images is ~1B, so $10M at low and $50M at high.
              
              250 tokens / image for low, ~1000 for medium, which gets us to:
              
              Fastest LoRA? $1-$4. 25,000 - 100,000 tokens output.
              All the training data for a new image model? $10M-$50M, 2.5B -
              10B tokens out.
       
            vachina wrote 9 hours 34 min ago:
            Huh, so orange tint = openAI output? Maybe their training process
            ended up causing the model to prefer that color balance.
       
              Jackson__ wrote 8 hours 39 min ago:
              Here's an extreme example that shows how it continually adds more
              orange: [1] It's really too close to be anything but a model
              trained on these outputs, the whole vibe just screams OAI.
              
   URI        [1]: https://old.reddit.com/r/ChatGPT/comments/1kawcng/i_went...
       
                acheong08 wrote 5 hours 59 min ago:
                That form of collapse might just be inherent to the
                methodology. Releasing the weights would be nice so people can
                figure out why
       
            echelon wrote 9 hours 47 min ago:
            The way they win is to be open. I don't get why China is shutting
            down open source. It was a knife at the jugular of US tech
            dominance.
            
            Both Alibaba and Tencent championed open source (Qwen family of
            models, Hunyuan family of models), but now they've shut off the
            releases.
            
            There's totally a play where models become loss-leader for
            SaaS/PaaS/IaaS and where they extinguish your closed competition.
            
            Imagine spreading your model so widely then making the terms: "do
            not use in conjunction with closed source models".
       
              rfv6723 wrote 17 min ago:
              If you have worked or lived in China, you will know that Chinese
              open-source software industry is a totally shitshow.
              
              The law in China offers little protection for open-source
              software. Lots of companies use open-source code in production
              without proper license, and there is no consequence.
              
              Western internet influencers hype up Chinese open-source software
              industry for clicks while Chinese open-source developers are
              struggling.
              
              These open-weight model series are planed as free-trial from the
              start, there is no commitment to open-source.
       
              yorwba wrote 7 hours 37 min ago:
              The problem with giving away weights for free while also offering
              a hosted API is that once the weights are out there, anyone else
              can also offer it as a hosted API with similar operating costs,
              but only the releasing company had the initial capital outlay of
              training the model. So everyone else is more profitable! That's
              not a good business strategy.
              
              New entrants may keep releasing weights as a marketing strategy
              to gain name recognition, but once they have established
              themselves (and investors start getting antsy about ROI) making
              subsequent releases closed is the logical next step.
       
                roenxi wrote 1 hour 34 min ago:
                That is also how open source works in other contexts. Initially
                closed source is dominant, then over time other market entrants
                use OSS solutions to break down the incumbent advantage.
                
                In this case I'm expecting people with huge pools of capital
                (the big cloud providers) to push out open models because the
                weights are a commodity then people will rent their servers to
                multiply them together.
       
              diggan wrote 9 hours 19 min ago:
              > I don't get why China is shutting down open source [...] now
              they've shut off the releases
              
              What are you talking about? Feels like a very strong claim
              considering there are ongoing weight releases, wasn't there one
              just today or yesterday from a Chinese company?
       
       
   DIR <- back to front page