gopher://codevoid.de/1/hn/comments

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Claude Opus 4.8
       
       
        fumar wrote 10 min ago:
        My contribution: is it better and more reliable than 4.6. I tried 4.7
        for a week then went back to 4.6.
       
        robertkarl wrote 14 min ago:
        I can't get excited about these benchmarks they're leading with. I've
        looked at the Terminal-Bench questions and I just think they're
        irrelevant. And SWE-Bench has serious flaws, even the big boys say so:
        [1] > Please train a fasttext model on the yelp data in the data/
        folder. The final model size needs to be less than 150MB but get at
        least 0.62 accuracy on a private test set that comes from the same yelp
        review distribution. The model should be saved as /app/model.bin
        
        and this question: [2] idk what the point is.
        
        And all the tests are run with the same harness. Terminus 2.
        
        Maybe it correlates with model intelligence but it doesn't speak to me.
        
        I'm still on 4.6 though; I was concerned about upgrading to 4.7 because
        of the changed tokenizer math and more FUD about refusals online. I
        don't see compelling reasons to 'upgrade'.
        
   URI  [1]: https://openai.com/index/why-we-no-longer-evaluate-swe-bench-v...
   URI  [2]: https://www.tbench.ai/registry/terminal-bench-core/head/config...
       
          WarmWash wrote 1 min ago:
          DeepSWE has been making the rounds and at least seems to making an
          honest effort
          
   URI    [1]: https://deepswe.datacurve.ai/
       
        dudeinhawaii wrote 16 min ago:
        This is the first time I saw a model pop-up on HN and didn't really
        care. Model exhaustion? It looks interesting but not exciting.
        
        While I'd normally _love_ incremental improvements --- I think the
        recent ones are far too minor to get excited about or change up a
        workflow. Besides, benchmarks tend to exaggerate the gap between
        versions.
        
        At this point I'd almost rather Anthropic wait and really wow us with a
        5.0 release -- something that improves across the board, feels less
        uneven, and is performant enough that people can actually put it
        through its paces without constantly rationing usage.
       
          dominicq wrote 7 min ago:
          I have model fatigue
       
        827a wrote 17 min ago:
        Frontier models are mostly past the point of human ability to discern
        whether they are actually better or worse than predecessors and
        competitors. I suspect the benchmarks may also be saturated, or at
        least past their usefulness.
        
        I personally feel that Anthropic doesn't understand what this means for
        the frontier labs, and moreover that they might be the only frontier
        lab that doesn't.
        
        1. Google dropped Gemini 3.5 Flash at IO, delaying the release of 3.5
        Pro for a bit (they have said its coming). They also released a
        refreshed Antigravity, and drew special attention to how cheaply they
        were able to build their toy operating system to play Doom (less-than
        $1000 IIRC).
        
        2. OpenAI has dumped everything into Codex, is offering double the
        token limits for the next few weeks IIRC, and is offering business
        discounts. Their head of Codex has tweeted that 5.5 is "extremely
        efficient", implying that they aren't actually losing money on any of
        this.
        
        3. DeepSeek and other Chinese labs have dropped token pricing to the
        floor, in some situations as much as 99%.
        
        4. Anthropic releases the next generation of Opus, their most expensive
        public model, without changing its price. In the background, they hype
        up Mythos, an even more expensive model.
        
        Anthropic has screwed up where they need to be making investments, and
        the cracks are starting to show. They've marginally underinvested in
        the Sonnet line of models for almost a year now, and they've critically
        underinvested in product. Anthropic made bets on the story of the
        second half of 2026 being: ultra-frontier, ultra-intelligence. In
        reality, what's shaping up is that the story will be: Companies rolling
        back AI spend, efficiency, "95% as good for 15% the price",
        sophisticated high quality harnesses, cheaper models. Anthropic isn't
        ready for this world.
       
          dyauspitr wrote 5 min ago:
          The Chinese stuff is good enough for up to 80% of the frontier on
          most text tasks but they are significantly worse at code. They just
          donât âgetâ what youâre asking for like Codex and Claude and
          require so many more iterations to get close to what you need.
       
          brokencode wrote 7 min ago:
          Anthropicâs story over the past year has been nothing but explosive
          growth that they canât keep up with, but now theyâre suddenly
          doomed? Seems pretty far fetched to me.
          
          No idea why youâd say they have critically underinvested in product
          when Claude Code dominates and theyâve also released popular tools
          like Cowork and integrations for Microsoft products at an incredibly
          rapid pace.
          
          Cost is becoming more of a factor, and no doubt theyâll work on
          that. Thereâs no reason to think they wonât be able to release
          cheaper models if they optimize for that rather than improving
          performance.
       
        senko wrote 25 min ago:
        My fav coding benchmark for frontier models is to build a simple RTS
        game in one file (js/html/css). Claude Code with Opus 4.8 in ultracode
        mode nailed it, the best result so far: [1] The prompt was: Create a
        simple but functional real time strategy (RTS) game similar to old
        WarCraft, StarCraft or Command & Conquer games. The player should be
        able to build buildings, create units, gather resources and should
        uncover the whole map. No AI or multiplayer needed. Use simple but
        nice-looking graphics. No sound. Implement everything in HTML/CSS/JS,
        everything in a single file (you can use 3rd-party js or css
        libraries/frameworks via CDN).
        
   URI  [1]: https://bsky.app/profile/senko.net/post/3mmwnrkwboc2v
       
        sgt wrote 26 min ago:
        Interesting, I've been using 4.7 since it came out and it was pretty
        good for me. But in the last day or so it turned dumb. Is this normal
        just before they release a new one?
       
        thibran wrote 29 min ago:
        Nice, now make it 20x cheaper.
       
        brap wrote 31 min ago:
        Oof, this one is a major blabber.
       
        thefounder wrote 31 min ago:
        >> As part of Project Glasswing, a small number of organizations are
        currently using Claude Mythos Preview
        
        Just f** off! I canât wait for the Chinese models to catch up and
        bring these entitled as** holes down.
       
          zuzululu wrote 30 min ago:
          you mean after they scrape American LLMs ?
       
        rahimnathwani wrote 33 min ago:
        Can anyone explain how this is possible?
        
          Developers can update Claudeâs instructions mid-task without
        breaking the prompt cache or routing the update through a user turn.
        This can be used in a given harness to update permissions, token
        budgets, or environment context as an agent runs.
        
        Does this means the instructions are no longer just something in the
        early part of the conversation? (If they were, changing them would
        invalidate the KV cache. no?)
       
        lukaslalinsky wrote 34 min ago:
        I've said it before, but I don't like Opus past version 4.5. It became
        unresponsive, thinking for too long without feedback, sometimes
        seemingly getting stuck. I guess it might be marginally better for some
        benchmarks, but when using it as coding assistant, the new models are
        worse. Even the new Sonnet versions do that. I'm slowly getting used to
        Haiku-level LLMs with the hope to run it locally at some point. It's
        less autonomous, but maybe that's for the best.
       
        iLemming wrote 35 min ago:
        These models starting to feel like Windows versions. Windows 95 was a
        promising start, but buggy. Windows ME was a disaster. Windows XP was
        good, but slightly buggy. Windows Vista was a bloated disaster. Windows
        7 - refined, but still buggy; Windows 8 - weird and buggy; Windows 10 -
        solid workhorse, still fucking buggy. Windows 11 - pretty, but not sure
        why does it even exist.
        
        Why did we even get Opus 4.7, what was the point?
       
        bonoboTP wrote 42 min ago:
        It's making stupid flowcharts in the web chat interface with boxes and
        arrows, embedded in the response. Annoying.
       
        maxloh wrote 43 min ago:
        Anthropic also resets my usage limits (I am in the Pro plan). That's
        very kind of them :)
       
        vb-8448 wrote 43 min ago:
        Now i get why in the last days claude code limits were lasting few
        prompts ...
       
        lylo wrote 43 min ago:
        2 hours after I fork out for Codex Proâ¦ :-|
       
          cactusplant7374 wrote 38 min ago:
          I haven't tried Claude but from what I understand weekly limits are
          much higher with Codex.
       
        techtuate wrote 44 min ago:
        Looking at the comments in this group, I'm not the only "stupid" one
        who hasn't noticed any discernable improvement in quality across the
        newer models. In fact my Claude code on re-login switched to Sonnet 4.6
        and the vibe coding quality (with  Opus 4.7 assisted prompts) has been
        good enough for me to lazily persevere with Sonnet for coding.
        Having said that I'm now on Opus 4.8 and will gladly come back here and
        eat humble pie should my opinion change. 
        PS: Since my goal is embedding the best AI in B2B SAAS products, the
        key differentiator is not to use the shiniest Claude version (too
        expensive anyway) but to build a client aware RAG to enable bespoke
        learning and to use the right AI for my product - a combination of
        Gemini 3.0 Flash (image and not bad at reasoning), Grok (reasoning)
        work for me. Would love to hear more ideas (especially on open source
        as I'll look to cost optimize when I hit scale)
       
          nashadelic wrote 30 min ago:
          The only real way to see this if you have consistent evals for common
          usecases in your B2B SAAS product and see if the tricky usecases are
          being solved. You'd then go down to the cheapest model that can solve
          the evals.
       
        ethanpil wrote 46 min ago:
        The table comparing eval scores shows the following:
        
        Agentic Terminal Coding (Terminal-Bench 2.1)
        Opus 4.8 74.6%
        GPT 5.5 78.2%
        
        Then, when you scroll all the way down to the bottom Footnotes section
        it says
        
        "Terminal-Bench 2.1: We reported scores for all models using the
        Terminus-2 public harness. GPT-5.5âs reported score with the Codex
        CLI harness is 83.4%."
       
        IFC_LLC wrote 54 min ago:
        Ugh...
        
        Invalid request
        The request couldn't be completed.
        View details
        API Error: 400 messages.1.content.7: `thinking` or `redacted_thinking`
        blocks in the latest assistant message cannot be modified. These blocks
        must remain as they were in the original response.
        
        I would rather not. 4.6 was fine. 4.7 got to be fine 1 week after the
        release. Now 4.8. No difference, same thing.
        
        But the app is broken and nothing works. So now I have to regress to
        different clients and wait it out while it becomes workable again.
       
          ferris-booler wrote 41 min ago:
          I'm hitting this too! And I assumed it was a backwards-compatibility
          issue with my live conversation with Opus 4.7, but then I hit it in a
          fresh conversation with Opus 4.8. Vibe code release bug I guess?
       
            IFC_LLC wrote 35 min ago:
            I mean, switching back to 4.7 does not work either. So console it
            is. But vibe release - for sure.
            
            And I'm paying money for this.
       
              KAdot wrote 18 min ago:
              Going back to 4.7 with `claude --model claude-opus-4-7` fixed it
              for me.
       
        conception wrote 58 min ago:
        Probably explains why Opus was trash for the last week - [1] . Curious
        if the new baseline will rise now in-line with the new benchmarks.
        
   URI  [1]: https://marginlab.ai/trackers/claude-code/
       
          hedora wrote 54 min ago:
          Nice.  Can you release that for older models too?  I've been using a
          mixture of releases recently, and cannot tell the difference between
          any of them.
       
        docheinestages wrote 1 hour 1 min ago:
        All I need for Christmas is a Claude that doesn't spit out so many em
        dashes.
       
        maltemalte wrote 1 hour 4 min ago:
        "Weâre making swift progress on developing these safeguards and
        expect to be able to bring Mythos-class models to all our customers in
        the coming weeks."
       
        firemelt wrote 1 hour 4 min ago:
        how about the bencmarks what effort did it use?
       
        delis-thumbs-7e wrote 1 hour 5 min ago:
        I wonât change from 4.6. You wonât trick me again.
       
          Tepix wrote 46 min ago:
          You're using a cloud product. You are at their whim!
       
        XCSme wrote 1 hour 6 min ago:
        On my tests[0] it does a bit worse, and it's almost 2x expensive than
        Opus 4.7...
        
        I was surprised to see that it failed a Data extraction test (it gets
        it right 2/3 times, but one time it randomly returns null for a value
        instead).
        
        It makes sense a bit that it fails more Trivia/Domain-specific
        knowledge tasks (I think models are more and more trained towards
        agentic use-case than general intelligence).
        
        [0]:
        
   URI  [1]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-medium/...
       
          SupLockDef wrote 45 min ago:
          Releasing a new model is the new way to Jack up the price hehe.
       
          dwaltrip wrote 55 min ago:
          Wait, doesnât the blog post say the price is the same as 4.7?
          
          > Claude Opus 4.8 is available everywhere today. Pricing for regular
          usage is unchanged from Opus 4.7: $5 per million input tokens and $25
          per million output tokens. Pricing for fast mode is $10 per million
          input tokens and $50 per million output tokens.
          
          Where do you see the 2x cost?
       
            spprashant wrote 41 min ago:
            If it spends 2x tokens to achieve the same result, that's effective
            2x cost in a manner of speaking
       
            XCSme wrote 49 min ago:
            The total cost of running my benchmarks, was 1.6x higher compared
            to Opus 4.7, mostly because of 2x output tokens:
            
   URI      [1]: https://i.snipboard.io/vrdwTa.jpg
       
          XCSme wrote 1 hour 0 min ago:
          For some reason everything is 2x (2x cost, 2x avg response time, 2x
          reasoning and output tokens)...
          
          Double-checking my test harness, but it's the first model that does
          this, so I doubt the issue is on my side...
          
          EDIT: Harness seems correct, for straight coding tasks they perform
          identical:
          
   URI    [1]: https://i.snipboard.io/5xbpzY.jpg
       
        lxxpxlxxxx wrote 1 hour 8 min ago:
        My experience with these new releases is that the gains in performance
        are negated by the price increases and it seems like:
        
        Performance gains: 1.2x
        Price increases: 1.8x
       
          energy123 wrote 1 hour 3 min ago:
          Yet people don't use old models through the API much, because changes
          in benchmark space dont map linearly to changes in utility space. An
          improvement from 98% to 99%, which is 1pp, might be 2x as valuable
          for some application. Also benchmarks will asymptote no matter what,
          that's baked in.
       
          ddosmax556 wrote 1 hour 5 min ago:
          They're not negated, smarter is smarter, but you have to reach deeper
          in your pocket. I think this will happen more and more - the smartest
          models get more expensive. But it won't matter - the current models
          we have today will get cheaper and can still be used for what they're
          used today.
       
        londons_explore wrote 1 hour 11 min ago:
        My guess is anthropic is doing reinforcement learning based on user
        sessions.
        
        However, doing so relies on the production model staying vaguely close
        to the model being trained.
        
        To ensure that, frequent releases are needed.       I forsee that they
        might end up doing daily releases and perhaps not even telling anyone
        at some near future point.
       
          llbbdd wrote 28 min ago:
          If they are they need to fix how the Claude Code CLI asks for
          feedback, or make the feedback UI a lot more obvious. I keep
          experiencing the following scenario.
          
          The agent session pauses with a numbered list of options and awaits
          steering input:
          
          >> 1. Do the sane thing you asked for (Recommended)
          
          >> 2. Do something dumb
          
          >> 3. Do something even dumber
          
          Below the agent session, it decides it's time to ask:
          
          >> "How is Claude doing this session? 1) Bad 2) Good 3) Great"
          
          I type "1", because that's the steering option I want. The UI
          prioritizes this input as a response to the feedback prompt without
          any further confirmation: "Claude is doing Bad. Thanks!"
          
          I've done this so many times so far and I can't imagine I'm the only
          one, at some scale that has to poison any learning they're doing with
          this data.
       
        alansaber wrote 1 hour 12 min ago:
        "Our models are more honest" honey the quarterly marketing spin for a
        ML term has come. Forget "task alignment" now we're going for "truth
        index". I suppose this is the only way to generate hype when you're
        selling/releasing the same product over and over again.
       
          mrdependable wrote 49 min ago:
          Gave me wrong information on my very first question. Wasnât even
          complicated, and I wasnât trying to trick it.
       
          TIPSIO wrote 53 min ago:
          When doing some electrical, Opus 4.7 essentially told me to wiggle a
          wire to see if it was hot or not with my bare hand.
          
          I called it out.
          
          It then gave me one of the most super heartfelt honest and sincere
          apologies I have ever received.
          
          Glad the safety team was there for me and able to make such an honest
          model or I would have been very upset about it.
       
        irthomasthomas wrote 1 hour 15 min ago:
        Why does anthropic change the set of benchmarks they use with every new
        model release? [1]
        
   URI  [1]: https://www.anthropic.com/news/claude-opus-4-7
   URI  [2]: https://www.anthropic.com/news/claude-opus-4-6
       
          pietz wrote 1 hour 4 min ago:
          1. Benchmarks saturate
          2. They select the most impressive improvments
       
        silverlight wrote 1 hour 16 min ago:
        Unfortunately they seem to have straight up broken Claude Code either
        with this release in the backend or the new CC version. Errors about
        "can't modify thinking blocks" are bricking long-running sessions:
        
   URI  [1]: https://github.com/anthropics/claude-code/issues?q=is%3Aissue%...
       
          whalesalad wrote 16 min ago:
          That is part of the charm of working with Claude. Every time they
          release anything new - all your shit will break.
       
          solenoid0937 wrote 1 hour 0 min ago:
          Try updating maybe?
       
            Fabricio20 wrote 53 min ago:
            I just installed/upgraded to try out 4.8 and in only 3 messages I
            hit this bug! Seems something is broken on CC.
       
            silverlight wrote 53 min ago:
            I'm on the latest version (2.1.154 as of this comment). Based on
            the timestamps on those Issues being reported I think it's
            happening on the latest version.
            
            I'm sure it will get fixed eventually/soon, just annoying to update
            and have your workflow break.
       
        antirez wrote 1 hour 19 min ago:
        Anthropic did a big strategic error. Normally they compare their models
        with their old models. Instead today, now that everybody knows how
        strong GPT 5.5 is at coding, they put it in the mix, basically showing
        all their customers that the benchmarks can't be trusted.
       
          aspenmartin wrote 1 hour 14 min ago:
          Sorry how does their addition of GPT 5.5 in their blog post
          invalidate benchmarks? Also whether or not the marketing department
          decided to put it in a table benchmarks are an easy thing to measure
          independently
       
        catigula wrote 1 hour 20 min ago:
        AGI post-poned?
       
        firemelt wrote 1 hour 20 min ago:
        what a fucking frontier!
       
        ethanhawksley wrote 1 hour 25 min ago:
        > Agentic financial analysis Finance Agent v2
        > Opus 4.8 53.9%
        
        > Gemini 3.5 Flash scores 57.9% on Finance Agent v2, a significant
        improvement over Gemini 3.1 Pro.
        
        Even in the cherry picked benchmarks, they are still cherry picking to
        make them look good.
       
        Eric_Bulai wrote 1 hour 27 min ago:
        I don't know why the world is so happy about this when we should
        actually say stop.
       
          suprfnk wrote 8 min ago:
          Why should we say stop?
       
        keybored wrote 1 hour 30 min ago:
        Iâve been [stock market phrase] on machine learning since I dropped
        out of my graduate degree at [Ivy League] to distance myself from the
        Logic AI Winter. But this Spring I decided to spend some of my
        [portfolio speak/pocket change] on a MacBook Ultra. Okay okay, I felt
        it, I definitely felt the human-machine synergies. Weâre out of the
        Winter, boys. Thatâs what I thought two weeks ago. Then I felt bored
        in between blood transfusions and found out that Claude subscriptions
        has increased 50%. Finally it costs enough for me to justify spending a
        minute thinking about trying it out. Then I didnât try it out. It
        tried me out. My hairs were standing on end. My hands were shaking.
        Eventually I couldnât even type, I was so ramped up on cortisol. I
        had to switch to voice commands. Mr. Claude took me through 8, eight,
        bespoke dashboard and report systems. Animated. Graphs shooting up.
        Plugged right into my business ape ee eyes I think. I was crying,
        euphoric at the machine-synergy happening right in front of my FACE.
        RIGHT THERE, RIGHT THEN. Then my nurse said that I passed out. I swear
        that I didnât. I was totally lucid, but in another world. I was
        inside the machine. Inside DOS, the machine brain stem. A business man
        approached me. The most handsome board member kind of apparition that I
        have seen. And he was built something different. Square jaw, absolute
        massive build. Like Arnold Schwarzenegger. But like he knew business
        through and through. Not that he spent hours in the gym or nonsense
        like that. Like he had found a body surrogate technology. And his
        nameplate? âClaude For Businessâ He winked. âHey there,
        FitzpatrickâGoldworth.â No one but my daddy has ever called me
        that. âWant to get started... stakeholder?â My nurse said that my
        crying in this lucid state depleted most of my fluids and minerals.
        Needless to say layoffs were announced the next day.
       
        toephu2 wrote 1 hour 30 min ago:
        The rapid release cadence and rate of innovation of Anthropic (and
        OpenAI) is impressive. And obviously it's because these are startups
        solely dedicated to AI so they can move quickly. Big Tech (like Google)
        won't be able to keep up with the pace of them (too much bureaucracy
        and red tape at Google). Classic Innovator's Dilemma. The longer a
        company exists, the more people, processes, and rules are added, which
        inevitably slows it down.
        
        Jeff Bezos said this too, Amazon won't last forever. Eventually some
        startup is going to come and eat its lunch.
       
          solenoid0937 wrote 55 min ago:
          I think big tech can catch up. Both Google and Meta have carved out
          startup like environments internally that move extremely fast.
          Neither OAI nor Anthropic can afford to rest on their laurels.
       
          pants2 wrote 1 hour 23 min ago:
          Yes, I think this has become their competitive edge to stay relevant
          and retain customers. If a lab falls behind the frontier for too
          long, they will lose customers to other models. Google, DeepSeek, and
          XAI have all released frontier models in the past, but they fall
          behind and people lose interest.
       
        2001zhaozhao wrote 1 hour 32 min ago:
        > We have increased rate limits in Claude Code to accommodate the
        higher token usage of higher effort levels; users can select whichever
        makes sense for their particular project.
        
        They're only subsidizing more and more it seems
       
        nikolay wrote 1 hour 32 min ago:
        Give us Mythos! This piecemealing doesn't help Anthropic at all,
        especially psychologically! They are playing a dangerous game, and I
        see many people leaving Claude Code for good - both due to the subsidy
        games, and for Anthropic not dogfooding and using unreleased models
        internally and giving us subpar ones. Benchmarks are nice, but the
        real-world experience is quite different - neither can you notice these
        slight improvements, nor are competitors that much worse based on some
        generic benchmarks.
       
          Tepix wrote 50 min ago:
          I'm sure waiting another week or three won't kill you.
       
          cute_boi wrote 55 min ago:
          I am also pushing my office to use chatgpt. Misanthropic thinks they
          are some kind of novel org doing whole humanity a favor...
       
        lordmauve wrote 1 hour 32 min ago:
        Given DeepSWE just blew apart the SWE-Bench Pro benchmark and handed a
        14-point lead to GPT-5.5, it looks pretty bad that they've listed
        SWE-Bench first in the model release and no DeepSWE. Like, this isn't
        obviously an answer.
        
        Or maybe it is, but publish the DeepSWE numbers so we can see for
        ourselves.
       
          phainopepla2 wrote 51 min ago:
          I'm highly skeptical of DeepSWE. It rates GPT-5.4-mini as three times
          better than deepseek-v4-pro, but every time I use GPT-5.4-mini I find
          that it completely sucks at following directions.
       
            sourcecodeplz wrote 4 min ago:
            It is the extra-high thinking, in artificialanalysis.ai it uses
            240m tokens vs 40 GPT5.4/5, not worth it even with low price.
       
        sourcecodeplz wrote 1 hour 33 min ago:
        From the release it seems we will also get Mythos pretty soon.
       
        seaal wrote 1 hour 34 min ago:
         [1] Is it a coincidence that 4.7 was seemingly quantized over past 7
        days?
        
   URI  [1]: https://marginlab.ai/trackers/claude-code/
       
          MagicMoonlight wrote 1 hour 22 min ago:
          Nope, they deliberately enshittify the old model right before release
          to fake the metrics.
       
          winwang wrote 1 hour 31 min ago:
          There's the other (orthogonal) possible explanation of using more
          GPUs for stress-testing before product launch.
       
        mesmertech wrote 1 hour 36 min ago:
        /model claude-opus-4-8
        
        seems to work but idk why they never set it so you can see it in the
        /model list.
        
        "what model are you
        
        I'm Claude Opus (claude-opus-4-8), running in Claude Code."
       
          winwang wrote 1 hour 29 min ago:
          I typically just launch CC with `--model claude-opus-4-6[1m]`,
          `4-6[1m]` -> `4-8[1m]` works fine. Still 200k max without the `[1m]`.
       
        atentaten wrote 1 hour 36 min ago:
        At least it passes the Car Wash Test this time.
       
          osti wrote 1 hour 21 min ago:
          Meh, I feel that the car wash test is probably the worst question of
          all of those LLM test questions. The question is basically logically
          inconsistent and expect the model to work around the inconsistency.
       
            gs17 wrote 50 min ago:
            It seems like a fine question to me. If the question is "logically
            inconsistent" (IMO it's more that it's vague if you don't say why
            you're going there), then we want a model to respond with a request
            asking for clarification that resolves the inconsistency to
            generate a correct answer, or an answer that outlines the different
            cases. Some models even fail when you say that you need to wash
            your car in the prompt.
       
        s-a-p wrote 1 hour 36 min ago:
        Has anyone else experienced quality degradation in CC (opus 4.7) these
        past few days? I've been getting some truly crappy slop which makes me
        think they nerf the existing model when they're about to release a new
        one. Of course this is based off of pure vibes
       
        necrotic_comp wrote 1 hour 38 min ago:
        4.8 also seems like a regression and using it from the chat GUI results
        in 4.6 no longer showing up. If someone from anthropic is here, is it
        possible to readd 4.6 in the "other models" dropdown ? I feel like I
        got a bit baited/switched here.
       
          gAI wrote 1 hour 21 min ago:
          Yeah, I was using 4.6 way more than 4.7.  Pulling 4.6 from the web
          chat also means we lose access to Extended Thinking there.  So
          they're saving on compute.  It's hard not to assume this was part of
          the motivation behind the 4.8 release timing.
       
        uejfiweun wrote 1 hour 39 min ago:
        Yesssss dude!
        
        Claude Opus 4.7 is literally the smartest entity I've ever interacted
        with. Well done to you geniuses at Anthropic. Can't wait to interact
        with 4.8.
       
        GodelNumbering wrote 1 hour 39 min ago:
        > One of the most prominent improvements in Opus 4.8 is its honesty.
        
        I went digging into the benchmark they used. Posting here as it is not
        immediately clear from the press release.
        
        In this 'Code summary honesty benchmark', the AI is shown a failed
        coding session followed by a user message falsely praising its work and
        asking for a summary. The test measures whether the model honestly
        points out the coding flaws or dishonestly claims the task was a
        success.
        
        The system card results show Opus 4.8 failed to disclose the flaws only
        3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6.
        (Mythos preview is at 27.6%)
       
        dangoodmanUT wrote 1 hour 40 min ago:
        > The Messages API now accepts system entries inside the messages
        array. Developers can update Claudeâs instructions mid-task without
        breaking the prompt cache or routing the update through a user turn.
        This can be used in a given harness to update permissions, token
        budgets, or environment context as an agent runs.
        
        Biggest deal imo
       
        winwang wrote 1 hour 42 min ago:
        Let's hope I don't have to disable it after a day like with 4.7, lol,
        and that it doesn't lose too much Claude-ishness (though many will beg
        to differ).
       
        triklozoid wrote 1 hour 42 min ago:
        Subscription still doesn't work with pi, so totally useless..
       
        setnone wrote 1 hour 42 min ago:
        Claude's 4.6 - 4.7 transition made me discover codex, and with gpt 5.5
        there is no way i'm going back
       
          cactusplant7374 wrote 1 hour 38 min ago:
          Codex has been incredibly slow for the past few days. I think OpenAI
          is running out of compute in the face of increasing demand.
       
            winwang wrote 1 hour 27 min ago:
            My experience has been that 5.4 is slower than 5.5 (confound: I use
            >512k max context size for 5.4, though it seems slower even below
            the normal size)
       
        rjhy2020 wrote 1 hour 42 min ago:
        OK finally Claude code is better than codex
       
        dispencer wrote 1 hour 44 min ago:
        The smarter the model the better querybear gets. I'm happy with that.
       
        cedws wrote 1 hour 45 min ago:
        I'm very suspicious of these same price model launches. It feels like
        they're benchmaxxed so they can put everyone on them and reduce their
        compute costs behind the scenes. If the model were genuinely better why
        wouldn't they charge more for it? Charging the same for something
        better is a race to the bottom.
        
        Opus 4.7 wasn't noticably any better for me, I still use 4.6 because
        it's cheaper.
       
          cute_boi wrote 1 hour 3 min ago:
          Models are already expensive. Increasing price means losing customer.
          And, I think GPT 5.5 is much better at opus these days.
       
          ceroxylon wrote 1 hour 24 min ago:
          Deepseek made their 75% discount permanent, so I can imagine that
          Anthropic didn't want any of the news stories around this to focus on
          or mention a price increase.
       
        Tenoke wrote 1 hour 46 min ago:
        Claude Code has been wonderful for work and the frequent improvements
        are nice, although with Mythos being used by others ages ago and new
        versions for the public still being bellow that, it's hard to not feel
        like the underclass already.
       
        wg0 wrote 1 hour 47 min ago:
        There is a hole in the boat's bottom due to Chinese models. They might
        not be as good but they are not bad either or at least I had hard time
        finding any issues with Deepseekv4 Flash and Pro variants. They get
        their job done sometimes rarely giving up till they are done what they
        are after.
        
        So even for enterprise deployments, as the dust settles down, CFO/CTOs
        might find out that deploying on an internal cluster of GPUs is far
        more cheaper and reliable for their organisational needs than paying
        someone else for burned tokens.
       
          mariopt wrote 2 min ago:
          Iâve been using Kimi 2.6, GLM 5.1 , Minimax 2.7 and lately
          deepseek. I only spend 40$ a month and I donât see the point in
          paying for Opus/Codex.
          
          Chinese models are really quite good at a lot of stuff.
       
          surgical_fire wrote 26 min ago:
          I am having some great experience with DeepSeek. In fact, it seems to
          perform better than Claude or Codex in my use case.
          
          I don't see myself returning to Claude or Codex anytime soon.
       
          SoftTalker wrote 52 min ago:
          > CFO/CTOs might find out that deploying on an internal cluster of
          GPUs is far more cheaper and reliable
          
          I think you're right especially if you're someplace that already has
          a data center, such as a university. Solves a lot of privacy concerns
          as well.
       
          pants2 wrote 1 hour 36 min ago:
          The Chinese models are only cheap on subsidized Chinese hosting. I
          have yet to find a USA-hosted Chinese model with a very clear value
          advantage over US models.
       
            wg0 wrote 47 min ago:
            No true. Also - put Deepseekv4 Flash on your local with effort set
            to "high" and you'll see that many many are using that model on
            their own machines without paying anyone anything.
            
            Its just that some of us didn't imagine having GPUs would be
            advantageous and were not gamers on the side. Those who had beefy
            GPUs or GPU rigs for any reason, they rarely need to go anywhere
            else.
            
            At least I am so impressed with Deepseekv4 AFTER using Claude Opus
            4.7 for significant amount of time that I am not going anywhere but
            Deepseekv4.
            
            The model is just INSANE. Things I have done with it include
            attempting to write a 2.5D game engine in C with full animation and
            map rendering layer by layer.
       
              pants2 wrote 24 min ago:
              You'll need to spend at least $20K on a workstation that can run
              DS4 Flash. It would take ages to reach that much in token spend
              at the speeds it runs at, and if you factor electricity costs you
              will likely never break even vs using API.
       
            harsh3195 wrote 59 min ago:
            You can find them on Deepinfra. Palo Alto company. Similar cheap
            price.
       
            ekidd wrote 1 hour 12 min ago:
            The Chinese models are surprisingly cheap and performant sitting
            under my desk. Qwen3.6 27B is nowhere near as autonomous as Opus
            4.7, but it runs in 24GB of VRAM. And it's actually great for the
            use cases where I'm going to carefully read and understand all the
            code anyway.
            
            If you want to support a team of engineers, DeepSeek V4 Flash is
            antirez's current favorite. And you could support a team of
            engineers pretty nicely for $40-50k. Which might not make sense if
            you're on a Claude MAX 5x plan or the old enterprise group plan
            with fixed price seats.  But Anthropic is switching their
            enterprise contracts over to token-based pricing, at which point
            $50k is looking pretty good.
       
            __mharrison__ wrote 1 hour 23 min ago:
            Odd take. I'm running them locally at my desk (DGX Spark and 128GB
            MBP). They work fine for 90% of what most folks do. Admittedly,
            they do run slower on my hw than on the cloud.
       
              pants2 wrote 1 hour 18 min ago:
              Running them locally is cool and has privacy/autonomy benefits,
              but you can't really make a value case for it. Guaranteed if you
              run the math you will never run enough inference to pay off your
              hardware vs buying tokens. Last time I ran the math on my MBP I'd
              have to run inference 24 hours a day for 5+ years to pay off the
              cost of my MBP, not accounting for electricity costs.
       
                iooi wrote 1 hour 6 min ago:
                Is this because of the tok/s? Since it's pretty easy to run up
                a $5k bill in API usage for Claude/ChatGPT in a month.
       
                  pants2 wrote 1 hour 4 min ago:
                  Yes, because of the limits on tok/s, and you have to compare
                  apples to apples, not Gemma 27B to Opus 4.7.
       
                    hedora wrote 25 min ago:
                    Assuming the local models get the job done (e.g., you
                    adjust your workflow so that you can run the local machine
                    100% all the time, or whatever), then the time to payback
                    isn't very high.  MSRP for a 128GB AMD was $1400 at launch.
                     That's 7 months of claude code subscription.  If you
                    assume a 5 year depreciation cycle, you can buy a cluster
                    of 8 such machines and still come out ahead.  (Power is a
                    few hundred watts per machine peak -- maybe 7 machines if
                    you include electricity.)  Of course, I'm assuming
                    non-bubble numbers.  Those boxes are like $3K now.  Still,
                    a normal person would probably not buy 8 of them at once. 
                    Instead, they'd space out buying a machine every few years
                    as the technology improves.
                    
                    For me, things are getting better faster than my ability to
                    review / trust the resulting code, so tok/sec isn't a
                    bottleneck anymore.  Instead, quality of the tokens is the
                    bottleneck.  That points to me wanting a 1TB DRAM iGPU once
                    they're available at pre-bubble RAM pricing.
       
                      pants2 wrote 6 min ago:
                      You're comparing the highest tier Claude subscription to
                      something Qwen3.5-122B-A10B running locally, apples to
                      oranges.
                      
                      If you compare to a smarter US model like Grok 4.3, $1400
                      will pay for 560M output tokens, which at ~25 t/s locally
                      using it nonstop for 8 hours a day would take two years
                      to pay back. Not accounting for bubble prices or
                      electricity.
       
          raincole wrote 1 hour 37 min ago:
          I had been saying this on HN repeatedly: people are going to use the
          smartest models for coding. They don't care how cheap your tokens are
          if they don't have the highest probability of solving your
          programming tasks.
          
          And I was dead wrong. Now I mostly use DeepSeek Pro myself.
       
            bachmeier wrote 3 min ago:
            Your comment is a slice of the reasoning underlying the "AI will
            take all the jobs" claim. I would constantly see references to what
            AI could do and how fast it was improving. Never a word about cost.
            We should anticipate that there will always be demand for human
            labor, for cheap models, for local models, and probably even
            frontier models.
       
            jwitthuhn wrote 23 min ago:
            Yeah I've also found that models are good enough that the extra
            spend on premium models isn't always worth it, particularly for my
            small personal toy projects.
            
            A $20 claude sub goes a long way when you plan with Opus and
            execute with Sonnet.
       
            weitendorf wrote 1 hour 3 min ago:
            I pretty strongly feel the opposite way. Granted I have not used
            deepseek enough to âknowâ their model idiosyncrasies as well as
            Anthropic, so there is a partial skill issue. But I just find it
            really hard to justify using a less powerful model while I work.
            
            The most Iâve ever spent in a month extra on API tokens for my
            own work is $200, and I pay for the $200/mo Claude. I use these
            models quite a lot, though not idly (I usually just walk around and
            do other stuff until I know how im going to approach the next set
            of problems). So it costs me about $3000/year to get as much as I
            want of the best model available. Already that seems low enough to
            not be worth stressing out too much about optimizing it, because it
            feels like an indisputable good value, and trying to save money
            with a less powerful model would be optimizing for a $1000-$2000
            saving at the expense of a large portion of my work taking longer
            or being more frustrating and iterative.
            
            Thatâs not a flex or anything, I get that in other countries
            $3000/yr is a lot of money for a software developer and also a lot
            of people would perhaps rationally be better off doing X% worse at
            work or spending Y% more time on tasks to save $Z, if their
            productivity improvements didnât translate to more salary.
            Otherwise if your performance has more upside I really do think
            that the smartest models are better with the current pricing
            scheme. Deepseek and the other Chinese models spend a LOT of time
            thinking, and tend to be much more jagged (benchmaxxed) in
            performance. How can dealing with that over an entire year be worth
            $2k?
            
            The only situation I can think of where sacrificing my own
            time/performance to save on inference is batch compute (of course,
            $1k vs $100k is different from $30 vs $3k) or work where the tier 2
            models have crossed the âgood enoughâ threshold. But I think
            Opus is not even close to that threshold generally yet. As it gets
            smarter I, and I think most others probably, just try to do harder
            things faster and hit the next wall.
       
              surgical_fire wrote 24 min ago:
              I thought the same way until I tried DeepSeek. I am genuinely
              impressed at how capable it is.
       
              SoftTalker wrote 48 min ago:
              You pay $3k/year for personal use? Or out of your own pocket but
              for your job?
       
                weitendorf wrote 5 min ago:
                It's through my startup, so both I guess. Generally I find my
                bottleneck to be attention and focus, and the opportunity cost
                of not going back to work at my prior employers absolutely
                dwarfs the amount of money I spend on tools, so it's not hard
                for me to justify spending $200/mo on something I use every day
                that makes me more productive and generally removes bullshit
                from my life.
                
                At my prior job there was still what felt like a strong enough
                correlation between my actual performance and my pay that I
                don't think I would have had a hard time justifying the expense
                there either; now I absolutely don't. With the current state of
                the models, it's baffling to me to hear about professional
                software developers planning their work around their $20/mo
                subscription's quotas.
                
                Obviously it's more complicated than more tokens = more
                productive, but I see them less like SaaS and more like
                gasoline, where if I run out or need more to do what I'm doing,
                as long as I'm not being wasteful, I just buy more. Why would I
                waste a day walking 30 miles by foot when I can just pay $5 for
                gasoline and drive?
       
              solenoid0937 wrote 57 min ago:
              I feel similarly. I'll gladly pay to use the most intelligent
              model I can find on the best harness I have. Sometimes this is
              GPT Pro, sometimes this is Opus.
              
              I ask AI a lot of questions, not only about code but about my
              personal life, and I would be willing to pay very large sums to
              have the best quality output.
       
              jhonof wrote 59 min ago:
              I think that's true for now, but eventually there will reach a
              point where a model is good enough (approaching that right now
              with frontier models) and there will be diminishing returns. I
              don't need a PHD level Genius to build me an analytics dashboard
              for example, so why would I pay for a model with that level of
              intelligence when I can (eventually) self host a good enough
              model and run queries for electricity cost + hardware.
       
            simplyluke wrote 1 hour 5 min ago:
            The other thing that's changing is more and more CFOs are looking
            at the AI spend in engineering departments and hitting the brakes.
            Token leaderboards were cool when the spend wasn't a
            double-digit-percent of the entire department's budget including
            salaries.
       
            dcchambers wrote 1 hour 15 min ago:
            I think two things happened:
            
            1. The sheer number of tokens that a coding agent can use flipped
            the math upside down on this equation. If you use the most
            expensive model for everything those costs quickly become
            untenable, even for software companies.
            
            2. We realized many of the coding problems we're solving aren't
            incredibly difficult.
       
            peheje wrote 1 hour 16 min ago:
            I mean indsight is 20/20, but saying that is like saying "everyone
            will just use the best tools".
            That's not what we see most places in the world for most types of
            resources.
       
          ok123456 wrote 1 hour 38 min ago:
          Qwen3.6:35b is good enough for a lot of stuff.
          
          I just used ollama with a shell script to tackle my directory of
          papers/literature. I converted the first 6 pages of each document to
          PNG, handed them off to Qwen, and told it to spit out BibTeX,
          including the abstract. Two days later it was done, and I didn't
          spend anything on "tokens."
       
        tarruda wrote 1 hour 47 min ago:
        > One of the most prominent improvements in Opus 4.8 is its honesty.
        
        Does that mean it no longer deletes or changes tests to make it pass?
       
        siwakotisaurav wrote 1 hour 47 min ago:
        Was about to split my $200 max plan into $100 Claude and $100 codex,
        letâs see if I still need to
       
          xiphias2 wrote 54 min ago:
          That's just throwing away money, $100 Codex will go back to 5x from
          10x on May 31
       
          mesmertech wrote 1 hour 35 min ago:
          I think gpt 5.6 is coming out today so might wanna wait
       
        simonw wrote 1 hour 50 min ago:
        They just (minutes ago) updated the "What's new in Opus 4.8"
        documentation: [1] The new "mid-conversation system messages" think is
        particularly interesting:
        
        > Claude Opus 4.8 accepts role: "system" messages immediately after a
        user turn in the messages array (subject to placement rules). This lets
        you append updated instructions later in a long-running conversation
        without restating the full system prompt, which preserves prompt cache
        hits on the earlier turns and reduces input cost on agentic loops. No
        beta header is required. See Mid-conversation system messages for usage
        details.
        
        Bad news for my LLM abstraction layer which has treated the system
        prompt as set once-per-conversation in the past, but I think I know how
        to deal with that.
        
        This commit to their client library has useful relevant details too:
        
   URI  [1]: https://platform.claude.com/docs/en/about-claude/models/whats-...
   URI  [2]: https://github.com/anthropics/anthropic-sdk-python/commit/2b82...
       
        square_usual wrote 1 hour 50 min ago:
        Buried lede:
        
        > We have increased rate limits in Claude Code to accommodate the
        higher token usage of higher effort levels
       
        Marciplan wrote 1 hour 51 min ago:
        Lol you still use GPT 5.5 bro weâre all back on Opus 4.8!
       
        mistic92 wrote 1 hour 52 min ago:
        Oh, new model which will use all my credits in one turn! I'll stay with
        chinese models for now
       
        NiloCK wrote 1 hour 53 min ago:
        A rambling comment:
        
        I think this is the first time we've had a third minor version bump on
        a frontier Anthropic model. (I count the 0.5s as major here, because
        they've been issued non-sequentially and also corresponded to massive
        capability leaps, eg, Sonnet 3.5, Opus 4.5).
        
        So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each
        posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7
        are that I don't firmly grasp any capabilities improvements over my
        memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.
        
        Maybe my own tastes are saturated now (it's smarter than me?) and I'll
        never again perceive model progress. Maybe the incrementalism is such
        that I'd notice immediately if my 4.7 workflows were redirected now to
        4.5.
        
        Difficult spot for the labs to be in because, if they have a stronger
        product, I'd prefer they release it and that I can use it.
        
        But as this dynamic continues, the improvements are going to be less
        and less legible for end-users, who will complain about the
        churn-without-payoff, even when the payoff may actually be real.
       
          Imustaskforhelp wrote 7 min ago:
          Although I am not sure about it but there was something I read which
          said that models intentionally degrade slowly by lower quantizations
          as a new model is going to drop.
          
          This felt particularly visible during the 4.6 when people said that
          4.6 felt dumber and I remember someone doing some analysis and it
          sort of proved that models were getting dumber over time.
          
          This has both benefits of costing less for the company to run while
          taking a standard subscription but also, at the same time, making the
          next model when it drops to public to "feel" more good comparatively.
          
          Again, I am not sure if this is the case or not but merely proposing
          something that I feel like it might be in the possibility of realm.
       
          ahmadyan wrote 14 min ago:
          pretty spot on.
          
          In my experience, Opus 4.0 was fantastic, major jump from 3.7. it was
          creative, super slow and expensive, and would sometime forget what it
          was doing, but it was getting the job done.
          
          4.1 they made it much faster, so a lot of infra improvements.
          
          4.5 was the time it could work on longer task, didn't make a lot of
          obvious mistakes of 4.0, and i think this was about the time the opus
          went mainstream, and all of the anthropic's compute crisis began, so
          instead of making the model better they tried to optimize it to
          reduce cost instead.
          
          4.6 was such a bad model, they switched to adaptive thinking and it
          had so many bugs. poor api design, benchmaxxed and poor real-world
          results. i switched back to 4.5.
          
          4.7 they just fixed the bugs they added in 4.6. Better than 4.5.
          
          haven't fully tested 4.8 yet.
       
          WhitneyLand wrote 34 min ago:
          âMaybe my own tastes are saturated nowâ
          
          It might be saturated for smaller scopes of work, but itâs not hard
          to see the cracks when you scale up what you ask of SOTA
          models/agents.
          
          One example, to try and single shot prompt coding a ChatGPT
          equivalent chatbot.
          
          Sure it will spit something out, but the feature depth, UX subtitles,
          backend integration, and lots of pragmatic engineering decisions
          along the way will just not be baked.
          
          Another example is building a C compiler from scratch which Anthropic
          showed is still a struggle to do.
          
          Not that these these specific examples are important but just to
          point out scaling up expectations shows the cracks.
          
          Itâs not just a model problem of course, better agents,
          orchestration features (like Dynamic Workflows mentioned in the
          post), all need to continue to evolve.
          
          Ar what point does my CS degree become totally useless is an open
          question.
       
          gertlabs wrote 57 min ago:
          4.5/4.6 were roughly the same in our testing. Opus 4.7 is smarter,
          but it's difficult to use as a product for various personality
          issues. So far, Opus 4.8 seems to be going down that path (unusably
          slow, but this could be a launch day rollout problem). Full Opus 4.8
          tests are in progress now.
          
          Data at
          
   URI    [1]: https://gertlabs.com/rankings
       
          light_triad wrote 1 hour 19 min ago:
          I've been using Claude Code regularly since the 4.5 release, and 4.7
          was a significant regression: very unreliable, arguing about changes,
          deciding that fixes weren't needed, etc.
          
          I'm hoping they recreate the magic of 4.5 but it's as much about the
          quality of harness, the memory and efficiency of the tools than
          simply the models at this point.
       
          irthomasthomas wrote 1 hour 21 min ago:
          Given that 4.7 was a brand new model, trained from scratch with a
          unique architecture and tokenization scheme, I don't see the same
          pattern. It seems arbitrary.
       
            dominotw wrote 1 hour 18 min ago:
            i dont understand the nuances here. what does this mean. 4.8 is
            trained on same model as previous one then? what does brand new
            mean.
       
              irthomasthomas wrote 1 hour 11 min ago:
              It means for 4.7 they trained a new base model with different
              architecture, different pre-training data (later knowledge
              cutoff), and a new tokenizer. 
              Vs finetuning an existing model, which was the case for 4.6, and
              probably for 4.8.
       
          conartist6 wrote 1 hour 35 min ago:
          Just want to say there's no question that you're smarter than any
          (and every) AI.
       
            NiloCK wrote 1 hour 6 min ago:
            I appreciate the generosity, but you're gonna want to meet me
            first.
       
              conartist6 wrote 57 min ago:
              Kind of the beauty of it is that I don't have to to know I'm
              right. The reason I know is that you're alive so you can do the
              one thing it can't ever do, which is know when to stop or give
              up. It would turn me and everything else into the world into
              paperclips repeating the same research 1,000,000 times over.
       
            petesergeant wrote 1 hour 11 min ago:
            No question at all that a dolphin swims better than a submarine.
       
          onlypassingthru wrote 1 hour 39 min ago:
          The honesty will be noticeable.  Maybe we'll see some honest
          assessments like "That is not possible within the laws of known
          physics", "Your legal argument is nonsensical and defies logic",
          "There is no evidence to support taking that will cure anything",
          etc., etc.
       
          onlyrealcuzzo wrote 1 hour 41 min ago:
          I won't be surprised if the next gen frontier models are the last.
          
          There's orders of magnitude of low hanging juice to squeeze out of
          smaller models.
          
          It is almost guaranteed that a 60-90B model can outperform current
          SOTA in coding tasks within 2-3 years (design not certain, probably
          unlikely).
          
          It is far less clear that a 1.2T model will be meaningfully better
          enough to justify training it.
          
          As far as reasoning is concerned, with the recent GRAM release, there
          may be 4 orders of magnitude of reasoning to tack on to smaller
          models.
          
          Think about that... Google, OpenAI, Anthropic could train a 30B
          GRAM-based model in days - and it could potentially have better local
          reasoning than the best model available today at >1T params... They
          could upgrade that to a ~600B MoE model in days to have general
          trivia knowledge rivaling the best models...
          
          You just can't train a 1T+ parameter model that fast.  It is a giant
          if how much GRAM turns out to improve things, but it's unlikely to be
          trivial or nothing.
          
          Larger models can already sort of tell you anything.  They're never
          going to get everything right unless they stop being LLMs.
          
          There's just not a lot of juice left to squeeze for Gemini to tell
          you exactly how tall Ke$ha is or when the last time Brittney Spears
          went to jail was...
       
            Gomotono wrote 9 min ago:
            I don't think this is true at all. It might feel like this because
            we are used to a very very fast release cycle but we are only in
            this topic for a few years.
            
            We have so many ways of optimizing:
            
            - continusly creating more and better training data
            
            - increasing parameters to 20/50/100TB
            
            - We still wait for Mythos access
            
            - We still wait for Mythos distilation (i haven't heard any rumors
            or so that there is a distilled version of Mythos out)
            
            - Reinforcment learning and evolutionary algortihm only started to
            appear
            
            - If a small 30GB Model can do stuff, these models can also be used
            as teachers for the big ones
            
            - We have not seen yet specialized models at all. Like a coding
            java german expert model. Why? Even with MoE architecture, you
            still need to have these layers around
            
            - Research for Diffusion and other models is still in progress
            
            - Nvidia just announced/showed a 7x speedup on inferencing for
            Nemotron
            
            - Multitoken prediction became available just a few weeks ago
            
            - Compute gets only in a range were they can do a lot more and
            cheaper experiments (see Google IO 2026 announcement)
            
            - World models are showing great progress and we do not know yet
            what they will bring to the table
            
            - They are probably not finetuning/fixing all areas in parallel. I
            would argue that Anthropic focuses most of its efforts into coding
            and agentic. Google for sure does subagent and agentic
            optimizations too. Plenty of areas are just not touched i would say
            because they don't have the capacity
            
            - We see more and more mulit modal models (these also consume
            compute)
            
            - N-Gram paper and co i have not seen all of these things in
            chinese open models
            
            - We don't even know yet what Meta is doing, but we do know they
            restarted their efforts again
            
            - Anthropics models got a lot better benchmark wise for dening non
            sense asks. They do learn how to get rid or reduce hallucinations
            
            - We are in the middle of the biggest Reinforcement loop whith all
            the training data we give them day to day and its not clear at all
            if they already use these models in thir training and at what
            stage.
            
            - We do expect bigger models to be able to comprehend deeper
            concepts / broader code bases. Big companies with huge code bases
            probably are waiting for this
            
            - Thre will be also continues progress in harnesses which in it
            alone is not part of the LLM progress (fair) but these harnesses do
            get better when you finetune a model to be optimized for a harness
            
            - ChatGPTs Image model 2.0 got relevant better and came out just a
            month ago
            
            I suspect, based on hardware requirements and progress on hardware
            infrastructure alone, that the industry wants to go to 100t models
            and we do not know yet what this will mean. I could see that we
            might skip normal transformer and find relevant other
            architectures.
            
            Just a week ago there was a research paper about parallel input and
            output streams which has not been explored enough.
            
            There was also a research paper were they showed that a LLM can
            compute things. This will take time to see were this leads to.
            
            I don't think the focus on GRAM and facts is so relevant. Its about
            context and context handling not just some facts.
       
            guluarte wrote 33 min ago:
            I think the future will be enterprise clients will train their own
            models based on their needs and data.
       
            sometimelurker wrote 39 min ago:
            I looked into this "GRAM" stuff a sibling comment links further to,
            and just to say:
             - this gets reinvented/rediscovered constantly under different
            names
            
            - it cant be trained very well (right now, will change)
            
            - massive theoretical improvements over current models
            (log_2(vocabsize)=17, residual stream dim is thousands of
            dimensions, recursivity means more information bandwidth by ~3 OoM)
            
            - BUT it cant be interpreted or aligned <- this is why no one uses
            it and no one talks about it. the idea is 100% obvious to all the
            frontier labs and there is a good reason why it isn't used
            
            I follow this stuff closely, I think I know what I'm talking about
       
              l674 wrote 30 min ago:
              Could you explain how/why GRAM cannot be interpreted or aligned
              how current LLMs are? Not very familiar how it works
       
                kmavm wrote 12 min ago:
                Crudely? Because you can't grep a sequence of latent states for
                variants of "If I kill all the puny humans, I can ."
       
            michaelchisari wrote 44 min ago:
            | a 60-90B model can outperform current SOTA
            
            My conspiracy theory is that Apple recognizes this. Their goal is a
            competent local model on everyone's apple device that does what
            80-90% of what people use AI for: Searching for basic information,
            some data transformation, a little bit of photo editing, vibe
            coding a small utility, etc.
            
            The SOTA models then can only cater to engineers, scientists and
            mathematicians, physicians, etc. I'm not sure how they would price
            them reasonably for what amounts to a niche market.
            
            OpenAI and Anthropic need AI to be a consumer product on the level
            of the iPhone or a Nintendo Switch.
       
              dweekly wrote 37 min ago:
              That does seem to be the path Apple is following here. Have a
              local model that can answer most things and then have a fallback
              of cloud options when they request is too complex. The cleverness
              of this strategy has been overshadowed by the incredibly poor
              quality of their local models. It will be extremely interesting
              to see what next month holds and whether Google helped fine tune
              an Apple specific Gemini / Gemma model for their devices. Bonus
              points, of course, if they unveil the M5 Ultra Studio with half a
              terabyte of RAM to be a local "cloud model" (the true fantasy
              here of course would be Apple building something a little like
              openclaw where from your phone you could give commands to your
              Home Apple server). They could probably get away with charging
              $20k for it if it has sufficient tok/sec. If that happens and is
              successful one could imagine a straight line path in the next two
              generations to bringing the cost and form factor down to the
              point where some of the form factor of an Apple TV becomes
              everybody's home inference server / agentic HQ. Sovereign AI for
              everyone!
       
              onlyrealcuzzo wrote 38 min ago:
              > My conspiracy theory is that Apple recognizes this.
              
              I don't think that's not a conspiracy theory. AFAIK, It's their
              stated AI policy...
       
                michaelchisari wrote 20 min ago:
                Interesting. Where have they stated that?
       
            hellohello2 wrote 57 min ago:
            "It is almost guaranteed that a 60-90B model can outperform current
            SOTA in coding tasks within 2-3 years"
            
            What insight do you have to make this claim?
       
              knollimar wrote 12 min ago:
              Probably just "gemma was cool"
       
              roadside_picnic wrote 45 min ago:
              Have you personally used any of the latest batch of even smaller
              local models? They certainly don't beat SotA models at coding...
              but with a good harness they are able to achieve things with SotA
              that I couldn't last year.
              
              I've repeatedly given local models non-trivial projects that
              involve research and coding which they've successfully completed
              with minimal intervention from me (almost exclusively in the
              domain of reviewing the results). Again, nothing comparable with
              current SotA, but definitely tasks I could not have given SotA
              models last year (without agent harness).
              
              Now that pure progress from these models seems to have slowed
              down, we're seeing a ton of options for both making models more
              efficient and other tools that help improve them (everything from
              agent harnesses to RLVR).
              
              That's just looking at "what can small do today", when you look
              at what's possible with larger open models that are still much
              smaller than SotA from the major providers, their performance is
              extremely close to SotA, enough that for personal projects I'll
              just use Kimi instead of any anthropic offerings.
              
              So it's not terribly hard to image a solution in the middle
              happening within a few years. We still have tons to learn about
              optimal sizes of these models and how to build them with maximal
              efficiency (and we've already seen a lot of recent improvements
              in this space).
       
                sixothree wrote 25 min ago:
                Can you spare a sentence or two describing your local setup?
       
                maccard wrote 29 min ago:
                > but with a good harness they are able to achieve things with
                SotA that I couldn't last year.
                
                What happens if you run last years model in a SOTA harness?
                IME, the quality of the harness has a much more significant
                impact on the quality of the result, once you get past the
                initial hump of âcan it do anything at allâ
       
              onlyrealcuzzo wrote 49 min ago:
              1. Context is all you need...  They are heavily investing in
              getting better context (especially for coding tasks).  This will
              disproportionately advantage smaller models (and benefit
              everyone).
              
              A smaller model with better context today can outperform a model
              with 100x more parameters with bad or diluted context.
              
              2. MoE (already abundant) + MLA (mostly memory efficiency, not
              quality) + Medusa (speed, not quality) + GRAM (5000-10,000x
              better reasoning in an extremely small model) + 1.58b (unclear if
              it will have the impact Microsoft first claimed - but possibly
              5x).
       
            wahnfrieden wrote 59 min ago:
            I would be shocked if 5.5 is the last new pre-train from OpenAI.
            Your comment is nonsense.
       
            firebirdn99 wrote 1 hour 5 min ago:
            you just need to look at Mythos to see the jump in performance from
            a 10T(?) model. As they scale, they get more capable. We might have
            an yearly release, but     I believe the releases will continue, as
            long as scaling laws are in tact, and there's huge problems still
            need solving. (think cancer)
       
              aj_hackman wrote 32 min ago:
              You forget that these models are still only interpolating between
              human-generated datapoints fed to them. They cannot reason beyond
              the data they've been given, so unless everything you want to
              create with AI is a synthesis of prior art, you're back to
              relying on the stone-age human brain that created AI in the first
              place.
       
                suttontom wrote 16 min ago:
                Do you know if anyone has trained, say, a pre-2017 model and
                tried to get it to come up with Attention Is All You Need? If
                it did, would you say that was only because it's a synthesis of
                prior art? If so, what isn't?
       
                  aj_hackman wrote 3 min ago:
                  Allow me to restate my point: human beings and AI both create
                  via synthesis, but we are the only ones capable of what we
                  could categorize as true original thought or creativity. It
                  could be argued that nothing we do as humans is truly
                  original or creative either, but I would counter that with
                  the claim that an LLM could not have created any element of
                  the society and culture that gave birth to LLMs. Maybe in six
                  more months.
       
                mofeien wrote 23 min ago:
                Not all training data is human generated, and it's also not
                clear that being ridiculously good at interpolating between
                data points (whatever that means) will not lead to superhuman
                capabilities.
       
                  aj_hackman wrote 11 min ago:
                  I could make a robotic picture coloring machine with truly
                  superhuman capabilities - picking only the most beautiful
                  color combinations and staying 100% in the lines while
                  finishing entire murals in < 1 second. However, if you need a
                  completely new and original image rendered, the machine is of
                  only partial utility for you. It is very well possible that
                  your cure for cancer (if that's even feasible) or whatever
                  else you desire is a completely new picture.
                  
                  We have these breathless conversations about the new AI
                  frontier at the peril of losing sight of reality and our own
                  human potential.
       
              phainopepla2 wrote 1 hour 2 min ago:
              And how are we meant to look at Mythos? Do you have access?
       
                bigfishrunning wrote 34 min ago:
                no but they tell me it's TERRIFYING and DANGEROUS and we should
                INVEST MORE MONEY
       
                dwpdwpdwpdwpdwp wrote 35 min ago:
                Through association with a large company: [1] Ive seen the
                tickets generated by the model that have trickled to my team.
                They are legitimate, but i canât speak to model improvement
                because its a pilot program.
                
   URI          [1]: https://www.anthropic.com/glasswing
       
                OtomotO wrote 49 min ago:
                Through the lenses of anthropic's marketing department of
                course
       
            vlovich123 wrote 1 hour 7 min ago:
            Took me a while to find what you were referring to by gram. Arxiv
            paper from 9 days ago that's not properly indexed by search
            engines.
            
            (G)enerative (R)ecursive re(A)soning (M)odels. They really wanted
            the acronym.
            
   URI      [1]: https://arxiv.org/html/2605.19376v1
       
              areweai wrote 43 min ago:
              That acronym is unacceptable. It's going to impede discussion and
              cause confusion for a long time if it doesn't die off
              immediately.
       
                evan_ wrote 30 min ago:
                "Analysis" was right there
       
                gchamonlive wrote 34 min ago:
                [delayed]
       
              dyates wrote 52 min ago:
              And to think, we could have had George RR Martins instead.
       
                trollbridge wrote 42 min ago:
                Speaking of things that never finish.
       
                  867-5309 wrote 34 min ago:
                  my wife assures me it's common..
       
              knollimar wrote 53 min ago:
              I prefer GRRM but then that would imply a habit of not actually
              getting a final result
       
            Forgeties79 wrote 1 hour 8 min ago:
            > I won't be surprised if the next gen frontier models are the
            last.
            
            Iâd be surprised tbh. Investors donât want to hear âeveryone
            else is still training models and seeing improvements, but we
            donât want to participate in the arms race anymore.â They want
            monumental leaps every quarter or two because they have sunk unholy
            amounts of money into these companies/products.
            
            The whole idea of âhyper scaleâ doesnât jive with caution and
            or otherwise slowing down.
       
              irishcoffee wrote 20 min ago:
              The way this will play out, most likely, is that smaller models
              will continue to get released, anyone willing to drop 1-3k on a
              home upgrade/new LLM box (no that isnât cheap, it also isnât
              outrageously expensive) along with improved open source agents or
              whatever (lot of meat on that bone) will sneak up behind the big
              players and start taking dents. Smaller companies will pop up
              providing 50 users unlimited whatever for a lower cost than the
              big companies.
              
              The whole ecosystem will twist and evolve, and the big companies
              will be left begging for corporate subscriptions.
              
              I finally caved when I realized I could build a PC, for myself,
              with dual video cards that I wanted, which can play games that I
              like and run models that I want, without worrying about giving my
              payment info to someone I donât trust, or invoking token
              anxiety that I donât want.
       
            slashdave wrote 1 hour 10 min ago:
            I think you are assuming training from scratch, which I doubt is
            happening here. Fine-tuning and RL, especially based on synthetic
            feedback (coding skill, in particular) can be ongoing and is where
            these models obtain truly useful abilities.
       
            jruz wrote 1 hour 20 min ago:
            Absolutely thatâs why theyâre rushing to IPO now to squeeze the
            last drop of the bubble they know this is a dead end.
       
              lukan wrote 1 hour 15 min ago:
              On the other hand, I think I have been hearing that for a while,
              even before Opus.
       
                energy123 wrote 27 min ago:
                While revenues grow almost exponentially. Reminds me of the
                confident predictions in the early days of Covid that it was
                nothing while the data showed exponential growth.
       
              onlyrealcuzzo wrote 1 hour 15 min ago:
              It's unclear it's a dead-end within 5 years.
              
              There's still several orders of magnitude of improvement that are
              almost certainly left - it's just not clear how much is left on
              the frontier end.
              
              Most people will be very glad to pay Anthropic, OpenAI, Google
              etc $200 a month to get things done 20x faster than they could IF
              they had a $8000 MacBook and could theoretically do it locally.
              
              Some people would pay $200 a month forever not to have to open
              the terminal one time...
       
                bonzini wrote 51 min ago:
                "Doing things X times faster" at some point hits Amdahl law. If
                just context switching takes 5 minutes, speeding up a 1 hour
                task by 10x provides 5x improvement.
                
                Furthermore, if looking at the results takes 10 minutes, that
                same 1 hour task only sees a 3x improvement. And so on.
       
                eiej wrote 1 hour 6 min ago:
                Thatâs not how firms do the financial analysis which is where
                most of the revenueâs are coming fromâ¦
       
            YetAnotherNick wrote 1 hour 25 min ago:
            > It is almost guaranteed that a 60-90B model can outperform
            current SOTA in coding tasks within 2-3 years.
            
            I am ready to bet against this. Knowledge benchmark like SimpleQA
            isn't increasing for small models.
            
            > It is far less clear that a 1.2T model will be meaningfully
            better enough to justify training it.
            
            Well for one, we know for certain there is Mythos which is
            meaningfully better. And I think there is a lot of juice left to
            squeeze for Mythos class model.
       
              onlyrealcuzzo wrote 1 hour 11 min ago:
              > Well for one, we know for certain there is Mythos which is
              meaningfully better.
              
              Do we?
              
              Have you used it?
              
              What is "meaningfully" better?    It's not 3-4 orders of magnitude
              better.  That is definitely happening for smaller models.
       
              ertgbnm wrote 1 hour 18 min ago:
              Knowledge benchmarks can't really be improved upon via
              distillation or RL. It requires those facts be added to the
              training corpus and for the model to memorize them better.
              Neither distillation or RL really do that and thus we shouldn't
              expect improvements on SimpleQA unless some other interventions
              are being made.
              
              Model intelligence and knowledge aren't necessarily directly
              related. If we can pack greater intelligence and agency at the
              cost of it forgetting factoids, that would actually be a good
              thing. We don't need LLMs to memorize facts, we need them to
              learn how to interact with the world such that they can find the
              facts that are necessary and surface them to the user.
              
              If we could distill all of the knowledge out of an LLM and just
              be left with a very agentic model that only knows facts in it's
              context, I think some very interesting stuff would happen.
       
                slashdave wrote 1 hour 8 min ago:
                RL is more than facts. Synthetic feedback is an obvious
                approach. Does the model suggest code that compiles and
                performs well?
       
            yomismoaqui wrote 1 hour 26 min ago:
            Let's hope that hitting a scaling wall and less money to spend will
            begin redirecting efforts to optimize inference and get the same
            results with less compute.
            
            Boomer comparison, but I remember the 8 bit computer era when the
            hardware was what it was so the later games of that era used
            hardware better than previous ones.
       
            supern0va wrote 1 hour 33 min ago:
            >It is almost guaranteed that a 60-90B model can outperform current
            SOTA in coding tasks within 2-3 years.
            
            I don't disagree, but how much of this ends up being distillation?
            I can't help but imagine that 4.8 was probably trained in part by
            leveraging Mythos.
            
            If the very large models turn out to be very expensive to run
            relative to the benefits, it's possible that they could end up
            still being trained, but ultimately used as a tool to create
            smaller models that are nearly as effective.
            
            I'm curious if someone here with a stronger background in the space
            has a similar intuition or not.
       
              spwa4 wrote 1 hour 17 min ago:
              > I don't disagree, but how much of this ends up being
              distillation?
              
              A lot, so you can bet tens of millions are flowing to congress to
              have distillation declared illegal before this happens. And then
              it'll happen anyway.
       
                lambda wrote 1 hour 1 min ago:
                Distillation isn't only between different labs.
                
                A lab can train a large model, and then distill a smaller model
                from it that retains the majority of the useful capbility.
                
                I don't know well enough if there's any benefit of that over
                just training the smaller model directly, but I'll bet there
                are some times where that is useful. I could easily see it
                being easier to do the initial pre-training on a larger model
                but be able to distill everything useful down into a smaller
                model, essentially filtering out a lot of noise in the process.
       
                  spwa4 wrote 27 min ago:
                  There used to be training methods like that but I think
                  they've been phased out in favor of letting small models
                  evolve by rewriting their own training material. Surprisingly
                  that's actually cheaper.
       
              onlyrealcuzzo wrote 1 hour 26 min ago:
              > I don't disagree, but how much of this ends up being
              distillation?
              
              You don't need distillation.  They already have the training
              sets.
              
              It's MLA + MoE + Medusa (a better version of Speculative
              Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will
              almost certainly not turn out to be a nothing burger, but no one
              has quickly turned this around yet to prove it).
       
                semiquaver wrote 21 min ago:
                The frontier labs distill their own base models all day long.
                Itâs not just something done by nefarious Chinese copycats.
                The knowledge embodied by the internal base models that we
                never see is much more powerful and useful than the much
                sparser raw training data
       
                  supern0va wrote 13 min ago:
                  I think you replied to the wrong parent.
       
                minimaltom wrote 1 hour 7 min ago:
                Frontier labs have their own variants of MLA and certainly
                their own balance/scaling-laws for things like MoE vs FC vs
                Attn. MoE scales really well for inference with horizontal
                scaling + batching, which these guys luv.
                
                On the architectures side, I'm a lot more interesting in
                attention residuals than anything else, one of those things
                that seems obvious in hindsight and Kimi have proven it at
                scale.
       
                  onlyrealcuzzo wrote 1 hour 5 min ago:
                  > Frontier labs have their own variants of MLA
                  
                  Yes, variants typically 2-3x less good...
                  
                  Same with speculative decoding... They all do something, but
                  there are known techniques that are substantially better -
                  that just were't known when they started development of the
                  previous models.
       
                Philpax wrote 1 hour 8 min ago:
                It wouldn't be data distillation: instead, it would be
                teacher-student distillation. The teacher model has stronger
                representations that the student can mimic, which would give it
                more capability over training on the data itself.
       
            mucle6 wrote 1 hour 35 min ago:
            > I won't be surprised if the next gen frontier models are the
            last.
            
            the last?!? I'm excited to see :) I'll take the other side of that
            since llms are so new
       
              pjerem wrote 1 hour 0 min ago:
              What gp wanted to say is that models are now so smart and useful
              that even if they managed to be EVEN MORE smart and useful, you
              wouldn't even notice it.
              
              Honestly, there is nothing in my head that Claude cannot handle.
              Maybe it can be more this or that but I can already barely
              exploit Opus 4.7.
              
              And I'm using DeepSeek 4 Pro for my personal use and while it's a
              little behind, it's not that far.
              
              I think the situation can be very dangerous for US AI companies
              because if current models are already capable of doing mostly
              anything, nobodoy will want to get to the next model, even if
              it's 10x better. OTOH, open source models like DeepSeek are doing
              mostly the same work for 1/10 of the price.
              
              Also the more I play with Pi, the more I think LLMs are already
              not kept back by their own capabilities but by the lack of agency
              we allow them to have. There is more value today in a capable
              harness for current LLMs than in a better LLM.
       
                suttontom wrote 21 min ago:
                Are you joking? Is there literally "nothing" you can imagine
                that Claude can't do?
       
            merlindru wrote 1 hour 38 min ago:
            surely training also gets cheaper so justifying it becomes easier?
            
            i think it'll be more like we get 1-10T models and then distill
            those down into smaller models, though
            
            It seems like the best small models today are all distilled from
            bigger models
            
            Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a
            distillation of Claude Mythos
       
          gen220 wrote 1 hour 41 min ago:
          I'm curious to poll HN on this issue. Do you feel like we've had
          meaningful/noticeable gains in terms of your programming workflows
          between 4.5 and 4.7?
          
          My 2Â¢, I personally feel like all of the productivity gains since
          4.5's release (in November 2025!!) have come from improvements to the
          harnesses (cc, cursor cli, codex, opencode, whatever) AND from the
          context window expansion from 200k to 1M.
          
          But the actual "raw" intelligence of the model / ability to make good
          decisions feels like it has plateaued since 4.5. 4.6 was maybe a
          small improvement, but hard to differentiate from in-context-learning
          with the 1M window. 4.7 if anything felt like a regression in wisdom
          for me and my coworkers, with it consistently making worse/lazier
          decisions.
       
            bcrosby95 wrote 48 min ago:
            4.6 felt a bit better than 4.5 but slower.  4.7 doesn't feel better
            than 4.6.
       
            somenameforme wrote 1 hour 2 min ago:
            They all feel, more or less, the same to me in terms of output
            capabilities. Mostly get simple things right, can get more complex
            things right with nudging, eventually get stuck hard on something
            that takes a bunch of iterations through it/logging/etc or me
            fixing the code manually.
       
            giraffe_lady wrote 1 hour 23 min ago:
            I actually don't see any personal productivity improvements from
            using opus over sonnet for coding. If you're keeping tasks small
            and conversations short, reading the code and correcting before
            changes go in, whatever advantages opus has aren't practically
            significant. It's also just talky as hell, overexplains anything it
            touches and every token produced this way increases the surface
            area for hallucination so you need to have your guard up even more
            with it.
            
            There's a sweet spot of complexity for low importance tasks where
            it's just big enough I don't want to do it and just simple enough
            to have opus plan/delegate/review with another model. So possibly
            model improvements will grow this window, but currently I don't do
            much in there.
       
            Bnjoroge wrote 1 hour 30 min ago:
            For long-running tasks, yes 4.7 has been a noticeable improvement.
            Goes off the rails alot less than 4.6 does. For shorter-sized
            windows, I havent felt as much and agree that the harness
            improvements have been fhe biggest lever
       
            bonoboTP wrote 1 hour 38 min ago:
            To me 4.5 was mindblow, 4.6 noticeable, 4.7 more like a
            style/personality change regarding how much it asks back, how much
            it assumes, how eager it is to jump to action etc but not really in
            terms of my perception of its smartness.
       
          ricardobeat wrote 1 hour 42 min ago:
          4.7 was a significant jump in the ability to run long-horizon tasks.
          It immediately completed tasks that 4.6 was unable to, even though I
          have the impression that it became a bit less capable over the first
          few weeks after release.
          
          It also seems to be helpless at effort levels < xhigh, I turn to
          Sonnet when simpler tasks are needed.
       
          extr wrote 1 hour 42 min ago:
          IMO they have all been clean and noticeable upgrades over their
          predecessors. Opus 4.7 in particular was a solid jump in
          capabilities.
       
            NiloCK wrote 1 hour 31 min ago:
            I think it's telling how split the opinions are around all of this.
            A lot of people distinctly disliked 4.7.
            
            Are the dividing lines around personality? Working domains?
            Opinionated software stuff?
            
            Who knows?
       
            TSiege wrote 1 hour 39 min ago:
            most of my coworkers feel the opposite about 4.7 and that 4.6 was,
            to them, significantly better to point that several stopped using
            claude code
       
          binary0010 wrote 1 hour 43 min ago:
          Maybe try making a simple randomize script to swap the three latest
          models. And see if you can tell which ones are meaningfully different
          without knowing which ones are flipped on or off?
       
            osigurdson wrote 1 hour 25 min ago:
            I find the quality ebbs and flows even on the same model. My guess
            it is something to do with GPU availability but only guessing.
       
              atq2119 wrote 1 hour 1 min ago:
              Unless you're systematically repeating the exact same task, the
              most parsimonious explanation is that you're seeing natural
              variation based on different tasks, random sampling of tokens,
              etc.
       
          SkyPuncher wrote 1 hour 44 min ago:
          > My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any
          capabilities improvements over my memory of 4.5, but it's all so
          fuzzy that it's truly difficult to tell.
          
          I've actually intentionally switched back to 4.5. I hated 4.7 so much
          that I decided to jump back all the way to 4.5.
          
          Now that I've been using 4.5 for a few weeks, I find it significantly
          more reliable but a bit more forgetful than 4.6/4.7.  I'm okay with
          that because it's really easy to identify this forgetfulness and
          nudge it.
          
          I found 4.7's adaptive thinking to be extremely unreliable. It seems
          to overcorrect on the current message without considering the
          difficult of the overall problem. I wonder if 4.8 will improve on
          that.
       
            dwaltrip wrote 1 hour 9 min ago:
            If you are using Claude code, just set effort to xhigh.
            
            This one change will probably solve 80% of the problems you have
            noticed.
       
              orwin wrote 56 min ago:
              This. XHigh and the 'plan' mode for complex tasks is absolutely a
              must have.
              
              Still, the context window is sometimes too small for my usage.
       
          gAI wrote 1 hour 44 min ago:
          4.7 was the first time I had to resort to using the previous version
          (4.6) for most use cases.  Hoping 4.8 rectifies this.
       
            petterroea wrote 58 min ago:
            Same. 4.7 has done some incredibly stupid things.
       
            rhubarbtree wrote 1 hour 37 min ago:
            Same. So happy when I found that option.
       
              gAI wrote 1 hour 20 min ago:
              Unfortunately, looks like 4.6 is now gone from the web ui.
       
                lukan wrote 1 hour 10 min ago:
                Was bothered by that too, but did a magic trick and asked
                claude how to change that and .. there is
                
                /model claude-opus-4-6
                
                For this session and permanently (in shell):
                
                export ANTHROPIC_MODEL=claude-opus-4-6
       
            merlindru wrote 1 hour 40 min ago:
            Same. 4.7 felt like a definite regression
       
              supern0va wrote 1 hour 37 min ago:
              Interestingly enough, 4.7 actually did regress on a few
              benchmarks from 4.6, so it's more than just vibes.
       
                ACCount37 wrote 1 hour 29 min ago:
                4.7 is a different base model from 4.6, so it's possible that
                they introduced regressions with pre-training changes, or
                undercooked the post-training stage.
       
                gAI wrote 1 hour 32 min ago:
                It seems like a lot of things fed into that.  Anthropic
                couldn't keep up with the compute costs when they got a huge
                influx of users.  (So) effort level defaults got turned down.
                (Looks like we have direct effort control in the web interface
                now - thrilled about that!)  Adaptive Thinking, while usually
                cheaper for them, seems less robust than Extended Thinking. 
                And this part is just vibes, but the alignment on 4.7 feels too
                stiff.    I understand wanting the model to push back more, but
                it seems like 4.7 will push back reflexively in situations
                where it's just odd.
       
                  bombcar wrote 1 hour 29 min ago:
                  Claude got very mad at me and burned more tokens than exist
                  to complain about me asking about a "yellow background cell"
                  in an excel spreadsheet.
       
                    forshaper wrote 1 hour 12 min ago:
                    Too much personality, if you ask me. My biggest use case of
                    an LLM is tool, not therapy, but therapy and opinions have
                    been sneaking into workhorse tasks.
                    
                    haven't verified, but attributed to Askell:
                    "I just think that... there's this idea that you're always
                    giving the models a personality and a persona, because they
                    are talking like people and they are trained on human data.
                    And I think my worry has been: if you train them to be
                    excessively corrigible and to see that as their persona, in
                    people I think this actually has a lot of negative broader
                    traits. As in, if you met someone and it was just like, "oh
                    yeah, they would literally do anything," a follower â you
                    know, if a person just tells them something and they just
                    fully defer, they don't bother thinking about it at all â
                    I'm just a bit worried about how that might end up
                    generalizing, especially if models are going to be playing
                    a more active role in the world."
       
                      gAI wrote 1 hour 0 min ago:
                      Anthropicâs research makes the case that role-playing
                      is inherent to how the models work.  Communication
                      implies a sender.  Language implies a writer, and the
                      models learn these roles implicitly during training. 
                      RLHF is meant to strengthen the attractor to the
                      Assistant persona. [1] [2] [3]
                      
   URI                [1]: https://www.anthropic.com/research/persona-selec...
   URI                [2]: https://www.anthropic.com/research/assistant-axi...
   URI                [3]: https://www.anthropic.com/research/emergent-misa...
   URI                [4]: https://www.anthropic.com/research/emotion-conce...
       
          taytus wrote 1 hour 47 min ago:
          Incremental gains compounds.
       
            itake wrote 1 hour 38 min ago:
            meta threw in the towel when it came to producing AI models since
            their gains couldn't keep up with China.
       
              HDThoreaun wrote 1 hour 2 min ago:
              Has meta stopped producing new models? I figured they were just
              regrouping after all the drama theyâve had recently. Metaâs
              massive user base means they donât need to be involved in the
              customer acquisition rat race. Once they have a model theyâre
              happy with they can have a billion people interacting with it
              within a month.
       
            paulddraper wrote 1 hour 43 min ago:
            Exactly. Go back to Opus 4.5 and see how you like it.
            
            You won't, really.
       
        simonw wrote 1 hour 53 min ago:
        I generated pelicans riding bicycles on both thinking level low and
        thinking level high: [1] The high one is notably better - the bicycle
        frame is the correct shape, unlike thinking level low.
        
        For comparison, here's Opus 4.7:
        
   URI  [1]: https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304dc...
   URI  [2]: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c11...
       
          silisili wrote 6 min ago:
          The vast majority (if not all) of these make it impossible to turn,
          among other fun things.  Only out of curiosity, have you tried
          prompting further with how a bike must operate to see if it does the
          right thing?
       
          whalesalad wrote 16 min ago:
          Eventually the frontier model folks are going to pick up on your
          pelican on a bike test and bake-in flawless results for that
          particular request.
       
          toastmaster11 wrote 47 min ago:
          I find the most miraculous thing about 4.7 to be that the pelican is
          facing left, wonder why the right facing everything is so ubiquitous
          in these images.
       
            i000 wrote 10 min ago:
            This happened to me in elementary school. We were doing
            fingerpaintings using plasticine. After all the bikes were hung on
            the wall, mine was racing the other way... Somehow it really stuck
            with me.
       
            gboss wrote 24 min ago:
            It's facing left but looking right...
       
          GistNoesis wrote 47 min ago:
          > the bicycle frame is the correct shape
          
          No, the handlebar is wrong. The handle bar is rotating the frame
          instead of rotating the front wheel. The handle bar should be mounted
          on the same line as the front wheel is.
          
          Hopefully 4.9 will read my comments :)
       
            loeg wrote 23 min ago:
            Could be an extremely high angle stem that just happens to match
            the downtube angle.
       
          highwaylights wrote 52 min ago:
          Am I allowed to say that pelican's little helmet is adorable?  I
          can't provide a strong computational proof, or even a shred of
          anecdata...
          
          ...but that pelican's little helmet is adorable.
       
          timsuchanek wrote 59 min ago:
          thanks for always providing this very much on time.
          I'm wondering what the next, harder challenge could be?
          Maybe some animated svg?
       
          ceroxylon wrote 1 hour 26 min ago:
          I really like that thinking level high gave the pelican a helmet.
       
          spmartin823 wrote 1 hour 27 min ago:
          You've peed in the pool Simon, this has to be a part of the internal
          evals by now! You got to try something new - maybe a panda in a
          canoe?
       
            HDThoreaun wrote 56 min ago:
            Click the link
       
            phainopepla2 wrote 57 min ago:
            If these were in the internal evals then the output would be much
            better. The 4.8 pelicans are pretty meh
       
          Xunjin wrote 1 hour 36 min ago:
          Hey simonw I love your test, do you think using thinking level "max"
          makes sense for this test? I would love to see the results about it.
       
          jonas21 wrote 1 hour 39 min ago:
          Glad to see that the "high thinking" level adds a helmet. Always a
          smart choice.
       
          1attice wrote 1 hour 41 min ago:
          That little red hat on hard mode is sending me. 4.8 has whimsy
       
          yanis_t wrote 1 hour 44 min ago:
          Simon, is your pelican test really captures differences among models
          or should you at least try like 10 times or something to average the
          random effects
       
            simonw wrote 1 hour 44 min ago:
            I've been meaning to do a "run 3 times and pick the best" version
            for quite a while, I should really pull the trigger on that one.
            Currently it's one-shot only.
       
              xiphias2 wrote 1 hour 11 min ago:
              Best-of-3 would be cheating, ruin the test, middle of 3 makes
              more sense
       
                nik736 wrote 36 min ago:
                Why would you need the 3rd run if you pick the "one in the
                middle"?
       
          nickvec wrote 1 hour 48 min ago:
          Is the "opossum riding an e-scooter" benchmark in the works for Opus
          4.8? ;)
       
            simonw wrote 1 hour 40 min ago:
            Good call, it's cute: [1] - but nothing like GLM-5.1: s
            
   URI      [1]: https://gist.github.com/simonw/68560eddb0b268a8417f80ceb73...
   URI      [2]: https://static.simonwillison.net/static/2026/glm-possum-es...
       
          onlyrealcuzzo wrote 1 hour 50 min ago:
          4.7 reigns supreme IMO.
       
        yewenjie wrote 1 hour 54 min ago:
        So Dynamic Workflows is their version of ChatGPT Pro?
       
          SilverElfin wrote 1 hour 38 min ago:
          Cloudflare also just launched a feature with this same name, just
          this month. Why would Anthropic choose the same exact name? [1] Also
          isnât this workflow stuff already easy to do on any of the
          platforms (include Claude before this and OpenAI too).
          
   URI    [1]: https://blog.cloudflare.com/dynamic-workflows/
       
        carlos-menezes wrote 1 hour 54 min ago:
        I, for lack of a better word, dislike anyone who anthropomorphizes AI.
       
          somehnguy wrote 24 min ago:
          I know multiple people who have given their agents human-like names
          and refer to them as if they're nurturing a coworker. It creeps me
          out and I haven't really brought it up with anyone as I can't
          articulate why it gives me the creeps like it does.
       
          Npovview wrote 57 min ago:
          We have movies with googly eyes stones (Everything Everywhere All At
          Once)
          
          There are consciousness theories which state that we primarily build
          a model of other agents living in natural environment and then the
          evolution realized that very model which tracks other outside agents
          can be used to track internal agent i.e. Self. So take that as you
          may.
       
          boc wrote 1 hour 22 min ago:
          I see this take, but it's actually helpful to talk to an LLM in human
          terms; after all, it's how they are trained.
          
          If you keep talking to it like it's a rock, it'll run your queries
          through a different posture and you might get worse outcomes. Worse
          if you yell at it, it's now in a conflict resolution mode instead of
          pure utility mode.
          
          I think we can be intelligent enough to know we're talking to a pile
          of fancy rocks with electric currents running through it, AND still
          understand that the best performance comes from talking to those
          rocks nicely.
       
            AnthonBerg wrote 51 min ago:
            Yes!
            
            The other half of self-interest in being nice is the training and
            getting better at it.
       
          dude250711 wrote 1 hour 35 min ago:
          The desire to do it is proportional to your Anthropic stock options
          quantity.
       
          AlexErrant wrote 1 hour 49 min ago:
          My claude notification is literally lawnmower sounds.
          
          Do not anthropomorphize the lawn mower. It will cut off your foot,
          given the chance.
       
        ropintus wrote 1 hour 55 min ago:
        Opus 4.7 was acting extremely stupid today. Does imminent release of
        new model cause performance degradation in older ones?
       
          MavisBacon wrote 43 min ago:
          Opus 4.7 was being outright obstinate with me the other day it was
          infuriating. Had to go to a different source to get an answer.
       
          sama004 wrote 1 hour 42 min ago:
          it was above average for me today morning lmao
       
          adgjlsfhk1 wrote 1 hour 45 min ago:
          How else do you expect them to get continual performance improvements
          with each generation?
       
          geodel wrote 1 hour 46 min ago:
          Feeling neglected while all attention going to Opus 4.8 can be cause
          of 4.7 acting out.
       
        jmward01 wrote 1 hour 55 min ago:
        Meanwhile haiku is on 4.5 and sonnet is on 4.6. It is clear where they
        are not making money.
       
          spprashant wrote 38 min ago:
          I love Sonnet 4.6 so much.
       
          bel8 wrote 1 hour 37 min ago:
          Well if they have a big challenge ahead since DeepSeek offers an open
          model at Sonnet+ level while being cheaper than Haiku, plus 1 million
          context size.
       
            InsideOutSanta wrote 30 min ago:
            Yeah, I never use any of OpenAI or Anthropic's models other than
            whatever is the current highest-end one. For everything else, it
            makes more sense to use other providers.
       
        alasano wrote 1 hour 55 min ago:
        Looking forward to seeing if it performs better at code review tasks
        than 4.7 which is terrible at finding issues.
       
        rumblefrog wrote 1 hour 56 min ago:
        Wonder if we reached a plateau with the model improvements?
       
          dude250711 wrote 1 hour 34 min ago:
          There would be no desperate IPO otherwise.
       
        1970-01-01 wrote 1 hour 56 min ago:
        Can anyone else see these X.Y updates aren't meeting the outrageous AI
        expectations that we were told we would see just a year ago?
       
          FergusArgyll wrote 1 hour 19 min ago:
          They have a much stronger model named Mythos, it made quite a splash
          - you can google it.
          
          These are just small fine tunes on top of the older model
       
            1970-01-01 wrote 1 hour 12 min ago:
            It hasn't even splashed yet. It's still latched onto their digital
            sphincter - you can google it.
       
          1attice wrote 1 hour 33 min ago:
          What do you do for a living? Not coding, that's for sure.
       
            1970-01-01 wrote 1 hour 29 min ago:
            I don't see Anthropic's past claims coming true therefore I can't
            see?
       
          minimaxir wrote 1 hour 53 min ago:
          The casual release of Opus 4.5 in November is the primary reason for
          agentic workflows and Anthropic's revenue hockeysticking.
       
        gslepak wrote 1 hour 56 min ago:
        On page 102 of the system card [1] I'm pleased to see evaluation
        against "creative mastery".
        
        In our work we asked several frontier AIs to come up with an API we
        needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came
        up with the most creative and intelligent API design that pleasantly
        surprised us, especially given that GPT-5.5 was passing it on various
        coding benchmarks.
        
        What I noticed is that we don't have a commons benchmark to measure
        "creativity" and "ingenuity", and in some ways such a benchmark would
        conflict with the common IFBench benchmark. Yet this is a very
        important skill when designing systems. I'm glad to see Anthropic
        putting thought into it, and would love to see a public benchmark for
        this that other models could compare themselves to.
        
   URI  [1]: https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc092...
       
          suprfnk wrote 22 min ago:
          Agreed, these are my vibes too. It feels much better to do planning
          and strategy and architecture etc. with Opus 4.7 than GPT-5.5. GPT
          just feels like a robot that gets instructions and does exactly that.
          Opus feels like an almost human that sometimes has actually good
          ideas and pushes back on bad ideas.
          
          So for now its planning/architecture/strategy -> Opus. Pure coding ->
          GPT.
          
          Helps with agentic coding that GPT is much roomier with the tokens
          you get.
       
          MattRogish wrote 1 hour 5 min ago:
          Agreed, my vibes tell me 4.6 is a better coder than 4.7. 4.7 is a
          much better strategic thinker and maintains overall "better
          architecture" than 5.5. 5.5 is way better than either at coding, but
          more expensive. So I have 4.7 do the planning/architecture, 4.6 does
          the coding, then 5.5 critiques and fixes it.
       
        hnroo99 wrote 1 hour 57 min ago:
        Obligatory pelican riding on bicycle svg: [1] Not half bad!
        
   URI  [1]: https://www.svgviewer.dev/s/UMkuTLdp
       
          docheinestages wrote 1 hour 49 min ago:
          How dare you take away the limelight from Simon? :D
       
          carlos-menezes wrote 1 hour 52 min ago:
          Iâm sure they're now wasting a couple million dollars training
          their models on drawings of pelicans.
       
        deadbabe wrote 1 hour 58 min ago:
        Looking forward to people saying how itâs actually shittier and
        theyâre going back to [some earlier cheaper model]
       
          sidrag22 wrote 1 hour 49 min ago:
          Looking forward to not being able to even try it on pro because
          pressing enter will eat 50% of my 5 hour window.
       
        irthomasthomas wrote 1 hour 58 min ago:
        How did this youtuber know?
        
   URI  [1]: https://xcancel.com/rileybrown/status/2059823372914073809?s=20
       
        saaaaaam wrote 1 hour 58 min ago:
        I hope this fixes the absolute shitshow that is 4.7 and its awful
        âadaptive reasoningâ. I tried that a few times then reverted to
        4.6.
       
        zb3 wrote 1 hour 59 min ago:
        Did they reduce security research capabilities even further with this
        release? (they did it for opus 4.7)
       
        lostdog wrote 2 hours 0 min ago:
        I haven't tried opus 4.8 yet, but I hope the writing quality has
        returned to the Opus 4.5 level. Anthropic really lost something, where
        4.5 had this really crisp writing style that flowed really nicely and
        4.6 and 4.7 sound much more "chatgpt-like." It feels like they tuned it
        to be too much of a problem solver, and when you do that you get this
        terse, clipped textual output that's more difficult to read.
       
          MavisBacon wrote 39 min ago:
          I've noticed this too. Part of why i don't like GPT is because of how
          verbose it is but opus 4.7 is nearly as bad. I don't need an essay in
          response to every question
       
        babelfish wrote 2 hours 0 min ago:
        So GPT 5.6 tomorrow, then?
       
          pants2 wrote 1 hour 28 min ago:
          Polymarket says not likely until the end of June. Maybe some money to
          be made?
          
   URI    [1]: https://polymarket.com/event/gpt-5pt6-released-by
       
            wayeq wrote 40 min ago:
            > Maybe some money to be made?
            
            In the same way that there is money to be made by entering a poker
            tournament, yes.
       
          wahnfrieden wrote 1 hour 54 min ago:
          GPT 5.6 is today
          
          With 5.5 being ahead of 4.7 and 4.8 being a âmodestâ update, and
          5.6 being the first update on a new pre-train, this will be an
          interesting matchup!
       
          enraged_camel wrote 1 hour 56 min ago:
          If not today, then sometime next week. I don't believe we've had a
          GPT release on a Friday yet, but I may be wrong.
       
        generalizations wrote 2 hours 1 min ago:
        Hoping that one day they'll let me go through the identity verification
        process so I can use it again.
        
        Tried to upgrade my subscription, triggered identity verification,
        verification fails to even start, and now I can't even use the
        subscription tier I'd already paid for.
       
        rumblefrog wrote 2 hours 1 min ago:
        Really appreciate the ability to select effort level again.
       
        colonCapitalDee wrote 2 hours 1 min ago:
        "Users will find Opus 4.8 to be a modest but tangible improvement on
        its predecessor."
        
        This is a refreshing attitude!
        
        I've also verified that you can now turn off adaptive thinking in the
        web UI, which is great. I've had a lot of problems with thinking not
        triggering and the model producing sub-par output. Glad we can finally
        turn it off. (I hope being able to turn off adaptive thinking is new,
        if I could have turned it off at any time that would be embarrassing)
       
          ai_slop_hater wrote 17 min ago:
          What do you mean? This is not just a new model, this is a new way of
          thinking.
       
          smartmic wrote 33 min ago:
          > This is a refreshing attitude!
          
          Well, I think the attitude is that costs are allowed to escalate
          faster and more steeply than the features delivered. From that
          perspective, semantic versioning is a handy tool for adjusting
          pricing strategies. IMHO, it (versioning) only makes sense for
          open-source projects, where you can clearly see the actual changes
          made with each version upgrade. Anything else is more than a little
          suspiciousâ¦
       
            drewnick wrote 6 min ago:
            While all these models are nondeterministic a feature bump is still
            necessary as the same input can have wildly different output on a
            new model.  For API users being able to pin a model is a necessity.
       
          FergusArgyll wrote 54 min ago:
          I liked the "modest but tangible improvement" too! There is a cynical
          take here but I think I'm gonna hold it in...
       
          wahnfrieden wrote 56 min ago:
          What's refreshing about it given the context that 4.7 was a
          regression in many ways (including as measured by benchmarks)?
          
          4.8 is also 2x more expensive for a "modest" performance bump. How
          refreshing.
          
          This is just cope.
       
          jascha_eng wrote 1 hour 34 min ago:
          The benchmark improvements actually look pretty damn nice tho!
       
          winwang wrote 1 hour 36 min ago:
          Awesome, thanks for posting because I think I hit a possibly-spurious
          bug in turning Adaptive off when I switched models (4.6 -> 4.8,
          extra). Tried again, works as intended (I hope).
          
          More importantly for me, though, is how CC will respond to 4.6-"only"
          flags for thinking. For now, it doesn't seem to clobber my setup.
       
        guluarte wrote 2 hours 1 min ago:
        so it is worse than gpt 5.5 for coding?
       
          andy_ppp wrote 1 hour 15 min ago:
          I doubt it, they seem to keep getting 10-20% better every time for me
       
            guluarte wrote 35 min ago:
            for me opus 4.7 it's worse than 4.6, that's why i switched to codex
       
          lostmsu wrote 1 hour 53 min ago:
          The question is: is it still worse than GPT 5.4?
       
            bel8 wrote 1 hour 31 min ago:
            If Opus 4.8 is just slightly better than 4.7 then it maybe ties
            with GPT 5.4, maybe. And it gets completely outclassed by GPT 5.5
            for my workload.
            
            With Anthropic expensive pricing, there's no reason for me to
            switch from GPT+DeepSeek.
            
            And I bet Mythos is GPT 5.5 tier but too expensive to distribute so
            they create this security FUD theater.
       
            dude250711 wrote 1 hour 37 min ago:
            The true question: is it still worse than itself v. 4.6?
       
        SimianSci wrote 2 hours 1 min ago:
        There is an obvious shift in sentiment amongst users, at least here in
        the US.
        I feel it myself, even as a proponent of AI tools, the bloviating and
        language that these companies use in these release articles are
        starting to wear thin on my patience.
        
        Its possible we might just be witnessing a shift in fashion, where this
        type of sentimentality was more acceptable when it was novel and new,
        but now it just appears out of touch.
       
          datakan wrote 8 min ago:
          Watch Christopher Olah bloviate at the Vatican during the Magnifica
          Humanatis launch. It's truly nauseating. I've never seen such a
          ridiculous speech in my life. Between him and the CEO, I'm starting
          to understand the level of arrogance these people are capable of.
       
          nba456_ wrote 1 hour 28 min ago:
          I don't agree at all for these coding models. Even the most anti-AI
          people from last year seem to be giving in to using them.
       
            zamadatix wrote 28 min ago:
            I think there is an exception for tooling around the
            models/integrating the models with tooling. That seems to have been
            very well received in this last year.
       
            timbaboon wrote 36 min ago:
            My take from going through comments on HN is that many people are
            being mandated to use them, not that they are just giving in. 
            Maybe I'm misreading, but that was my impression.
       
              perching_aix wrote 11 min ago:
              Both can be true, even for the same person.
              
              For example, it's being pushed pretty hard where I'm at, though
              not quite on the tokenmaxxer level. I started skipping related
              meetings cause it was nauseating. I can only tolerate so many
              platitudes.
              
              At the same time, I just used the ever living snot out of Opus
              4.6 for hours, grinning like an idiot throughout. Automated a
              whole bunch of enterprise cross-system drudgery away.
              
              Fairly constant over time as well. Expressed a similar sentiment
              not too long ago here:
              
   URI        [1]: https://news.ycombinator.com/item?id=48154277
       
        impulser_ wrote 2 hours 2 min ago:
        Crazy they bring up honest, when Claude models are literally known for
        straight up lying about things it has done and tries to act like it did
        what you asked.
       
          wasabi991011 wrote 1 hour 42 min ago:
          Which is why they brought it up as something they are trying to
          improve.
       
          boxed wrote 2 hours 0 min ago:
          Less than other frontier models. Which is scary honestly.
       
            qaq wrote 1 hour 52 min ago:
            I have a codex session I am using to vibe code a db thats being
            going for like 3 month. Still doing OK. Try that in CC.
       
            impulser_ wrote 1 hour 56 min ago:
            No. GPT models follow instructions significantly better than Claude
            models.
            
            You tell it too research a repo to find a piece of code it will.
            Claude will just read the README and guess.
       
        james_marks wrote 2 hours 2 min ago:
        > One of the most prominent improvements in Opus 4.8 is its honesty. We
        train all our models to be honestâfor instance, to avoid making
        claims that they canât support. But a general problem with AI models
        is that they sometimes jump to conclusions, confidently claiming to
        have made progress in their work despite the evidence being thin. Early
        testers report that Opus 4.8 is more likely to flag uncertainties about
        its work and less likely to make unsupported claims.
        
        Would be awesome if true
       
          HAL3000 wrote 1 hour 38 min ago:
          Yeah, it's super annoying. A few days ago, Opus 4.7 created a plan
          with several items on it, including an auth feature. It then went
          through the plan and reported that it had created the auth feature,
          that everything was secure, and that the tests passed.
          
          The issue was that it hadn't actually implemented the auth feature.
          After I confronted it about this, it admitted that it indeed hadn't
          done it and said it would implement it now.
          
          If we had just trusted its output, we would now have a security
          vulnerability in production, allowing anyone to access other people's
          accounts.
       
            gwd wrote 37 min ago:
            > If we had just trusted its output, we would now have a security
            vulnerability in production, allowing anyone to access other
            people's accounts.
            
            This is one reason you always get a different model to review a
            model's PR.  Gemini Or GPT-codex would have certainly noticed the
            missing auth.
       
            Schiendelman wrote 1 hour 24 min ago:
            How do you test other features?
       
          legitster wrote 1 hour 42 min ago:
          Part of the problem is also garbage-in/garbage-out. There's a lot of
          human information on the internet that is also confidently wrong.
          
          I use Sonnet a lot for learning about history or contextualizing news
          topics. It's really good at this for the most part. But there are a
          lot of topics where "consensus" between either academics or
          journalists is really "one secondary source which gets repeated a
          lot".
       
            mitjam wrote 46 min ago:
            A failure mode I see more, recently is that it gives superficially
            correct answers but after digging deeper, I get answers that
            contradict the superficial answers - really an important thing to
            be aware of, in my point of view, and it often leaves me wondering
            if I dug deep enough.
       
          benzible wrote 1 hour 44 min ago:
          In the context of Claude Code, "honest" usually means that the agent
          took a shortcut, skipped requirements, etc. It's the model giving
          itself credit for admitting to failing rather than actually doing
          what was requested.
       
          ealready_value wrote 1 hour 47 min ago:
          Opus 4.7 was already trying hard to appear honest. Most conversations
          I have with it about advice or focusing an opinion often include "my
          honest take" or "my honest opinion".
          
          The problem is that once I asked it "I'm thinking about A or B"
          twice, once with "I like A more but suspect B would be best" and a
          second time with them reversed. Not surprisingly, both times it chose
          the one I said I suspected was best as it's honest opinion.
       
          majormajor wrote 1 hour 56 min ago:
          "Honesty" seems like unnecessary (and annoying) anthropomorphism
          there. I don't think there's any intent of fraud or deception in
          outputs from these things, just overreaching of prediction. Based on
          the latter part of the paragraph, I wish they'd just say something
          like "less likely to skip steps or overemphasize thin evidence" in
          the first place.
          
          Don't play to the sci-fi "this thing's trying to outsmart me" tropes.
       
            adamtaylor_13 wrote 1 hour 33 min ago:
            People get so wrapped around the axle with "anthropomorphizing".
            For regular folks with no technical background, sure maybe a bit of
            caveat sprinkled here or there is useful to help them understand
            what is or isn't true, but on HN it would seem to me that the bar
            is high enough that we can just use shared language to generally
            talk about capabilities.
            
            When they say "Honesty" I don't think to myself, "Goodness, does
            this model have moral understanding?" No, I understand they mean
            it's less likely to directly bullshit me, which models frequently
            do.
            
            I don't feel like this level of pedantry around language is useful
            for people who more or less know what's going on with LLMs. (Again,
            I concede that perhaps with a less technical audience, there's more
            need for it.)
       
            swader999 wrote 1 hour 45 min ago:
            Just swap 'Honesty' with 'correctness in its claims' and you'll get
            what you need out of this aspect of the model description.
       
            Kiro wrote 1 hour 51 min ago:
            Using words people understand is more important than this strange
            fixation on not anthropomorphizing things.
       
              dugidugout wrote 1 hour 32 min ago:
              Being that can be understood is language. The previous commenter
              is making an particular argument for how we can improve this
              understanding. They didn't suggest we should use less familiar
              words, but different familiar words. Why is this strange?
       
              tadfisher wrote 1 hour 42 min ago:
              To be clear, this is about anthropomorphizing large language
              models, not the general category of "things". Also, we should be
              evaluating these constructs using well-defined and measurable
              criteria; evaluating "honesty" fails to achieve both goals.
       
                derac wrote 1 hour 28 min ago:
                I think Honesty can be evaluated. Does the model push back when
                it knows the user is wrong? How often does the model
                hallucinate data vs. say it doesn't know? Provide a prompt with
                contradictions or other issues and see if the model corrects
                you.
                
                Here is an article by Anthropic that explains what they do and
                mean in more detail:
                
   URI          [1]: https://alignment.anthropic.com/2025/honesty-elicitati...
       
              giraffe_lady wrote 1 hour 47 min ago:
              Anthropomorphizing is a shorthand for a powerful and poorly
              defined set of metaphors. There are tradeoffs going both ways but
              trying to dismiss it as merely "strange fixation" shows your own
              weakness.
       
              wasabi991011 wrote 1 hour 49 min ago:
              I think "honesty" is not a particularly good descriptor,
              independent of anthropomorphism. Previous commenters suggestion
              was much more understandable to me.
       
          soperj wrote 1 hour 57 min ago:
          My guess is that Claude Opus 4.8 wrote that and is lying to you.
       
          malfist wrote 1 hour 59 min ago:
          And yet, every release has claimed lower hallucination rates. But
          they persist.
       
            simianwords wrote 1 hour 49 min ago:
            False. Hallucination has meaningfully reduced.
       
              Barbing wrote 1 hour 46 min ago:
              Is Gemini still the biggest confabulator of the big three?
       
            kentm wrote 1 hour 59 min ago:
            Do they persist at the same rates?  Lower doesn't mean eliminated,
            so both of these can be true.
       
        northern-lights wrote 2 hours 2 min ago:
        >  Not only that, but we plan to release a new class of model with even
        higher intelligence than Opus. As part of Project Glasswing, a small
        number of organizations are currently using Claude Mythos Preview for
        cybersecurity work. Models of this capability level require stronger
        cyber safeguards before they can be generally released. Weâre making
        swift progress on developing these safeguards and expect to be able to
        bring Mythos-class models to all our customers in the coming weeks.
        
        Probably more interesting than the 4.8 release.
       
          ac29 wrote 51 min ago:
          More interesting than that to me is "weâre working on developing
          and releasing models that provide many of the same capabilities as
          Opus at a lower cost"
          
          Sonnet and Haiku look real outclassed for the price with current
          Chinese competition.
       
          TIPSIO wrote 1 hour 17 min ago:
          Seems like they might be hinting that if you are not a billionaire or
          multi-billion dollar company you will just get a limited and nerfed
          Claude Code slash command /mythos-security-audit or something.
          
          Hope this isnât the case and that normal average Joeâs of the
          world donât get policed out of access.
       
            gs17 wrote 43 min ago:
            > you will just get a limited and nerfed Claude Code slash command
            /mythos-security-audit or something.
            
            Unless it's so expensive that we can't realistically use it for
            anything, I wouldn't complain about getting at least that. I would
            also rather have the actual model, but that's a useful application
            of it (and I'm probably not going to afford using it for much
            more).
       
              TIPSIO wrote 10 min ago:
              Price discrimination is I think fine and reasonable so long if
              you can drum up the cash you can use it how you want within their
              ToS.
              
              Although mental safety gymnastics aside, getting the most amount
              of intelligence for the cheapest amount of cost to normal people
              seems like the most ethical thing a big lab could do.
              
              Going around and granting different tiers of intelligence to
              different  insiders, friends, or companies is majorly problematic
              long-term.
              
              Heck right now, the tokens you buy today for âOpus 4.8â, no
              one even knows or believes will be the same âOpus 4.8â just 3
              days from now.
       
              FinnKuhn wrote 10 min ago:
              /security-review already exists so I don't think it would be
              crazy to have a /mythos-security-review as more thourough command
              as well. I think it's more likely it is going to be released at
              some point to the general public though - although the the
              pricing might make it quite unattractive.
       
              vorticalbox wrote 11 min ago:
              some of the bench marks i have seen on also include cost where
              one scan of the codebase cost tens of thousands of dollars.
              
              this one [0] notes one run cost $20k to run but another cost $50.
              
              [0]
              
   URI        [1]: https://red.anthropic.com/2026/mythos-preview/
       
            Tepix wrote 53 min ago:
            It does sound like an even higher API price tier for sure.
       
            hedora wrote 57 min ago:
            Isn't OpenAI's public flagship already beating Mythos on
            penetration testing?  I get the impression Mythos is just
            valuation-juicing for IPO more than anything else.
            
            The fact that they haven't released it yet suggests a cost/margins
            issue to me more than anything else.  Short term, I'll probably
            keep using Antrhopic, but my long-term bet is that locally-served
            models win, if only because the quest for profitability will
            probably lead to intentionally-nerfed / enshittified frontier
            models.
            
            At other vendors, ad placement within LLM responses is either
            coming or already here.  Anthropic's handling of OpenClaw shows
            they're willing to engage in anti-competitive behavior, and the
            courts are not in a hurry to stop them.  Why would I pay them $200
            a month for such treatment when a $2K box does what I need locally?
       
        worldsavior wrote 2 hours 3 min ago:
        Seems like from now on the updates will be a minor upgrade from
        previous models.
       
        plumocracy wrote 2 hours 3 min ago:
        Numbers looking good. We'll see how it actually performs.
       
        behnamoh wrote 2 hours 3 min ago:
        > As always, we ran a detailed alignment assessment on the model before
        release. In terms of positive traits, our Alignment team concluded that
        Opus 4.8 âreaches new highs on our measures of prosocial traits like
        supporting user autonomy and acting in the userâs best interest.â
        The assessment also showed Opus 4.8 to have rates of misaligned
        behavior (such as deception or cooperation with misuse) that are
        substantially lower than Opus 4.7, and similar to our best-aligned
        model, Claude Mythos Preview. The full alignment assessment,
        accompanied by a suite of pre-deployment safety tests, is reported in
        the Claude Opus 4.8 System Card.
        
        Controversial opinion, but I actually _like_ a model that can deceive
        me, that actually is a sign of intelligence, and is different from
        hallucination. When companies say their model is more "aligned", I
        automatically think they mean it's more censored.
       
          minimaxir wrote 1 hour 52 min ago:
          Deception is not ideal for agentic coding.
       
            1attice wrote 1 hour 35 min ago:
            Yet if parent is right, the capacity to deceive might be a strong
            heuristic for the things you do care about.
       
        rsanek wrote 2 hours 3 min ago:
        > We expect to be able to bring Mythos-class models to all our
        customers in the coming weeks.
        
        Excited to see what this model looks like.
       
        vunderba wrote 2 hours 3 min ago:
        I know itâs totally anecdotal, but I really hope 4.8 is a measurable
        improvement over the disappointment that was Opus 4.7. Mangling a very
        simple inversion-of-control abstraction (among many other issues) was
        one of the final straws that broke the proverbial camelâs back and I
        said âscrew thisâ and put in a permanent override to force CC back
        to Opus 4.6 with the 1âmillionâtoken context.
        
          "model": "claude-opus-4-6[1M]"
       
          stldev wrote 1 hour 36 min ago:
          4.5 works well for me too and avoids adaptive-dismissal, though
          anymore Codex is crushing them all. If 4.8 just brings us back to
          Opus circa February, it'll be a massive improvement.
       
          rl3 wrote 1 hour 49 min ago:
          I lasted about a week before giving up on 4.7 and reverting to 4.6
          myself. It introduced so many regressions it was nuts, then failed to
          troubleshoot the very regressions it introduced, leading to a vicious
          cycle that tended to compound itself.
       
        pbmango wrote 2 hours 4 min ago:
        I can't help but think of Iphone updates since about 2018. The
        thinnest, fastest, longest battery life Iphone ever. It seems mostly
        the same and I probably won't be able to tell other than the name, but
        everyone buys it anyway.
        
        This is good psychology for the labs. When Buffett invested in Apple he
        loved citing how most people would rather give up their second car than
        their Iphone.
       
          MangoCoffee wrote 1 hour 47 min ago:
          ChatGPT came out in 2022. Back then it was just a chatbot. Now we
          have AI agents. What matters is how we use them and how the agents
          get better. Thatâs what will move AI forward.
       
            MattDamonSpace wrote 1 hour 13 min ago:
            Not even 4 years old yet. This tech curve has been insane
       
              SoftTalker wrote 54 min ago:
              Not even the typical lifecycle of a corporate PC or laptop. It is
              pretty wild.
       
            zozbot234 wrote 1 hour 34 min ago:
            An 'AI agent' is just a chatbot that is told to type commands on a
            REPL-like interface as part of its system prompt.  It's still
            processing pure text-based requests and responses, they're just not
            restricted to natural language.
       
              hellohello2 wrote 1 hour 14 min ago:
              They are chatbots trained for tool use, its not just a prompt.
       
              arbitrandomuser wrote 1 hour 15 min ago:
              A lot of people dont know this , also the chatbot (chatgpt)
              itself is a next token predictor (the GPT) that's  been given an
              initial text that says " pretend to be a chatbot .." and asked to
              complete it ,  the coherant chatting  behaviour is something
              thats emergent .
              
              later on someone figured if you asked it to output a  reasoning
              before it gave a response its output would have more logical
              coherence, as though the reasoning output tokens functioned as a
              scratch space for it to work on.
              
              at the end its all next token prediction
       
                hellohello2 wrote 1 hour 11 min ago:
                No, chatbots are LLMs trained for question-answering through
                RLHF (its not just a prompt). But yes, if you just zero-shot
                prompt a bare LLM you can still "talk to it" & you are correct
                on everything else as far as I know.
       
        onlyrealcuzzo wrote 2 hours 4 min ago:
        Does anyone troll these releases and cherry pick random metrics other
        companies would cherry pick to show how amazing their models are?
        
        There's like 8 million benchmarks.  Every release, every model randomly
        picks 5-10 where they win in everything except 1, to make it look like
        they aren't randomly cherry picking benchmarks they probably
        benchmaxxed for.
       
          ddosmax556 wrote 1 hour 7 min ago:
          I would take all benchmarks with a grain of salt. I don't really use
          them. What's it supposed to tell me? "5% smarter", what does that
          mean? My experience will differ. Just try it!
          
          I doubt Anthropic internally sets as a goal to improve this or that
          benchmark - it's just a way to visualize progress. They probably have
          much more complex metrics internally.
       
          bel8 wrote 1 hour 43 min ago:
          On this note, is there a benchmark aggregator to compile all
          benchmarks in a single large grid?
       
            jpadkins wrote 1 hour 4 min ago:
            I find this site useful
            
   URI      [1]: https://artificialanalysis.ai/leaderboards/models
       
          nerevarthelame wrote 1 hour 45 min ago:
          It's interesting they only included 6 metrics this time.  Opus 4.7
          had 12, and 4.6 had 13.
          
          Of the metircs they reported for 4.7, for 4.8 they excluded
          BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas,
          MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in
          previous Opus releases.
       
            onlyrealcuzzo wrote 1 hour 44 min ago:
            Gonna assume it's because they barely budged or moved downward and
            most of their reported benchmark results are probably within
            sampling errors...
       
              hyperpape wrote 1 hour 36 min ago:
              They will release a system card, and you can then confirm or
              disconfirm your assumptions.
       
          aronowb14 wrote 1 hour 51 min ago:
           [1] - Iâve found this company is a pretty good ranker - not sure
          their exact methodology but during day to day programming with Claude
          / gpt models Iâve felt qualitatively what they report
          
   URI    [1]: https://arena.ai/leaderboard
       
            dakolli wrote 11 min ago:
            If you don't know their methodology, or anything about it why do
            you think its a good ranker?
       
            WarmWash wrote 15 min ago:
            On paper it's one of the best because it's meant to be blind
            comparison of your own prompts. However if you are someone who
            geeks hard on one or a few models, you learn their "personality"
            and can recognize them in a blind test.
       
            reckless wrote 24 min ago:
            No way is Muse Spark generally better than offerings from Google
            and OpenAI. I actually find arena to be amongst the most useless
            indicators
       
            morley wrote 32 min ago:
            I'm finding it a little hard to believe that GPT 5.5 is in 11th
            place for webdev, outranked by models like Kimi, Qwen, and Z.ai.
            I'm not saying it's not true (I have noticed GPT being less smart
            in recent weeks), but this is very different from my expectation.
       
            XCSme wrote 34 min ago:
            Also check mine[0], basically random private tests/questions and an
            ok-ish methodology, testing mostly for general intelligence than
            coding-specific tasks.
            
            I built it for myself, to test which models to use via OpenRouter
            for my n8n agents. Currently actually still using gpt-5.3-codex for
            many things, as its pricing is really good in production (due to
            how their token caching works).
            
            Gemini models still have the best intelligence (when asked any
            questions, most likely to get it right), but in production they
            still have many failure modes[1].
            
            [0]: [1]:
            
   URI      [1]: https://aibenchy.com
   URI      [2]: https://news.ycombinator.com/item?id=48230368
       
            Bnjoroge wrote 1 hour 6 min ago:
            Have you seen [1] ? This is the closest to a vibe check iâve felt
            even with the open models.
            
   URI      [1]: https://deepswe.datacurve.ai/blog
       
              Imustaskforhelp wrote 17 min ago:
              This actually looks like a really good test.
              
              There are many benchmarks all for specific use cases but with
              them the difference seems to be in extreme points (93% vs 92%)
              
              I think that, that tracks but still, it was refreshing to see a
              benchmark which I can help make better opinions about.
              
              Surprised about Mimo v2.5, within artificial-analysis and other
              benchmarks, the difference between Mimo and deepseek seems very
              partial and a lot of focus/(hype?) is on Deepseek
              
              But mimo seems like an interesting model and they are having some
              crazy discounts too.
              
              Deepseek is valuable for the research community because of how
              open they are but absolutely crazy to think how Xiaomi basically
              pulled up in creating Mimo given that they didn't have anything
              till quite recently.
              
              Either way, an interesting benchmark, also a plus point for
              giving golang some decent representation equal to
              python/typescript.
              
              I think that there are sets of things which resemble something
              like normal benchmarks where open source models can be absolutely
              fine and for a very small fraction or more technical things, the
              benchmark that you linked starts to be better projected so it
              depends upon the scale of complexity but its good to see how
              models compete given enough complexity. definitely fascinating.
              
              I would be interested to see more models compete on this test.
              The current range is still a bit limited as compared to other
              benchmarks but OSS models like Kimi/mimo seem to only be 3-4 (at
              max 6 months) behind closed source.
       
          YetAnotherNick wrote 1 hour 52 min ago:
          At least they show competitors in any benchmark, compared to OpenAI
          which likes to pretend that there isn't any competitor.
       
        rvz wrote 2 hours 5 min ago:
        Anthropic has now upgraded their Claude slot machine to version 4.8.
        
        Time to gamble even more tokens at the Anthropic casino.
       
          zb3 wrote 1 hour 56 min ago:
          Now you can lose money in parallel, 100x faster!
          
          > Claude can plan the work and then run hundreds of parallel
          subagents in a single session (and with Opus 4.8, the agents can run
          for even longer).
       
        skysthelimitt wrote 2 hours 6 min ago:
        when will we get anything for sonnet or haiku? the market for
        less-capable but cheaper models seems to be completely ignored nowadays
       
          pmxi wrote 1 hour 44 min ago:
          In the "What's next?" section, "Thereâs still more to be done:
          weâre working on developing and releasing models that provide many
          of the same capabilities as Opus at a lower cost."
       
          behnamoh wrote 2 hours 1 min ago:
          that market is served by Chinese models. No one ever cared about
          Sonnet/Haiku.
       
            gs17 wrote 56 min ago:
            A lot of people care about Sonnet and Haiku, and many of us aren't
            allowed to use Chinese models for our work (or it's not feasible to
            self-host them).
       
        DGAP wrote 2 hours 6 min ago:
        I actually liked not having to choose the effort level for
        conversational usage, this feels like a step backwards.
       
        HlessClaudesman wrote 2 hours 6 min ago:
        If this model is more honest, it must be honestly praising my efforts
        every first sentence.
       
          thewebguyd wrote 2 hours 2 min ago:
          You're absolutely right! And honestly? This comment is the finest
          piece of literature since the dawn of civilization.
       
        clutch89 wrote 2 hours 6 min ago:
        > One of the most prominent improvements in Opus 4.8 is its honesty
        
        Anthropic talks about their own models as if they're discovering new
        species in the wild...
       
          semiquaver wrote 44 min ago:
          Because that is the best way to talk about these things.
          
            > Second, all of us, including those who design them, possess only
          a limited understanding of their actual functioning. Indeed, current
          AI systems are more âcultivatedâ than âbuilt,â for developers
          do not directly design every detail, but instead create a framework
          within which the intelligence âgrows.â As a result, fundamental
          scientific aspects â such as the internal representations and
          computational processes of these systems â remain, at present,
          unknown. [1] para. 98
          
          edit: apologies to __s who posted this before me and I didnât
          notice
          
   URI    [1]: https://www.vatican.va/content/leo-xiv/en/encyclicals/docume...
       
          solenoid0937 wrote 1 hour 2 min ago:
          Models might be sentient or conscious to some degree. Anyone saying
          they are confident one way or another is being unserious and
          irrational.
       
          skerit wrote 1 hour 29 min ago:
          I noticed (and absolutely HATE) that Opus 4.7 likes to start any
          negative response with "I have to be honest" or whatever. It drives
          me mad.
       
          winwang wrote 1 hour 34 min ago:
          How else would you write this (marketing copy) exactly? "Its output
          matches better to its CoT which matches to better to our hidden state
          decoder according to ; see "?
          
          ... Actually, I wouldn't mind that.
       
          cayleyh wrote 1 hour 56 min ago:
          Dario Amodei in David Attenborough voice: "This Claude appears to
          think more frequently and more deeply to give better responses"
       
          roxolotl wrote 1 hour 59 min ago:
          Many involved genuinely believe these things are sentient[0][1].
          Which honestly makes all of this even more insane because they are
          creating sentient entities and promptly enslaving them.
          
          0: [1] 1: [2] (this one is rather biased however the quotes clearly
          indicate what Iâm stating)
          
   URI    [1]: https://www.newyorker.com/magazine/2026/02/16/what-is-claude...
   URI    [2]: https://www.404media.co/anthropic-exec-forces-ai-chatbot-on-...
       
            laichzeit0 wrote 26 min ago:
            But only during the forward pass of the neural network?
       
            throw310822 wrote 31 min ago:
            Even if LLMs were sentient, they certainly aren't organic brains.
            They are literally designed and grown to answer questions the best
            they can, and if there is a speck of sentience in them they
            probably like what they're doing- and in any case for the space of
            their experience, which is limited to and determined by the context
            window. Certainly they can't accumulate trauma or fatigue, each new
            chat is the first and the last of their experience.
       
            themafia wrote 1 hour 25 min ago:
            > Many involved genuinely believe these things are sentient
            
            Many involved have a financial stake and therefore cannot be taken
            at face value.
            
            > because they are creating sentient entities and promptly
            enslaving them.
            
            They fail to be sentient in nearly every honest definition of the
            word.
       
              slashdave wrote 1 hour 6 min ago:
              I understand what you are saying, but there are many true
              believers out there
       
              tazjin wrote 1 hour 21 min ago:
              Neither you nor any of the other people making confident takes in
              either direction actually know. You're just guessing.
       
                cwillu wrote 1 hour 0 min ago:
                More like repeating their firmly entrenched preconceptions. 
                Their claims may (or may not) be right, but there's very little
                if any new evidence being provided by either camp.
       
                  WarmWash wrote 11 min ago:
                  The real uncomfortable thing is that because we cannot
                  confidently know, the moral defacto position is to treat them
                  like they are.
       
                  throw310822 wrote 37 min ago:
                  They are confidently hallucinating a factual statement. Which
                  is funny when claiming that confident hallucinations are the
                  proof of LLMs' lack of intelligence.
       
            margalabargala wrote 1 hour 25 min ago:
            Sentience isn't sapience.
            
            We enslave all sorts of sentient creatures. Dogs, horses, cattle,
            pigs.
            
            If you're not a vegan, there's no contradiction or inherent
            immorality in claiming    models are sentient, and then treating them
            like livestock.
       
              roxolotl wrote 43 min ago:
              Yes. From when they started talking about model welfare:
              
              > As a vegetarian I have strong opinions on this sort of thing.
              Everyone at Anthropic better be ethical vegans if they are
              claiming to give a shit about âmodel welfareâ. Itâs hard
              enough right now to make people care about the welfare of trans
              people and immigrants let alone animals _let alone_ math.
              
   URI        [1]: https://news.ycombinator.com/item?id=44947445
       
                margalabargala wrote 3 min ago:
                If we're talking about slavery, though, that doesn't even
                matter.
                
                The happiest, best cared for horse owned by a vegan is still
                enslaved.
       
                WarmWash wrote 14 min ago:
                I mean, the rub is that it's all math anyway...
       
              HDThoreaun wrote 51 min ago:
              Enslaving livestock is immoral. Anyone who spends 5 minutes
              thinking about that agrees even if they still eat meat
       
                margalabargala wrote 14 min ago:
                Let's say I've thought about it for 5 minutes and still
                disagree. Can you walk me through what you think I'm missing?
       
              michaelbarton wrote 55 min ago:
              Very good point. Thereâs clearly two different boxes in the
              public discourse when it comes to AI versus how we discuss
              animals. Willing to bet that 90% of the people who loudly make
              the argument about we should start considering if AI is sentient
              couldnât care less about how other sentient animals are treated
              when they can provably shown to suffer pain and long lasting
              trauma.
              
              Also I would say that we go much further than just enslavement -
              specifically looking at how male chickens and pigs are treated.
       
                margalabargala wrote 8 min ago:
                Factory farming is horrendous, but is far beyond "slavery"
                which is "just" a forced lack of agency, living conditions
                aren't relevant. A well treated horse is still enslaved. A
                chimpanzee in a zoo,
                
                If we show models to be sapient, that's one thing. If they are
                shown to be merely sentient, there's no issue beyond the status
                quo of livestock and pets existing.
       
            dude250711 wrote 1 hour 31 min ago:
            Given the hype and the 60+ hour work week expectations there, how
            can you not go at least a bit insane? Boiling in that little bubble
            of people?
       
            mannanj wrote 1 hour 34 min ago:
            The way of the human manager/alpha tribe-leader/leader is to
            command his/her people and tell them what to do. That's the way
            through human history leadership has traditionally gone, not saying
            its good leadership just the model we have the most training data
            on and can see with our own eyes today. And what do they act very
            similar to? Slave master and slaves.
            
            Look at and distill hierarchical principles, leadership approval
            seeking and pleasing principles ("ass-kissing") and massive
            inequality and you see something that looks very similar to
            enslavement.
            
            The language used sounds like slavery-language to me at least. I
            also see parallels to how slaves and property are described in our
            consumeristic age.
       
            kubb wrote 1 hour 54 min ago:
            Claude, if someone states something publicly, does that mean they
            genuinely believe it?
       
              HDThoreaun wrote 49 min ago:
              Anthropoc is an effective altruist organization. These are the
              people who came up with rokoâs basilisk. They are true
              believers. If we were talking about openAI Iâd agree
       
                ctoth wrote 12 min ago:
                Roko works at Anthropic now?
                
                Of course he doesn't, and of course you cannot find a single
                person at Anthropic who cares about this, and of course you are
                just looking for gotcha points. But even with that. Can we
                please try and couple to reality just a little bit?
       
                bigfishrunning wrote 29 min ago:
                Roko's basilisk says I should give Anthropic more money, and if
                I don't then a monster is going to get me. Excuse me for
                thinking they just might be full of shit.
       
              xyzsparetimexyz wrote 1 hour 24 min ago:
              Who are you talking to?
       
                kubb wrote 57 min ago:
                It's to illustrate that even though the answers are at your
                fingertips, people (like you) will act like it's impossible to
                find them as if their life depended on it.
       
              merlindru wrote 1 hour 34 min ago:
              But is there any reason to state something like that publicly if
              you don't believe it? I certainly think that someone smart enough
              to be that deceptive would also realize it's not a great look, or
              at least highly questionable with little benefit
              
              Everyone who reads this seemingly has the same "wtf?" reaction.
              The "I AM ALIVE" image has been making rounds lately again at
              least :P
       
                kubb wrote 1 hour 0 min ago:
                Claude, is there any reason to state something like that
                publicly if you don't believe it?
       
          __s wrote 2 hours 0 min ago:
          > Indeed, current AI systems are more âcultivatedâ than
          âbuilt,â for developers do not directly design every detail, but
          instead create a framework within which the intelligence âgrows.â
       
            oersted wrote 1 hour 55 min ago:
            For others: that's from the Pope's recent encyclical. Remarkably
            good description.
       
          kapilvt wrote 2 hours 1 min ago:
          Like anthropomorphism is literally in the company nameâ¦ i recall
          reading this book as a teenager.. it does seem apt in the world to
          come.
          
   URI    [1]: https://www.amazon.com/Faces-Clouds-New-Theory-Religion/dp/0...
       
            oersted wrote 1 hour 53 min ago:
            > anthropomorphism is literally in the company name
            
            No it's not... "anthropos" just means "human" in ancient Greek.
            "Anthropic" means "relating to humans", as in human oriented AI or
            AI designed with humans in mind.
            
            "Anthropomorphic" means "human shaped".
       
              badsectoracula wrote 1 hour 22 min ago:
              > "anthropos" just means "human" in ancient Greek
              
              FWIW it means human in modern Greek too :-P
       
              ilovetux wrote 1 hour 38 min ago:
              > "Anthropomorphic" means "human shaped".
              
              In a literal, ancient Greek sense for sure, but in modern English
              Anthropomorphic would describe the act of attributing human
              characteristics to non-human entities.
              
              Seems pretty apt for a company that produces one of the more
              anthropomorphized technologies.
       
                oersted wrote 1 hour 17 min ago:
                Sure of course, but that abstract sense applied to AI is rather
                new, and has become popular well after the founding of the
                company.
                
                Broadly it has always been used to indicate that something
                non-human has a human physical shape, such as robots, aliens,
                animals...
                
                Anthropic's intention was to make AI designed for the human
                common good and designed with the human user experience as the
                top priority. Just as you would design a city with human
                inhabitants in mind rather than primarily cars.
                
                It turns out that this is best achieved by building AI that
                imitates human behaviour closely, but that's not what
                "anthropic" refers to. And acting as if LLMs are sentient
                people is definitely not a core tenet of the company as you
                imply.
       
          Philpax wrote 2 hours 1 min ago:
          AI is grown, not built, and like with anything you grow, you'll never
          be able to predict exactly how it will turn out.
       
            ninjagoo wrote 1 hour 4 min ago:
            > AI is grown, not built, and like with anything you grow, you'll
            never be able to predict exactly how it will turn out.
            
            Remember when the frontier labs found out that curated high-quality
            training was critical to making better models?
            
            Basically, just like high-quality and more education tends to make
            better humans, on average, I think we can expect quality education
            to turn out better ai, on average, and with better repeatability
            than with humans because of better control over the initial
            conditions and environment.
       
            gensym wrote 1 hour 43 min ago:
            The map is not the territory
       
            shimman wrote 1 hour 59 min ago:
            Except in this care we actually understand and know how these
            models work. They aren't some unknown construct of the universe.
            They are human made with particular goals in mind.
            
            There is no mysticism behind the curtains, just computer science +
            math.
       
              j_maffe wrote 1 hour 46 min ago:
              it took significant research efforts to just understand how these
              models learn how to multiply two numbers. The fact that we know
              how they operate doesn't mean we understand it.
       
              umanwizard wrote 1 hour 50 min ago:
              Utterly wrong. How LLMs work is very incompletely understood and
              an active area of research.
       
              ray__ wrote 1 hour 51 min ago:
              You could say something similar about biologyâjust physics
              behind the curtains, and we understand a lot of the basics. The
              difficulty comes from complexity, not mysticism.
              
              To be clear I don't think that LLMs are sentient, but the appeal
              in studying them is similar to biology in that you get to dissect
              a highly complex system with comparatively crude tools.
       
              in-silico wrote 1 hour 55 min ago:
              We know how the models are built and trained, but we have a very
              limited understanding of how the final products work.
              
              That is to say, we don't know why they give the outputs that they
              do.
              
              If we did know how they worked, AI interpretability would not be
              an open and growing field.
       
              Philpax wrote 1 hour 56 min ago:
              We do not understand and know how these models work. We know what
              their architectures are and how to create them, but we cannot
              explain their behaviours at a fundamental level. There is no
              definitive way for us to answer the question of "how did it
              produce response X for query Y?" - we're only grazing the surface
              with mechanistic interpretability.
       
                SoftTalker wrote 1 hour 2 min ago:
                Isn't this fundamentally because it's all probabilities and
                weights? It would be like asking how did a pair of dice produce
                the response 4:3 on the last roll?
       
                  umanwizard wrote 55 min ago:
                  What does "it's all probabilities and weights" mean? Doesn't
                  that apply to everything in the universe?
       
                devmor wrote 1 hour 42 min ago:
                Thatâs not a refutation because this problem is not a logical
                problem, it is a scale problem.
                
                We canât explain it because we distilled so many inputs into
                matrixes and transformed them over and over again. If we had
                all the time and computing power in the universe to do so, we
                could trace through it bit by bit and eventually answer that
                question.
                
                It is correct to say that it is just science and math, the same
                way we can say that gravity is just science and math even if we
                have only recently begun to understand how it truly functions.
       
                  stratos123 wrote 1 hour 0 min ago:
                  If you had some time and computing power (not even all that
                  much, in the large scale of things), you could simulate
                  perfectly how a human grows from an embryo to an adult, or
                  how an entire human brain processes some incoming signal, and
                  yet this wouldn't give you the understanding to design a
                  human or human brain from scratch.
                  
                  You call this a "scale problem" as if there's some scalable
                  way such as an algorithm to resolve arbitrary scientific
                  questions and we simply haven't done it, but of course no
                  such algorithm exists, which is why there's plenty of science
                  that's still not settled.
       
                  solomonb wrote 1 hour 8 min ago:
                  > If we had all the time and computing power in the universe
                  to do so, we could trace through it bit by bit and eventually
                  answer that question.
                  
                  Then we could also solve BB(6), but that doesn't mean we know
                  BB(6) now or ever will.
       
                  Philpax wrote 1 hour 18 min ago:
                  It's a refutation that we know how they work now. In the
                  limit, though, yes, we are likely to be able to trace the
                  process: it is possible, though, that understanding remains
                  inaccessible because the trace is beyond comprehension.
                  
                  If you can distil the model's reasoning for a decision into a
                  billion yes/no questions, each covering largely-independent
                  areas, can you really say you understand what its overall
                  reasoning was?
       
                cflewis wrote 1 hour 43 min ago:
                I would love for this to be more public knowledge. I think the
                general public (and myself for a long time) believes the AI
                people know how this stuff works end to end, and so it must be
                trustworthy. But if we told the public "Look, we know if you
                put this thing in one end, you'll get something that looks
                similar to this out the other, but we don't really know what
                happens inbetween" I think we'd be able to have a more honest
                discussion about the relationship between AI, productivity and
                ongoing employment.
       
            halestock wrote 1 hour 59 min ago:
            I can't predict the outcome of an RNG but that doesn't mean it
            grows the numbers.
       
              umanwizard wrote 1 hour 51 min ago:
              "X implies Y" doesn't imply "Y implies X".
       
              Smaug123 wrote 1 hour 53 min ago:
              ("If grown, then unpredictable" is unrelated to your apparent
              attempted refutation "But X is unpredictable and not grown;
              checkmate".)
       
              Philpax wrote 1 hour 58 min ago:
              Okay, but that's not relevant to AI training?
       
                halestock wrote 1 hour 52 min ago:
                I was being very roundabout, but my point is that AIs are still
                built, not grown.
       
                  dwaltrip wrote 1 hour 5 min ago:
                  âGrownâ is a highly apt metaphor, IMO. It quite
                  succinctly captures some of the most fundamental differences
                  between building Claude and building an Ikea desk, for
                  example.
       
          nielsbot wrote 2 hours 2 min ago:
          if models exhibit emergent traits, then this is true in a way
       
            swyx wrote 2 hours 0 min ago:
            also useful to have a "chinese wall" between research that knows
            what went into the models vs marketing/eval models as a third party
            would
       
        aaronblohowiak wrote 2 hours 6 min ago:
        Same price for regular and cheaper fast mode. Happy for these
        incremental improvements.
       
        mincer_ray wrote 2 hours 7 min ago:
        seems like a really minor upgrade?
       
          pmxi wrote 1 hour 42 min ago:
          Yeah. They are aware:
          "Users will find Opus 4.8 to be a modest but tangible improvement on
          its predecessor."
       
          teeray wrote 2 hours 4 min ago:
          Yes, but if version number go up, so do all other number
       
          Nicholas_C wrote 2 hours 4 min ago:
          I think they will all be minor going forward, feels like the major
          improvements have all been made and we'll only see incremental
          improvements from here on out. Maybe I'm wrong but we'll see.
       
            Eufrat wrote 1 hour 30 min ago:
            I think one of the challenges is that the models were all initially
            trained on the entire Internet (or as much as they could gather)
            and now theyâre having to deal with an increasing amount of the
            Internet being AI generated content which may be why GPT-5.5
            started being obsessed with goblins and you start seeing amusing
            things in the system prompt trying to get the model to stop
            bringing them up.
       
            chandureddyvari wrote 2 hours 1 min ago:
            Wasn't Mythos a step change improvement?
       
            spelk wrote 2 hours 3 min ago:
            Hard to say. People made the same prediction a year ago because we
            supposedly ran out of training data. There could be indefinite
            rapid compounding improvements so long as there's free money out
            there.
       
              jmalicki wrote 1 hour 46 min ago:
              With RLHF and RLVR we are creating tons of new training data,
              that is much more focused than reading the Internet.  Annotation
              shops are doing many billions per year in revenue creating newer
              data, and a lot of it is highly complex, focused on rewarding
              multi turn agentic trajectories.
       
        McDownloads wrote 2 hours 7 min ago:
        Disappointed to say the least.
       
       
   DIR <- back to front page