_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   A postmortem of three recent issues
       
       
        Razengan wrote 14 min ago:
        Anthropic seems to have a "We're special" syndrome:
        
        * No Sign in with Apple on the website, so tough luck if you signed up
        on iOS via that
        
        * Hard to get support after buying a subscription
        
        * Can't remove your payment method from your account
        
        * Ask Claude questions about Claude itself such as privacy just gives
        you links to their website
        
        What even is the benefit of paying for Claude over ChatGPT or Grok who
        are better in overall UX?
       
        wbnns wrote 1 hour 55 min ago:
        You're absolutely right!
       
        endymion-light wrote 2 hours 12 min ago:
        I don't know if this is related, but I've noticed a massive issue
        recently within web app design where claude will just create random
        streams of text that displays in the dom. Think it's something
        specifically related to attempting to use svelte, but definitely a
        massive degredation that I didn't notice prior to this.
       
        jaiyam wrote 3 hours 59 min ago:
        Do we know why Google’s Gemini didn’t get affected by the XLA bug?
        Don't they use approx top-k or mixed precision? Also, TIL that Claude
        is served in prod at bf16
       
        itsgrimetime wrote 6 hours 32 min ago:
        Wish they would have included what the actual failure mode was. I’ve
        been having issues where Claude Code will just hang after running some
        tool call, was that caused by one of these bugs?
       
        mike_hearn wrote 7 hours 9 min ago:
        The most interesting thing about this is the apparent absence of unit
        tests. The test for the XLA compiler bug just prints the outputs, it's
        more like a repro case than a unit test in the sense that it'd be run
        by a test harness and have coverage tracked. And the action items are
        simply to lean more aggressively into evals.
        
        Although unit testing an entire LLM is not really feasible right now,
        all these bugs were in small deterministic parts of the system. Load
        balancing, top-k probability calculations and so on are all engineered
        parts no different to other software, and should in principle all be
        unit testable. At most you need an injectable PRNG. Yes,
        non-deterministic optimization bugs are awful but I've personally found
        compiler and database bugs in the past using just regular app test
        suites. With CI you get a lot of runs so rare events can still surface
        as long as you investigate flakes. One of my current projects runs
        every unit test in the same process in parallel, which has proven an
        excellent and cheap strategy for flushing out rare thread safety issues
        and database deadlocks.
        
        A few days ago I commented on a thread about the Java launch that
        people often feel Java is "enterprisey" compared to Python because Java
        code is typically written to be heavily unit testable. A lot of
        abstraction is driven by the desire for dependency injection, for
        example. I contrasted that to scripting language culture where I've
        found testing is often either missing or kinda surface level (e.g.
        mostly just asserting on types).
        
        When I've been learning PyTorch a few years ago I noticed the same
        thing. The tutorials took you from simple to complex stuff without
        talking much about how you test or best structure the code. This makes
        sense for ML research where you don't have a clear goal and success
        boils down to maxing a score in some kind of eval, but it doesn't make
        sense for production deployment at scale.
        
        I wonder if the AI labs could use more people with SRE and HA SWE
        background to focus on things like this. I'm kinda skeptical that more
        aggressive rolling evals-in-prod are the best way to avoid bugs like
        these happening again.
       
          vintagedave wrote 6 hours 31 min ago:
          I've had to write some detailed prompts and examples to have AI
          generate the kind of unit tests I want in Python. I've seen the
          assertions on types alone too. I want assertions on values and more.
          
          Even more than that, AI tends to mock _everything_. Mocking is
          useful, but the more real code a unit test invokes, the better,
          because the risk is not only the code itself but its interactions,
          the interface. Yet AI in Python will mock so heavily it barely tests
          even the code itself, with tautological statements.
          
          I've prompted with heavy warnings against mocking and pointing
          directly at examples of thorough tests as examples. FWIW, Python does
          have excellent tools for injection and can write really nicely
          structured code.
       
            andoando wrote 1 hour 42 min ago:
            Mocked tests also make refactoring a pain in the ass.
            
            This is why I heavily prefer integration tests
       
            redman25 wrote 3 hours 56 min ago:
            I wish I had 100 upvotes to give you. Weak, heavily mocked tests
            are my biggest pet peave. Test “quality” is important and not
            something a lot of devs pay attention to.
            
            I’ve found myself preferring integration tests or unit tests with
            a “real” database set up because the tests are much more
            effective. If you design them right, they don’t even need to be
            slower.
       
            mike_hearn wrote 6 hours 1 min ago:
            I'm curious how you structure your Python to be well testable. I
            have to admit, my own use of Python has been limited to scripts and
            (a long time ago) a game engine, not large codebases. So unit
            testing for those hardly came up.
            
            It seems there's a couple of dependency injection frameworks but
            they're clones of what's found in Java, right down to the type
            names. One of them even calls injectable objects beans! (Rhazes)
       
              lordmathis wrote 3 hours 16 min ago:
              I learned to write well testable code when I learned go. It
              pushes you to pass interfaces instead of direct implementations.
              There's also no inheritance, just composition. While there's no 1
              to 1 translation to Python the concepts are still useful. It can
              be easier in Python thanks to duck typing.
       
              vintagedave wrote 3 hours 50 min ago:
              Most of my Python is web. Individual components, same as always -
              approach with a set API and not too many dependencies, and allow
              injection via some route if so. I also test web endpoints. One
              thing I really like is isolating tests that require data --
              rather than mocking the database, for example, I'll create an
              in-memory SQLite DB used while running tests. That way I can test
              the full stack: a web API, see its results, and check what was
              changed in the database at the same time, all isolated from the
              'real' stack.
       
              Balinares wrote 5 hours 38 min ago:
              Same as you do it in any language: you compose instead of
              inheriting, you avoid shared state, you generally think about how
              this thing you're implementing can be tested even as you are
              implementing it. Test-driven development tends to constrain your
              interfaces too early but you can get a lot of the same benefits
              with, let's call it, test-mindful development. That works in any
              language.
       
        am17an wrote 8 hours 31 min ago:
        They must really be having a bad time if Anthropic of all labs is
        willing to share their infra details. On the actual precision bug, it
        is quite unfortunate on FMA side, numerical issues are often deeply
        bewildering and no AI can solve them (yet.) Also goes to show, if you
        are in a super crunch situation like this one (competitor literally
        eating your lunch every day), you need humans to understand what went
        wrong and even then it can take weeks to rectify.
       
        mulmboy wrote 13 hours 14 min ago:
        Big missing piece - what was the impact of the degraded quality?
        
        Was it 1% worse / unnoticeable? Did it become useless?
        The engineering is interesting but I'd like to see it tied to actual
        impact
       
          cpursley wrote 6 hours 17 min ago:
          Significant, check any Claude related thread here over the last month
          or the Claude Code subreddit. Anecdotally, the degradation has been
          so bad that I had to downgrade to a month old version - which has
          helped a lot. I think part of the problem is there as well (Claude
          Code).
       
        Omnipresent wrote 14 hours 3 min ago:
        Which LLM can generate that timeline event graphic from text?
       
        yomismoaqui wrote 14 hours 28 min ago:
        This reminds me of the story [1] about Facebook intentionally breaking
        parts of its Android app for some users (including crashing or
        disabling functionality), to see how far it could degrade before users
        stopped using Facebook.
        
        According to reports, users did not stop coming back even when the app
        was broken for hours.
        
        A similar thing happened to me when playing some initial version of The
        Binding of Isaac on Linux, when it was made with Flash. Its performance
        wasn't the best but I couldn't stop playing.
        
        So if people still returns maybe Anthropic has something great going on
        with Claude Code.
        
        [1] 
        
   URI  [1]: https://www.theguardian.com/technology/2016/jan/05/facebook-de...
       
        mvdtnz wrote 14 hours 31 min ago:
        I don't believe for one second that response quality dropped because of
        an infrastructural change and remained degraded, unnoticed, for weeks.
        This simply does not pass the sniff test.
       
          blackqueeriroh wrote 12 hours 57 min ago:
          Can you provide any proof of what you’re saying? Any examples that
          would bear out what you’re asserting? Anything at all?
          
          “I refuse to believe what the people who would know the best said,
          for no real reason except that it doesn’t feel right” isn’t
          exactly the level of considered response we’re hoping for here on
          HN. :)
       
            mccoyb wrote 12 hours 42 min ago:
            Have you used these tools at all? It's incredibly obvious. It was
            obvious for weeks during August, where several people posted about
            degradation on r/ClaudeAI ...
            
            There's a thousand and one reasons why a company valued in the
            billions, with the eyes of the world watching, would not be
            completely honest in their public response.
       
              breakingcups wrote 6 hours 35 min ago:
              Google Gemini had an ongoing issue for at least 3 weeks with
              empty responses and the thinking context filled with absolute
              nonsense. It was even acknowledged on social media by their head
              of product but no official communication at all. No status page
              updates, nada. Apparently it was even a known issue for much
              longer but they only started fixing it after a config change made
              the problem much more prevalent.
       
        dantodor wrote 15 hours 2 min ago:
        That is a very good start in sharing some level of information with
        their users, and kudos to the Anthropic team for doing that. However, I
        don't see any mention of the longstanding issue in CC of API timeout
        errors. And, at least for me, it's the most frustrating one.
       
          lukasb wrote 14 hours 56 min ago:
          I almost never see these. Maybe issue is your network?
       
        stephen_cagle wrote 15 hours 17 min ago:
        I do wonder what a random dip in quality causes in a long running
        conversation? Does the conversation recover at a later point, or does
        the introduction of temporary idiocy permanently affect the rest of the
        conversation?
        
        Statistically, probably likely that the dip occurred at a point that
        wasn't too important? But what happens if the idiot comes out at a
        critical point?
        
        Kind of reminds me of the two alternate ways that time travel works in
        sci-fi. Does the small change to the past explode like a fission
        reaction, or does history heal itself?
        
        Anywho, if errors do accumulate, I can see being very pissed off even
        with temporary idiocy from the model, as it means it poisons the
        context for the entire rest of the conversation.
       
          unsupp0rted wrote 1 hour 33 min ago:
          Depends how good your competitors are at capitalizing on it.
          
          Guess what Sam Altman is good at.
       
        woah wrote 15 hours 45 min ago:
        Vibe coding gone wrong?
       
          nojs wrote 14 hours 47 min ago:
          Hey now, CC is only 80% vibe coded!
          
   URI    [1]: https://www.reddit.com/r/singularity/comments/1khxwjh/claude...
       
            bdangubic wrote 14 hours 18 min ago:
            80% of Atlassian employees use Jira :)
       
        data-ottawa wrote 15 hours 52 min ago:
        With all due respect to the Anthropic team, I think the Claude status
        page[1] warrants an internal code red for quality. There were 50
        incidents in July, 40 incidents in August, and 21 so far in September.
        I have worked in places where we started approaching half these numbers
        and they always resulted in a hard pivot to focusing on uptime and
        quality.
        
        Despite this I'm still a paying customer because Claude is a fantastic
        product and I get a lot of value from it. After trying the API it
        became a no brainer to buy a 20x Max membership. The amount of stuff
        I've gotten done with Claude has been awesome.
        
        The last several weeks have strongly made me question my subscription.
        I appreciate the openness of this post, but as a customer I'm not
        happy.
        
        I don't trust that these issues are all discovered and resolved yet,
        especially the load balancing ones. At least anecdotally I notice that
        around 12 ET (9AM pacific) my Claude Code sessions noticeably drop in
        quality. Again, I hope the team is able to continue finding and fixing
        these issues. Even running local models on my own machine at home I run
        into complicated bugs all the time — I won't pretend these are easy
        problems, they are difficult to find and fix.
        
   URI  [1]: https://status.anthropic.com/history
       
          renewiltord wrote 12 hours 2 min ago:
          This is always why you should put as few incidents on status page as
          possible. People's opinion will drop and then the negative effect
          will fade over time. But if you have a status page then it's
          incontrovertible proof. Better to lie. They'll forget.
          
          e.g. S3 has many times encountered increased error rate but doesn't
          report. No one says anything about S3.
          
          People will say many things, but their behaviour is to reward the
          lie. Every growth hack startup guy knows this already.
       
            pnutjam wrote 3 hours 21 min ago:
            Yup, these guys aren't the customers anyway. The investors are the
            only ones they care about because the customers don't come close to
            paying the actual costs.
       
          martinald wrote 14 hours 21 min ago:
          What makes it even worse is the status page doesn't capture all
          smaller incidents. This is the same for all providers. If they
          actually provided real time graphs of token latency, failed requests,
          token/s etc I think they'd be pretty horrific.
          
          If you trust this OpenRouter data the uptime record of these APIs
          is... not good to say the least: [1] It's clear to me that every
          provider is having enormous scale challenges. Claude Code often slows
          to a crawl and I have to interrupt it and tell it to try again.
          
          This is especially pronounced around 4-6pm UK time (when we have
          Europe, Eastern US and West Coast US all hammering it).
          
          Even today I was getting 503 errors from Gemini AI studio with model
          overloaded at that time, nothing on status page.
          
          I really wonder if it would be worth Claude et al offering a cheaper
          off peak plan, to try and level out demand. Perhaps the optics of
          that don't look good though.
          
          Edit to add: I think another potential dimension to this is GB200s
          have been a lot slower to come on stream than probably the industry
          expected. There's been a lot of defects with various hardware and
          software components and I suspect the liquid cooling has been
          difficult to get right (with far more catastrophic failure states!).
          
   URI    [1]: https://openrouter.ai/openai/gpt-5/uptime
       
            l1n wrote 4 hours 49 min ago:
            > Claude et al offering a cheaper off peak plan
            We do offer Batch Processing today -
            
   URI      [1]: https://docs.claude.com/en/docs/build-with-claude/batch-pr...
       
              martinald wrote 4 hours 20 min ago:
              I mean for Claude Code.
       
            Maxious wrote 11 hours 35 min ago:
            Artificial Analysis also monitor LLM provider APIs independently
            "based on 8 measurements each day at different times" you can see
            the degradation as opus 4.1 came online
            
   URI      [1]: https://artificialanalysis.ai/providers/anthropic#end-to-e...
       
          willsmith72 wrote 15 hours 11 min ago:
          > Despite this I'm still a paying customer because Claude is a
          fantastic product and I get a lot of value from it.
          
          Doesn't that say it all? At this point the quality of the AI trumps
          reliability for the customer (you and me), so even though of course
          they should (and I'm sure will) focus on it, why would they
          prioritise reliability over model quality right now?
       
            edoceo wrote 14 hours 28 min ago:
            The up-theead complaint is that quality drops and draws a line to
            reliability.  They (Anthropx) have two hard problems to solve.
       
          ruszki wrote 15 hours 18 min ago:
          I don’t know whether they are better or worse than others. One for
          sure, a lot of companies lie on their status pages. I encounter
          outages frequently which are not reported on their status pages.
          Nowadays, I’m more surprised when they self report some problems.
          Personally, I didn’t have serious problems with Claude so far, but
          it’s possible that I was just lucky. In my perspective, it just
          seems that they are reporting outages in a more faithful way. But
          that can be completely coincidental.
       
          lumost wrote 15 hours 23 min ago:
          I've become extremely nervous about these sudden declines in quality.
          Thankfully I don't have a production product using AI (yet), but in
          my own development experience - the model becoming dramatically
          dumber suddenly is very difficult to work around.
          
          At this point, I'd be surprised if the different vendors on
          openrouter weren't abusing their trust by silently dropping
          context/changing quantization levels/reducing experts - or other
          mischievous means of delivering the same model at lower compute.
       
            martinald wrote 14 hours 35 min ago:
            Openrouter is aware this is happening and flags it now on the UI.
            It's a real problem.
       
        vlovich123 wrote 16 hours 7 min ago:
        The value of figuring out how to make their LLM serving deterministic
        might help them track this down. There was a recent paper about how the
        received wisdom that kept assigning it to floating point associativity
        actually overlooked the real reasons for non-determinism [1]
        
   URI  [1]: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-l...
       
          ants_everywhere wrote 13 hours 4 min ago:
          network traffic and machine load aren't deterministic. I think for
          the near term, getting full determinism (e.g. for auditing) is going
          to only be feasible for batch jobs that are not cost sensitive.
          
          A google search isn't deterministic. Neither is loading upvote count
          on social media.
          
          It's common advice in distributed systems to have a graceful
          degradation state instead of becoming unavailable. That wouldn't be
          possible in a system that's completely deterministic.
       
            vlovich123 wrote 10 hours 46 min ago:
            Network traffic and machine load don’t usually impact the output
            of a pure (in the CS sense of purity) math function (which is what
            an LLM is) unless you’ve written your system to be sensitive to
            that.
            
            > to have a graceful degradation state instead of becoming
            unavailable. That wouldn't be possible in a system that's
            completely deterministic.
            
            What does this even mean? I see no incompatibility between
            determinism and your ability to perform the same function more
            slowly. Determinism just means that the output of the system is
            solely dependent on the inputs - feed the same inputs get the same
            outputs. If by degraded state you’re intentionally choosing to
            change your inputs, that doesn’t change the determinism of your
            system.
            
            When it is said that LLMs aren’t deterministic, it’s because
            the output token is dependent on the inputs context and all other
            contexts processed in the same batch because the kernels are
            written non-deterministically. If the kernels were written
            deterministically (so that the output only depended on your input
            context), then there wouldn’t be a problem and it also wouldn’t
            change the ability for the system to degrade; it would be
            deterministic because capturing the input context and random seed
            would be sufficient. As it stands you’d have to capture the
            interim states of the other inputs being processed in the same
            batch and that interim state problem is what makes it non
            deterministic.
            
            As for Google search, it’s not clear to me it’s
            non-deterministic. When you Google the exact same thing twice you
            get exactly the same page of results and selected snippets. That
            suggests there’s more determinism in the system than you’re
            giving it credit for.
       
          mmaunder wrote 16 hours 1 min ago:
          It has a big impact on performance to do determinism. Which leaves
          using another model to essentially IQ test their models with
          reporting and alerting.
       
        HoyaSaxa wrote 16 hours 13 min ago:
        I’m pretty surprised that Anthropic can directly impact the infra for
        AWS Bedrock as this article suggests. That goes against AWSs
        commitments. I’m sure the same is true for Google Vertex but I
        haven’t digged in there from a compliance perspective before.
        
        > Our own privacy practices also created challenges in investigating
        reports. Our internal privacy and security controls limit how and when
        engineers can access user interactions with Claude, in particular when
        those interactions are not reported to us as feedback.
        
        Ok makes sense and glad to hear
        
        > It remains particularly helpful for users to continue to send us
        their feedback directly. You can use the /bug command in Claude Code
        
        Ok makes sense and I’d expect that a human can then see the context
        in that case although I hope it is still very explicit to the end user
        (I’m not a Claude Code user so I cannot comment)
        
        > or you can use the "thumbs down" button in the Claude apps to do so
        
        This is pretty concerning. I can’t imagine the average person equates
        hitting this button with forfeiting their privacy.
       
          l1n wrote 16 hours 6 min ago:
          (Anthropic employee, speaking in a personal capacity)
          
          > I’m pretty surprised that Anthropic can directly impact the infra
          for AWS Bedrock as this article suggests.
          
          We don't directly manage AWS Bedrock deployments today, those are
          managed by AWS.
          
          > I can’t imagine the average person equates hitting this button
          with forfeiting their privacy.
          
          We specify
          
          > Submitting this report will send the entire current conversation to
          Anthropic for future improvements to our models.
          
          in the thumbs down modal. Is there a straightforward way to improve
          this copy?
       
            HoyaSaxa wrote 12 hours 32 min ago:
            > We don't directly manage AWS Bedrock deployments today, those are
            managed by AWS.
            
            That was my understanding before this article. But the article is
            pretty clear that these were "infrastructure bugs" and the one
            related to AWS Bedrock specifically says it was because "requests
            were misrouted to servers". If Anthropic doesn't manage the AWS
            Bedrock deployments, how could it be impacting the load balancer?
       
              l1n wrote 4 hours 51 min ago:
              The load balancer container is provided to Bedrock, since it's
              part of our overall LLM serving system, but they run it.
       
            crazygringo wrote 15 hours 38 min ago:
            Sounds fine to me. I'm assuming it wasn't obvious to readers that
            there was a confirmation message that appears when thumbs down is
            clicked.
       
              HoyaSaxa wrote 12 hours 36 min ago:
              Yes, I don't use Claude so I wasn't aware. I'm glad to hear it
              sounds like it is conspicuous.
       
            pluto_modadic wrote 15 hours 47 min ago:
            "have a human take a look at this conversation (from {time} to
            {time})"
       
          _da_ wrote 16 hours 9 min ago:
          > This is pretty concerning. I can’t imagine the average person
          equates hitting this button with forfeiting their privacy.
          
          When you click "thumbs down" you get the message "Submitting this
          report will send the entire current conversation to Anthropic for
          future improvements to our models." before you submit the report, I'd
          consider that pretty explicit.
       
            HoyaSaxa wrote 12 hours 36 min ago:
            Great to hear. I'm not a Claude user and the article did not make
            it seem that way.
       
        zer00eyz wrote 16 hours 25 min ago:
        If you are going to run a non deterministic system on three very
        different hardware platforms doesn't it behoove you to tell your users
        where their experience is coming from?
        
        Calling the platforms A, B and C might help provide us the insight
        we're missing to spot incongruous behaviors faster than trying to
        aggregate more generalized feedback.
       
        behnamoh wrote 16 hours 39 min ago:
        Reminder that Anthropic is the only AI company that has never released
        any open-source/weight models.
       
          _zoltan_ wrote 8 hours 5 min ago:
          and?
       
          arduanika wrote 16 hours 35 min ago:
          Sure, but don't you feel safer that way?
       
            behnamoh wrote 16 hours 26 min ago:
            of course, who wants an open-source Sonnet 3... /s
       
        cyanf wrote 16 hours 42 min ago:
        > On August 29, a routine load balancing change unintentionally
        increased the number of short-context requests routed to the 1M context
        servers. At the worst impacted hour on August 31, 16% of Sonnet 4
        requests were affected.
        
        Interesting, this implies that the 1M context servers performs worst at
        low context.
        Perhaps this is due to some KV cache compression, eviction or sparse
        attention scheme being applied on these 1M context servers?
       
          kiratp wrote 16 hours 14 min ago:
          This is due to RoPE scaling.
          
          > All the notable open-source frameworks implement static YaRN, which
          means the scaling factor remains constant regardless of input length,
          potentially impacting performance on shorter texts. We advise adding
          the rope_scaling configuration only when processing long contexts is
          required. It is also recommended to modify the factor as needed. For
          example, if the typical context length for your application is
          524,288 tokens, it would be better to set factor as 2.0.
          
   URI    [1]: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking
       
        flutas wrote 17 hours 1 min ago:
        And yet no offers of credits to make things right for the users, for
        what was essentially degraded performance of what you paid for.
        
        I know I'll probably get push back on this, but it left a sour taste in
        my mouth when I paid for a $200 sub that felt like it was less useful
        than ChatGPT Plus ($20) at times.
        
        Or to summarize: [south park "we're sorry" gif]
       
          kashunstva wrote 8 hours 58 min ago:
          I would not count on any compensation. I was a Pro subscriber until
          the third week of May when I was no longer able to login due to some
          auto ban. No idea what triggered it but no one from support has
          bothered to respond to my appeals form submissions. A lost cause.
       
          topaz0 wrote 11 hours 19 min ago:
          Hey, they need that $200 to postpone their inevitable bankruptcy
       
          blackqueeriroh wrote 12 hours 59 min ago:
          I’m pretty certain if you check the ToS that Anthropic doesn’t
          guarantee a level of response quality, and explicitly even says there
          is zero guarantee, even for paid plans.
          
          So to be fair, you are getting exactly what you paid for - a
          non-deterministic set of generated responses of varying quality and
          accuracy.
       
        Wowfunhappy wrote 17 hours 2 min ago:
        > On August 25, we deployed a misconfiguration to the Claude API TPU
        servers that caused an error during token generation. An issue caused
        by a runtime performance optimization occasionally assigned a high
        probability to tokens that should rarely be produced given the context,
        for example producing Thai or Chinese characters in response to English
        prompts, or producing obvious syntax errors in code. A small subset of
        users that asked a question in English might have seen
        "สวัสดี" in the middle of the response, for example.
        
        Can anyone explain to a layperson how this sort of thing is even
        possible for an LLM?
        
        For normal code, of course stupid bugs happen all the time. You
        accidentally introduce an off-by-one error in a conditional, for
        example, or add an extra `goto fail`.
        
        But LLMs aren't written by humans! Models are trained by automated
        programs over a period of many months across unfathomably massive data
        centers.
        
        How would a human introduce a bug like the one described in TFA?
       
          blackqueeriroh wrote 13 hours 2 min ago:
          Simple answer: there are two separate processes here, training and
          inference.
          
          As you discuss, training happens over a long period of time in a
          (mostly) hands-off fashion once it starts.
          
          But inference? That’s a separate process which uses the trained
          model to generate responses, and it’s a runtime process - send a
          prompt, inference runs, response comes back. That’s a whole
          separate software stack, and one that is constantly being updated to
          improve performance.
          
          It’s in the inference process where these issues were produced.
       
          jldugger wrote 15 hours 6 min ago:
          The AI kernels are floating point, so it's possible to do some
          unintuitive math that ends up negative even though it wouldn't be in
          the Real domain. I wouldn't be surprised if checking for overflow
          state is disabled for perf reasons and the negative simply becomes
          really big -- like asking for the -1st item in an array and getting
          the last.
       
          Centigonal wrote 16 hours 52 min ago:
          LLMs produce a probability distribution for what the next token might
          be. How you pick the actual word that is printed next from that
          probability distribution is by using a sampling approach[1]. If your
          sampling approach is "select the next word randomly from among the
          top 4 possibilities" and you flip a > sign, you could end up with the
          behavior described in the OP. [1] Here is an example of two common
          approaches:
          
   URI    [1]: https://www.reddit.com/r/AIDungeon/comments/1eppgyq/can_some...
       
            jjmarr wrote 16 hours 40 min ago:
            The next word can also be selected with weighted randomization and
            "temperature" is used to control how much weight lower probability
            tokens get.
            
            I've honestly received the best results in creative writing by
            ignoring top_k/top_p and simply tuning temperature. Restricting my
            output to only common words causes everything to feel generic. But
            Deepseek constantly breaks into Chinese/gibberish/ZALGO! when I go
            to 1.14.
            
            This isn't related to the "recent issues" but I feel like it's
            useful advice for anyone trying out AI story creation.
       
          Voloskaya wrote 16 hours 56 min ago:
          LLMs are still executed by code written by humans.
          In this case, the model ultimately give you a probability
          distribution over each (~200k) tokens in the vocabulary. It's then up
          to you to decide how you want to sample the next token, you could for
          example just always sample the most likely, or to make the output
          more creative, you can sample randomly from the top-k tokens. This
          top-k sampling, to make it efficient, is written in XLA and compiled
          to run directly as a kernel, there was a bug in that kernel, which
          presumably led to tokens outside of the top-k window be select from
          times to times.
       
          ashdksnndck wrote 17 hours 0 min ago:
          There are many layers of human-written code in between you and the
          weights.
       
        OGEnthusiast wrote 17 hours 9 min ago:
        Seems like Claude is using TPUs a lot more than I thought. For some
        reason I thought 90%+ of their capacity was from AWS.
       
        bravetraveler wrote 17 hours 10 min ago:
        > We don't typically share this level of technical detail about our
        infrastructure, but the scope and complexity of these issues justified
        a more comprehensive explanation.
        
        Layered in aggrandizing. You host a service, people give you money.
       
          levocardia wrote 16 hours 57 min ago:
          No, what that statement means is "we know that if we just say 'we
          weren't downgrading performance to save money', you won't believe us,
          so here is a deep dive on the actual reason it happened"
       
            pluto_modadic wrote 15 hours 46 min ago:
            they're big, and we expect proper behavior out of them when they
            mess up. that includes public details.
       
            bravetraveler wrote 16 hours 49 min ago:
            They can still do the deep dive, that is absolutely convincing.
            They likely did: distracted before I could finish [work,
            unfortunately - incident of our own]
            
            My criticism is it's 'puffy'. The 'scope and complexity' for a
            public postmortem is 'customer-facing'. Otherwise it's a
            tree/forest scenario.
            
            One might say 'the lady doth protest too much'; this should be
            routine. It is, elsewhere: see Cloud, Web Hosting, PBX. Pick your
            decade.
       
        extr wrote 17 hours 11 min ago:
        > Incorrect routing affected less than 0.0004% of requests on Google
        Cloud's Vertex AI between August 27 and September 16.
        
        Matches my experience. I use CC through our enterprise Vertex AI
        account and never noticed any degradation.
        
        In general it seems like these bugs, while serious, were substantially
        less prevalent than anecdotal online reports would have you believe. We
        are really talking about a ~1-2 week window here where most issues were
        concentrated, a relatively small percentage of total requests and total
        users impacted.
       
          thousand_nights wrote 16 hours 43 min ago:
          i don't trust companies anymore because every time there's a
          worldwide outage they use softspeak like "we're observing elevated
          amounts of errors for a small subset of users", hours after some CTO
          approves to change the status page
          
          imho there's a big market gap for companies that are truly honest
          with customers instead of corporate gaslighting
       
            edoceo wrote 14 hours 23 min ago:
            I'm with you that a market gap for honesty exists - especially on
            status pages. Making a better product and being honest I'd class as
            very-very-hard.
            
            I do think an independent service status monitor might be an easier
            stip-gap and could serve to improve honesty. It's not trivial.
       
          ispeaknumbers wrote 17 hours 4 min ago:
          I'm not sure if you can claim these were "less prevalent than
          anecdotal online reports". From their article:
          
          > Approximately 30% of Claude Code users had at least one message
          routed to the wrong server type, resulting in degraded responses.
          
          > However, some users were affected more severely, as our routing is
          "sticky". This meant that once a request was served by the incorrect
          server, subsequent follow-ups were likely to be served by the same
          incorrect server.
          
          30% of Claude Code users getting a degraded response is a huge bug.
       
            extr wrote 16 hours 55 min ago:
            I don't know about you but my feed is filled with people claiming
            that they are surely quantizating the model, Anthropic is
            purposefully degrading things to save money, etc etc. 70% of users
            were not impacted. 30% had at least one message degraded. One
            message is basically nothing.
            
            I would have appreciated if they had released the full distribution
            of impact though.
       
              mirekrusin wrote 11 hours 0 min ago:
              Routing bug was sticky, "one message is basically nothing" is not
              what was happening - if you were affected, you were more likely
              to be affected even more.
       
              lmm wrote 13 hours 15 min ago:
              > 30% had at least one message degraded. One message is basically
              nothing.
              
              They don't give an upper bound though. 30% had at least one
              message degraded. Some proportion of that 30% (maybe most of
              them?) had some larger proportion of their messages (maybe most
              of them?) degraded. That matters, and presumably the reason we're
              not given those numbers is that they're bad.
       
              dytyruio wrote 15 hours 52 min ago:
              > Anthropic is purposefully degrading things to save money
              
              Regardless of whether it’s to save money, it’s purposefully
              inaccurate:
              
              “When Claude generates text, it calculates probabilities for
              each possible next word, then randomly chooses a sample from this
              probability distribution.”
              
              I think the reason for this is that if you were to always choose
              the highest probable next word, you may actually always end up
              with the wrong answer and/or get stuck in a loop.
              
              They could sandbag their quality or rate limit, and I know they
              will rate limit because I’ve seen it. But, this is a race.
              It’s not like Microsoft being able to take in the money for
              years because people will keep buying Windows. AI companies can
              try to offer cheap service to government and college students,
              but brand loyalty is less important than selecting the smarter AI
              to help you.
       
                efskap wrote 10 hours 8 min ago:
                >or get stuck in a loop
                
                You are absolutely right! Greedy decoding does exactly that for
                longer seqs: [1] Interestingly DeepSeek recommends a
                temperature of 0 for math/coding, effectively greedy.
                
   URI          [1]: https://huggingface.co/docs/transformers/generation_st...
       
                andy99 wrote 14 hours 15 min ago:
                > I think the reason for this is that if you were to always
                choose the highest probable next word, you may actually always
                end up with the wrong answer and/or get stuck in a loop.
                
                No, it's just the definition of sampling at non-zero
                temperature. You can set T=0 to always get the most likely
                token. Temperature trades of consistency for variety. You can
                set T to zero in the API, I assume the defaults for Claude code
                and their chat are nonzero.
       
              flutas wrote 16 hours 45 min ago:
              That 30% is of ALL users, not users who made a request, important
              to note the weasel wording there.
              
              How many users forget they have a sub? How many get a sub through
              work and don't use it often?
              
              I'd bet a large number tbh based on other subscription services.
       
                extr wrote 16 hours 37 min ago:
                That's a pretty cynical read. My personal impression is that
                Anthropic has a high level of integrity as an organization.
                Believe what you want, I'm inclined to give them the benefit of
                the doubt here and move on.
       
                  kashunstva wrote 8 hours 54 min ago:
                  > My personal impression is that Anthropic has a high level
                  of integrity as an organization.
                  
                  Unless you consider service responsiveness as a factor of
                  integrity. Still waiting on a service message reply from
                  third week of May. I’m sure it’s right around the corner
                  though.
       
                smca wrote 16 hours 39 min ago:
                (I work at Anthropic) It's 30% of all CC users that made a
                request during that period. We've updated the post to be
                clearer.
       
                  flutas wrote 16 hours 5 min ago:
                  Thanks for the correction and updating the post.
                  
                  I typically read corporate posts as cynically as possible,
                  since it's so common to word things in any way to make the
                  company look better.
                  
                  Glad to see an outlier!
       
        deepdarkforest wrote 17 hours 13 min ago:
        Wow. Sneaky. They do not even state the rate of impact for the XLA bug
        afaik, which affected everyone, not just claude code users, very vague.
        Interesting.
        
        Claude code made almost half a billion so far[1] (>500m in ARR and its
        like 9 months old) , and 30% of all users have been impacted at least
        once, just from the first routing bug. Scary stuff.
        
        Their post mortem is basically "evaluations are hard, we relied on vibe
        checking, now we are going to have even more frequent vibe checking". I
        believe it was indeed unintentional, but in the future where investor's
        money wont come down from the skies, serving distilled models will be
        very tempting. And you can not be liable to any SLA currently, it's
        just vibes. I wonder how enterprise vendors are going to deal with this
        going forward, you cannot just degrade quality without client or vendor
        even being able to really prove it.
        
        [1][ [1] ]
        
   URI  [1]: https://www.anthropic.com/news/anthropic-raises-series-f-at-us...
       
          VirusNewbie wrote 14 hours 23 min ago:
          They likely don't want to say how much of their inference comes from
          GCP vs. AWS.
       
          extr wrote 17 hours 3 min ago:
          Is your contention that paying for a service entitles you to zero
          bugs, ever?
       
            gabriel666smith wrote 10 hours 40 min ago:
            We already kind of have a solution for this with SLAs. Humans,
            being (probably) non-deterministic, also fuck up. An expectation of
            a level of service is, I think, reasonable. It's not "zero mistakes
            ever", just as it can't be "zero bugs ever".
            
            We're firmly in the realms of 'this thing is kind of smarter /
            faster at a task compared to me my employees, so I am contracting
            it to do that task'.
            
            That doesn't mean 'if it fails, no payment'.
            
            But I think it's too analogous to non-tech-products to hide behind
            a 'no refunds' policy. It's that good - there are consequences for
            it, I think.
       
            flutas wrote 16 hours 47 min ago:
            If you paid for a streaming service and the HD option only worked
            for a random subset of users, and not you, would you complain?
            
            It's a material difference in the product, not just "a bug."
       
              dylan604 wrote 16 hours 37 min ago:
              I'd honestly blame my ISP for traffic shaping my connection as a
              first assumption, and not immediately blame the streaming
              platform.
       
            deepdarkforest wrote 16 hours 50 min ago:
            Of course not! But usually, you can quantify metrics for quality,
            like uptime, lost transactions, response time, throughput etc. Then
            you can have accountability, and remediate. Even for other bugs,
            you can often reproduce and show clearly the impact. But in this
            case, other than internal benchmarks, you cannot really prove it.
            There is no accountability yet
       
              _zoltan_ wrote 16 hours 24 min ago:
              why would they publish the data you seek? I would not publish it
              either.
              
              the blog explains what issues they had and how they fixed them.
              this is good enough.
       
        stellalo wrote 17 hours 23 min ago:
        Title should be fixed: it’s about Claude models in general, not
        Claude Code
       
          dang wrote 9 hours 28 min ago:
          Fixed now, thanks! (Submitted title was "Claude Code Degradation: A
          postmortem of three recent issues".)
       
        moatmoat wrote 18 hours 14 min ago:
        TL;DR — Anthropic Postmortem of Three Recent Issues
        
        In Aug–Sep 2025, Claude users saw degraded output quality due to
        infrastructure bugs, not intentional changes.
        
        The Three Issues
        1. *Context window routing error*  
           - Short-context requests sometimes routed to long-context servers.
        
           - Started small, worsened after load-balancing changes.
        
        2. *Output corruption*    
           - TPU misconfigurations led to weird outputs (wrong language, syntax
        errors).
        
           - Runtime optimizations wrongly boosted improbable tokens.
        
        3. *Approximate top-k miscompilation*  
           - A compiler bug in TPU/XLA stack corrupted token probability
        selection.
        
           - Occasionally dropped the true top token.
        
        Why It Was Hard to Detect
        - Bugs were subtle, intermittent, and platform-dependent.
        
        - Benchmarks missed these degradations.
        
        - Privacy/safety rules limited access to real user data for debugging.
        
        Fixes and Next Steps
        - More sensitive, continuous evals on production.
        
        - Better tools to debug user feedback safely.
        
        - Stronger validation of routing, output correctness, and
        token-selection.
       
          sebastiennight wrote 17 hours 29 min ago:
          > Privacy/safety rules limited access to real user data for
          debugging.
          
          Do their ToS really limit access to user data (prompt/response)? I
          don't remember seeing anything to that effect in their terms.
       
            favorited wrote 17 hours 15 min ago:
            I know that when you submit a thumbs up/down rating for a response,
            you need to opt-in to the whole chat conversation being shared with
            Anthropic.
       
            mcintyre1994 wrote 17 hours 25 min ago:
            I’d imagine they have a lot of internal controls, even if
            ultimately someone at the company can read the data within their
            terms. It makes sense that the teams debugging stuff wouldn’t
            have this access immediately.
       
       
   DIR <- back to front page