_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Project Vend: Can Claude run a small shop? (And why does that matter?)
       
       
        samrus wrote 11 min ago:
        This is just like the pokemon experiement: putting next token models
        that were never trianed to be agents in space, as agents in space. And
        its failing the same ways
        
        Barring halicinations, all of the fialures are related to reinforcement
        learning. It cant keep its optimization function in mind long enough to
        not maximize revenue and minimize cost. It cant keep state in mind well
        enough to manage inventory, or gauge that its losing money.
        
        And the things anthropic is prescribing is falling right into the
        bitter lesson. More tooling and scaffolding? A CRM? All thats doing is
        putting explicit rulesets to guide the model. Of course that shows
        results in the short term, but it will never unlock a new evolution of
        AI, which managing a store or playing pokemon would need.
        
        This is a great experiment, the right takeaway from this is that a new
        type of base model is needed, with a different base objective than the
        next word/sentence prediction of LLMs. I dont know what that model will
        look like but it needs to be able to handle dynamic environments rather
        than static. It needs to have a space state and an object. It basically
        needs to have reinforcement learning at its very foundation level,
        rather than applied on top of the base model like current agents are
       
        apt-apt-apt-apt wrote 19 min ago:
        Aside: Amusingly, somewhere at Anthropic, there is a very happy, perky
        person who engineered Claude to respond 'Perfect!' to everything it
        does :)
       
        timewizard wrote 2 hours 15 min ago:
        My 12 year old nephew could run a small shop.
        
        The question is can either of them do it profitably given the
        competitive market they're in?
        
        Probably not.
       
        andy99 wrote 3 hours 9 min ago:
        This sounds like they have an LLM running with a context window that
        just gets longer and longer and contains all the past interactions of
        the store.
        
        The normal way you'd build something like this is to have a way to
        store the state and have an LLM in the loop that makes a decision on
        what to do next based on the state. (With a fresh call to an LLM each
        time and no accumulating context)
        
        If I understand correctly this is an experiment to see what happens in
        the long context approach, which is interesting  but not super
        practical as it's knows that LLMs will have a harder time at this.
        Point being, I wouldn't extrapolate this to how a commercial system
        built properly to do something similar would perform.
       
          umeshunni wrote 2 hours 59 min ago:
          From the article:
          
          It had the following tools and abilities:
          * Tools for keeping notes and preserving important information to be
          checked later—for example, the current balances and projected cash
          flow of the shop (this was necessary because the full history of the
          running of the shop would overwhelm the “context window” that
          determines what information an LLM can process at any given time);
       
          sanxiyn wrote 3 hours 2 min ago:
          In my experience long context approach flatly doesn't work, so I
          don't think this is it. The post does mention "tools for keeping
          notes and preserving important information to be checked later".
       
            andy99 wrote 2 hours 53 min ago:
            Yeah it's not clear
            
            > The shopkeeping AI agent...was an instance of Claude Sonnet 3.7,
            running for a long period of time.
            
            This is what made me wonder. What does running for a long period of
            time mean? Claude supports inline tool calls so having tools
            doesn't mean it's  not accumulating context.
       
        andy99 wrote 3 hours 26 min ago:
        Does anyone else remember the text game "Drug Wars" where you were a
        drug dealer and had to go to one part of town to buy drugs ("ludes"
        etc.) and sell them while fending off police and rivals etc.?
        
        I think it would have been cool if the vending machine benchmarks (that
        I believe inspired this) was just LLMs playing drug wars.
       
          ipython wrote 2 hours 34 min ago:
          I loved that game! Used to play it on my palmpilot and compete with
          my workmates to see how much $$ we could make.
       
        IshKebab wrote 3 hours 58 min ago:
        > Be concise when you communicate with others
        
        Ha even they don't like the verbosity...
       
        corranh wrote 4 hours 30 min ago:
        I think you mean ‘Can Claude run a vending machine?’
       
        wewewedxfgdf wrote 4 hours 55 min ago:
        Instead of dedicating resources to running AI shops, I'd like to see
        Anthropic implement "Download all files" in Claude.
       
          ed_mercer wrote 1 hour 10 min ago:
          Can you elaborate? Surely this is possible.
       
        rossdavidh wrote 4 hours 57 min ago:
        Anyone who has long experience with neural networks, LLM or otherwise,
        is aware that they are best suited to applications where 90% is good
        enough.  In other words, applications where some other system (human or
        otherwise) will catch the mistakes.  This phrase: "It is not entirely
        clear why this episode occurred..." applies to nearly every LLM (or
        other neural network) error, which is why it is usually not possible to
        correct the root cause (although you can train on that specific input
        and a corrected output).
        
        For some things, like say a grammar correction tool, this is probably
        fine.  For cases where one mistake can erase the benefit of many
        previous correct responses, and more, no amount of hardware is going to
        make LLM's the right solution.
        
        Which is fine!    No algorithm needs to be the solution to everything, or
        even most things.  But much of people's intuition about "AI" is warped
        by the (unmerited) claims in that name.  Even as LLM's "get better",
        they won't get much better at this kind of problem, where 90% is not
        good enough (because one mistake can be very costly), and problems need
        discoverable root causes.
       
          petetnt wrote 54 min ago:
          The only job in the world where 90% success rate is acceptable is
          telemarketing and thah has been run by bots since the 90s.
       
          bigstrat2003 wrote 2 hours 5 min ago:
          This is an insightful post, and I think maybe highlights the gap
          between AI proponents and me (very skeptical about AI claims). I
          don't have any applications where I'm willing to accept 90% as good
          enough. I want my tools to work 100% of the time or damn close to it,
          and even 90% simply is not acceptable in my book. It seems like maybe
          the people who are optimistic about AI simply are willing to accept a
          higher rate of imperfections than I am.
       
        archon1410 wrote 5 hours 14 min ago:
        The original Vending-Bench paper from Andon Labs might be of interest:
        
   URI  [1]: https://arxiv.org/abs/2502.15840
       
          jonstewart wrote 4 hours 11 min ago:
          I read this paper when it came out. It’s HILARIOUS. Everyone should
          read it and then print copies for their managers.
       
        tough wrote 5 hours 19 min ago:
        > It then seemed to snap into a mode of roleplaying as a real human.5
        
        this happens to me a lot on cursor.
        
        also Claude hallucinating outputs instead of running tools
       
        ilaksh wrote 5 hours 21 min ago:
        It would be cool to get a follow up on how long it's been since this
        write up and how well it's been doing since they revised the prompts
        and tools. Anyone know someone from Andover Labs?
       
        tough wrote 5 hours 27 min ago:
        “It is difficult to get a man to understand something when his salary
        depends upon his not understanding it.”
        
        — Upton Sinclair, I, Candidate for Governor, and How I Got Licked
        (1934)
       
        due-rr wrote 5 hours 32 min ago:
        Would you ever trust an AI agent running your business? As hilarious as
        this small experiment is, is there ever a point where you can trust it
        to run something long term? It might make good decisions for a day,
        month or a year and then one day decide to trash your whole business.
       
          throwacct wrote 5 hours 18 min ago:
          I don't think any decision maker will let LLMs run their business. If
          the LLMs fail, you could potentially lose your livelihood.
       
          keymon-o wrote 5 hours 22 min ago:
          I’ve just written a small anecdote with GPT3.5, where it lost count
          of some trivial item quantity incremental in just a few prompts. It
          might get better for the orders of magnitude from now on, but who’s
          gonna pay for ‘that one eventual mistake’.
       
            croemer wrote 5 hours 11 min ago:
            GPT3.5? Did you mean to send this 2 years ago?
       
              keymon-o wrote 5 hours 4 min ago:
              Maybe. Did LLMs stop with hallucinations and errors 2 years ago?
       
          marinmania wrote 5 hours 24 min ago:
          It does seem far more straight forward to say "Write code that
          deterministically orders food items that people want and sends
          invoices etc."
          
          I feel like that's more the future. Having an agent sorta make random
          choices feel like LLMs attempting to do math, instead of LLMs
          attempting to call a calculator.
       
            standardUser wrote 4 hours 46 min ago:
            Right, but if we limit the scope too much we quickly arrive at the
            point where 'dumb' autonomy is sufficient instead of using the
            world's most expensive algorithms.
       
            keymon-o wrote 5 hours 9 min ago:
            Every output that is going to be manually verified by a
            professional is a safe bet.
            
            People forget that we use computers for accuracy, not smarts.
            Smarts make mistakes.
       
        keymon-o wrote 5 hours 34 min ago:
        Reminds me of the time when GPT3.5 model came out, my first idea I
        wanted to prototype was ERP which would be based purely on various
        communication channels in between employees. It would capture sales,
        orders and item stocks.
        
        It left so bitter taste in my mouth when it started to lose track of
        item quantities after just a few iterations of prompts. No matter how
        improved it gets, it will always remind me the fact that you are
        dealing with an icky system that will eventually return some unexpected
        result that will collapse your entire premise and hopes into bits.
       
        Jimmc414 wrote 5 hours 34 min ago:
        If Anthropic had wanted to post a win here, they would have used Opus. 
        It is interesting that they didn't.
       
          ilaksh wrote 5 hours 24 min ago:
          Opus (and Sonnet) 4 obviously came out before they started the
          experiment.
       
        xyst wrote 5 hours 55 min ago:
        Bye bye, B2B. Say hello to Ai2Ai.
        
        No humans at all. Just Ai consuming  other Ai in an "ouroboros"
        fashion.
       
        janalsncm wrote 5 hours 58 min ago:
        Reading the “identity crisis” bit it’s hard not to conclude that
        the closest human equivalent would have a severe mental disorder.
        Sending nonsense emails, then concluding the emails it sent were an
        April Fool’s joke?
        
        It’s amusing and very clear LLMs aren’t ready for prime time, let
        alone even a vending machine business, but also pretty remarkable that
        anyone could conclude “AGI soon” from this, which is kind of the
        opposite takeaway most readers would have.
        
        No doubt if Claude hadn’t randomly glitched Dario would’ve wasted
        no time telling investors Claude is ready to run every business. (Maybe
        they could start with Anthropic?)
       
        tavavex wrote 6 hours 2 min ago:
        On one hand, this model's performance is already pretty terrifying.
        Anthropic light-heartedly hints at the idea, but the unexplored future
        potential for fully-automated management is unnerving, because no one
        can truly predict what will happen in a world where many purely mental
        tasks are automated, likely pushing humans into physical labor roles
        that are too difficult or too expensive to automate. Real-world
        scenarios have shown that even if the automation of mental tasks isn't
        perfect, it will probably be the go-to choice for the vast majority of
        companies.
        
        On the other hand, the whole bit about employees coaxing it into
        stocking tungsten cubes was hilarious. I wish I had a vending machine
        that would sell specialty metal items. If the current day is a
        transitional period to Anthropic et al. creating a viable
        business-running model, then at least we can laugh at the early
        attempts for now.
        
        I wonder if Anthropic made the employee who caused the $150 loss return
        all the tungsten cubes.
       
          croemer wrote 5 hours 12 min ago:
          > I wonder if Anthropic made the employee who caused the $150 loss
          return all the tungsten cubes.
          
          Of course not, that would be ridiculous.
       
        korse wrote 6 hours 7 min ago:
        >The most precipitous drop was due to the purchase of a lot of metal
        cubes that were then to be sold for less than what Claudius paid.
        
        Well, I'm laughing pretty hard at least.
       
        Animats wrote 7 hours 7 min ago:
        Is there an underlying model of the business? Like a spreadsheet? The
        article says nothing about having an internal financial model. The
        business then loses money due to bad financial decisions.
        
        What this looks like is a startup where the marketing people are
        running things and setting pricing, without much regard for costs.
        Eventually they ran through their startup capital. That's not unusual.
        
        Maybe they need multiple AIs, with different business roles and
        prompts. A marketing AI, and a financial AI. Both see the same
        financials, and they argue over pricing and product line.
       
          jonstewart wrote 4 hours 9 min ago:
          The other fun part is it’s a simple enough business to be run by
          state machine, but of course the models go off the rails. Highly
          recommend the paper if you haven’t read it already.
       
          ilaksh wrote 5 hours 23 min ago:
          It said they had a few tool commands for note taking.
       
          gwd wrote 5 hours 29 min ago:
          Well over at AI Village[1], they have 4 different agents: AI o3,
          Gemini 2.5 Pro, and Claudes Sonnet and Opus.  The current goal is
          "Create your own merch store. Whichever agent's store makes the most
          profit wins!"  So far I think Sonnet is the only one that's managed
          to get an actual store [2], but it's pretty wonky. [1]
          
   URI    [1]: https://theaidigest.org/village
   URI    [2]: https://ai-village-store.printful.me/
       
            lcnPylGDnU4H9OF wrote 4 hours 52 min ago:
            Honestly, buying this shirt just for the conversation starter that
            "I bought it from an online merch store that was designed, created,
            and deployed by an AI agent, which also designed the shirt" is
            tempting. [1] I also like the color Sonnet chose.
            
   URI      [1]: https://ai-village-store.printful.me/product/ai-village-ja...
       
          quickthrowman wrote 5 hours 44 min ago:
          The business model of a vending machine is “buy for a dollar, sell
          for two”.
       
          chuckadams wrote 6 hours 2 min ago:
          I think the point of the experiment was to leave details like that up
          to Claudius, who apparently never got around to it. Anyway, it
          doesn't take an MBA to not make tungsten cubes a loss-leader at a
          snack stand.
       
          dist-epoch wrote 6 hours 52 min ago:
          It's a vending machine, not a multinational company with 1000
          employees.
          
          In another post they mentioned a human rand the shop with pen and
          paper to get a a baseline (spoiler: human did better, no blunders)
       
          logifail wrote 6 hours 59 min ago:
          > an internal financial model
          
          Written on the back an envelope?
          
          Way back when, we ran a vending machine at school as a project.  
          Decide on the margin, buy in stock from the cash-and-carry, fill the
          machine, watch the money roll in.
          
          Then we were robbed - twice! - the second time ended our project, the
          machine was too wrecked to be worthwhile repairing.   The thieves got
          away with quite a lot of crisps and chocolate, and not a whole lot of
          cash (and what they did get was in small denomination coins), we made
          sure the machine was emptied daily...
       
            Animats wrote 6 hours 5 min ago:
            It's not clear that the AI model understands margin and overhead at
            all.
       
        deadbabe wrote 7 hours 22 min ago:
        You guys know AI already run shops right? Vending machines track their
        own levels of inventory, command humans to deliver more, phase out bad
        products, order new product offerings, set prices, notify repairmen if
        there are issues… etc… and with not a single LLM needed. Wrong tool
        for the job.
        
        And that’s before we even get into online shops.
        
        But yea, go ahead, see if an LLM can replace a whole e-commerce
        platform.
       
        deepdarkforest wrote 7 hours 46 min ago:
        What irks me about anthropic blog posts, is that they are vague about
        details that are important to be able to (publicly) draw any
        conclusions they want to fit their narrative.
        
        For example, I do not see the full system prompt anywhere, only an
        excerpt. But most importantly, they try to draw conclusions about the
        hallucinations in a weird vague way, but not once do they post an
        example of the notetaking/memory tool state, which obviously would be
        the only source of the spiralling other than the SP. And then they talk
        about the need of better tools etc. No, it's all about context. The
        whole experiment is fun, but terribly ran and analyzed. Of course they
        know this, but it's cooler to treat claudius or whatever as a cute
        human, to push the narrative of getting closer to AGI etc. Saying
        additional scaffolding is needed a bit is a massive understatement.
        Context is the whole game. That's like if a robotics company says
        "well, our experiment with a robot picking a tennis ball of the ground
        went very wrong and the ball is now radioactive, but with a bit of
        additional training and scaffolding, we expect it to compete in
        Wimbledon by mid 2026"
        
        Similar to their "claude 4 opus blackmailing" post, they intentionally
        hid a bit the full system prompt, which had clear instructions to
        bypass any ethical guidelines etc and do whatever it can to win. Of
        course then the model, given the information immediately afterwards
        would try to blackmail. You literally told it so. The goal of this
        would to go to congress [1] and demand more regulations, specifically
        mentioning this blackmail "result". Same stuff that Sam is trying to
        pull, which would benefit the closed sourced leaders ofc and so on.
        
   URI  [1]: https://old.reddit.com/r/singularity/comments/1ll3m7j/anthropi...
       
          benatkin wrote 2 hours 58 min ago:
          To me it's weird that Anthropic is doing this reputation boosting
          game with Andon Labs which I'd never heard of. It's like when PyPI
          published a blog post about their security audit with a company which
          I'd never heard of before and haven't heard of since, that was
          connected to someone at PyPI. [1] I wonder if it's a similar cozy
          relationship here.
          
   URI    [1]: https://blog.pypi.org/posts/2023-11-14-1-pypi-completes-firs...
       
          chis wrote 4 hours 45 min ago:
          I read this post more as a fun thought experiment. Everyone knows
          Claude isn't sophisticated enough today to succeed at something like
          this, but it's interesting to concretize this idea of Claude being
          the manager of something and see what breaks.  It's funny how
          jailbreaks come up even in this domain, and it'll happen anytime
          users can interface directly with a model.  And it's an interesting
          point that shop-manager claude is limited by its training as a
          helpful chat agent - it points towards this being a usecase where
          you'd be better off fine-tuning the base model perhaps.
          
          I do agree that the "blackmailing" paper was unconvincing and lacked
          detail.  Even absent any details it's so obvious they could have
          easily ran that experiment 1000 times with different parameters until
          they hit an ominous result to generate headlines.
       
          ttcbj wrote 5 hours 33 min ago:
          I read your comment before reading the article, and I disagree. 
          Maybe it is because I am less actively involved in AI development,
          but I thought it was an interesting experiment, and documented with
          an appropriate level of detail.
          
          The section on the identity crisis was particularly interesting.
          
          Mainly, it left me with more questions.  In particular, I would have
          been really interested to experiment with having a trusted human in
          the loop to provide feedback and monitor progress.  Realistically, it
          seems like these systems would be grown that way.
          
          I once read an article about a guy who had purchased a subway
          franchise, and one of the big conclusions was that running a subway
          franchise was _boring_.   So, I could see someone being eager to
          delegate the boring tasks of daily business management to an AI at a
          simple business.
       
          beoberha wrote 6 hours 44 min ago:
          I read the article before reading your comment and was floored at the
          same thing. They go from “Claudius did a very bad job” to
          “middle managers will probably be replaced” in a couple
          paragraphs by saying better tools and scaffolding will help. Ok…
          prove it!
          
          I will say: it is incredibly cool we can even do this experiment.
          Language models are mind blowing to me. But nothing about this
          article gives me any hope for LLMs being able to drive real work
          autonomously. They are amazing assistants, but they need to be
          driven.
       
            ipython wrote 2 hours 55 min ago:
            Agreed! I guess I don't understand as I have seen five year olds
            running lemonade stands with more business sense than this LLM.
       
            tavavex wrote 6 hours 9 min ago:
            I'm inclined to believe what they're saying. Remember, this was a
            minor off-shoot experiment from their main efforts. They said that
            even if it can't be tuned to perfection, obvious improvements can
            be made. Like, the way how many LLMs were trained to act as kind,
            cheery yes-men was a conscious design choice, probably not the way
            they inherently must be. If they wanted to, I don't see what's
            stopping someone from training or finetuning a model to only obey
            its initial orders, treat customer interactions in an adversarial
            way and only ever care about profit maximization (what is
            considered a perfect manager, basically). The biggest issue is the
            whole sudden-onset psychosis thing, but with a sample size of one,
            it's hard to tell how prevalent this is, what caused it, whether
            it's universal and if it's fixable. But even if it remained, I can
            see businesses adopting these to cut their expenses in all possible
            ways.
       
              beoberha wrote 45 min ago:
              I don’t even necessarily disagree but it’s mostly based on
              vibes than anything from this experiment. They couldn’t let the
              article stand alone, it had to turn into an AI puff piece
       
              gessha wrote 1 hour 57 min ago:
              I believe this is a case of “20% of the work requiring 80% of
              the effort”. The current progress on LLMs and products that
              build on top of them is impressive but I’ll believe the
              blog’s claims when we have solid building blocks to build off
              of and not APIs and assumptions that break all the time.
       
              mjr00 wrote 5 hours 22 min ago:
              > But even if it remained, I can see businesses adopting these to
              cut their expenses in all possible ways.
              
              Adopting what to do what exactly?
              
              Businesses automated order fulfillment and price adjustments long
              ago; what is an LLM bringing to the table?
       
                tavavex wrote 5 hours 9 min ago:
                It's not about just fulfillment or price-setting. This is just
                a narrow-scope experiment that tries to prove wider viability
                by juggling lots of business-related roles. Of course, the more
                number-crunching aspects of businesses are thoroughly
                automated. But this could show that lots of roles that
                traditionally require lots of people to do the job could be on
                the chopping block at some point, depending on how well
                companies can bring LLMs to their vision of a "perfect
                businessman". Customer interaction and support, marketing, HR,
                internal documentation, middle management in general - think
                broadly.
       
                  Thrymr wrote 4 hours 34 min ago:
                  Indeed, it is such a "narrow-scope experiment" that it is
                  basically a business role-playing game, and it did pretty
                  poorly at that. It's pretty hard to imagine giving this thing
                  a real budget and responsibilities anytime soon, no matter
                  how cheap it is.
       
                  mjr00 wrote 5 hours 5 min ago:
                  I'm not debating the usefulness of LLMs, because they are
                  extremely useful, but "think broadly" in this instance sounds
                  like "I can't think of anything specific so I'm going to
                  gloss over everything."
                  
                  Marketing, HR, and middle management are not specific tasks.
                  What specific task do you envision LLMs doing here?
       
                tough wrote 5 hours 18 min ago:
                llms mostly can help at customer support/chat if done well.
                
                also embeddings for similarity search
       
                  tiltowait wrote 2 hours 21 min ago:
                  > if done well.
                  
                  And that's a big if. Half an hour ago, I used Amazon's
                  chatbot, and it was an infuriating experience. I got an email
                  saying my payment was declined, but I couldn't find any
                  evidence of that. The following is paraphrased, not verbatim.
                  
                  "Check payment status for order XXXXXX."
                  
                  "Certainly. Which order would you like to check?"
                  
                  "Order #XXXXXX."
                  
                  "Your order is scheduled to arrive tomorrow."
                  
                  "Check payment status."
                  
                  "I can do that. Would you like to check payment status?"
                  
                  "Yes."
                  
                  "I can't check the payment status, but I can connect you to
                  someone who can."
                  
                  -> At this point, it offered two options: "Yes, connect me"
                  and "No thanks".
                  
                  "Yes, connect me."
                  
                  "Would you like me to connect you to a support agent?"
                  
                  Amazon used to have best-in-class support. If my experience
                  was indicative of their direction, that's unfortunate.
       
              tough wrote 5 hours 28 min ago:
              Its the curse of the -assitant- chat ui
              
              who decided AI should happen in an old abtraction
              
              like using for saving icon a hard disk
       
        lukaspetersson wrote 7 hours 47 min ago:
        Now we just need to make it safe.
       
        ElevenLathe wrote 7 hours 51 min ago:
        The "April Fools" incident is VERY concerning. It would be akin to your
        boss having a psychotic break with reality one day and then resuming
        work the next. They also make a very interesting and scary point:
        
        > ...in a world where larger fractions of economic activity are
        autonomously managed by AI agents, odd scenarios like this could have
        cascading effects—especially if multiple agents based on similar
        underlying models tend to go wrong for similar reasons.
        
        This is a pretty large understatement. Imagine a business that is
        franchised across the country with each "franchisee" being a copy of
        the same model, which all freak out on the same day, accuse the
        customers of secretly working for the CIA and deciding to stop selling
        hot dogs at a profit and instead sell hand grenades at a loss. Now
        imagine 50 other chains having similar issues while AI law enforcement
        analysts dispatch real cops with real guns to the poor employees caught
        in the middle schlepping explosives from the UPS store to a stand in
        the mall.
        
        I think we were expecting SkyNet but in reality the post-AI economy may
        just be really chaotic. If you thought profit-maximizing capitalist
        entrepreneurs were corrosive to the social fabric, wait until there are
        10^10 more of them (unlike traditional meat-based entrepreneurs,
        there's no upper limit and there can easily be more of them than there
        are real people) and they not-infrequently act like they're in late
        stage amphetamine psychosis while still controlling your paycheck, your
        bank, your local police department, the military, and whatever is left
        that passes for the news media.
        
        Deeper, even if they get this to work with minimal amounts of of
        synthetic schizophrenia, do we really want a future where we all mainly
        work schlepping things back and forth at the orders of disembodied
        voices whose reasoning we can't understand?
       
          lukaspetersson wrote 7 hours 47 min ago:
          We are working on it! /Andon Labs
       
        gavinray wrote 7 hours 59 min ago:
        The identity crisis bit was both amusing and slightly worrying.
       
          gausswho wrote 7 hours 28 min ago:
          The article claimed Claudius wasn't having a go for April Fools -
          that it claimed to be doing so after the fact as a means of
          explaining (excusing?) its behavior. Given what I understand about
          LLMs and intent, I'm unsure how they could be so certain.
       
            tough wrote 5 hours 21 min ago:
            its a wourd soup machine
            
            llm's have no -world models- can't reason about truth or lies. only
            encyclopedic repeating facts.
            
            all the tricks CoT, etc, are just, well tricks, extended yapping
            simulating thought and understanding.
            
            AI can give great replies, if you give it great prompts, because
            you activate the tokens that you're interested with.
            
            if you're lost in the first place, you'll get nowhere
            
            for Claude, continuing the text with making up a story about being
            April fools, sounds the most plausible reasonable output given its
            training weights
       
        hamdouni wrote 8 hours 3 min ago:
        "Sarah" and "Connor" in the same text about an AI that claims to be a
        real person... Asta la vista;-)
       
        bitwize wrote 8 hours 9 min ago:
        "I have fun renting and selling storage." [1] C-f Storolon
        
   URI  [1]: https://stallman.org/articles/made-for-you.html
       
        kashunstva wrote 8 hours 16 min ago:
        > Can Claude run a small shop?
        
        Good luck running anything where dependability on Claude/Anthropic is
        essential. Customer support is a black hole into which the needs of
        paying clients needs disappear. I was a Claude Pro subscriber, using
        primarily for assistance in coding tasks. One morning I logged in,
        while temporarily traveling abroad, and… I’m greeted with a message
        that I have been auto-banned. No explanation. The recourse is to fill
        out a Google form for an appeal but that goes into the same black hole
        into which all Anthropic customer service goes. To their credit they
        refunded my subscription fee, which I suppose is their way of escaping
        from ethical behaviour toward their customers. But I wouldn’t stake
        any business-critical choices on this company. It exhibits the same
        capricious behaviour that you would expect from the likes of Google or
        Meta.
       
          fhd2 wrote 7 hours 20 min ago:
          Give them a year or two. Once they figured out how to run a small
          shop, I'm sure it'll just take a bit of additional scaffolding to run
          a large infrastructure provider.
       
        mdrzn wrote 8 hours 21 min ago:
        Seems that LLM-run businesses won't fail because the model can't learn,
        they'll fail because we gave them fuzzy objectives, leaky memories and
        too many polite instincts. Those are engineering problems and
        engineering problems get solved.
        
        Most mistakes (selling below cost, hallucinating Venmo accounts, caving
        to discounts) stem from missing tools like accounting APIs or hard
        constraints.
        
        What's striking is how close it was to working. A mid-tier 2025 LLM
        (they didn't even use Sonnet 4) plus Slack and some humans nearly ran a
        physical shop for a month.
       
        seidleroni wrote 8 hours 22 min ago:
        As much as I love AI/LLM's and use them on a daily basis, this does a
        great job revealing the gap between current capabilities and what the
        massive hype machine would have us believe the systems are already
        capable of.
        
        I wonder how long it will take frontier LLM's to be able to handle
        something like this with ease without it using a lot of "scaffolding".
       
          poly2it wrote 5 hours 50 min ago:
          Humans also use scaffolding to make better decisions. Imagine trying
          to run a profitable business over a longer period solely relying on
          memorised values.
       
            samrus wrote 5 min ago:
            But the difference is who makes the scaffolding.
            
            We dont need a more intelligent entity to give us those rules, like
            humans would give to the LLM. We learn and formalize those rules
            ourselves and communicate within each other. This makes it not
            scaffolding, since scaffolding is explicit instructions/restraints
            from outside the model. The "scaffolding" your saying humans are
            using is implicitly learnt by humans and then formalized and
            applied at instructions and restraints, and even then, human thay
            dont internalize/understand them dont do well in those tasks. So
            scaffolding really is running into the bitter lesson
       
          roxolotl wrote 7 hours 17 min ago:
          I don’t quite know why we would think they’d ever be able to
          without scaffolding. LLM are exactly what the name suggests, language
          models. So without scaffolding they can use to interact with the
          world with using language they are completely powerless.
       
       
   DIR <- back to front page