_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Two things LLM coding agents are still bad at
       
       
        mohsen1 wrote 35 min ago:
        > LLMs are terrible at asking questions
        
        Not if they're instructed to. In my experience you can adjust the
        prompt to make them ask questions. They ask very good questions
        actually!
       
        Plough_Jogger wrote 53 min ago:
        Let's just change the title to "LLM coding agents don't use copy &
        paste or ask clarifying questions" and save everyone the click.
       
        odkral wrote 56 min ago:
        If I need an exact copy pasting, I indicate that couple times in the
        prompt and it (claude) actually does what I am asking. But yeah overall
        very bad at refactoring big chunks.
       
        SamDc73 wrote 58 min ago:
        For 2) I feel like codex-5 kind of attempted to address this problem,
        with codex it usually asks a lot of questions and give options before
        digging in (without me prompting it to).
        
        For copy-paste, you made it feel like a low-hanging fruit? Why don't AI
        agents have copy/paste tools?
       
        joshribakoff wrote 1 hour 11 min ago:
        My human fixed a bug by introducing a new one. Classic. Meanwhile, I
        write the lint rules, build the analyzers, and fix 500 errors before
        they’ve finished reading Stack Overflow. Just don’t ask me to
        reason about their legacy code — I’m synthetic, not insane.
        
        —
        
        Just because this new contributor is forced to effectively “SSH”
        into your codebase and edit not even with vim but with with sed and awk
        does not mean that this contributor is incapable of using other tools
        if empowered to do so. The fact it is able to work within such
        constraints goes to show how much potential there is. It is already
        much better at a human than erasing the text and re-typing it from
        memory and while it is a valid criticism that it needs to be taught how
        to move files imagine what it is capable of once it starts to use tools
        effectively.
        
        —
        
        Recently, I observed LLMs flail around for hours trying to get our e2e
        tests running as it tried to coordinate three different processes in
        three different terminals. It kept running commands in one terminal try
        to kill or check if the port is being used in the other terminal.
        
        However, once I prompted the LLM to create a script for running all
        three processes concurrently, it is able to create that script,
        leverage it, and autonomously debug the tests now way faster than I am
        able to. It has also saved any new human who tries to contribute from
        similar hours of flailing around. Is there something we could have
        easily done by hand but just never had the time to do before LLMs. If
        anything, the LLM is just highlighting the existing problem in our
        codebase that some of us got too used to.
        
        So yes, LLMs makes stupid mistakes, but so do humans, the thing is that
        LLms can ifentify and fix them faster (and better, with proper
        steering)
       
        strangescript wrote 1 hour 18 min ago:
        You don't want your agents to ask questions. You are thinking too short
        term. Its not ideal now, but agents that have to ask frequent questions
        are useless when it comes the vision of totally autonomous coding.
        
        Humans ask questions of groups to fix our own personal short comings.
        It make no sense to try and master an internal system I rarely use, I
        should instead ask someone that maintains it. AI will not have this
        problem provided we create paths of observability for them. It doesn't
        take a lot of "effort" for them to completely digest an alien system
        they need to use.
       
          justonceokay wrote 1 hour 13 min ago:
          If you look at a piece of architecture, you might be able to infer
          the intentions of the architect. However, there are many
          interpretations possible. So if you were to add an addendum to the
          building it makes sense that you might want to ask about the
          intentions.
          
          I do not believe that AI will magically overcome the Chesterton Fence
          problem in a 100% autonomous way.
       
        causal wrote 1 hour 26 min ago:
        Similar to the copy/paste issue I've noticed LLMs are pretty bad at
        distilling large documents into smaller documents without leaving out a
        ton of detail. Like maybe you have a super redundant doc. Give it to an
        LLM and it won't just deduplicate it, it will water the whole thing
        down.
       
        justinhj wrote 1 hour 26 min ago:
        Building an mcp tool that has access to refactoring operations should
        be straightforward and using it appropriately is well within the
        capabilities of current models. I wonder if it exists? I don't do a lot
        of refactoring with llm so haven't really had this pain point.
       
        majora2007 wrote 1 hour 36 min ago:
        I think LLMs provide value, used it this morning to fix a bug in my PDF
        Metadata parser without having to get too deep into the PDF spec.
        
        But most of the time, I find that the outputs are nowhere near the
        effect of just doing it myself. I tried Codex Code the other day to
        write some unit tests. I had a few setup and wanted to use it (because
        mocking the data is a pain).
        
        It took about 8 attempts, I had to manually fix code, it couldn't
        understand that some entities were obsolete (despite being marked and
        the original service not using them). Overall, was extremely
        disappointed.
        
        I still don't think LLMs are capable of replacing developers, but they
        are great at exposing knowledge in fields you might not know and help
        guide you to a solution, like Stack Overflow used to do (without the
        snark).
       
          ojosilva wrote 27 min ago:
          I think LLMs have what it takes at this point in time, but it's the
          coding agent (combined with the model) that make the magic happen.
          Coding agents can implement copy-pasting, it's a matter of building
          the right tool for it, then iterating with given models/providers,
          etc. And that's true for everything else that LLMs lack today.
          Shortcomings can be remediated with good memory and context
          engineering, safety-oriented instructions, endless verification and
          good overall coding agent architecture. Also having a model that can
          respond fast, have a large context window and maintain attention to
          instructions is also essential for a good overall experience.
          
          And the human prompting, of course. It takes good sw engineering
          skills, particularly knowing how to instruct other devs in getting
          the work done, setting up good AGENTS.md (CLAUDE.md, etc) with
          codebase instructions, best practices, etc etc.
          
          So it's not an "AI/LLMs are capable of replacing developers"...
          that's getting old fast. It's more like, paraphrasing the wise "it's
          not what your LLM can do for you, but what can you do for your LLM"
       
        simonw wrote 1 hour 37 min ago:
        I feel like the copy and paste thing is overdue a solution.
        
        I find this one particularly frustrating when working directly with
        ChatGPT and Claude via their chat interfaces. I frequently find myself
        watching them retype 100+ lines of code that I pasted in just to make a
        one line change.
        
        I expect there are reasons this is difficult, but difficult problems
        usually end up solved in the end.
       
          danenania wrote 47 min ago:
          Yeah, I’ve always wondered if the models could be trained to output
          special reference tokens that just copy verbatim slices from the
          input, perhaps based on unique prefix/suffix pairs. Would be a
          dramatic improvement for all kinds of tasks (coding especially).
       
          rhetocj23 wrote 1 hour 25 min ago:
          Whats the time horizon for said problems to be solved? Because guess
          what - time is running and people will not continue to aimlessly
          provide money at this stuff.
       
            simonw wrote 42 min ago:
            I don't see this one as an existential crisis for AI tooling, more
            of a persistent irritation.
            
            AI labs already shipped changes related to this problem - most
            notable speculative decoding, which lets you provide the text you
            expect to see come out again and speeds it up: [1] They've also
            been iterating on better tools for editing code a lot as part of
            the competition between Claude Code and Codex CLI and other coding
            agents.
            
            Hopefully they'll figure out a copy/paste mechanism as part of that
            work.
            
   URI      [1]: https://simonwillison.net/2024/Nov/4/predicted-outputs/
       
        TrackerFF wrote 1 hour 42 min ago:
        I very much agree on point 2.
        
        I often wish that instead of just starting to work on the code,
        automatically, even if you hit enter / send by accident, the models
        would rather ask for clarification. The models assume a lot, and will
        just spit out code first.
        
        I guess this is somewhat to lower the threshold for non-programmers,
        and to instantly give some answer, but it does waste a lot of resources
        - I think.
        
        Others have mentioned that you can fix all this by providing a guide to
        the mode, how it should interact with you, and what the answers should
        look like. But, still, it'd be nice to have it a bit more human-like on
        this aspect.
       
        nc wrote 1 hour 42 min ago:
        Add to this list, ability to verify correct implementation by viewing a
        user interface, and taking a holistic code-base / interface-wide view
        of how to best implement something.
       
        capestart wrote 1 hour 44 min ago:
        Large language models can help a lot, yet they still lack the human
        touch, particularly in the areas of context comprehension and question
        formulation. The entire "no copy-paste" rule seems strange as well. It
        is as if the models were performing an operation solely in their minds
        rather than just repeating it like we do. It gives the impression that
        they are learning by making mistakes rather than thinking things
        through. They are certainly not developers' replacements at this point!
       
        jamesjyu wrote 1 hour 56 min ago:
        For #2, if you're working on a big feature, start with a markdown
        planning file that you and the LLM work on until you are satisfied with
        the approach. Doesn't need to be rocket science: even if it's just a
        couple paragraphs it's much better than doing it one shot.
       
        linsomniac wrote 1 hour 59 min ago:
        >Sure, you can overengineer your prompt to try get them to ask more
        questions
        
        That's not overengineering, that's engineering.  "Ask clarifying
        questions before you start working", in my experience, has led to some
        fantastic questions, and is a useful tool even if you were to not have
        the AI tooling write any code.    As a good programmer, you should know
        when you are handing the tool a complete spec to build the code and
        when the spec likely needs some clarification, so you can guide the
        tool to ask when necessary.
       
          manmal wrote 47 min ago:
          You can even tell it how many questions to ask. For complex topics, I
          might ask it to ask me 20 or 30 questions. And I'm always surprised
          how good those are. You can also keep those around as a QnA file for
          later sessions or other agents.
       
        segmondy wrote 1 hour 59 min ago:
        Someone has definitely fallen behind and has massive skill issues. 
        Instead of learning you are wasting time writing bad takes on LLM.   I
        hope most of you don't fall down this hole, you will be left behind.
       
        mehdibl wrote 2 hours 7 min ago:
        You can do copy and paste if you offer it a tool/MCP that do that. It's
        not complicated using either function extraction with AST as target or
        line numbers.
        
        Also if you want it to pause asking questions, you need to offer that
        thru tools (example Manus do that) and I have an MCP that do that and
        surprisingly I got a lot of questions and if you prompt, it will do.
        But the push currently is for full automation and that's why it's not
        there. We are far better in supervised step by step mode.
        There is elicitation already in MCP, but having a tool asking questions
        require you have a UI that will allow to set the input back.
       
        ravila4 wrote 2 hours 11 min ago:
        Regarding copy-paste, I’ve been thinking the LLM could control a
        headless Neovim instance instead. It might take some specialized
        reinforcement learning to get a model that actually uses Vim correctly,
        but then it could issue precise commands for moving, replacing, or
        deleting text, instead of rewriting everything.
        
        Even something as simple as renaming a variable is often safer and
        easier when done through the editor’s language server integration.
       
        enraged_camel wrote 2 hours 27 min ago:
        First point is very annoying, yes, and it's why for large refactors I
        have the AI write step-by-step instructions and then do it myself. It's
        faster, cheaper and less error-prone.
        
        The second point is easily handled with proper instructions. My AI
        agents always ask questions about points I haven't clarified, or when
        they come across a fork in the road. Frequently I'll say "do X" and
        it'll proceed, then halfway it will stop and say "I did some of this,
        but before I do the rest, you need to decide what to do about such and
        such". So it's a complete non-problem for me.
       
        pengfeituan wrote 2 hours 29 min ago:
        The first issue is related to the inner behavior of LLMs. Human can
        ignore some detailed contents of code and copy and paste, but LLM
        convert them into hidden states. It is a process of compression. And
        the output is a process of decompression. And something maybe lost. So
        it is hard for LLM to copy and paste. The agent developer should
        customize the edit rules to do this.
        
        The second issue is that, LLM does not learn much high level context
        relationship of knowledge. This can be improved by introducing more
        patterns in the training data. And current LLM training is doing much
        on this. I don't think    it is a problem in next years.
       
        8s2ngy wrote 2 hours 39 min ago:
        One thing LLMs are surprisingly bad at is producing correct LaTeX
        diagram code. Very often I've tried to describe in detail an electric
        circuit, a graph (the data structure), or an automaton so I can quickly
        visualize something I'm studying, but they fail. They mix up labels,
        draw without any sense of direction or ordering, and make other errors.
        I find this surprising because LaTeX/TiKZ have been around for decades
        and there are plenty of examples they could have learned from.
       
        celeritascelery wrote 2 hours 42 min ago:
        The “LLMs are bad at asking questions” is interesting. There are
        some times that I will ask the LLM to do something without giving it
        All the needed information. And rather than telling me that something's
        missing or that it can't do it the way I asked, it will try and do a
        halfway job using fake data or mock something out to accomplish it.
        What I really wish it would do is just stop and say, “hey, I can't do
        it like you asked Did you mean this?”
       
        squirrel wrote 2 hours 53 min ago:
        A friendly reminder that "refactor" means "make and commit a tiny
        change in less than a few minutes" (see links below). The OP and many
        comments here use "refactor" when they actually mean "rewrite".
        
        I hear from my clients (but have not verified myself!) that LLMs
        perform much better with a series of tiny, atomic changes like Replace
        Magic Literal, Pull Up Field, and Combine Functions Into Transform. [1]
        [2]
        
   URI  [1]: https://martinfowler.com/books/refactoring.html
   URI  [2]: https://martinfowler.com/bliki/OpportunisticRefactoring.html
   URI  [3]: https://refactoring.com/catalog/
       
        Lerc wrote 2 hours 54 min ago:
        I think the issue with them making assumptions and failing to properly
        diagnose issues comes more from fine-tuning than any particular
        limitation in LLMs themselves.     When fine tuned on a set of
        problem->solution data it kind of carries the assumption that the
        problem contains enough data for the solution.
        
        What is really needed is a tree of problems which appear identical at
        first glance, but the issue and the solution is something that is one
        of many possibilities which can only be revealed by finding what
        information is lacking, acquiring that information, testing the
        hypothesis then, if the hypothesis is shown to be correct, then finally
        implementing the solution.
        
        That's a much more difficult training set to construct.
        
        The editing issue, I feel needs something more radical.  Instead of the
        current methods of text manipulation, I think there is scope to have a
        kind of output position encoding for a model to emit data in a
        non-sequential order.     Again this presents another training data
        problem,  there are limited natural sources to work from showing
        programming in the order a programmer types it.    On the other hand I
        think it should be possible to do synthetic training examples by
        translating existing model outputs that emit patches, search/replaces,
        regex mods etc. and translate those to a format that directly encodes
        the final position of the desired text.
        
        At some stage I'd like to see if it's possible to construct the models
        current idea of what the code is purely by scanning a list of cached
        head_embeddings of any tokens that turned into code.    I feel like
        there should be enough information given the order of emission and the
        embeddings themselves to reconstruct a piecemeal generated program.
       
        imcritic wrote 3 hours 9 min ago:
        About the first point mentioned in article: could that problem be
        solved simply by changing the task from something like "refactor this
        code" to something like "refactor this code as a series of smaller
        atomic changes (like moving blocks of code or renaming variable
        references in all places), disable suitable for git commits (and
        provide git message texts for those commits)"?
       
        crazygringo wrote 3 hours 55 min ago:
        > Sure, you can overengineer your prompt to try get them to ask more
        questions (Roo for example, does a decent job at this) -- but it's very
        likely still won't.
        
        Not in my experience. And it's not "overengineering" your prompt, it's
        just writing your prompt.
        
        For anything serious, I always end every relevant request with an
        instruction to repeat back to me the full design of my instructions or
        ask me necessary clarifying questions first if I've left anything
        unclear, before writing any code. It always does.
        
        And I don't mind having to write that, because sometimes I don't want
        that. I just want to ask it for a quick script and assume it can fill
        in the gaps because that's faster.
       
        notpachet wrote 3 hours 57 min ago:
        > They keep trying to make it work until they hit a wall -- and then
        they just keep banging their head against it.
        
        This is because LLMs trend towards the centre of the human cognitive
        bell curve in most things, and a LOT of humans use this same problem
        solving approach.
       
          gessha wrote 3 hours 43 min ago:
          The approach doesn’t matter as much. The halting problem does :)
       
        NumberCruncher wrote 4 hours 4 min ago:
        I don’t really understand why there’s so much hate for LLMs here,
        especially when it comes to using them for coding. In my experience,
        the people who regularly complain about these tools often seem more
        interested in proving how clever they are than actually solving real
        problems. They also tend to choose obscure programming languages where
        it’s nearly impossible to hire developers, or they spend hours
        arguing over how to save $20 a month.
        
        Over time, they usually get what they want: they become the smartest
        ones left in the room, because all the good people have already moved
        on. What’s left behind is a codebase no one wants to work on, and you
        can’t hire for it either.
        
        But maybe I’ve just worked with the wrong teams.
        
        EDIT: Maybe this is just about trust. If you can’t bring yourself to
        trust code written by other human beings, whether it’s a package, a
        library, or even your own teammates, then of course you’re not going
        to trust code from an LLM. But that’s not really about quality,
        it’s about control. And the irony is that people who insist on
        controlling every last detail usually end up with fragile systems
        nobody else wants to touch, and teams nobody else wants to join.
       
          kakacik wrote 3 hours 49 min ago:
          It has been discussed ad nauseum. It demolishes learning curve all of
          us with decade(s) of experience went through, to become seniors we
          are. Its not a function of age, not a function of time spent staring
          at some screen or churning our basic crud apps, its function of hard
          experience, frustration, hard won battles, grokking underlying
          technologies or algorithms.
          
          Llms provide little of that, they make people lazy, juniors stay
          juniors forever, even degrading mentally in some aspects. People need
          struggle to grow, when you have somebody who had his hand held whole
          life they are useless human disconnected from reality, unable to
          self-sufficiently achieve anything significant. Too easy life
          destroys both humans and animals alike (many experiments have been
          done on that, with damning results).
          
          There is much more like hallucinations, questionable added value of
          stuff that confidently looks OK but has underlying hard-to-debug bugs
          but above should be enough for a start.
          
          I suggest actually reading those conversations, not just skimming
          through them, this has been stated countless times.
       
          tossandthrow wrote 3 hours 55 min ago:
          I regularly check in on using LLMs. But a key criteria for me is that
          an LLM needs to objectively make me more efficient, not subjectively.
          
          Often I find myself cursing at the LLM for not understanding what I
          mean - which is expensive in lost time / cost of tokens.
          
          It is easy to say: Then just don't use LLMs. But in reality, it is
          not too easy to break out of these loops of explaining, and it is
          extremely hard to assess when not to trust that the LLM will not be
          able to finish the task.
          
          I also find that LLMs consistently don't follow guidelines. Eg. to
          never use coercions in TypeScript (It always gets in a rogue `as`
          somewhere) - to which I can not trust the output and needs to be
          extra vigilant reviewing.
          
          I use LLMs for what they are good at. Sketching up a page in
          React/Tailwind, sketching up a small test suite - everything that can
          be deemed a translation task.
          
          I don't use LLMs for tasks that are reasoning heavy: Data modelling,
          architecture, large complex refactors - things that require deep
          domain knowledge and reasoning.
       
            NumberCruncher wrote 3 hours 47 min ago:
            > Often I find myself cursing at the LLM for not understanding what
            I mean...
            
            Me too. But in all these cases, sooner or later, I realized I made
            a mistake not giving enough context and not building up the
            discussion carefully enough. And I was just rushing to the
            solution. In the agile world, one could say I gave the LLM not a
            well-defined story, but a one-liner. Who is to blame here?
            
            I still remember training a junior hire who started off with:
            
            “Sorry, I spent five days on this ticket. I thought it would only
            take two. Also, who’s going to do the QA?”
            
            After 6 months or so, the same person was saying:
            
            “I finished the project in three weeks. I estimated four. QA is
            done. Ready to go live.”
            
            At that point, he was confident enough to own his work end-to-end,
            even shipping to production without someone else reviewing it.
            Interestingly, this colleague left two years ago, and I had to take
            over his codebase. It’s still running fine today, and I’ve
            spent maybe a single day maintaining it in the last two years.
            
            Recently, I was talking with my manager about this. We agreed that
            building confidence and self-checking in a junior dev is very
            similar to how you need to work with LLMs.
            
            Personally, whenever I generate code with an LLM, I check every
            line before committing. I still don’t trust it as much as the
            people I trained.
       
        mcny wrote 4 hours 55 min ago:
        I sometimes give LLM random "easy" questions. My assessment is still
        that they all need the fine print "bla bla can be incorrect"
        
        You should either already know the answer or have a way to verify the
        answer. If neither, the matter must be inconsequential like just a
        child like curiosity. For example, I wonder how many moons Jupiter
        has... It could be 58, it could be 85 but either answer won't alter any
        of what I do today.
        
        I suspect some people (who need to read the full report) dump thousand
        page long reports into LLM, read the first ten words of the response
        and pretend they know what the report says and that is scary.
       
          mexicocitinluez wrote 3 hours 25 min ago:
          > or have a way to verify the answer
          
          Fortunately, as devs, this is our main loop. Write code, test, debug.
          And it's why people who fear AI-generated code making it's way into
          production and causing errors makes me laugh. Are you not testing
          your code? Or even debugging it? Like, what process are you using
          that prevents bugs happening? Guess what? It's the exact same process
          with AI-generated code.
       
          latexr wrote 3 hours 32 min ago:
          > For example, I wonder how many moons Jupiter has... It could be 58,
          it could be 85
          
          For those curious, the answer is 97.
          
   URI    [1]: https://en.wikipedia.org/wiki/Moons_of_Jupiter
       
        mr_mitm wrote 4 hours 59 min ago:
        The other day, I needed Claude Code to write some code for me. It
        involved messing with the TPM of a virtual machine. For that, it was
        supposed to create a directory called `tpm_dir`. It constantly got it
        wrong and wrote `tmp_dir` instead and tried to fix its mistake over and
        over again, leading to lots of weird loops. It completely went off the
        rails, it was bizarre.
       
        hotpotat wrote 5 hours 3 min ago:
        Lol this person talks about easing into LLMs again two weeks after
        quitting cold turkey. The addiction is real. I laugh because I’m in
        the same situation, and see no way out other than to switch professions
        and/or take up programming as a hobby in which I purposefully subject
        myself to hard mode. I’m too productive with it in my profession to
        scale back and do things by hand — the cat is out of the bag and
        I’ve set a race pace at work that I can’t reasonably retract from
        without raising eyebrows. So I agree with the author’s referenced
        post that finding ways to still utilize it while maintaining a mental
        map of the code base and limiting its blast radius is a good middle
        ground, but damn it requires a lot of discipline.
       
          mallowdram wrote 3 hours 41 min ago:
          cat out of the bag is disautomation. the speed in the timetable is an
          illusion if the supervision requires blast radius retention. this is
          more like an early video game assembly line than a structured skilled
          industry
       
          schwartzworld wrote 5 hours 1 min ago:
          > I’ve set a race pace at work that I can’t reasonably retract
          from without raising eyebrows
          
          Why do this to yourself? Do you get paid more if you work faster?
       
            hotpotat wrote 4 hours 38 min ago:
            It started as a mix of self-imposed pressure and actually enjoying
            marking tasks as complete. Now I feel resistant to relaxing things.
            And no, I definitely don’t get paid more.
       
        cadamsdotcom wrote 5 hours 36 min ago:
        You need good checks and balances. E2E tests for your happy path, TDD
        when you & your agent write code.
        
        Then you - and your agent - can refactor fearlessly.
       
        mihau wrote 6 hours 46 min ago:
        > you can overengineer your prompt to try get them to ask more
        questions
        
        why overengineer? it's super simple
        
        I just do this for 60% of my prompts: "{long description of the
        feature}, please ask 10 questions before writing any code"
       
        amelius wrote 7 hours 3 min ago:
        I recently asked an llm to fix an Ethernet connection while I was
        logged into the machine through another. Of course, I explicitly told
        the llm to not break that connection. But, as you can guess, in the
        process it did break the connection.
        
        If an llm can't do sys admin stuff reliably, why do we think it can
        write quality code?
       
        arbirk wrote 7 hours 54 min ago:
        Those 2 things are not inherit to LLM's and could easily be changed by
        giving it the proper tools and instructions
       
        podgorniy wrote 7 hours 57 min ago:
        > LLMs are terrible at asking questions. They just make a bunch of
        assumptions
        
        _Did you ask it to ask questions?_
       
        _ink_ wrote 7 hours 59 min ago:
        > LLMs are terrible at asking questions. They just make a bunch of
        assumptions and brute-force something based on those guesses.
        
        I don't agree with that. When I am telling Claude Code to plan
        something I also mention that it should ask questions when informations
        are missing. The questions it comes up with a really good, sometimes
        about cases I simply didn't see. To me the planning discussion doesn't
        feel much different than in a GitLab thread, only at a much higher
        iteration speed.
       
        BenGosub wrote 8 hours 0 min ago:
        The issue is partly that some expect a fully fledged app or a full
        problem solution, while others    want incremental changes. To some
        extent this can be controlled by setting the rules in the beginning of
        the conversation. To some extent, because the limitations noted in the
        blog still apply.
       
        janmarsal wrote 8 hours 23 min ago:
        My biggest issue with LLMs right now is that they're such spineless yes
        men. Even when you ask their opinion on if something is doable or
        should it be done in the first place, more often than not they just go
        "Absolutely!" and shit out a broken answer or an anti-pattern just to
        please you. Not always, but way too often. You need to frame your
        questions way too carefully to prevent this.
        
        Maybe some of those character.ai models are sassy enough to have
        stronger opinions on code?
       
        cheema33 wrote 8 hours 30 min ago:
        From the article:
        > I contest the idea that LLMs are replacing human devs...
        
        AI is not able to replace good devs. I am assuming that nobody sane is
        claiming such a thing today. But, it can probably replace bad and
        mediocre devs. Even today.
        
        In my org we had 3 devs who went through a 6-month code boot camp and
        got hired a few years ago when it was very difficult to find good devs.
        They struggled. I would give them easy tasks and then clean up their
        PRs during review. And then AI tools got much better and it started
        outperforming these guys. We had to let two go. And third one quit on
        his own.
        
        We still hire devs. But have become very reluctant to hire junior devs.
        And will never hire someone from a code boot camp. And we are not the
        only ones. I think most boot camps have gone out of business for this
        reason.
        
        Will AI tools eventually get good enough to start replacing good devs?
        I don't know. But the data so far shows that these tools keep getting
        better over time. Anybody who argues otherwise has their heads firmly
        stuck in sand.
        
        In the early US history approximately 90% of the population was
        involved in farming. Over the years things changed. Now about 2% has
        anything to do with farming. Fewer people are farming now. But we have
        a lot more food and a larger variety available. Technology made that
        possible.
        
        It is totally possible that something like that could happen to the
        software development industry as well. How fast it happens totally
        depends on how fast do the tools improve.
       
          Leynos wrote 7 hours 30 min ago:
          What do you think was the reason that the bootcamp grads struggling
          to get better at what they do?
       
            _1 wrote 1 hour 35 min ago:
            My experience with them is that the are taught to cover as much
            syntax and libraries as possible, without spending time learning
            how solve problems and develop their own algorithms. They (in
            general) expect to follow predefined recipes.
       
            cheema33 wrote 2 hours 57 min ago:
            A computer science degree in most US colleges takes about 4 years
            of work. Boot camps try to cram that into 6 months. All the while
            many students have other full-time jobs. This is simply not enough
            training for the students to start solving complex real world
            problem. Even 4 years is not enough.
            
            Many companies were willing to hire fresh college grads in the
            hopes that they could solve relatively easy problems for a few
            years, gain experience and become successful senior devs at some
            point.
            
            However, with the advent of AI dev tools, we are seeing very clear
            signs that junior dev hiring rates have fallen off a cliff. Our
            project manager, who has no dev experience, frequently assigns easy
            tasks/github issues to Github Copilot. Copilot generates a PR in a
            few minutes that other devs can review before merging. These PRs
            are far superior to what an average graduate of a code boot camp
            could ever create. Any need we had for a junior dev has completely
            disappeared.
       
              username223 wrote 58 min ago:
              > Any need we had for a junior dev has completely disappeared.
              
              Where do your senior devs come from?
       
                weakfish wrote 22 min ago:
                That's the question that has been stuck in my head as I read
                all these stories about junior dev jobs disappearing. I'm
                firmly mid-level, having started my career just before LLM
                coding took off. Sometimes it feels like I got on the last
                chopper out of Saigon.
       
        aragonite wrote 8 hours 31 min ago:
        Has anyone had success getting a coding agent to use an IDE's built-in
        refactoring tools via MCP especially for things like project-wide
        rename? Last time I looked into this the agents I tried just did regex
        find/replace across the repo, which feels both error-prone and wasteful
        of tokens. I haven't revisited recently so I'm curious what's possible
        now.
       
          olejorgenb wrote 5 hours 21 min ago:
          Serena MCP does this approach IIRC
       
          petesergeant wrote 6 hours 38 min ago:
          That's interesting, and I haven't, but as long as the IDE has an API
          for the refactoring action, giving an agent access to it as a tool
          should be pretty straightforward. Great idea.
       
        clayliu wrote 8 hours 41 min ago:
        “They’re still more like weird, overconfident interns.”
        Perfect summary. LLMs can emit code fast but they don’t really handle
        code like developers do — there’s no sense of spatial manipulation,
        no memory of where things live, no questions asked before moving stuff
        around. Until they can “copy-paste” both code and context with
        intent, they’ll stay great at producing snippets and terrible at
        collaborating.
       
          furyg3 wrote 8 hours 28 min ago:
          This is exactly how we describe them internally: the smartest interns
          in the world.  I think it's because the chat box way of interacting
          with them is also similar to how you would talk to someone who just
          joined a team.
          
          "Hey it wasn't what you asked me to do but I went ahead and
          refactored this whole area over here while simultaneously screwing up
          the business logic because I have no comprehension of how users use
          the tool".  "Um, ok but did you change the way notifications work
          like I asked". "Yes."  "Notifications don't work anymore". "I'll get
          right on it".
       
        senko wrote 8 hours 41 min ago:
        I'd argue LLM coding agents are still bad at many more things. But to
        comment on the two problems raised in the post:
        
        > LLMs don’t copy-paste (or cut and paste) code.
        
        The article is confusing the architectural layers of AI coding agents.
        It's easy to add "cut/copy/paste" tools to the AI system if that shows
        improvement. This has nothing to do with LLM, it's in the layer on top.
        
        > Good human developers always pause to ask before making big changes
        or when they’re unsure [LLMs] keep trying to make it work until they
        hit a wall -- and then they just keep banging their head against it.
        
        Agreed - LLMs don't know how to back track. The recent (past year)
        improvements in thinking/reasoning do improve in this regard (it's the
        whole "but wait..." RL training that exploded with OpenAI o1/o3 and
        DeepSeek R1, now done by everyone), but clearly there's still work to
        do.
       
          typpilol wrote 8 hours 26 min ago:
          Ask a model to show you the seahorse emojii and you'll get a storm of
          "but wait!"
       
        bad_username wrote 8 hours 44 min ago:
        LLMs are great at asking questions if you ask them to ask questions.
        Try it: "before writing the code, ask me about anything that is nuclear
        or ambiguous about the task".
       
          d1sxeyes wrote 8 hours 41 min ago:
          “If you think I’m asking you to split atoms, you’re probably
          wrong”.
       
        SafeDusk wrote 8 hours 45 min ago:
        @kixpanganiban Do you think it will work if for refactoring tasks, we
        take aways OpenAI's `apply_patch` tool and just provide `cut` and
        `paste` for the first few steps?
        
        I can run this experiment using ToolKami[0] framework if there is
        enough interest or if someone can give some insights.
        
        [0]:
        
   URI  [1]: https://github.com/aperoc/toolkami
       
        pammf wrote 8 hours 49 min ago:
        In Claude Code, it always shows the diff between current and proposed
        changes and I have to explicitly allow it to actually modify the code.
        Doesn’t that “fix” the copy-&-paste issue?
       
        nxpnsv wrote 8 hours 54 min ago:
        Codex has got me a few times lately, doing what I asked but certainly
        not what I intended:
        
        - Get rid of these warnings "...": captures and silences warnings
        instead of fixing them
        - Update this unit test to relfect the changes "...": changes the code
        so the outdated test works
        - The argument passed is now wrong: catches the exception instead of
        fixing the argument
        
        My advice is to prefer small changes and read everything it does before
        accepting anything, often this means using the agent actually is slower
        than just coding...
       
          d1sxeyes wrote 8 hours 25 min ago:
          You also have to be a bit careful:
          
          “Fix the issues causing these warnings”
          
          Retrospectively fixing a test to be passing given the current code is
          a complex task, instead, you can ask it to write a test that tests
          the intended behaviour, without needing to infer it.
          
          “The argument passed is now wrong” - you’re asking the LLM to
          infer that there’s a problem somewhere else, and to find and fix
          it.
          
          When you’re asking an LLM to do something, you have to be very
          explicit about what you want it to do.
       
            nxpnsv wrote 1 hour 25 min ago:
            Exactly, I think the takeaway is that being careful when
            formulating a task is essential with LLMs. They make errors that
            wouldn’t be expected when asking the same from a person.
       
        cat-whisperer wrote 8 hours 54 min ago:
        The copy-paste thing is interesting because it hints at a deeper issue:
        LLMs don't have a concept of "identity" for code blocks—they just
        regenerate from learned patterns. I've noticed similar vibes when
        agents refactor—they'll confidently rewrite a chunk and introduce
        subtle bugs (formatting, whitespace, comments) that copy-paste would've
        preserved. The "no questions" problem feels more solvable with better
        prompting/tooling though, like explicitly rewarding clarification in
        RLHF.
       
          stellalo wrote 8 hours 39 min ago:
          I feel like it’s the opposite: the copy-paste issue is solvable,
          you just need to equip the model with the right tools and make sure
          they are trained on tasks where that’s unambiguously the right
          thing to do (for example, cases were copying code “by hand” would
          be extremely error prone -> leads to lower reward on average).
          
          On the other hand, teaching the model to be unsure and ask questions,
          requires the training loop to break and bring a human input in, which
          appears more difficult to scale.
       
            saghm wrote 8 hours 23 min ago:
            > On the other hand, teaching the model to be unsure and ask
            questions, requires the training loop to break and bring a human
            input in, which appears more difficult to scale.
            
            The ironic thing to me is that the one thing they never seem to be
            willing to skip asking about is whether they should proceed with
            some fix that I just helped them identify. They seem extremely
            reluctant to actually ask about things they don't know about, but
            extremely eager to ask about whether they should do the things they
            already have decided they think are right!
       
        juped wrote 9 hours 2 min ago:
        It's apparently lese-Copilot to suggest this these days, but you can
        find very good hypothesizing and problem solving if you talk
        conversationally to Claude or probably any of its friends that isn't
        the terminally personality-collapsed SlopGPT (with or without showing
        it code, or diagrams); it's actually what they're best at, and often
        they're even less likely than human interlocutors to just parrot some
        set phrase at you.
        
        It's only when you take the tech out of the area it's good at and start
        trying to get it to "write code" or even worse "be an agent" that it
        starts cracking up and emitting garbage; this is only done because
        companies want to forcememe some kind of product besides "chatbot",
        whether or not it makes sense. It's a shame because it'll happily and
        effectively write the docs that don't exist but you wish did for more
        or less anything. (Writing code examples for docs is not a weak point
        at all.)
       
        rossant wrote 9 hours 3 min ago:
        Recently, I asked Codex CLI to refactor some HTML files. It didn't
        literally copy and pasted snippets here and there as I would have done
        myself, it rewrote them from memory, removing comments in the process.
        There was a section with 40 successive    links with complex URLs.
        
        A few days later, just before deployment to production, I wanted to
        double check all 40 links. First one worked. Second one worked. Third
        one worked. Fourth one worked. So far so good. Then I tried the last
        four. Perfect.
        
        Just to be sure, I proceeded with the fifth one. 404. Huh. Weird. The
        domain was correct though and the URL seemed reasonable.
        
        I tried the other 31 links. ALL of them 404ed. I was totally confused.
        The domain was always correct. It seemed highly suspicious that all
        websites would have had moved internal URLs at the same time. I didn't
        even remember that this part of the code had gone through an LLM.
        
        Fortunately, I could retrieve the old URLs on old git commits. I
        checked the URLs carefully. The LLM had HALLUCINATED most of the path
        part of the URLs! Replacing things like
        domain.com/this-article-is-about-foobar-123456/ by
        domain.com/foobar-is-so-great-162543/...
        
        These kinds of very subtle and silently introduced mistakes are quite
        dangerous. Be careful out there!
       
          scottbez1 wrote 38 min ago:
          The last point I think is most important: "very subtle and silently
          introduced mistakes" -- LLMs may be able to complete many tasks as
          well (or better) than humans, but that doesn't mean they complete
          them the same way, and that's critically important when considering
          failure modes.
          
          In particular, code review is one layer of the conventional swiss
          cheese model of preventing bugs, but code review becomes much less
          effective when suddenly the categories of errors to look out for
          change.
          
          When I review a PR with large code moves, it was historically
          relatively safe to assume that a block of code was moved as-is (sadly
          only an assumption because GitHub still doesn't have indicators of
          duplicated/moved code like Phabricator had 10 years ago...), so I can
          focus my attention on higher level concerns, like does the new API
          design make sense? But if an LLM did the refactor, I need to
          scrutinize every character that was touched in the block of code that
          was "moved" because, as the parent commenter points out, that "moved"
          code may have actually been ingested, summarized, then rewritten from
          scratch based on that summary.
          
          For this reason, I'm a big advocate of an "AI use" section in PR
          description templates; not because I care whether you used AI or not,
          but because some hints about where or how you used it will help me
          focus my efforts when reviewing your change, and tune the categories
          of errors I look out for.
       
          FitchApps wrote 53 min ago:
          Reminds me when I asked Claude (through Windsurf) to create a S3
          Lambda trigger to resize images (as soon as PNG image appears in S3,
          resize it). The code looked flawless and I deployed ..only to learn
          that I introduced a perpetual loop :) For every image resized, a new
          one would be created and resized. In 5 min, the trigger created
          hundreds of thousands of images ...what a joy was to clean that up in
          S3
       
          polynomial wrote 56 min ago:
          Evals don't fix this.
       
            HardCodedBias wrote 53 min ago:
            Maybe they don't fix it, but I suspect that they move us towards it
            occurring less often.
       
          dkarl wrote 1 hour 42 min ago:
          I've had similar experience both in coding and in non-coding research
          questions. An LLM will do the first N right and fake its work on the
          rest.
          
          It even happens when asking an LLM to reformat a document, or asking
          it to do extra research to validate information.
          
          For example, before a recent trip to another city, I asked Gemini to
          prepare a list of brewery taprooms with certain information, and I
          discovered it had included locations that had been closed for years
          or had just been pop-ups. I asked it to add a link to the current
          hours for each taproom and remove locations that it couldn't verify
          were currently open, and it did this for about the first half of the
          list. For the last half, it made irrelevant changes to the entries
          and didn't remove any of the closed locations. Of course it
          enthusiastically reported that it had checked every location on the
          list.
       
            Romario77 wrote 1 hour 15 min ago:
            LLMs are not good at "cycles" - when you have to go over a list and
            do the same action on each item.
            
            It's like it has ADHD and forgets or gets distracted in the middle.
            
            And the reason for that is that LLMs don't have memory and process
            the tokens, so as they keep going over the list the context becomes
            bigger with more irrelevant information and they can lose the
            reason they are doing what they are doing.
       
              fwip wrote 53 min ago:
              It would be nice if the tools we usually use for LLMs had a bit
              more programmability. In this example, It we could imagine being
              able to chunk up work by processing a few items, then reverting
              to a previous saved LLM checkpoint of state, and repeating until
              the list is complete.
              
              I imagine that the cost of saving & loading the current state
              must be prohibitively high for this to be a normal pattern,
              though.
       
                radarsat1 wrote 5 min ago:
                Agreed. You basically want an LLM to have a tool that writes
                its own agent to accomplish a repetitive task. I think this is
                doable.
       
              dmoy wrote 55 min ago:
              Which is annoying because that is precisely the kind of boring
              rote programming tasks I want an LLM to do for me, to free up my
              time for more interesting problems
       
              polynomial wrote 56 min ago:
              So much for Difference and Repetition.
       
          mehdibl wrote 2 hours 5 min ago:
          Errors are normal and happen ofter. You need to focus on providing it
          ability to test the changes and fix errors.
          
          If you expect one shot you will get a lot of bad surprises.
       
          cpfohl wrote 2 hours 53 min ago:
          My custom prompt instructs GPT to output changes to code as a
          diff/git-patch. I don’t use agents because it makes it hard to see
          what’s happening and I don’t trust them yet.
       
            ravila4 wrote 2 hours 22 min ago:
            I’ve tried this approach when working in chat interfaces (as
            opposed to IDEs), but I often find it tricky to review diffs
            without the full context of the codebase.
            
            That said, your comment made me realize I could be using “git
            apply”more effectively to review LLM-generated changes directly
            in my repo.  It’s actually a neat workflow!
       
          yodsanklai wrote 3 hours 43 min ago:
          5 minutes ago, I asked Claude to add some debug statements in my
          code. It also silently changed a regex in the code. It was easily
          caught with the diff but can be harder to spot in larger changes.
       
            jihadjihad wrote 2 hours 29 min ago:
            I had a pretty long regex in a file that was old and crusty, and
            when I had Claude add a couple helpers to the file, it changed the
            formatting of the regex to be a little easier on the eyes in terms
            of readability.
            
            But I just couldn't trust it. The diff would have been no help
            since it went from one long gnarly line to 5 tight lines. I kept
            the crusty version since at least I am certain it works.
       
            alzoid wrote 2 hours 41 min ago:
            I asked Claude to add a debug endpoint to my hardware device that
            just gave memory information.  It wrote 2600 lines of C that gave
            information about every single aspect of the system.  On the one
            hand kind of cool. It looked at the MQTT code and the update code,
            the platform (esp) and generated all kinds of code.  It recommended
            platform settings that could enable more detailed information that
            checked out when I looked at the docs.    I ran it and it worked. On
            the other hand, most of the code was just duplicated over and over
            again ex: 3 different endpoints that gave overlapping information.
            About half of the code generated fake data rather than actually do
            anything with the system.
            
            I rolled back and re-prompted and got something that looked good
            and worked. The LLMs are magic when they work well but they can
            throw a wrench into your system that will cost you more if you
            don't catch it.
            
            I also just had a 'senior' developer tell me that a feature in one
            of our platforms was deprecated.  This was after I saw their code
            which did some wonky hacky like stuff to achieve something simple. 
            I checked the docs and said feature (URL Rewriting) was obviously
            not deprecated.  When I asked how they knew it was deprecated they
            said Chat GPT told them. So now they are fixing the fix chat gpt
            provided.
       
              troupo wrote 1 hour 31 min ago:
              > About half of the code generated fake data rather than actually
              do anything with the system.
              
              All the time
              
                  // fake data. in production this would be real data
                  ... proceeds to write sometimes hundreds of lines
                  of code to provide fake data
       
                stuartjohnson12 wrote 59 min ago:
                "hey claude, please remove the fake data and use the real data"
                
                "sure thing, I'll add logic to check if the real data exists
                and only use the fake data as a fallback in case the real data
                doesn't exist"
       
                  weakfish wrote 33 min ago:
                  This comment captures exactly what aggravates me about CC /
                  other agents in a way that I wasn't sure how to express
                  before. Thanks!
       
                  alzoid wrote 44 min ago:
                  I will also add checks to make sure the data that I get is
                  there even though I checked 8 times already and provide loads
                  of logging statements and error handling.  Then I will go to
                  every client that calls this API and add the same checks and
                  error handling with the same messaging. Oh also with all
                  those checks I'm just going to swallow the error at the entry
                  point so you don't even know it happened at runtime unless
                  you check the logs.  That will be $1.25 please.
       
          smougel wrote 3 hours 45 min ago:
          Not related to code... But when I use a LLM to perform a kind of
          copy/paste, I try to number the lines and ask it to generate a
          start_index and stop_index to perform the slice operation. Much less
          hallucinations and very cheap in token generation.
       
          grafmax wrote 3 hours 53 min ago:
          Yeah this sort of thing is a huge time waster with LLMs.
       
          weinzierl wrote 4 hours 56 min ago:
          Not code, but I once pasted an event announcement and asked for just
          spelling and grammar check. LLM suggested a new version with minor
          tweak which I copy pasted back.
          
          Just before sending I noticed that it had moved the event date by one
          day. Luckily I caught it but it taught me that you never should
          blindly trust LLM output even with super simple tasks, no relevant
          context size, clear and simple one sentence prompt.
          
          LLM's do the most amazing things but they also sometimes screw up the
          simplest of tasks in the most unexpected ways.
       
            nonethewiser wrote 1 hour 46 min ago:
            >Not code, but I once pasted an event announcement and asked for
            just spelling and grammar check. LLM suggested a new version with
            minor tweak which I copy pasted back. Just before sending I noticed
            that it had moved the event date by one day.
            
            This is the kind of thing I immediately noticed about LLMs when I
            used them for the first time. Just anecdotally, I'd say it had this
            problem 30-40% of the time. As time has gone on, it has gotten so
            much better. But it still makes this kind of problem -- lets just
            say --    5% of the time.
            
            The thing is, it's almost more dangerous to rarely make the
            problem. Because now people aren't constantly looking for it.
            
            You have no idea if it's not just randomly flipping terms or
            injecting garbage unless you actually validate it. The ideal of
            giving it an email to improve and then just scanning the result
            before firing it off is terrifying to me.
       
            flowingfocus wrote 2 hours 29 min ago:
            A diff makes these kind of errors much easier to catch.
            
            Or maybe someone from XEROX has a better idea how to catch subtly
            altered numbers?
       
              mcpeepants wrote 2 hours 27 min ago:
              I verify all dates manually by memorizing their offset from the
              date of the signing of the Magna Carta
       
          Xss3 wrote 6 hours 28 min ago:
          This is a horror story about bad quality control practices, not the
          use of LLMs.
       
            __MatrixMan__ wrote 3 hours 17 min ago:
            I have a project that I've leaned heavily on LLM help for which I
            consider to embody good quality control practices. I had to get
            pretty creative to pull it off: spent a lot of time working on this
            sync system so that I can import sanitized production data into the
            project for every table it touches (there are maybe 500 of these)
            and then there's a bunch of hackery related to ensuring I can still
            get good test coverage even when some of these flows are partially
            specified (since adding new ones proceeds in several separate
            steps).
            
            If it was a project written by humans I'd say they were crazy for
            going so hard on testing.
            
            The quality control practices you need for safely letting an LLM
            run amok aren't just good. They're extreme.
       
          amelius wrote 7 hours 6 min ago:
          In these cases I explicitly tell the llm to make as few changes as
          possible and I also run a diff. And then I reiterate with a new
          prompt if too many things changed.
       
            globular-toast wrote 6 hours 41 min ago:
            You can always run a diff. But how good are people at reading
            diffs? Not very. It's the kind of thing you would probably want a
            computer to do. But now we've got the computer generating the diffs
            (which it's bad at) and humans verifying them (which they're also
            bad at).
       
              CaptainOfCoit wrote 5 hours 3 min ago:
              Yeah, pick one for you to do, the other for the LLMs to do,
              ideally pick the one you're better at, otherwise 50/50 you'll
              actually become faster.
       
          coldtea wrote 8 hours 27 min ago:
          >A few days later, just before deployment to production, I wanted to
          double check all 40 links.
          
          This was allowed to go to master without "git diff" after Codex was
          done?
       
            raffael_de wrote 7 hours 31 min ago:
            This and why are the URLs hardcoded to begin with? And given the
            chaotic rewrite by Codex it would probably be more work to untangle
            the diff than just do it yourself right away.
       
            rossant wrote 8 hours 23 min ago:
            It was a fairly big refactoring basically converting a working
            static HTML landing page into a Hugo website, splitting the HTML
            into multiple Hugo templates. I admit I was quite in a hurry and
            had to take shortcuts. I didn't have time to write automated tests
            and had to rely on manual tests for this single webpage. The diff
            was fairly big. It just didn't occur to me that the URLs would go
            through the LLMs and could be affected! Lesson learnt haha.
       
              indigodaddy wrote 32 min ago:
              This is why my instinct for this sort of task is, "write a script
              that I can use to do x y z," instead of "do x y z"
       
              cimi_ wrote 7 hours 19 min ago:
              Speaking of agents and tests, here's a fun one I had the other
              day: while refactoring a large code base I told the agent to do
              something precise to a specific module, refactor with the new
              change, then ensure the tests are passing.
              
              The test suite is slow and has many moving parts; the tests I
              asked it to run take ~5 minutes. The thing decided to kill the
              test run, then it made up another command it said was the 'tests'
              so when I looked at the agent console in the IDE everything
              seemed fine collapsed, i.e. 'Tests ran successfully'.
              
              Obviously the code changes also had a subtle bug that I only saw
              when pushing its refactoring to CI (and more waiting). At least
              there were tests to catch the problem.
       
                rossant wrote 3 hours 26 min ago:
                So it took a shortcut as it was too lazy and it lied to your
                face about it. AGI is here for good.
       
                tuesdaynight wrote 3 hours 40 min ago:
                I think that it's something that model providers don't want to
                fix, because the amount of times that Claude Code just decided
                to delete tests that were not passing before I added a memory
                saying that it would need to ask for my permission to do that
                was staggering. It stopped happening after the memory, so I
                believe that it could be easily fixed by a system prompt.
       
                  Ezhik wrote 3 hours 19 min ago:
                  Your Claude Code actually respects CLAUDE.md?
       
              exe34 wrote 8 hours 2 min ago:
              this is why I'm terrified of large LLM slop changesets that I
              can't check side by side - but then that means I end up doing
              many small changes that are harder to describe in words than to
              just outright do.
       
          hshdhdhehd wrote 8 hours 29 min ago:
          Well using an LLM is like rolling dice. Logits are probabilities. It
          is a bullshit machine.
       
            dude250711 wrote 6 hours 47 min ago:
            Yeah, it read like "when running with scissors be careful out
            there". How about not running with scissors at all?
            
            Unless of course the management says "from now on you will be
            running with scissors and your performance will increase as a
            result".
       
              hansmayer wrote 2 hours 45 min ago:
              And if you stab yourself in the stomach ... you must have sucked
              at running with the scissors :)
       
          worldsayshi wrote 8 hours 32 min ago:
          This is of course bad but: humans also makes (different) mistakes all
          the time. We could account for the risk of mistakes being introduced
          and make more tools that validate things for us. In a way LLM:s
          encourage us to do this by adding other vectors of chaos into our
          work.
          
          Like, why not have tools built into our environment that checks that
          links are not broken? With the right architecture we could have
          validations for most common mistakes without having the solution
          adding a bunch of tedious overhead.
       
            lenkite wrote 6 hours 37 min ago:
            In the above kind of described situation, a meticulous coder
            actually makes no mistakes. They will however make a LOT more
            mistakes if they use LLM's to do the same.
            
            I have already had to correct a LOT of crap similar to the above in
            refactoring-done-via-LLM over the last year.
            
            When stuff like this was done by a plain, slow, organic human, it
            was far more accurate. And many times, completely accurate with no
            defects. Simply because many developers pay close attention when
            they are forced to do the manual labour themselves.
            
            Sure the refactoring commit is produced faster with LLM assistance,
            but repeatedly reviewing code and pointing out weird defects is
            very stressful.
       
              worldsayshi wrote 3 hours 13 min ago:
              I think it goes without saying that we need to be sceptical when
              to use and not use LLM. The point I'm trying to make is more that
              we should have more validations and not that we should be less
              sceptical about LLMs.
              
              Meticulousness shouldn't be an excuse to not have layers of
              validation that doesn't have to cost that much if done well.
       
              thunky wrote 3 hours 24 min ago:
              > I have already had to correct a LOT of crap similar to the
              above in refactoring-done-via-LLM over the last year
              
              The person using the LLM should be reviewing their code before
              submitting it to you for review.  If you can catch a copy paste
              error like this, then so should they.
              
              The failure you're describing is that your coworkers are not
              doing their job.
              
              And if you accept "the LLM did that, not me" as an excuse then
              the failure is on you and it will keep happening.
       
              mr_mitm wrote 5 hours 5 min ago:
              A meticulous coder probably wouldn't have typed out 40 URLs just
              because they want to move them from one file to another. They
              would copy-past them and run some sed-like commands. You could
              instruct an LLM agent to do something similar. For modifying a
              lot of files or a lot of lines, I instruct them to write a script
              that does what I need instead of telling them to do it
              themselves.
       
            cimi_ wrote 7 hours 30 min ago:
            Your point to not rely on good intentions and have systems in place
            to ensure quality is good - but your comparison to humans didn't go
            well with me.
            
            Very few humans fill in their task with made up crap then lie about
            it - I haven't met any in person. And if I did, I wouldn't want to
            work with them, even if they work 24/7.
            
            Obligatory disclaimer for future employers: I believe in AI, I use
            it, yada yada. The reason I'm commenting here is I don't believe we
            should normalise this standard of quality for production work.
       
            exe34 wrote 8 hours 1 min ago:
            > that checks that links are not broken?
            
            Can you spot the next problem introduced by this?
       
            rossant wrote 8 hours 27 min ago:
            I agree, these kinds of stories should encourage us to setup more
            robust testing/backup/check strategies. Like you would absolutely
            have to do if you suddenly invited a bunch of inexperienced interns
            to edit your production code.
       
            rullelito wrote 8 hours 28 min ago:
            LLMs are turning into LLMs+hard-coded fixes for every imaginable
            problem.
       
          ivape wrote 8 hours 47 min ago:
          You’re just not using LLMs enough. You can never trust the LLM to
          generate a url, and this was known over two years ago. It takes one
          token hallucination to fuck up a url.
          
          It’s very good at a fuzzy great answer, not a precise one. You have
          to really use this thing all the time and pick up on stuff like that.
       
            fwip wrote 11 min ago:
            I think part of the issue is that it doesn't "feel" like the LLM is
            generating a URL, because that's not what a human would be doing. A
            human would be cut & pasting the URLs, or editing the code around
            them - not retyping them from scratch.
            
            Edit: I think I'm just regurgitating the article here.
       
            jollyllama wrote 3 hours 39 min ago:
            > You’re just not using LLMs enough.
            
            > You can never trust the LLM to generate a url
            
            This is very poorly worded. Using LLMs more wouldn't solve the
            problem. What you're really saying is that the GP is uninformed
            about LLMs.
            
            This may seem like pedantry on my part but I'm sick of hearing
            "you're doing it wrong" when the real answer is "this tool can't do
            that." The former is categorically different than the latter.
       
              IanCal wrote 3 hours 17 min ago:
              It's pretty clearly worded to me, they don't use LLMs enough to
              know how to use them successfully. If you use them regularly you
              wouldn't see a set of urls without thinking "Unless these are
              extremely obvious links to major sites, I will assume each is
              definitely wrong".
              
              >  I'm sick of hearing "you're doing it wrong"
              
              That's not what they said. They didn't say to use LLMs more for
              this problem. The only people that should take the wrong meaning
              from this are ones who didn't read past the first sentence.
              
              >  when the real answer is "this tool can't do that."
              
              That is what they said.
       
                jollyllama wrote 2 hours 52 min ago:
                > If you use them regularly you wouldn't see a set of urls
                without thinking...
                
                Sure, but conceivably, you could also be informed of this
                second hand, through any publication about LLMs, so it is very
                odd to say "you don't use them enough" rather than "you're
                ignorant" or "you're uninformed". It is very similar to these
                very bizarre AI-maximalist positions that so many of us are
                tired of seeing.
       
                  IanCal wrote 2 hours 43 min ago:
                  This isn't ai maximalist though, it's explicitly pointing out
                  something that regularly does not work!
                  
                  > Sure, but conceivably, you could also be informed of this
                  second hand, through any publication about LLMs, so it is
                  very odd to say "you don't use them enough" rather than
                  "you're ignorant" or "you're uninformed".
                  
                  But this is to someone who is actively using them, and the
                  suggestion of "if you were using them more actively you'd
                  know this, this is a very common issue" is not at all weird.
                  There are other ways they could have known this, but they
                  didn't.
                  
                  "You haven't got the experience yet" is a much milder way of
                  saying someone doesn't know how to use a tool properly than
                  "you're ignorant".
       
            hansmayer wrote 8 hours 13 min ago:
            Yeah so, the reason people use various tools and machines in the
            first place is to simplify the work or everydays tasks by : 1)
            Making the tasks execute faster 2) Getting more reliable outputs
            then doing this by yourself 3) Making it repeatable . The LLMs
            obviously dont check any of these boxes so why don´t we stop
            pretending that we as users are stupid and don´t know how to use
            them and start taking them for what they are - cute little mirages,
            perhaps applicable as toys of some sort, but not something we
            should use for serious engineering work really?
       
              IanCal wrote 3 hours 20 min ago:
              They easily check a bunch of those boxes.
              
              > why don´t we stop pretending that we as users are stupid and
              don´t know how to use them
              
              This is in response to someone who saw a bunch of URLs coming out
              of it and was surprised at a bunch of them being wrong. That's
              using the tool wrong. It's like being surprised that the top
              results in google/app store/play store aren't necessarily the
              best match for your query but actually adverts!
       
                hitarpetar wrote 22 min ago:
                it's amazing that you picked another dark pattern as your
                comparison
       
                mbesto wrote 32 min ago:
                > This is in response to someone who saw a bunch of URLs coming
                out of it and was surprised at a bunch of them being wrong.
                That's using the tool wrong. It's like being surprised that the
                top results in google/app store/play store aren't necessarily
                the best match for your query but actually adverts!
                
                The CEO of Anthropic said I can fire all of my developers soon.
                How could one possibly be using the tool wrong? /s
       
                  IanCal wrote 8 min ago:
                  If you base all your tech workings on the promises of CEOs
                  you'll fail badly, you should not be surprised by this.
       
                hansmayer wrote 2 hours 51 min ago:
                The URLs being wrong in that specific case is one where they
                were using the "wrong tool". I can name you at least a dozen
                other cases from own experience, where too, they appear to be
                the wrong tool, for example for working with Terraform or for
                not exposing secrets by hardcoding them in the frontend. Et
                cetera. Many other people will have contributed thousands if
                not more similar but different cases. So what good are these
                tools then for really? Are we all really that stupid? Many of
                us mastered the hard problem of navigating various abstraction
                layers of computer over the years, only to be told, we now
                effing dont know how to write a few sentences in English? Come
                on. I'd be happy to use them in whatever specific domain they
                supposedly excel at. But no-one seems to be able to identify
                one for sure. The problem is, the folks pushing or better said,
                shoving down these bullshit generators down our throats are
                trying to sell us the promise of an "everything oracle". What
                did old man Altman tell us about ChatGPT 5? PhD level tool for
                code generation or some similar nonsense? But it turns out it
                only gets one metric right each time - generating a lot of
                text. So, essentially, great for bullshit jobs (i count some of
                the IT jobs as such too), but not much more.
       
                  IanCal wrote 2 hours 18 min ago:
                  > Many of us mastered the hard problem of navigating various
                  abstraction layers of computer over the years, only to be
                  told, we now effing dont know how to write a few sentences in
                  English? Come on.
                  
                  If you're trying to one shot stuff with a few sentences then
                  yes you might be using these things wrong. I've seen people
                  with PhDs fail to use google successfully to find things,
                  were they idiots? If you're using them wrong you're using
                  them wrong - I don't care how smart you are in other areas.
                  If you can't hand off work knowing someones capabilities then
                  that's a thing you can't do - and that's ok. I've known
                  unbelievably good engineers who couldn't form a solid plan to
                  solve a business problem or collaboratively work to get
                  something done to save their life. Those are different
                  skills. But gpt5-codex and sonnet 4 / 4.5 can solidly write
                  code, gpt-5-pro with web search can really dig into things,
                  and if you can manage what they can do you can hand off work
                  to them. If you've only ever worked with juniors with a
                  feeling of "they slow everything down but maybe someday
                  they'll be as useful as me" then you're less likely to
                  succeed at this.
                  
                  Let's do a quick overview of recent chats for me:
                  
                  * Identifying and validating a race condition in some code
                  
                  * Generating several approaches to a streaming issue,
                  providing cost analyses of external services and complexity
                  of 3 different approaches about how much they'd change the
                  code
                  
                  * Identifying an async bug two good engineers couldn't find
                  in a codebase they knew well
                  
                  * Finding performance issues that had gone unnoticed
                  
                  * Digging through synapse documentation and github issues to
                  find a specific performance related issue
                  
                  * Finding the right MSC for a feature I wanted to use but
                  didn't know existed - and then finding the github issue that
                  explained how it was only half implemented and how to enable
                  the experimental other part I needed
                  
                  * Building a bunch of UI stuff for a short term contract I
                  needed, saving me a bunch of hours and the client money
                  
                  * Going through funding opportunities and matching them
                  against a charity I want to help in my local area
                  
                  * Building a search integration for my local library to
                  handle my kids reading challenge
                  
                  * Solving a series of VPN issues I didn't understand
                  
                  * Writing a lot of astro related python for an art project to
                  cover the loss of some NASA images I used to have access to.
                  
                  >  the folks pushing or better said
                  
                  If you don't want to trust them, don't. Also don't believe
                  the anti-hype merchants who want to smugly say these tools
                  can't do a god damn thing. They're trying to get attention as
                  well.
       
                    hansmayer wrote 1 hour 56 min ago:
                    Again mate, stop making arrogant assumptions and read some
                    of my previous comments. I and my team are early adopters,
                    since about 2 years. I am even paying for premium-level
                    service. Trust me, it sucks and under-delivers. But good
                    for you and others who claim they are productive with it -
                    I am sure we will see those 10x apps rolling in soon,
                    right? It's only been like 4 years since the revolutionary
                    magic machine was announced.
       
                      IanCal wrote 1 hour 31 min ago:
                      I read your comments. Did you read mine? You can pass
                      them into chatgpt or claude or whatever premium services
                      you pay for to summarise them for you if you want.
                      
                      > Trust me, it sucks
                      
                      Ok. I'm convinced.
                      
                      > and under-delivers.
                      
                      Compared to what promise?
                      
                      > I am sure we will see those 10x apps rolling in soon,
                      right?
                      
                      Did I argue that? If you want to look at some massive
                      improvements, I was able to put up UIs to share results &
                      explore them with a client within minutes rather than it
                      taking me a few hours (which from experience it would
                      have done).
                      
                      > It's only been like 4 years since the revolutionary
                      magic machine was announced.
                      
                      It's been less than 3 since chatgpt launched, which if
                      you'd been in the AI sphere as long as I had (my god it's
                      20 years now) absolutely was revolutionary. Over the last
                      4 years we've seen gpt3 solve a bunch of NLP problems
                      immediately as long as you didn't care about cost to
                      gpt-5-pro with web search and codex/sonnet being able to
                      explore a moderately sized codebase and make real and
                      actual changes (running tests and following up with
                      changes). Given how long I spent stopping a robot hitting
                      the table because it shifted a bit and its background
                      segmentation messed up, or fiddling with classifiers for
                      text, the idea I can get a summary from input without
                      training is already impressive and then to be able to say
                      "make it less wanky" and have it remove the corp speak is
                      a huge shift in the field.
                      
                      If your measure of success is "the CEOs of the biggest
                      tech orgs say it'll do this soon and I found a problem"
                      then you'll be permanently disappointed. It'd be like me
                      sitting here saying mobile phones are useless because I
                      was told how revolutionary the new chip in an iphone was
                      in a keynote.
                      
                      Since you don't seem to want to read most of this, most
                      isn't for you. The last bit is, and it's just one
                      question:
                      
                      Why are you paying for something that solves literally no
                      problems for you?
       
            grey-area wrote 8 hours 42 min ago:
            Or just not bother. It sounds pretty useless if it flunks on basic
            tasks like this.
            
            Perhaps you’ve been sold a lie?
       
              seanw265 wrote 1 hour 19 min ago:
              I suspect you haven't tried a modern mid-to-large-LLM & Agent
              pair for writing code. They're quite capable, even if not suited
              for all tasks.
       
              IanCal wrote 3 hours 12 min ago:
              They're moderately unreliable text copying machines if you need
              exact copying of long arbitrary strings. If that's what you want,
              don't use LLMs. I don't think they were ever really sold as that,
              and we have better tools for that.
              
              On the other hand, I've had them easily build useful code, answer
              questions and debug issues complex enough to escape good
              engineers for at least several hours.
              
              Depends what you want. They're also bad (for computers) at
              complex arithmetic off the bat, but then again we have
              calculators.
       
                goalieca wrote 2 hours 15 min ago:
                > I don't think they were ever really sold as that, and we have
                better tools for that.
                
                We have OpenAI calling gpt5 as having PhD level of intelligence
                and others like Anthropoc saying it will write all our code
                within months. Some are claiming it’s already writing 70%.
                
                I say they are being sold as a magical do everything tool.
       
                  IanCal wrote 36 min ago:
                  Intelligence isn't the same as "can exactly replicate text".
                  I'm hopefully smarter than a calculator but it's more
                  reliable at maths than me.
                  
                  Also there's a huge gulf between "some people claim it can do
                  X" and "it's useful". Altman promising something new doesn't
                  decrease the usefulness of a model.
       
                    hitarpetar wrote 20 min ago:
                    saddest goalpost ever
       
                    mbesto wrote 26 min ago:
                    What you are describing is "dead reasoning zones".[0]
                    
                        "This isn't how humans work. Einstein never saw ARC
                    grids, but he'd solve them instantly. Not because of prior
                    knowledge, but because humans have consistent reasoning
                    that transfers across domains. A logical economist becomes
                    a logical programmer when they learn to code. They don't
                    suddenly forget how to be consistent or deduce.
                    
                        But LLMs have "dead reasoning zones" — areas in their
                    weights where logic doesn't work. Humans have dead
                    knowledge zones (things we don't know), but not dead
                    reasoning zones. Asking questions outside the training
                    distribution is almost like an adversarial attack on the
                    model."
                    
   URI              [1]: https://jeremyberman.substack.com/p/how-i-got-the-...
       
                  buildbot wrote 1 hour 2 min ago:
                  Would you hire a PhD to copy URLs by hand? Would them having 
                  PhD make it less likely they’d make a mistake than an high
                  school student doing the same?
       
                    hitarpetar wrote 5 min ago:
                    I would not hire anyone for a role that requires computer
                    use who does not know how to use copy/paste
       
                    parineum wrote 34 min ago:
                    A high school student would use copy/paste and the urls
                    would be perfect duplicates..
       
                      IanCal wrote 7 min ago:
                      > A high school student would use copy/paste and the urls
                      would be perfect duplicates..
                      
                      Did the LLM have this?
       
                    goalieca wrote 34 min ago:
                    Grad students and even post docs often do a lot of this
                    manual labour for data entry and formatting. Been there,
                    done that.
       
                      IanCal wrote 6 min ago:
                      Manual data entry has lots of errors. All good workflows
                      around this base themselves on this fact.
       
              ivape wrote 8 hours 37 min ago:
              Well, you see it hallucinates on long precise strings, but if we
              ignore that, and focus on what it’s powerful at, we can do
              something powerful. In this case, by the time it gets to
              outputting the url, it already determined the correct intent or
              next action (print out a url). You use this intent to do a tool
              call to generate a url. Small aside, it’s ability to figure
              what and why is pure magic, for those still peddling the
              glorified autocomplete narrative.
              
              You have to be able to see what this thing can actually do, as
              opposed to what it can’t.
       
                sebtron wrote 8 hours 4 min ago:
                > Well, you see it hallucinates on long precise strings
                
                But all code is "long precise strings".
       
                  ogogmad wrote 5 hours 12 min ago:
                  He obviously means random unstructured strings, which code is
                  usually not.
       
            doikor wrote 8 hours 45 min ago:
            I would generalise it to you can’t trust LLMs to generate any
            kind of unique identifier. Sooner or later it will hallucinate a
            fake one.
       
              wat10000 wrote 1 hour 6 min ago:
              I would generalize it further: you can't trust LLMs.
              
              They're useful, but you must verify anything you get from them.
       
        sidgtm wrote 9 hours 4 min ago:
        As a UX designer I see they lack the ability of being opinionated about
        a design piece and go with the standard mental model. I got fed up with
        this and made a simple java script code to run a simple canvas on the
        localhost to pass on more subjective feedback using highlights and
        notes feature. I tried using playwright first but a. its token heavy b.
        it's still for finding what's working or breaking instead of thinking
        deeply about the design.
       
          seunosewa wrote 8 hours 15 min ago:
          Whatdo the notes look like?
       
            sidgtm wrote 7 hours 39 min ago:
            specific inputs e.g. move, color change, or giving specific inputs
            for interaction piece.
       
        freetonik wrote 9 hours 5 min ago:
        I see a pattern in these discussions all the time: some people say how
        very, very good LLMs are, and others say how LLMs fail miserably;
        almost always the first group presents examples of simple CRUD apps,
        frontend "represent data using some JS-framework" kind of tasks, while
        the second group presents examples of non-trivial refactoring, stuff
        like parsers (in this thread), algorithms that can't be found in
        leetcode, etc.
        
        Tech twitter keeps showing "one-shotting full-stack apps" or "games",
        and it's always something extremely banal. It's impressive that a
        computer can do it on its own, don't get me wrong, but it was trivial
        to programmers, and now it is commoditized.
       
          regularfry wrote 7 hours 45 min ago:
          The function of technological progress, looked at through one lens,
          is to commoditise what was previously bespoke.    LLMs have expanded
          the set of repeatable things.  What we're seeing is people on the one
          hand saying "there's huge value in reducing the cost of producing
          rote assets", and on the other "there is no value in trying to apply
          these tools to tasks that aren't repeatable".
          
          Both are right.
       
          NitpickLawyer wrote 8 hours 47 min ago:
          > almost always the first group presents examples of simple CRUD apps
          
          How about a full programming language written by cc "in a loop" in ~3
          months? With a compiler and stuff? [1] It might be a meme project,
          but it's still impressive as hell we're here.
          
          I learned about this from a yt content creator that took that repo,
          asked cc to "make it so that variables can be emojis", and cc did
          that 5$ later. Pretty cool.
          
   URI    [1]: https://cursed-lang.org/
       
            Gazoche wrote 2 hours 19 min ago:
            > written by cc "in a loop" in ~3 months?
            
            What does that mean exactly? I assume the LLM was not left alone
            with its task for 3 months without human supervision.
       
              NitpickLawyer wrote 2 hours 11 min ago:
              From the FAQ:
              
              > the following prompt was issued into a coding agent:
              
              > Hey, can you make me a programming language like Golang but all
              the lexical keywords are swapped so they're Gen Z slang?
              
              > and then the coding agent was left running AFK for months in a
              bash loop
       
                sarchertech wrote 11 min ago:
                I don’t buy it at all. Not even Anthropic or Open AI have
                come anywhere close to something like this.
                
                Running for 3 months and generating a working project this
                large with no human intervention is so far outside of the
                capabilities of any agent/LLM system demonstrated by anyone
                else that the mostly likely explanation is that the promoter is
                lying about it running on its own for 3 months.
                
                I looked through the videos listed as “facts” to support
                the claims and I don’t see anything longer than a few hours.
       
            freetonik wrote 8 hours 18 min ago:
            Ok, not trivial for sure, but not novel? IIUC, the language does
            not have really new concepts, apart from the keywords (which is
            trivial).
            
            Impressive nonetheless.
       
              NitpickLawyer wrote 8 hours 2 min ago:
              Novel as in never done before? Of course not.
              
              Novel as in "an LLM can maintain coherence on a 100k+ LoC project
              written in zig"? Yeah, that's absolutely novel in this space.
              This wasn't possible 1 year ago. And this was fantasy 2.5 years
              ago when chatgpt launched.
              
              Also impressive in that cc "drove" this from a simple prompt.
              Also impressive that cc can do stuff in this 1M+ (lots of js in
              the extensions folders?) repo. Lots of people claim LLMs are
              useless in high LoC repos. The fact that cc could navigate a
              "new" language and make "variables as emojis" work is again novel
              (i.e. couldn't be done 1 year ago) and impressive.
       
                freetonik wrote 7 hours 3 min ago:
                >Novel as in "an LLM can maintain coherence on a 100k+ LoC
                project written in zig"? Yeah, that's absolutely novel in this
                space.
                
                Absolutely. I do not underestimate this.
       
          quietbritishjim wrote 8 hours 58 min ago:
          Yesterday, I got Claude Code to make a script that tried out
          different point clustering algorithms and visualise them. It made the
          odd mistake, which it then corrected with help, but broadly speaking
          it was amazing. It would've taken me at least a week to write by
          hsnd, maybe longer. It was writing the algorithms itself, definitely
          not just simple CRUD stuff.
       
            dncornholio wrote 1 hour 45 min ago:
            That's actually a very specific domain, which is well documented
            and researched in which LLM's will alawys do well. Shit will hit
            the fans quickly when you're going to do integration where it won't
            have a specific problem domain.
       
              fwip wrote 7 min ago:
              Yep - visualizing clustering algorithms is just the "CRUD app" of
              a different speciality.
              
              One rule of thumb I use, is if you could expect to find a student
              on a college campus to do a task for you, an LLM will probably be
              able to do a decent job. My thinking is because we have a lot of
              teaching resources available for how to do that task, which the
              training has of course ingested.
       
            an0malous wrote 2 hours 39 min ago:
            Let’s see the diff
       
            piva00 wrote 8 hours 36 min ago:
            In my experience it's been great to have LLMs for narrowly-scoped
            tasks, things I know how I'd implement (or at least start
            implementing) but that would be tedious to manually do, prompting
            it with increasingly higher complexity does work better than I
            expected for these narrow tasks.
            
            Whenever I've attempted to actually do the whole "agentic coding"
            by giving it a complex task, breaking it down in sub-tasks, loading
            up context, reworking the plan file when something goes awry,
            trying again, etc. it hasn't a single fucking time done the thing
            it was supposed to do to completion, requiring a lot of manual
            reviewing, backtracking, nudging, it becomes more exhausting than
            just doing most of the work myself, and pushing the LLM to do the
            tedious work.
            
            It does work sometimes to use for analysis, and asking it to
            suggest changes with the reasoning but not implement them, since
            most times when I let it try to implement its broad suggestions it
            went haywire, requiring me to pull back, and restart.
            
            There's a fine line to walk, and I only see comments on the
            extremes online, it's either "I let 80 agents running and they
            build my whole company's code" or "they fail miserably on every
            task harder than a CRUD". I tend to not believe in either extreme,
            at least not for the kinds of projects I work on which require more
            context than I could ever fit properly beforehand to these robots.
       
            freetonik wrote 8 hours 53 min ago:
            I also got good results for “above CRUD” stuff occasionally.
            Sorry if I wasn’t clear, I meant to primarily share an
            observation about vastly different responses in discussions related
            to LLMs. I don’t believe LLMs are completely useless for
            non-trivial stuff, nor I believe that they won’t get better. Even
            those two problems in the linked article: sure, those actions are
            inherently alien to the LLM’s structure itself, but can be solved
            with augmentation.
       
        ziotom78 wrote 9 hours 11 min ago:
        I fully resonate with point #2. A few days ago, I was stuck trying to
        implement some feature in a C++ library, so I used ChatGPT for
        brainstorming.
        
        ChatGPT proposed a few ideas, all apparently reasonable, and then it
        advocated for one that was presented unambiguously as the "best". After
        a few iterations, I realized that its solution would have required a
        class hierarchy where the base class contained a templated virtual
        function, which is not allowed in C++. I pointed this out to ChatGPT
        and asked it to rethink the solution; it then immediately advocated for
        the other approach it had initially suggested.
       
        sxp wrote 9 hours 13 min ago:
        Another place where LLMs have a problem is when you ask them to do
        something that can't be done via duct taping a bunch of Stack Overflow
        posts together. E.g, I've been vibe coding in Typescript on Deno
        recently. For various reasons, I didn't want to use the standard
        Express + Node stack which is what most LLMs seem to prefer for web
        apps. So I ran into issues with Replit and Gemini failing to handle the
        subtle differences between node and deno when it comes to serving HTTP
        requests.
        
        LLMs also have trouble figuring out that a task is impossible. I wanted
        boilerplate code that rendered a mesh in Three.js using
        GL_TRIANGLE_STRIP because I was writing a custom shader and needed to
        experiment with the math. But Three.js does support GL_TRIANGLE_STRIP
        rendering for architectural reasons. Grok, ChatGPT, and Gemini all
        hallucinated a GL_TRIANGLE_STRIP rendering API rather than telling be
        about this and I had to Google the problem myself.
        
        It feels like current Coding LLMs are good at replacing junior
        engineers when it comes to shallow but broad tasks like creating UIs,
        modifying examples available on the web, etc. But they fail at
        senior-level tasks like realizing that the requirements being asked of
        them aren't valid and doing something that no one has done in their
        corpus of training data.
       
          athrowaway3z wrote 9 hours 9 min ago:
          >But Three.js does support GL_TRIANGLE_STRIP rendering for
          architectural reasons.
          
          Typo or trolling the next LLM to index HN comments?
       
        the_mitsuhiko wrote 9 hours 17 min ago:
        > LLMs don’t copy-paste (or cut and paste) code. For instance, when
        you ask them to refactor a big file into smaller ones, they’ll
        "remember" a block or slice of code, use a delete tool on the old file,
        and then a write tool to spit out the extracted code from memory. There
        are no real cut or paste tools. Every tweak is just them emitting write
        commands from memory. This feels weird because, as humans, we lean on
        copy-paste all the time.
        
        There is not that much copy/paste that happens as part of refactoring
        so it leans to just using context recall.  It's not entirely clear if
        providing an actual copy/paste command is particularly useful, at least
        from my testing it does not do much.  More interesting are repetitive
        changes that clog up the context.  Those you can improve on if you have
        `fastmod` or some similar tool available: with it you can instruct
        codex or claude to perform edits with it.
        
        > And it’s not just how they handle code movement -- their whole
        approach to problem-solving feels alien too.
        
        It is, but if you go back and forth to work out a plan for how to solve
        the problem, then the approach greatly changes.
       
          brianpan wrote 8 hours 54 min ago:
          How is it not clear that it would be beneficial?
          
          To use another example, with my IDE I can change a signature or
          rename something across multiple files basically instantly. But an
          LLM agent will take multiple minutes to do the same thing and doesn't
          get it right.
       
            the_mitsuhiko wrote 8 hours 35 min ago:
            > How is it not clear that it would be beneficial?
            
            There is reinforcement learning on the Anthropic side for a text
            edit tool, which is built in a way that does not lend itself to
            copy/paste. If you use a model like the GPT series then there might
            not be reinforcement learning for text editing (I believe, I don't
            really know), but it operates on line-based replacements for the
            most part and for it to understand what to manipulate it needs to
            know the content in the context.  When you try to give it a
            copy/paste buffer it does not fully comprehend what the change in
            the file looks like after the operation.
            
            So it might be possible to do something with copy/paste, but I did
            not find it to be very obvious how you make that work with an
            agent, given that it needs to read the file into context anyways
            and its recall capabilities are surprisingly good.
            
            > To use another example, with my IDE I can change a signature or
            rename something across multiple files basically instantly.
            
            So yeah, that's the more interesting case and there things like
            codemod/fastmod are very effective if you tell an agent to use it.
            They just don't reach there.
       
          3abiton wrote 9 hours 14 min ago:
          I think copy/paste can alleviate context explosion. Basically the
          model can remember what's the code block contain, can access it at
          any time, without needing to "remember" it.
       
        giancarlostoro wrote 9 hours 18 min ago:
        Point #2 cracks me up because I do see with JetBrains AI (no fault of
        JetBrains mind you) the model updates the file, and sometimes I somehow
        wind up with like a few build errors, or other times like 90% of the
        file is now build errors. Hey what? Did you not run some sort of what
        if?
       
        throw-10-8 wrote 9 hours 20 min ago:
        3. Saying no
        
        LLMs will gladly go along with bad ideas that any reasonable dev would
        shoot down.
       
          pimeys wrote 9 hours 1 min ago:
          I've found codex to be better here than Claude. It has stopped many
          times and said hey you might be wrong. Of course this changes with a
          larger context.
          
          Claude is just chirping away "You're absolutely right" and making me
          to turn on caps lock when I talk to it and it's not even noon yet.
       
            throw-10-8 wrote 8 hours 55 min ago:
            i find the chirpy affirmative tone of claude to be rage inducing
       
              pimeys wrote 8 hours 43 min ago:
              This. The biggest reason I went with OpenAI this month...
       
                throw-10-8 wrote 4 hours 23 min ago:
                My "favorite" is when it makes a mistake and then tries
                gaslight you into thinking it was your mistake and then
                confidently presents another incorrect solution.
                
                All while having the tone of an over caffeinated intern who has
                only ever read medium articles.
       
          nxpnsv wrote 9 hours 4 min ago:
          Agree, this is really bad.
       
            throw-10-8 wrote 9 hours 1 min ago:
            It's a fundamental failing of trying to use a statistical
            approximation of human language to generate code.
            
            You can't fix it.
       
        Vipsy wrote 9 hours 20 min ago:
        Coding agents tend to assume that the development environment is static
        and predictable, but real codebases are full of subtle, moving parts -
        tooling versions, custom scripts, CI quirks, and non-standard file
        layouts.
        
        Many agents break down not because the code is too complex, but because
        invisible, “boring” infrastructure details trip them up. Human
        developers subconsciously navigate these pitfalls using tribal memory
        and accumulated hacks, but agents bluff through them until confronted
        by an edge case. This is why even trivial tasks intermittently fail
        with automation agents. you’re fighting not logic errors, but
        mismatches with the real lived context. Upgrading this
        context-awareness would be a genuine step change.
       
          pimeys wrote 9 hours 5 min ago:
          Yep. One of the things I've found agents always having a lot of
          trouble with is anything related to OpenTelemetry. There's a thing
          you call that uses some global somewhere, there's a docker container
          or two and there's the timing issues. It takes multiple tries to get
          anything right. Of course this is hard for a human too if you haven't
          used otel before...
       
        tjansen wrote 9 hours 22 min ago:
        Agreed with the points in that article, but IMHO the no 1 issue is that
        agents only see a fraction of the code repository. They don't know
        whether there is a helper function they could use, so they re-implement
        it. When contributing to UIs, they can't check the whole UI to identify
        common design patterns, so they re-invent it.
        
        The most important task for the human using the agent is to provide the
        right context. "Look at this file for helper functions", "do it like
        that implementation", "read this doc to understand how to do it"... you
        can get very far with agents when you provide them with the right
        context.
        
        (BTW another issue is that they have problems navigating the directory
        structure in a large mono repo. When the agents needs to run commands
        like 'npm test' in a sub-directory, they almost never get it right the
        first time)
       
          bunderbunder wrote 1 hour 40 min ago:
          This is what I keep running into. Earlier this week I did a code
          review of about new lines of code, written using Cursor, to implement
          a feature from scratch, and I'd say maybe 200 of those lines were
          really necessary.
          
          But, y'know what? I approved it. Because hunting down the existing
          functions it should have used in our utility library would have taken
          me all day. 5 years ago I would have taken the time because a PR like
          that would have been submitted by a new team member who didn't know
          the codebase well, and helping to onboard new team members is an
          important part of the job. But when it's a staff engineer using
          Cursor to fill our codebase with bloat because that's how management
          decided we should work, there's no point. The LLM won't learn
          anything and will just do the same thing over again next week, and
          the staff engineer already knows better but is being paid to pretend
          they don't.
       
            ahi wrote 39 min ago:
            I really really hate code review now. My colleagues will have their
            LLMs generate thousands of lines of boiler plate with every pattern
            and abstraction under the sun. A lazy programmer use to do the bare
            minimum and write not enough code. That made review easy. Error
            handling here, duplicate code there, descriptive naming here, and
            so on. Now a lazy programmer generates a crap load of code cribbed
            from "best practice" tutorials, much of it unnecessary and
            irrelevant for the actual task at hand.
       
            tjansen wrote 1 hour 3 min ago:
            >>because that's how management decided we should work, there's no
            point
            
            If you are personally invested, there would be a point. At least if
            you plan to maintain that code for a few more years.
            
            Let's say you have a common CSS file, where you define .warning
            {color: red}. If you want the LLM to put out a warning and you just
            tell it to make it red, without pointing out that there is the
            .warning class, it will likely create a new CSS def for that
            element (or even inline it - the latest Claude Code has a tendency
            to do that). That's fine and will make management happy for now.
            
            But if later management decides that it wants all warning messages
            to be pink, it may be quite a challenge to catch every place
            without missing one.
       
              bunderbunder wrote 36 min ago:
              There really wouldn't be; it would just be spitting into the
              wind. What am I going to do, convince every member of my team to
              ignore a direct instruction from the people who sign our
              paychecks?
       
          hwillis wrote 3 hours 49 min ago:
          That's what claude.md etc are for.  If you want it to follow your
          norms then you have to document them.
       
            tjansen wrote 1 hour 1 min ago:
            That's fine for norms, but I don't think you can use it to describe
            every single piece of your code. Every function, every type, every
            CSS class...
       
            ColonelPhantom wrote 1 hour 59 min ago:
            Well, sure, but from what I know, humans are way better at
            following 'implicit' instructions than LLMs. A human programmer can
            'infer' most of the important basic rules from looking at the
            existing code, whereas all this agents.md/claude.md/whatever stuff
            seems necessary to even get basic performance in this regard.
            
            Also, the agents.md website seems to mostly list README.md-style
            'how do I run this instructions' in its example, not stylistic
            guidelines.
            
            Furthermore, it would be nice if these agents add it themselves.
            With a human, you tell them "this is wrong, do it that way" and
            they would remember it. (Although this functionality seems to be
            worked on?)
       
          rdsubhas wrote 7 hours 24 min ago:
          To be fair, this is a daily life story for any senior engineer
          working with other engineers.
       
          Leynos wrote 7 hours 25 min ago:
          I wonder if a large context model could be employed here via tool
          call. One of the great things Gemini chat can do is ingest a whole
          GitHub repo.
          
          Perhaps "before implementing a new utility or helper function, ask
          the not-invented-here tool if it's been done already in the codebase"
          
          Of course, now I have to check if someone has done this already.
       
            bunderbunder wrote 1 hour 29 min ago:
            Large context models don't do a great job of consistently attending
            to the entire context, so it might not work out as well in practice
            as continuing to improve the context engineering parts of coding
            agents would.
            
            I'd bet that most the improvement in Copilot style tools over the
            past year is coming from rapid progress in context engineering
            techniques, and the contribution of LLMs is more modest. LLMs'
            native ability to independently "reason" about a large slushpile of
            tokens just hasn't improved enough over that same time period to
            account for how much better the LLM coding tools have become. It's
            hard to see or confirm that, though, because the only direct
            comparison you can make is changing your LLM selection in the
            current version of the tool. Plugging GPT5 into the original
            version of Copilot from 2021 isn't an experiment most of us are
            able to try.
       
            knes wrote 2 hours 9 min ago:
            This is what we do at Augmentcode.com.
            
            We started with building the best code retrieval and build an agent
            around it.
       
            4b11b4 wrote 3 hours 48 min ago:
            Sure, but just bcuz it went into context doesn't mean LLM
            "understand" it. Also, not all sections of context iz equal.
       
            itsdavesanders wrote 4 hours 7 min ago:
            Claude can use use tools to do that, and some different code
            indexer MCPs work, but that depends on the LLM doing the coding to
            make the right searches to find the code. If you are in a project
            where your helper functions or shared libs are scattered everywhere
            it’s a lot harder.
            
            Just like with humans it definitely works better if you follow good
            naming conventions and file patterns. And even then I tend to make
            sure to just include the important files in the context or clue the
            LLM in during the prompt.
            
            It also depends on what language you use. A LOT. During the day I
            use LLMs with dotnet and it’s pretty rough compared to when I’m
            using rails on my side projects. Dotnet requires a lot more
            prompting and hand holding, both due to its complexity but also due
            to how much more verbose it is.
       
        schiho wrote 9 hours 26 min ago:
        I just run into this issue with claude sonet 4.5, asked it to
        copy/paste some constants from one file to another, a bigger chunk of
        code, it instead "extracted" pieces and named them so. As a last
        resort, after going back and forth it agreed to do a file/copy by
        running a system command. I was surprised that of all the programming
        tasks, a copy/paste felt challenging for the agent.
       
          tjansen wrote 9 hours 19 min ago:
          I guess the LLMs are trained to know what finished code looks like.
          They don't really know the operations a human would use to get there.
       
        hu3 wrote 9 hours 26 min ago:
        I have seen LLMs in VSCode Copilot ask to execute 'mv oldfile.py
        newfile.py'.
        
        So there's hope.
        
        But often they just delete and recreate the file, indeed.
       
        AllegedAlec wrote 9 hours 27 min ago:
        On a more important level, I found that they still do really badly at
        even a minorly complex task without extreme babysitting.
        
        I wanted it to refactor a parser in a small project (2.5K lines total)
        because it'd gotten a bit too interconnected. It made a plan, which
        looked reasonable, so I told it to do this in stages, with checkpoints.
        It said it'd done so. I asked it "so is the old architecture also
        removed?" "No, it has not been removed." "Is the new structured used in
        place of the old one?" "No, it has not." 
        After it did so, 80% of the test suite failed because nothing it'd
        written was actually right.
        
        Did so three times with increasingly more babysitting, but it failed at
        the abstract task of "refactor this" no matter what with pretty much
        the same failure mode. I feel like I have to tell it exactly to make
        changes X and Y to class Z, remove class A etc etc, at which point I
        can't let it do stuff unsupervised, which is half of the reason for
        letting an LLM do this in the first place.
       
          jansan wrote 8 hours 20 min ago:
          I was hoping that LLMs being able to access strict tools, like Gemini
          using Python libraries, would finally give reliable results.
          
          So today I asked Gemini to simplify a mathematical expression with
          sympy. It did and explained to me how some part of the expression
          could be simplified wonderfully as a product of two factors.
          
          But it was all a lie. Even though I explicitly asked it to use sympy
          in order to avoid such hallucinations and get results that are
          actually correct, it used its own flawed reasoning on top and again
          gave me a completely wrong result.
          
          You still cannot trust LLMs. And that is a problem.
       
            ogogmad wrote 5 hours 2 min ago:
            The obvious point has to be made: Generating formal proofs might be
            a partial fix for this. By contrast, coding is too informal for
            this to be as effective for it.
       
          coldtea wrote 8 hours 23 min ago:
          >I feel like I have to tell it exactly to make changes X and Y to
          class Z, remove class A etc etc, at which point I can't let it do
          stuff unsupervised, which is half of the reason for letting an LLM do
          this in the first place.
          
          The reason better turn to "It can do stuff faster than I ever could
          if I give it step by step high level instructions" instead.
       
            AllegedAlec wrote 7 hours 16 min ago:
            That would be a solution, yes. But currently it feels extremely
            borked from a UX perspective. It purports to be able to do this,
            but when you tell it to it breaks in unintuitive ways.
            
            I hate this idea of "well you just need to understand all the
            arcane ways in which to properly use it to its proper effects".
            
            It's like a car which has a gear shifter, but that's not fully
            functional yet, so instead you switch gear by spelling out in morse
            code the gear you want to go into using L as short and R as long.
            Furthermore, you shouldn't try to listen to 105-112 on the FM band
            on the radio, because those frequencies are used to control the
            brakes and ABS and if you listen to those frequencies the brakes no
            longer work.
            
            We would rightfully stone any engineer who'd design this and then
            say "well obvious user error" when the user rightfully complains
            that they crash whenever they listen to Arrow FM.
       
              coldtea wrote 6 hours 16 min ago:
              >But currently it feels extremely borked from a UX perspective.
              It purports to be able to do this, but when you tell it to it
              breaks in unintuitive ways.
              
              Thankfully as programmers we know better and don't need to care
              what the UI pretends to be able to do :)
              
              >We would rightfully stone any engineer who'd design this and
              then say "well obvious user error" when the user rightfully
              complains that they crash whenever they listen to Arrow FM.
              
              We might curse the company and engineer who did it, but we would
              still use that car and do those workarounds, if doing so allowed
              us to get to our destination in 1/10 the regular time...
       
                AllegedAlec wrote 3 hours 5 min ago:
                > >But currently it feels extremely borked from a UX
                perspective. It purports to be able to do this, but when you
                tell it to it breaks in unintuitive ways.
                
                > Thankfully as programmers we know better and don't need to
                care what the UI pretends to be able to do :)
                
                But we do though. You can't just say "yeah they left all the
                foot guns in but we ought to know not to use them", especially
                not when the industry shills tell you those footguns are
                actually rocket boosters to get you to the fucking moon and
                back.
       
          jeswin wrote 8 hours 33 min ago:
          > I wanted it to refactor a parser in a small project
          
          This expression tree parser (typescript to sql query builder - [1] )
          has zero lines of hand-written code. It was made with Codex + Claude
          over two weeks (part-time on the side). Having worked on ORMs
          previously, it would have taken me 4x-10x the time to get to the same
          state (which also has 100s of tests, with some repetitions). That's a
          massive saving in time.
          
          I did not have to baby sit the LLMs at all. So the answer is, I think
          it depends on what you use it for, and how you use it. Like every
          tool, it takes a really long time to find a process that works for
          you. In my conversations with other developers who use LLMs
          extensively, they all have their unique, custom workflows. All of
          them however do focus on test suites, documentation, and method
          review processes.
          
   URI    [1]: https://tinqerjs.org/
       
            AllegedAlec wrote 7 hours 55 min ago:
            I have tried several. Overall I've now set on strict TDD (which it
            still seems to not do unless I explicitly tell it to even though I
            have it as a hard requirement in claude.md).
       
              jeswin wrote 7 hours 21 min ago:
              Claude forgets claude.md after a while, so you need to keep
              reminding. I find that codex does a design job better than Claude
              at the moment, but it's 3x slower which I don't mind.
       
            iLoveOncall wrote 8 hours 21 min ago:
            Hum yeah, it shows. Just the fact that the API looks completely
            different for Postgre and SQLite tells us everything we need to
            know about the quality of the project here.
       
              pprotas wrote 7 hours 24 min ago:
              I guess the interesting question is whether @jeswin could have
              created this project at all if AI tools were not involved. And if
              yes, would the quality even be better?
       
                iLoveOncall wrote 3 hours 32 min ago:
                Actually the interesting question is whether this library not
                existing would have been a loss for humanity. I'll posit that
                it would not.
       
                jeswin wrote 7 hours 18 min ago:
                Very true. However, to claim that the "API looks completely
                different for Postgre and SQLite" is disingenuous. What was he
                looking at?
       
                  tom_ wrote 2 hours 13 min ago:
                  There are two examples on the landing page, and they both
                  look quite different. Surely if the API is the same for both,
                  there'd be just one example that covers both cases, or two
                  examples would be deliberately made as identical as possible?
                  (Like, just a different new somewhere, or different import
                  directive at the top, and everything else exactly the same?)
                  I think that's the point.
                  
                  Perhaps experienced users of relevant technologies will just
                  be able to automatically figure this stuff out, but this is a
                  general discussion - people not terribly familiar with any of
                  them, but curious about what a big pile of AI code might
                  actually look like, could get the wrong impression.
       
                    jeswin wrote 5 min ago:
                    If you're mentioning the first two examples, they're doing
                    different things. The pg example does an orderby, and the
                    sqlite example does a join. You'll be able to switch the
                    client (ie, better-sqlite and pg-promise) in either
                    statement, and the same query would work on the other
                    database.
                    
                    Maybe I should use the same example repeated for clarity.
                    Let me do that.
       
              jeswin wrote 7 hours 50 min ago:
              > Just the fact that the API looks completely different for
              Postgre and SQLite tells us everything we need to know about the
              quality of the project here.
              
              How does the API look completely different for pg and sqlite? Can
              you share an example?
              
              It's an implementation of LINQ's IQueryable. With some bells
              missing in DotNet's Queryable, like Window functions (RANK
              queries etc) which I find quite useful.
              
              Add: What you've mentioned is largely incorrect. But in any case,
              it is a query builder. Meaning, an ORM like database abstraction
              is not the goal. This allows us to support pg's extensions, which
              aren't applicable to other database.
       
          habibur wrote 9 hours 0 min ago:
          Might be related with what the article was talking. AI can't
          cut-paste. It deletes the code and then regenerates it at another
          location instead of cut-paste.
          
          Obviously generated code drift a little from deleted ones.
       
          hu3 wrote 9 hours 24 min ago:
          Interesting. What model and tool was used?
          
          I have seen similar failure modes in Cursor and VSCode Copilot (using
          gpt5) where I have to babysit relatively small refactors.
       
            AllegedAlec wrote 9 hours 21 min ago:
            Claude code. Whichever model it started up automatically last
            weekend, I didn't explicitly check.
       
              rglynn wrote 9 hours 17 min ago:
              This feels like a classic Sonnet issue. From my experience, Opus
              or GPT-5-high are less likely to do the "narrow instruction
              following without making sensible wider decisions based on
              context" than Sonnet.
       
                coldtea wrote 8 hours 22 min ago:
                This is "just use another Linux distro" all over again
       
                  rglynn wrote 4 hours 4 min ago:
                  Yes and no, it's a fair criticism to some extent. Inasamuch
                  as I would agree that different models of the same type have
                  superficial differences.
                  
                  However, I also think that models which focus on higher
                  reasoning effort in general are better at taking into account
                  the wider context and not missing obvious implications from
                  instructions. Non-reasoning or low-reasoning models serve a
                  purpose, but to suggest they are akin to different flavours
                  misses what is actually quite an important distinction.
       
        koliber wrote 9 hours 31 min ago:
        Most developers are also bad at asking questions. They tend to assume
        too many things from the start.
        
        In my 25 years of software development I could apply the second
        critique to over half of the developers I knew. That includes myself
        for about half of that career.
       
          rkomorn wrote 9 hours 28 min ago:
          But, just like lots of people expect/want self-driving to outperform
          humans even on edge cases in order to trust them, they also want "AI"
          to outperform humans in order to trust it.
          
          So: "humans are bad at this too" doesn't have much weight (for people
          with that mindset).
          
          It makes sense to me, at least.
       
            darkwater wrote 9 hours 22 min ago:
            If we had a knife that most of the time cuts a slice of bread like
            the bottom p50 of humans cutting a slice of bread with their hands,
            we wouldn't call the knife useful.
            
            Ok, this example is probably too extreme, replace the knife with an
            industrial machine that cut bread vs a human with a knife. Nobody
            would buy that machine either if it worked like that.
       
              koliber wrote 8 hours 46 min ago:
              Agreed in a general sense, but there's a bit more nuance.
              
              If a knife slices bread like a normal human at p50, it's not a
              very good knife.
              
              If a knife slices bread like a professional chef at p50, it's
              probably a very decent knife.
              
              I don't know if LLMs are better at asking questions than a p50
              developer. In my original comment I wanted to raise the question
              of whether the fact that LLMs are not good at asking questions
              makes them still worse than human devs.
              
              The first LLM critique in the original article is that they can't
              copy and paste. I can't argue with that. My 12 year old
              copies-and-pastes better than top coding agents.
              
              The second critique says they can't ask questions. Since many
              developers also are not good at this, how does the current state
              of the art LLM compare to a p50 developer in this regard?
       
              Certhas wrote 8 hours 50 min ago:
              I think this is still too extreme. A machine that cuts and preps
              food at the same level as a 25th percentile person _being paid to
              do so_, while also being significantly cheaper would presumably
              be highly relevant.
       
                rkomorn wrote 8 hours 40 min ago:
                Aw man. There are so many angles though.
                
                Your p25 employee is probably much closer to your p95 employee
                than to the p50 "standard" human, so yeah, I think you have a
                point there.
                
                But at least in food prep, p25 would already be pretty damn
                hard to achieve. That's a hell of a lot of autonomy and
                accuracy (at least in my restaurant kitchen experience which is
                admittedly just one year in "fine dining"-ish kitchens).
                
                I'd say the p25 of software or SRE folks I've worked with is
                also a pretty high bar to hit, too, but maybe I've been lucky.
       
              rkomorn wrote 9 hours 15 min ago:
              I feel kind of attacked for my sub-p50 bread slicing skills, TBH.
              :(
       
        rconti wrote 9 hours 31 min ago:
        Doing hard things that aren't greenfield? Basically any difficult and
        slightly obscure question I get stuck with and hope the collective
        wisdom of the internet can solve?
       
          athrowaway3z wrote 8 hours 56 min ago:
          You don't learn new languages/paradigms/frameworks by inserting it
          into an existing project.
          
          LLMs are especially tricky because they do appear to work magic on a
          small greenfield, and the majority of people are doing
          clown-engineering.
          
          But I think some people are underestimating what can be done in
          larger projects if you do everything right (eg docs, tests, comments,
          tools) and take time to plan.
       
        nikanj wrote 9 hours 31 min ago:
        4/5 times when Claude is looking for a file, it starts by running
        bash(dir c:\test /b)
        
        First it gets an error because bash doesn’t understand \
        
        Then it gets an error because /b doesn’t work
        
        And as LLMs don’t learn from their mistakes, it always spends at
        least half a dozen tries (e.g. bash(cmd.exe /c dir c:\test /b )) before
        it figures out how to list files
        
        If it was an actual coworker, we’d send it off to HR
       
          cheema33 wrote 8 hours 52 min ago:
          Most models struggle in a Windows environment. They are trained on a
          lot of Unixy commands and not as much on Windows and PowerShell
          commands. It was frustrating enough that I started using WSL for
          development when using Windows. That helped me significantly.
          
          I am guessing this because:
          
          1. Most of the training material online references Unix commands.
          2. Most Windows devs are used to GUIs for development using Visual
          Studio etc. GUIs are not as easy to train on.
          
          Side note:
          Interesting thing I have noticed in my own org is that devs with
          Windows background strictly use GUIs for git. The rest are
          comfortable with using git from the command line.
       
          anonzzzies wrote 9 hours 21 min ago:
          I have a list of those things in CLAUDE.md -> it seems to help
          (unless it's context is full, but you should never let it get close
          really).
       
        ra wrote 9 hours 32 min ago:
        IaC, and DSLs in general.
       
        IanCal wrote 9 hours 43 min ago:
        Editing tools are easy to add it’s just you have to pick what things
        to give them because too many and they struggle as it uses up a lot of
        context. Still, as costs come down multiple steps to look for tools
        becomes cheaper too.
        
        I’d like to see what happens with better refactoring tools, I’d
        make a bunch more mistakes copying and retyping or using awk. If they
        want to rename something they should be able to use the same tooling
        the rest of us get.
        
        Asking questions is a good point but that’s both a bit of promoting
        and I think the move to having more parallel work makes it less
        relevant. One of the reasons clarifying things more upfront is useful
        is we take a lot of time and cost a lot of money to build things so the
        economics favours getting it right first time. As the time comes down
        and the cost drops to near zero, the balance changes.
        
        There are also other approaches to clarify more what you want and how
        to do it first, breaking that down into tasks, then letting it run with
        those (spec kit). This is an interesting area.
       
        baq wrote 9 hours 46 min ago:
        they're getting better at asking questions; I routinely see search
        calls against the code base index. they just don't ask me questions.
       
        davydm wrote 11 hours 23 min ago:
        Coding and...?
       
          Black616Angel wrote 9 hours 48 min ago:
          Copy and pasting.
          
          Oh, sorry. You already said that. :D
       
          drdeca wrote 9 hours 50 min ago:
          More granular. What things is it bad at that result in it being
          overall “bad at coding”? It isn’t all of the parts.
       
       
   DIR <- back to front page