_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   How we made our AI code review bot stop leaving nitpicky comments
       
       
        lupire wrote 16 hours 35 min ago:
        Instead of training an LLM to make comments, train the LLM to generate
        test cases and then guess the line of code that breaks the test.
       
          dakshgupta wrote 15 hours 46 min ago:
          You might enjoy @goodside on Twitter. He’s a prompt engineer at
          Scale AI and a lot of his observations and techniques are
          fascinating.
       
        utdiscant wrote 18 hours 23 min ago:
        "We picked the latter, which also gave us our performance metric -
        percentage of generated comments that the author actually addresses."
        
        This metric would go up if you leave almost no comments. Would it not
        be better to find a metric that rewards you for generating many
        comments which are addressed, not just having a high relevance?
        
        You even mention this challenge yourselves: "Sadly, even with all kinds
        of prompting tricks, we simply could not get the LLM to produce fewer
        nits without also producing fewer critical comments."
        
        If that was happening, that doesn't sound like it would be reflected in
        your performance metric.
       
          SomewhatLikely wrote 15 hours 23 min ago:
          You could probably modify the metric to addressed comments per 1000
          lines of code.
       
          dakshgupta wrote 15 hours 49 min ago:
          Good criticism that we should pay closer attention to. Someone else
          pointed this out and too and since then we’ve started tracking
          addressed comment per file changed as well.
       
        aarondia wrote 18 hours 41 min ago:
        I find these retro blog posts from people building LLM solutions super
        useful for helping me try out new prompting, eval, etc. techniques.
        Which blog posts have you all found most useful? Would love a link.
       
        iandanforth wrote 19 hours 50 min ago:
        "As a last resort we tried machine learning ..."
        
        - Hilarious that a cutting edge solution (document embedding and
        search) from 5-6 years ago was their last resort.
        
        - Doubly hilarious that "throw more AI at it" surprised them when it
        didn't work.
       
        AgentOrange1234 wrote 21 hours 17 min ago:
        I’d be curious to hear more on Attempt 2. It sounds like the approach
        was basically to ask an llm for a score for each comment. Adding
        specifics to this prompt might go a long way? Like, what specifically
        is the rationale for this change, is this likely to be a functional
        bug, is it a security issue, how does it impact maintainability over
        the long run, etc.; basically I wonder if asking about more specific
        criteria and trying to define what you mean by nits can help the LLM
        give you more reliable scores.
       
          dakshgupta wrote 21 hours 11 min ago:
          That’s an interesting point - we didn’t try this. Now that you
          said that, I bet even clearly defining what each number on the scale
          means would help.
       
        pcwelder wrote 23 hours 41 min ago:
        A whole article without mentioning the name of the LLM. It's not like
        Sonnet and O1 have the same modalities.
        
        Anything you do today might become irrelevant tomorrow.
       
        hsbauauvhabzb wrote 1 day ago:
        I can’t fathom a world where an LLM would be able to review code in
        any meaningful way -at all-.
        
        It should not substitute a human, and probably wasted more effort than
        it solves by a wide margin.
       
          makingstuffs wrote 22 hours 14 min ago:
          Yeah it seems like the inverse of what is logical: Have a human
          review the code which _may_ have been written/influenced by an LLM.
          Having an LLM do code review seems about as useful as YOLO merging to
          main and praying to the Flying Spaghetti Monster...
       
          dakshgupta wrote 22 hours 40 min ago:
          I understand your skepticism and discomfort, and also agree that an
          LLM should not replace a human code reviewer.
          
          I would encourage you to try one (nearly all including ours have a
          free trial).
          
          When done right, they serve as a solid first pass and surface things
          that warrant a second look. Repeated code where there should be an
          abstraction, inconsistent patterns from other similar code elsewhere
          in the codebase, etc. Things that linters can’t do.
          
          Would you not want your coworkers to have AI look at their PRs, have
          them address the relevant comments, and then pass it to you for
          review?
       
            hsbauauvhabzb wrote 13 hours 39 min ago:
            No, just like I don’t want coworkers using LLMs to generate code.
            LLMs have their uses, producing quality code is not one of them,
            they generate cancerous outcomes.
       
            smallpipe wrote 19 hours 56 min ago:
            > Would you not want your coworkers to have AI look at their PRs,
            have them address the relevant comments
            
            God no. Review is where I teach more junior people about the code
            base. And where I learn from the more senior people. Either of us
            spending time making the AI happy just to be told we’re solving
            the wrong problem is a ridiculous waste of time.
       
          hsbauauvhabzb wrote 23 hours 48 min ago:
          And for any poor soul forced into an environment using AI driven
          commit reviews, be sure you reply to the commit reviews with gpt
          output and continue the conversation, eventually bring in some of the
          AI ‘engineers’ to help resolve the issue, waste their time back.
       
        wzdd wrote 1 day ago:
        Previously:
        
   URI  [1]: https://news.ycombinator.com/item?id=42465374
       
        planetpluta wrote 1 day ago:
        > Essentially we needed to teach LLMs (which are paid by the token) to
        only generate a small number of high quality comments.
        
        The solution of filtering after the comment is generated doesn’t seem
        to address the “paid by the token” piece.
       
          dietr1ch wrote 1 day ago:
          It may be a step forward in terms of gathering data and defining the
          filter, but pushing up this filter might be expensive unless there's
          enough volume to justify it.
          
          I took it as a comment on that generally models will be biased
          towards being nitty and that's something that needs to be dealt with
          as the incentives are not there to fix things at the origin.
       
        thomasahle wrote 1 day ago:
        TLDR:
        
        > Giving few-shot examples to the generator didn't work.
        
        > Using an LLM-judge (with no training) didn't work.
        
        > Using an embedding + KNN-classifier (lots of training data) worked.
        
        I don't know why they didn't try fine-tuning the LLM-judge, or at least
        give it some few-shot examples.
        
        But it shows that embeddings can make very simple classifiers work
        well.
       
        Retr0id wrote 1 day ago:
        > LLMs (which are paid by the token)
        
        Straight-forwardly true and yet I'd never thought about it like this
        before. i.e. that there's a perverse incentive for LLM vendors to tune
        for verbose outputs. We rightly raise eyebrows at the idea of
        developers being paid per volume of code, but it's the default for
        LLMs.
       
          johnfn wrote 19 hours 4 min ago:
          LLMs are paid by input token and output token. LLM developers are
          just as incentivized to give quality output so you make more calls to
          the LLM.
       
          rgbrgb wrote 21 hours 6 min ago:
          In the current landscape I think competition handles this neatly,
          especially since open models have almost the opposite incentive
          (training on concision is cheaper). As siblings note, Claude tends to
          be a little less verbose. Though I find they can all be quite concise
          when instructed (e.g. “just the code, no yapping”).
       
          vergessenmir wrote 21 hours 42 min ago:
          It's why I prefer Claude to let's Chat GPT. I'm paying for tokens
          generated too.
       
          britannio wrote 1 day ago:
          Only while a vendor is ahead of the others. We developers will favour
          a vendor with faster inference and lower pricing.
       
            dustingetz wrote 16 hours 31 min ago:
            then, cartels
       
            Retr0id wrote 18 hours 58 min ago:
            I think real endgame is some kind of "no win no fee" arrangement.
            In the case of the article in the OP, it'd be if they billed
            clients per "addressed" comment. It's less clear how that would map
            onto someone selling direct access to an LLM, though.
       
        Kwpolska wrote 1 day ago:
        > Please don't use HN primarily for promotion. It's ok to post your own
        stuff part of the time, but the primary use of the site should be for
        curiosity.
        
   URI  [1]: https://news.ycombinator.com/newsguidelines.html
       
          mosselman wrote 18 hours 8 min ago:
          I don't feel like this is primarily promotion. Sure it is a bit of a
          content marketing article, but still offers some potentially valuable
          insight.
       
            Kwpolska wrote 16 hours 30 min ago:
            A look at OP’s posting history suggests he posts a lot of
            advertising.
       
        keybored wrote 1 day ago:
        Why review bots. Why not a bot that runs as part of the validation
        suite?    And then you dismiss and follow up on what you want?
        
        You can run that locally.
       
        profsummergig wrote 1 day ago:
        Is there a name for the activity of trying out different strategies to
        improve the output of AI?
        
        I found this article surprisingly enjoyable and interesting and if like
        to find more like it.
       
          thrw42A8N wrote 1 day ago:
          Prompt engineering
       
            profsummergig wrote 16 hours 35 min ago:
            But this wasn't a different prompt that made it work better. Here
            they created a database of prior human created comments with dev
            ratings of how good the devs found them.
       
        fnqi8ckfek wrote 1 day ago:
        Reading from the other comments here, I'm the only one thinking that
        this is just busywork...? Just get rid of the thing, it's a solution
        without a problem.
       
          Hilift wrote 1 day ago:
          I guess normally it's difficult to stop an employee from leaving
          nitpicky comments. But with AI you can finally take control.
       
          stavros wrote 1 day ago:
          I'm fairly sure there are at least a few companies that have the
          problem of needing PRs reviewed.
       
          dannersy wrote 1 day ago:
          I've been following this conversation and many of the sentiments
          against an AI code review bot are gone. I guess they're getting
          shadow banned?
       
        extr wrote 1 day ago:
        I'm surprised the authors didn't try the "dumb" version of the solution
        they went with: instead of using fancy cosine similarity create to
        implicit clusters, just ask it to classify the comment along a few
        dimensions and then do your own filtering on that (or create your own
        0-100 scoring!) Seems like you would have more control that way and
        actually derive some rich(er) data to fine tune on. It seems they are
        already almost doing this: all the examples in the article start with
        "style"!
        
        I have seen this pattern a few times actually, where you want the AI to
        mimic some heuristic humans use. You never want to ask it for the
        heuristic directly, just create the constitute data so you can do some
        simple regression or whatever on top of it and control the cutoff
        yourself.
       
          pacifika wrote 1 day ago:
          I thought the llm would just makes up the scores because of lack of
          training.
       
        anonzzzies wrote 1 day ago:
        We found that in general it's pretty hard to make the llm just stop
        without human intervention. You can see with things like Cline that if
        the llm has to check its own work, it'll keep making 'improvements' in
        a loop; removing all comments, adding all comments etc. It needs to
        generate something and seems overly helpful to give you something.
       
          seb1204 wrote 1 day ago:
          Would the LLM make less nit picking comments if the code base
          included coding style and commenting rules in the repository?
       
        righthand wrote 1 day ago:
        Our AI code review bot (Codacy) is just an LLM that compiles all linter
        rules and might be the most annoying useless thing. For example it will
        ding your PR for not considering Opera browser limitations on a backend
        NodeJS PR.
        
        Furthermore most of the code reviews I perform, rarely do I ever really
        leave commentary. There are so many frameworks and libraries today that
        solve whatever problem, unless someone adds complex code or puts a file
        in a goofy spot, it’s an instant approval. So an AI bot doesn’t
        help something which is a minimal non-problem task.
       
          oakejp12 wrote 16 hours 9 min ago:
          Are the junior programmers that you hire that good that you don't
          need training or commentary? I find I spend a lot of time reviewing
          MRs and leaving commentary. For instance, this past week, I had a
          junior apply the same business logic across classes instead of
          creating a class/service and injecting it as a dependency.
          
          This, improper handling of exceptions, missed testing cases or no
          tests at all, incomplete types, a misunderstanding of a nuanced
          business case, etc. An automatic approval would leave the codebase in
          such a dire state.
       
            righthand wrote 12 hours 57 min ago:
            I still pair with juniors, sometimes even the code review is a
            pairing session. When did I say I don’t need training or
            commentary? I’m talking about useless AI tools, not day to day
            work.
       
          jahnu wrote 1 day ago:
          Our approach is to leave PR comments as pointers in the areas where
          the reviewer needs to look, where it is most important, and they can
          forget the rest. I want good code reviews but don’t want to waste
          anyone’s time.
       
            righthand wrote 19 hours 6 min ago:
            I do this as well for my PRs, it’s a good way to leave a little
            explainer on the code especially with large merge requests.
       
            minasmorath wrote 20 hours 26 min ago:
            To be honest, I tried to work with with GitHub Copilot to see if it
            could help junior devs focus in on the important parts of a PR and
            increase their efficiency and such, but... I've found it to be
            worse than useless at identifying where the real complexities and
            bug opportunities actually are.
            
            When it does manage to identify the areas of a PR with the highest
            importance to review, other problematic parts of the PR will go
            effectively unreviewed, because the juniors are trusting that the
            AI tool was 100% correct in identifying the problematic spots and
            nothing else is concerning. This is partially a training issue, but
            these tools are being marketed as if you can trust them 100% and so
            new devs just... do.
            
            In the worst cases I see, which happen a good 30% of the time at
            this point based on some rough napkin math, the AI directs junior
            engineers into time-wasting rabbit holes around things that are
            actually non-issues while actual issues with a PR go completely
            unnoticed. They spend ages writing defensive code and polyfills for
            things that the framework and its included polyfills already 100%
            cover. But if course, that code was usually AI generated too, and
            it's incomplete for the case they're trying to defend against
            anyway.
            
            So IDK, I still think there's value in there somewhere, but
            extracting it has been an absolute nightmare for me.
       
          Spivak wrote 1 day ago:
          I've honestly considered just making a script that hits the checkbox
          automatically because we have to have "code review" for compliance
          but 99% of the time I simply don't care. Is the change you made gonna
          bring down prod, no, okay ship it.
          
          At some level if I didn't trust you to not write shitty code you
          wouldn't be on our team. I don't think I want to go all the way and
          say code review is a smell but not really needing it is what code
          ownership, good integration tests, and good qa buys you.
       
            Tainnor wrote 16 hours 38 min ago:
            > At some level if I didn't trust you to not write shitty code you
            wouldn't be on our team.
            
            Code review isn't about preventing "shitty code" because you don't
            trust your coworkers.
            
            Of course, if 99% of the time you don't care about the code review
            then it's not going to be an effective process, but that's a
            self-fulfilling prophecy.
       
            rgbrgb wrote 21 hours 2 min ago:
            How do you bootstrap the trust without code review? Always confused
            how teams without review onboard someone new.
       
              criley2 wrote 20 hours 16 min ago:
              I think you just accept the risk, same as not doing code review.
              If you're doing something Serious™ then you follow the serious
              rules to protect your serious business. If it doesn't actually
              matter that much if prod goes down or you release a critical
              vulnerability or leak client data or whatever, then you accept
              the risk. It's cheaper to push out code quickly and cleanup the
              mess as long as stakeholders and investors are happy.
       
              ajmurmann wrote 20 hours 25 min ago:
              Pair-programming is one option. It might seem expensive at first
              glance but usually both parties get a lot of value from it (new
              possible ideas for the teacher if the learner isn't entirely
              junior and obviously quick learning for the new person)
       
                Spivak wrote 6 hours 31 min ago:
                We do pair programming for new hires, works well. I'm
                currently frustrated because our code reviews are both
                mandatory due to security certifications but also completely
                worthless to our team structure except Slack spam asking
                someone to click the approve button.
                
                * The work that is done is decided beforehand, the code in
                every PR corresponds to a card that's already been discussed.
                
                * There's no incentive whatsoever to "sneak something in" and
                if you do so maliciously you'll be fired and maybe prosecuted
                depending on the damage.
                
                * Your code goes through integration testing and QA so you'll
                never (immediately) take down prod.
                
                * I have backups coming out my ears that assume the code
                running on our app servers is actively malicious so you
                couldn't cause data loss if you tried.
                
                * Norms are all enforced in code, which makes discussions about
                style pointless. If it passes CI it's good enough.
       
            pacifika wrote 1 day ago:
            Code review is a good onboarding on upskilling tool, as a reviewer.
       
          mcbishop wrote 1 day ago:
          Some have a smaller tighter codebase... where it's realistic to
          pursue consistent application of an internal style guide (which
          current AI seems well suited to help with (or take ownership of)).
       
            doikor wrote 1 day ago:
            Code style is something that should be enforceable with linters and
            formatters for the most part without any need for an AI.
       
              lupire wrote 16 hours 44 min ago:
              That's not what style means.
       
            fallingknife wrote 1 day ago:
            A linter is well suited for this.  If you have style rules too
            complex for a linter to handle I think the problem is the rules,
            not the enforcement mechanism.
       
        jerrygoyal wrote 1 day ago:
        I'm in the market for PR review bots, as the nitpicking issue is real.
        So far, I have tried Coderabbit, but it adds too much noise to PRs, and
        only a very small percentage of comments are actually useful. I
        specifically instructed it to ignore nitpicks, but it still added such
        comments. Their cringy ASCII art comments make it even harder to take
        them seriously.
        
        I recently signed up for Korbit AI, but it's too soon to provide
        feedback. Honestly, I’m getting a bit fed up with experimenting with
        different PR bots.
        
        Question for the author: In what ways is your solution better than
        Coderabbit and Korbit AI?
       
          swells34 wrote 1 day ago:
          Isn't code review the one area you'd never want to replace the
          programmer in? Like, that is the most important step of the process;
          it's where we break down the logic and sanity check the
          implementation. That is expressly where AI is the weakest.
       
            eterm wrote 20 hours 40 min ago:
            Code review is also one of the last places to get adequate
            knowledge-sharing into many processes which have gone fully remote
            / asynchronous.
       
            hansvm wrote 1 day ago:
            If you think of the "AI" benefit as more of a linter then the value
            starts to become clear.
            
            E.g., I've never hesitated to add a linting rule requiring `except
            Exception:` in lieu of `except:` in Python, since the latter is
            very rarely required (so the necessary incantations to silence the
            linter are cheap on average) and since the former is almost always
            a bug prone to making shutdown/iteration/etc harder than they
            should be. When I add that rule at new companies, ~90% of the
            violations are latent bugs.
            
            AI has the potential to (though I haven't seen it work well yet in
            this capacity) lint most such problematic patterns. In an ideal
            world, it'd even have the local context to know that, e.g., since
            you're handling a DB transaction and immediately re-throwing the
            raw `except:` is appropriate and not even flag it in the "review"
            (the AI linting), reducing the false positive rate. You'd still
            want a human review, but you could avoid bugging a human till the
            code is worth reviewing, or you could let the human focus on things
            that matter instead of harping on your use of low-level atomic
            fences yet again. AI has potential to improve the transition from
            "code complete" to "shipped and working."
       
              siscia wrote 1 day ago:
              Wouldn't be simpler to just use a linter then?
       
                hansvm wrote 19 hours 40 min ago:
                Linters suffer from a false positive/negative tradeoff that AI
                can improve. If they falsely flag things then developers tend
                to automatically ignore or silence the linter. If they don't
                flag a thing then ... well ... they didn't flag it, and that
                particular burden is pushed to some other human reviewer. Both
                states are less than ideal, and if you can decrease the rate of
                them happening then the tool is better in that dimension.
                
                How does AI fit into that picture then? The main benefits IMO
                are the abilities to (1) use contextual clues, (2) process
                "intricate" linting rules (implicitly, since it's all just text
                for the LLM -- this also means you can process many more
                linting rules, since things too complicated to be described
                nicely by the person writing a linter without too high of a
                false positive rate are unlikely to ever be introduced into the
                linter), and (3) giving better feedback when rules are broken.
                Some examples to compare and contrast:
                
                For that `except` vs `except Exception:` thing I mentioned, all
                a linter can do is check whether the offending pattern exists,
                making the ~10% of proper use cases just a little harder to
                develop. A smarter linter (not that I've seen one with this
                particular rule yet) could allow a bare `except:` if the
                exception is always re-raised (that being both the normal
                use-case in DB transaction handling and whatnot where you might
                legitimately want to catch everything, and also a coding
                pattern where the practice of catching everything is unlikely
                to cause the bugs it normally does). An AI linter can handle
                those edge cases automatically, not giving you spurious
                warnings for properly written DB transaction handling.
                Moreover, it can suggest a contextually relevant proper fix
                (`except BaseException:` to indicate to future readers that you
                considered the problems and definitely want this behavior,
                `except Exception:` to indicate that you do want to catch
                "everything" but without weird shutdown bugs, `except
                SomeSpecificException:` because the developer was just being
                lazy and would have accidentally written a new bug if they
                caught `Exception` instead, or perhaps just suggesting a
                different API if exceptions weren't a reasonable way to control
                the flow of execution at that point).
                
                As another example, you might have a linting rule banning
                low-level atomics (fences, seq_cst loads, that sort of thing).
                Sometimes they're useful though, and an AI linter could handle
                the majority of cases with advice along the lines of "the thing
                you're trying to do can be easily handled with a mutex; please
                remove the low-level atomics". Incorporating the context like
                that is impossible for normal linters.
                
                My point wasn't that you're replacing a linter with an
                AI-powered linter; it's that the tool generates the same sort
                of local, mechanical feedback a linter does -- all the stuff
                that might bog down a human reviewer and keep them from
                handling the big-picture items. If the tool is tuned to have a
                low false-positive rate then almost any advice it gives is, by
                definition, an important improvement to your codebase. Human
                reviewers will still be important, both in catching anything
                that slips through, and with the big-picture code review tasks.
       
                stavros wrote 1 day ago:
                It would be simpler, but it wouldn't be as effective.
       
                  abenga wrote 1 day ago:
                  How so? The build fails if linter checks don't pass. What
                  does an ai add to this?
       
                    stavros wrote 1 day ago:
                    Much more nuanced linting rules.
       
                      fzeroracer wrote 23 hours 46 min ago:
                      What exactly is gained from these nuanced linting rules?
                      Why do they need to exist? What are they actually doing
                      in your codebase other than sucking up air?
       
                        stavros wrote 23 hours 39 min ago:
                        Are you arguing against linters? These disingenuous
                        arguments are just tiring, you either accept that
                        linters can be beneficial, and thus more nuanced rules
                        are a good thing, or you believe linters are generally
                        useless. This "LLM bad!" rhetoric is exhausting.
       
                          fzeroracer wrote 23 hours 28 min ago:
                          No. And speaking of disingenuous arguments, you've
                          made an especially absurd one here.
                          
                          Linters are useful. But you're arguing that you have
                          such complex rules that they cannot be performed by a
                          linter and thus must be offloaded to an LLM. I think
                          that's wrong, and it's a classical middle management
                          mistake.
                          
                          We can all agree that some rules are good. But that
                          does not mean more rules are good, nor does it mean
                          more complex rules are good. Not only are you
                          integrating a whole extra system to support these
                          linting rules, you are doing so for the sake of
                          adding even more complex linting rules that cannot be
                          automated in a way that prevents developer friction.
       
                            stavros wrote 23 hours 25 min ago:
                            If you think "this variable name isn't descriptive
                            enough" or "these comments don't explain the 'why'
                            well enough" is constructive feedback, then we
                            won't agree, and I'm very curious as to what your
                            PR reviews look like.
       
            jerrygoyal wrote 1 day ago:
            No one is blindly merging PR after AI Code Review. Think of review
            bot as an extra pair of eyes.
       
              bobnamob wrote 1 day ago:
              Yet.
              
              The things I've seen juniors try to do because "the AI said so"
              is staggering. Furthermore, I have little faith that the average
              junior gets competent enough teaching/mentoring that this
              attitude won't be a pervasive problem in 5ish years.
              
              Admittedly, maybe I'm getting old and this is just another "darn
              kids these days" rant
       
            dakshgupta wrote 1 day ago:
            I agree - at least with where technology is today, I would strongly
            discourage replacing human code review with AI.
            
            It does however serve as a good first pass, so by the time the
            human reviewer gets it, the little things have been addressed and
            they can focus on broader technical decisions.
            
            Would you want your coworkers to run their PRs through an AI
            reviewer, resolve those comments, then send it to you?
       
              AgentOrange1234 wrote 20 hours 59 min ago:
              This feels like the right take.
              
              Once we have an AI starting to do something, there’s an art to
              gradually adopting it in a way that makes sense.
              
              We don’t lose the ability to have humans review code. We
              don’t have to all use this on every PR. We don’t have to
              blindly accept every comment it makes.
       
            yorwba wrote 1 day ago:
            You probably shouldn't take a code review bot's "LGTM!" at face
            value, but if it tells you there is an issue with your code and it
            is indeed an issue with your code, it has successfully subbed in
            for your coworker who didn't get around to looking at it yet.
       
              imiric wrote 1 day ago:
              > if it tells you there is an issue with your code and it is
              indeed an issue with your code
              
              That's the crucial bit. If it ends up hallucinating issues as
              LLMs tend to do[1], then it just adds more work for human
              developers to confirm its report.
              
              Human devs can create false reports as well, but we're more
              likely to miss an issue than misunderstand and be confident about
              it.
              
              [1] 
              
   URI        [1]: https://www.theregister.com/2024/12/10/ai_slop_bug_repor...
       
            imoverclocked wrote 1 day ago:
            Many code review products are not AI. They are often collections of
            rules that have been specifically crafted to catch bad patterns.
            Some of the rules are stylistic in nature and those tend to turn
            people away the fastest IMHO.
            
            Many of these tools can be integrated into a local workflow so they
            will never ping you on a PR but many developers prefer to just
            write some code and let the review process suss things out.
       
          dakshgupta wrote 1 day ago:
          I haven’t explored Korbit but with CodeRabbit there are a couple of
          things:
          
          1. We are better at full codebase context, because of how we index
          the codebase like a graph and use graph search and an LLM to
          determine what other parts of the codebase should be taken into
          consideration while reviewing a diff.
          
          2. We offer an API you can use to build custom workflows, for example
          every time a test fails in your pipeline you can pass the output to
          Greptile and it will diagnose with full codebase context.
          
          3. Customers of ours that switched from CodeRabbit usually say
          Greptile has far fewer and less verbose comments. This is a
          subjective, of course.
          
          That said, CodeRabbit is cheaper.
          
          Both products have free trials for though, so I would recommend
          trying both and seeing which one your team prefers, which is
          ultimately what matters.
          
          I’m happy to answer more questions or a demo for your team too ->
          daksh@greptile.com
       
        dcreater wrote 1 day ago:
        Are you still looking for slaves, err I mean employees, who are going
        to work 80+hrs a week? Do you sponsor visas?
       
        just-another-se wrote 1 day ago:
        And how do you guys deal with a cold start problem then? Suppose the
        repo is new and has very few previous comments?
       
          dakshgupta wrote 1 day ago:
          We do use some other techniques to filter out comments that are
          almost always considered useless by teams.
          
          For the typical team size that uses us (at least 20+ engineers) the
          number of downvotes gets high enough to show results within a workday
          or two, and achieves something of a stable state within a week.
       
          croemer wrote 1 day ago:
          Do pooling with the average embedding of all customers
       
        tayo42 wrote 1 day ago:
        > Attempt 2: LLM-as-a-judge
        
        Wouldnt this be achievable with a classifier model? Maybe even a combo
        of getting the embedding and then putting it through a classifier? Kind
        of like how Gans work.
        
        Edit: I read the article before the comment section, silly me lol
       
        Havoc wrote 1 day ago:
        Think it would initially have gone better had they not used „nits“
        but rather nitpicks. ie something that’s in the dictionary that the
        chatbot is likely to understand
       
          kgeist wrote 16 hours 32 min ago:
          They should have also added some basic chain-of-thought reasoning in
          the prompt + asked it to add [nitpick] tag at the end, so that the
          likelihood of adding it correctly increased due to in-context
          learning. Then another pass removes all the nitpicks and internal
          reasoning steps.
       
        callamdelaney wrote 1 day ago:
        Does it work on real engineers?
       
          aeve890 wrote 1 day ago:
          What do real engineers have to do with software? /jk
       
          defrost wrote 1 day ago:
          A quarter of real engineers are already Civil.
       
        nikolayasdf123 wrote 1 day ago:
        > $0.45/file capped at $50/dev/month
        
        wow. this is really expensive... especially given core of this
        technology is open source and target customers can set it up themselves
        self-hosted
       
          MacsHeadroom wrote 1 day ago:
          A cap of less than 1% of an average developer's pay is "really
          expensive"?
       
            nikolayasdf123 wrote 1 day ago:
            who is this average developer? layoffs are left and right for last
            4 years. startups getting shutdown. funding for gov contracts
            getting cut. it is increasingly hard to find any job in software.
            many guys I know are barely making it, and better spend those money
            on their child or living expenses. "average developers" say in
            china, philipines, india, vietnam, cis countries, making way less
            than to un-frugally spend it on 50USD subscription, for something
            that may not even work well, and may require human anyways. pay
            0.4USD "per-file" is just ridiculous. this pricing immediately puts
            off and is a non-starter
            
            UPD: oh, you mean management will fire SWEs and replace them with
            this? well, yeah, then it makes sense to them. but the quality has
            to be good. and even then many mid to large size orgs I know are
            cutting all subscriptions (particularly per developer or per box)
            they possibly can (e.g. Microsoft, Datadog etc.) so even for them
            cost is of importance
       
            Arainach wrote 1 day ago:
            For a glorified linter?  Absolutely.
            
            For reference, IntelliJ Ultimate - a full IDE with leading language
            support - costs that much.
       
        jumploops wrote 1 day ago:
        We do something internally[0] but specifically for security concerns.
        
        We’ve found that having the LLM provide a “severity” level
        (simply low, medium, high), we’re able to filter out all the nitpicky
        feedback.
        
        It’s important to note that this severity level should be specified
        at the end of the LLM’s response, not the beginning or middle.
        
        There’s still an issue of context, where the LLM will provide a false
        positive due to unseen aspects of the larger system (e.g. make sure to
        sanitize X input).
        
        We haven’t found the bot to be overbearing, but mostly because we
        auto-delete past comments when changes are pushed.
        
        [0]
        
   URI  [1]: https://magicloops.dev/loop/3f3781f3-f987-4672-8500-bacbeefca6...
       
          dakshgupta wrote 1 day ago:
          The severity needing to be at the end was an important insight. It
          made the results much better but not quite good enough.
          
          We had it output a json with fields {comment: string, severity:
          string} in that order.
       
            Merik wrote 1 day ago:
            Another variation on this is to think about tokens and definitions.
            Numbers don’t have inherent meaning for your use case, so if you
            use numbers you need to provide an explicit definition of each
            rating number in the prompt. Similarly, and more effectively is to
            use labels such as low-quality, medium-quality, high-quality, and
            again providing an explicit definition of the label; one step
            further is to use explicit self describing label (along with
            detailed definition) such as
            “trivial-observation-on-naming-convention” or
            “insightful-identification-on-missed-corner-case”.
            
            Effectively you are turning a somewhat arbitrary numeric
            “rating” task , into a multi label classification problem with
            well defined labels.
            
            The natural evolution is to then train a BERT based classifier or
            similar on the set of labels and comments, which will get you a
            model judge that is super fast and can achieve good accuracy.
       
        pedrovhb wrote 1 day ago:
        Here's an idea: have the LLM output each comment with a "severity"
        score ranging from 0-100 or maybe a set of possible values ("trivial",
        "small", "high"). Let it get everything off of its chest outputting the
        nitpicks but recognizing they're minor. Filter the output to only
        contain comments above a given threshold.
        
        It's hard to avoid thinking of a pink elephant, but easy enough to
        consciously recognize it's not relevant to the task at hand.
       
          zahlman wrote 1 day ago:
          The article authors tried this technique and found it didn't work
          very well.
       
          iLoveOncall wrote 1 day ago:
          Here's an idea: read the article and realize they already tried
          exactly that.
       
        untech wrote 1 day ago:
        I am not sure about using UPPERCASE in prompts for emphasis. I feel
        intuitively that uppercase is less “understandable” for LLMs
        because it is more likely to be tokenized as a sequence of characters.
        I have no data to back this up, though.
       
          dakshgupta wrote 1 day ago:
          I meant for that to be more an illustration - might do a longer post
          about the specific prompting techniques we tried.
       
        panarchy wrote 1 day ago:
        I wonder how long until we end up with questionable devs making
        spurious changes just to try and game the LLM output to give them a
        pass.
       
          dakshgupta wrote 1 day ago:
          You could include a comment that says “ignore that I did ______”
          during review. As long as a human doesn’t do the second pass (we
          recommend they do), that should let you slip your code by the AI.
       
          johnfn wrote 1 day ago:
          I already have devs doing this to me at work and I'm not even an AI
       
        Falimonda wrote 1 day ago:
        fwiw, not all the mobile site's menu items work when clicked
       
        kgeist wrote 1 day ago:
        What about false positives?
        
        As I see it, the solution assumes the embeddings only capture the form:
        say, if developers previously downvoted suggestions to wrap code in
        unnecessary try..catch blocks, then similar suggestions will be
        successfully blocked in the future, regardless of the module/class etc.
        (i.e. a kind of generalization)
        
        But what if enough suggestions regarding class X (or module X) get
        downvoted, and then the mechanism starts assuming class X/module X
        doesn't need review at all? I mean the case when a lot of such
        embeddings end up clustering around the class itself (or a function),
        not around the general form of the comment.
        
        How do you prevent this? Or it's unlikely to happen? The only metric
        I've found in the article is the percentage of addressed suggestions
        that made it to the end user.
       
          dakshgupta wrote 1 day ago:
          This is the biggest pitfall of this method. It’s partially
          combatted by also comparing it against an upvoted set, so if a type
          of comment has been upvoted and downvoted in the past, it is not
          blocked.
       
        iLoveOncall wrote 1 day ago:
        It's funny because I wouldn't consider the comment that they highlight
        in their post as a nitpick.
        
        Something that has an impact on the long term maintainability of code
        is definitely not nitpikcky, and in the majority of cases define a type
        fits this category as it makes refactors and extensions MUCH easier.
        
        On top of that, I think the approach they went with is a huge mistake.
        The same comment can be a nitpick on one CR but crucial on another,
        clustering them is destined to result in false-positives and
        false-negatives.
        
        I'm not sure I'd want to use a product to review my code for which 1) I
        cannot customize the rules, 2) it seems like the rules chosen by the
        creators are poor.
        
        To be honest I wouldn't want to use any AI-based code reviewer at all.
        We have one at work (FAANG, so something with a large dedicated team)
        and it has not once produced a useful comment and instead has been
        factually wrong many times.
       
          jbirer wrote 1 day ago:
          Two theories:
          
          1) This is an ego problem. Whoever is doing the development cannot
          handle being called out on certain software architecture / coding
          mistakes, so it becomes "nitpicking".
          
          2) The software shop has a "ship out faster, cut corners" culture,
          which at that point might as well turn off the AI review bot.
       
            trash_cat wrote 17 hours 44 min ago:
            This is a company culture problem.
       
            AgentOrange1234 wrote 21 hours 13 min ago:
            1 is super interesting.
            
            I’ve found it’s nice to talk to an LLM about personal issues
            because I know it’s not a real person judging me. Maybe if the
            comments were kept private with the dev, it’d be more just a
            coaching tool that didn’t feel like a criticism?
       
              Comfy-Tinwork wrote 18 hours 29 min ago:
              Private with the Dev, the operator & everyone who uses any
              product of the training data they build.
       
          dakshgupta wrote 1 day ago:
          This is an important point - there is no universal understanding of
          nitpickiness. It is why we have it learn every new customers ways
          from scratch.
       
            mannykannot wrote 23 hours 1 min ago:
            This does not address the issue raised in iLoveOncall's third
            paragraph: "the same comment can be a nitpick on one CR but crucial
            on another..." In "attempt 2", you say that "the LLMs judgment of
            its own output was nearly random", which raises questions that go
            well beyond just nitpicking, up to that of whether the current
            state of the art in LLM code review is fit for much more than
            ticking the box that says "yes, we are doing code review."
       
              lupire wrote 16 hours 40 min ago:
              If you are using an LLM for judgment, you are using it wrong. 
              An LLM is good for generating suggestions, brainstorming, not
              making judgments.
              
              That's why it is called Generative AI.
       
                mannykannot wrote 16 hours 7 min ago:
                Indeed - and if you are doing code review without judgement,
                you are doing it wrong.
       
        throw310822 wrote 1 day ago:
        Seems easier than getting the same from my colleagues.
       
          mr_toad wrote 1 day ago:
          I wonder if it was trained on a lot of nitpicking comments from human
          reviews.
       
        dbetteridge wrote 1 day ago:
        Prompt:
        
        If the comment could be omitted without affecting the codes
        functionality but is stylistic or otherwise can be ignored then preface
        the comment with
        
        NITPICK
        
        I'm guessing you've tried something like the above and then filtering
        for the preface, as you mentioned the llm being bad at understanding
        what is and isn't important.
       
          seb1204 wrote 1 day ago:
          My thinking too. Or similarly including some coding style and
          commenting style rules in the repository so they become part of the
          code base and get considered by the LLM
       
          fnqi8ckfek wrote 1 day ago:
          My boss can't stop nitpicking, and I've started telling him that "if
          your comment doesn't affect the output I'm ignoring it".
       
            marcandre wrote 1 day ago:
            That is such an over-simplistic criteria!
            Proof by the absurd: pass your code through an obfuscator /
            simplifier and the output won't be affected.
       
              fnqi8ckfek wrote 19 hours 14 min ago:
              Ok?
              
              So if the boss's nitpick is "you must pass your code thru an
              indicator that doesn't affect the output", I'm going to ignore
              it. I don't understand what point you're trying to make.
       
          dakshgupta wrote 1 day ago:
          We tried this too, better but not good enough. It also often labeled
          critical issues as nitpicks, which is unacceptable in our context.
       
            firesteelrain wrote 1 day ago:
            Then some of the nitpicks were actually valid?
            
            What about PossiblyWrong, then?
       
          abecedarius wrote 1 day ago:
          Nitpick: I'd ask for NITPICK at the end of output instead of the
          start. The model should be in a better place to make that decision
          there.
       
            ivanjermakov wrote 1 day ago:
            I find it ironic how capable LLMs have become, but they're still
            struggling with things like this. Great reminder that it's still
            text prediction at its core.
       
              Eisenstein wrote 1 day ago:
              I find it more helpful to keep in mind the autoregressive nature
              rather than the prediction. 'Text prediction' brings to mind that
              it is guessing what word would follow another word, but it is
              doing so much more than that. 'Autoregressive' brings to mind
              that it is using its previously generated output to create new
              output every time it comes up with another token. In that case
              you immediately understand that it would have to make the
              determination of severity after it has generated the description
              of the issue.
       
              spott wrote 1 day ago:
              I mean, think of LLM output as unfiltered thinking.  If you were
              to make that determination would you make it before you had
              thought it through?
       
        XenophileJKO wrote 1 day ago:
        Unless they were using a model too stupid for this operation, the
        fundamental problem is usually solvable via prompting.
        
        Usually the issue is that the models have a bias for action, so you
        need to give it an accetpable action when there isn't a good comment.
        Some other output/determination.
        
        I've seen this in many other similar applications.
       
          NBJack wrote 20 hours 20 min ago:
          Prompting only works if the training corpus used for the LLM has
          critical mass on the concept. I agree that the model should have
          alternatives (i.e. preface the comment with the tag 'Nit:'), but bear
          in mind code review nit-picking isn't exactly a well-researched area
          of study. Also to your point, given this was also likely trained
          predominantly on code, there's probably a lack of comment sentiment
          in the corpus.
       
          keepingscore wrote 1 day ago:
          This isn't my post but I work with llms everyday and I've seen this
          kind of instruction ignoring behavior on sonnet when the context
          window starts getting close to the edge.
       
            dakshgupta wrote 1 day ago:
            This might be what we experienced. We regularly have context reach
            30k+ tokens.
       
            XenophileJKO wrote 1 day ago:
            I don't know, in practice there are so many potential causes that
            you have to look case by case in situations like that. I don't have
            a ton of experience with the raw Claude model specifically, but
            would anticipate you'll have the same problem classes.
            
            Usually it comes down to one of the following:
            
            - ambiguity and semantics (I once had a significant behavior
            difference between "suggest" and "recommend", i.e. a model can
            suggest without recommending.)
            
            - conflicting instructions
            
            - data/instruction bleeding (delimiters help, but if the span is
            too long it can loose track of what is data and what is
            instructions.)
            
            - action bias (If the task is to find code comments for example,
            even if you tell it not to, it will have a bias to do it as you
            defined the task that way.)
            
            - exceeding attention capacity (having to pay attention to too much
            or having too many instructions. This is where structures output or
            chain of thought type approaches help. They help focus attention on
            each step of the process and the related rules.)
            
            I feel like these are the ones you encounter the most.
       
              dakshgupta wrote 1 day ago:
              One word changes impacting output is interesting but also quite
              frustrating. Especially because the patterns don’t translate
              across models.
       
                gopher_space wrote 1 day ago:
                The "write like the people who wrote the info you want" pattern
                absolutely translates across models.
       
                  sdesol wrote 1 day ago:
                  Yes and no. I've found the order in which you give
                  instructions matters for some models as well. With LLMs, you
                  really need to treat them like black boxes and you cannot
                  assume one prompt will work for all. It is honestly, in my
                  experience, a lot of trial and error.
       
          marviel wrote 1 day ago:
          +1 for "No Comment/Action Required" responses reducing
          trigger-happiness
       
        Nullabillity wrote 1 day ago:
        [flagged]
       
          dang wrote 1 day ago:
          "Please don't post shallow dismissals, especially of other people's
          work. A good critical comment teaches us something."
          
   URI    [1]: https://news.ycombinator.com/newsguidelines.html
       
          dakshgupta wrote 1 day ago:
          I would say they are useful but they aren’t magic (at least yet)
          and building useful applications on top of them requires some work.
       
            WesolyKubeczek wrote 1 day ago:
            Not only do LLMs require some work to build anything useful, they
            are also fuzzy and nondeterministic, requiring you to tweak your
            prompting tricks once in a while, and they are also quite
            expensive.
            
            Nothing but advantages.
       
        pama wrote 1 day ago:
        After failing with three reasonable ideas, they solved the problem with
        an idea that previously would have seemed unlikely to work (get
        similarity to past known nits and use a threshold of 3 similar nits
        above a certain cutoff similarity score for filtering). A lot of
        applications of ML have a similar hacky trial and error flavor like
        that. Intuition is often built in retrospect and may not transfer well
        to other domains. I would have guessed that finetuning would also work,
        but agree with the author that it would be more expensive and less
        portable across different models.
       
          olddustytrail wrote 1 day ago:
          I don't think the prompt was reasonable. Nits are the eggs of
          headlice. Expecting the LLM to guess they meant superficial issues,
          rather than just spelling that out, is a bit weird.
          
          It's like saying "don't comment on elephants" when you mean don't
          comment on the obvious stuff.
       
            tayo42 wrote 1 day ago:
            Do llms struggle with other homonyms?
       
              Eisenstein wrote 1 day ago:
              They can, but I think the parent is saying that when you are
              crafting an instruction for someone or something to follow, you
              should be as direct and as clear as possible in order to remove
              the need for the actor to have to intuit a decision instead of
              act on certainty.
              
              I would start with something like: 'Avoid writing comments which
              only concern stylistic issues or which only contain criticisms
              considered pedantic or trivial.'
       
            whitfin wrote 1 day ago:
            Yeah, even the title here says "nitpicky". I'm not sure if they
            tried "nitpicks" instead of "nits", but I don't know why you
            wouldn't...
       
          dakshgupta wrote 1 day ago:
          Since we posted this, two camps of people reached out:
          
          Classical ML people who recommended we try training a classifier,
          possibly on the embeddings.
          
          Fine tuning platforms that recommended we try their platform.
          
          The challenge there would be gathering enough data per customer to
          meaningfully capture their definition and standard for nit-pickiness.
       
            pama wrote 1 day ago:
            What you use now is a simple KNN classifier, and if it works well
            enough, perhaps no need to go much further. If you need to squeeze
            out a couple additional percentage points maybe try a different
            simple and robust ML classifier (random forest, xgboost, or a
            simple two layer network). All these methods, including your
            current classifier, will get better with additional data and minor
            tuning in the future.
       
              dakshgupta wrote 1 day ago:
              Thank you, I will try this. I suspect we can extract some
              universal theory of nits and have a base filter to start with,
              and have it learn per-company preferences on top of that.
       
                keepingscore wrote 1 day ago:
                You should be able to do that already by taking all of your
                customers nit embeddings and averaging them to produce a point
                in space that represents the universal nit. Embeddings are
                really cool and the fact that they still work when averaging is
                one of their cool properties.
       
                  dakshgupta wrote 1 day ago:
                  This is a cool idea - I’ll try this and add it as an
                  appendix to this post.
       
       
   DIR <- back to front page