_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   How we decreased GitLab repo backup times from 48 hours to 41 minutes
       
       
        hinkley wrote 4 hours 49 min ago:
        This feels like some of the problems I’ve ended up tackling. The
        people too close to the problem don’t see it so someone has to reach
        across the aisle and make something happen in code they normally
        wouldn’t touch.
        
        Even with the quadratic complexity I suspect this problem has been
        brewing for a while. This commit has been biding its time for 16 years
        waiting to ruin someone’s day (let’s not confuse the utility of
        Make it Work early in a project with justifying retaining that code in
        perpetuity. Make it Right, Make it Fast.) Backups have probably been
        going over six hours for years and steadily climbing.
        
        “We” feels a lot like one or two people jumping on a grenade.
       
        fHr wrote 5 hours 38 min ago:
        absolute chads, nice!
       
        jas39 wrote 13 hours 13 min ago:
        What a poor write-up, neither defining the problem nor the solution
        with any clearity. Did anyone come away with any information that can
        be used for anything?
       
        SkiFire13 wrote 14 hours 42 min ago:
        > Ultimately, we traced the issue to a 15-year-old Git function with
        O(N²) complexity and fixed it with an algorithmic change, reducing
        backup times exponentially.
        
        I have yet to finish the article, but this means they improved the
        complexity to something like O(logN), right? I hate when people confuse
        quadratic improvement for exponential ones.
       
          globular-toast wrote 14 hours 37 min ago:
          Really annoying to see someone use exponential wrong when they're
          talking about performance of algorithms. We're supposed to know what
          this means! They went from quadratic to linear.
       
        ianks wrote 16 hours 8 min ago:
        Reminds me of the timeless advice I got for acing leet code interviews:
        “remember to use a hash map.” Tends to work out pretty well.
       
        DeepYogurt wrote 16 hours 53 min ago:
        I miss a good tech blog post. Cheers for this
       
        KETpXDDzR wrote 19 hours 56 min ago:
        With git it should be straightforward to implement an incremental,
        dedup backup solution since all objects are stored with their hashs in
        filename.
       
          mdaniel wrote 3 hours 37 min ago:
          Run `man git-repack` or its `man git-gc` friend and recall that most
          filesystems hate dealing with a bazillion small files
          
          I think there have been several attempts to use S3-ish blobstores as
          git backends but of the two major git hosters, only one of them is
          MIT licensed and they don't attempt that stunt, so safe to assume
          it's not a viable approach
       
        katella wrote 21 hours 40 min ago:
        I don't know much about git backups, but why would there be race
        conditions if you just create the backup off the local repo instead of
        remote?
       
        tcdent wrote 22 hours 29 min ago:
        TLDR if you only add objects to an array that are already not contained
        in said array you won't have to iterate back through it to remove the
        duplicates you created.
        
        wild that this a pattern like this would be part of git-core, but I
        guess we all overlook stuff on a regular basis
       
        iamcreasy wrote 22 hours 48 min ago:
        Cool! Please post another flame graph after applying the patch.
       
        niffydroid wrote 1 day ago:
        Why use this method over just having a script do a clone or pull
        elsewhere? Got is about distribution right?
       
        kristianp wrote 1 day ago:
        It would have been nice if the post included some details, such as
        before and after code.
       
        sneak wrote 1 day ago:
        …and they say leetcode-style problems are out of date.
        
        While perhaps implementing low level sorts is silly, knowing which data
        structures to use and when, and what underlying big-o performance they
        have, is critical, as demonstrated here.
       
        einpoklum wrote 1 day ago:
        I had a somewhat similar experience when writing a "remove duplicates"
        extension for Thunderbird: [1] I used a hash to begin with, using a
        simplistic digest function to the message headers I was comparing,
        getting me a 4-byte hash key.
        
        That worked, but was kind of slow.
        
        Finally, the idea came to me to not apply _any_ digesting, and use the
        combined concatenated headers of the messages as hash keys. About 2k
        bytes per hash key!
        
        The result: About 20x perf improvement if memory serves.
        
        How is that possible? The reason is that the code all runs in a
        Javascript machine; and applying the digest was not a built-in
        function, it was looping over the headers and doing the arithmetic.
        Thousands upon thousands of JS abstract machine steps. The use of the
        large hash key may be inefficient, but - it's just one JS object /
        dictionary operation, and one of the most heavily-optimized in any
        implementation.
        
   URI  [1]: https://addons.thunderbird.net/en-US/thunderbird/addon/removed...
       
        bob1029 wrote 1 day ago:
        I'm confused why you wouldn't simply snapshot the block-level device if
        the protocol of the information on top is going to cause this much
        headache. Quiescing git operations for block level activity is probably
        not trivial, but it sounds like an easier problem to solve to me.
        
        This is the approach I've taken with SQLite in production environments.
        Turn on WAL and the problem gets even easier to solve. Customer
        configures the VM for snapshots every X minutes. Git presumably doesn't
        have something approximating a WAL, so I understand the hesitation with
        this path. But, I still think the overall strategy is much more viable
        and robust to weird edges within git.
       
          lbotos wrote 1 day ago:
          > Git presumably doesn't have something approximating a WAL, so I
          understand the hesitation with this path.
          
          Bingo. One of the worst problems is helping a client piece back
          together a corrupted repo when they are using snapshots. Check my
          profile to see how I know. :)
          
          It's usually an OMG down scenario, and then you are adding in the "oh
          no, now the restore is corrupted."
          
          It's fixable but it's definitely annoying.
       
          amtamt wrote 1 day ago:
          > This is the approach I've taken with SQLite in production
          environments. Turn on WAL and the problem gets even easier to solve.
          
          A few months back a better solution was provided by SQLite:
          sqlite3_rsync
          
   URI    [1]: https://www.sqlite.org/rsync.html
       
          nonameiguess wrote 1 day ago:
          Gitlab isn't just a managed service. They release the software for
          self-hosted instances as well. There is no guarantee or requirement
          that users all run Gitlab on the same filesystem or even a filesystem
          that supports block level snapshots at all. Presumably, they want a
          universal backup system that works for all Gitlabs.
       
            bob1029 wrote 1 day ago:
            I've never heard of a .git folder that spanned multiple
            filesystems. It sounds like we are now conflating the git workspace
            with everything else in the product.
            
            There are system requirements that a customer would be expected to
            adhere to if they wanted a valid enterprise support contract with
            one of these vendors.
       
              nonameiguess wrote 1 day ago:
              I can't reply to the other reply for some reason, but what they
              said is indeed what I meant. gitlab.com might be running their
              Gitaly servers on btrfs or zfs or lvm volumes or whatever, but
              other customers may be using ext2. Gitlab the company could
              require customers to only run Gitaly on a specific filesystem,
              but up to now, they never have, it would be pretty shitty to
              suddenly change their minds after a decade plus of establishing
              one expectation, and whoever the developer is who submitted the
              patch to upstream Git and got a technical blog post out of it has
              absolutely no power to dictate contract terms to enterprise
              customers personally and instead did what is actually in their
              power to do.
       
              to11mtm wrote 1 day ago:
              I think GP's point is that the filesystems used by someone
              self-hosting gitlab may not be the same as what gitlab themselves
              are using.
              
              File systems can be weird. Sometimes the OS can be weird and
              fsync type calls may not do what you expect. At least at one
              point MacOS fsync didn't behave the same way as Linux (i.e. Linux
              should ensure the write is truly done and not just in cache so
              long as the drive isn't lying). [0]
              
              > There are system requirements that a customer would be expected
              to adhere to if they wanted a valid enterprise support contract
              with one of these vendors.
              
              Gitlab has a community edition. Not handling data well would be
              bad for their public image.
              
              [0] -
              
   URI        [1]: https://news.ycombinator.com/item?id=30372218
       
        SoftTalker wrote 1 day ago:
        Cool discovery but the article could have been about 1/10 as long and
        still communicated effectively. At least they didn't post it as a
        video, so it was easy to skim to the important details.
       
          m104 wrote 2 hours 15 min ago:
          It's the "impact" style of technical write-ups: sell the problem and
          the scale, then present the solution, which is thus presented and
          understood through the lens of business and customer success.
          
          Generously, this writing style is supposed to show the business value
          of teams and individuals, for promotions or other recognition. But
          yeah, it can be frustrating to read this style.
       
          hliyan wrote 7 hours 9 min ago:
          Interesting that multiple people are noticing this same thing. For
          me, this could have been:
          
          "We found that a large portion of the 48 hours taken to backup our
          rails respository was due to a function in `git bundle create` that
          checks for duplicate references entered as command line arguments.
          The check contained a nested for loop ( O(N2) ), which we replaced
          with a map data structure in an upstream fix to Git. The patch was
          accepted, but we also backported fix without waiting for the next Git
          version. With this change, backup time dropped to 41 minutes."
       
          davidron wrote 21 hours 38 min ago:
          For those that haven't read the article yet, scroll down to the flame
          graph and start reading unit it starts talking about back porting the
          fix. Then stop.
       
            wgjordan wrote 19 hours 16 min ago:
            "How we decreased reading 'How we decreased GitLab repo backup
            times from 48 hours to 41 minutes' times from 4.8 minutes to 41
            seconds"
       
          1oooqooq wrote 1 day ago:
          it could have been longer. I still don't know why  they were doing
          backup bundles with two refs :)
       
            6LLvveMx2koXfwn wrote 18 hours 37 min ago:
            They weren't, if you look at the fix [1] the dedupe loop was run in
            all cases, not just those with known dupes, so the performance hit
            was any bundle with lots of refs.
            
            1.
            
   URI      [1]: https://github.com/git/git/commit/bb74c0abbc31da35be529995...
       
              rjmunro wrote 3 hours 38 min ago:
              But why couldn't they just dedupe the refs from the command line
              before starting the actual bundling - surely there are never more
              than a couple of hundred of those (one per branch)?
       
          kokada wrote 1 day ago:
          Yes. I read the whole article thinking that this must have been
          generated by LLM, because at least the style remembers it.
       
            djdeutschebahn wrote 7 hours 22 min ago:
            Exactly thought the same. Reading experience of the post would have
            been definitely improved, with less text.
       
            bearjaws wrote 8 hours 34 min ago:
            Don't take my bullet points away from me
       
              jorvi wrote 8 hours 4 min ago:
              They came for the em dashes, and I did not speak up. Then they
              came for the bullet points, and I did not speak up..
       
            jaygreco wrote 17 hours 15 min ago:
            Glad I wasn’t the only one who thought this. The post is also
            missing one obvious thing that I expect in any technical post: code
            snippets. Let me see the code.
            
            ChatGPT has ruined bullet points for the rest of us…
            
            No offense but writing this blog post couldn’t take more than a
            few minutes, why spoil it with LLM? Shoot, use one to check grammar
            and recommend edits even.
       
            Scaevolus wrote 18 hours 36 min ago:
            Em dashes and bullet points!
       
            ruuda wrote 23 hours 35 min ago:
            That was also my thought.
       
        prymitive wrote 1 day ago:
        Well done, next maybe make the web ui usable because right now
        there’s absolutely no distinction between the UI itself and the user
        content, which IMHO combined with action buttons in the middle of the
        page makes for a really poor ux.
       
        divbzero wrote 1 day ago:
        There seems to be a lesson here about the balance between premature vs.
        anticipatory optimization. We’re generally warned against premature
        optimization but perhaps, as a rule of thumb, we should look for
        optimizations in frequently-called functions that are obvious and not
        onerous to implement.
       
          Quekid5 wrote 1 day ago:
          If a set-of-strings was trivially available in the source language
          (at time of implementation) the original programmer would probably
          have done this (relatively) trivial optimization... This is a symptom
          of anemic languages like C.
       
        ashishb wrote 1 day ago:
        O(n^2) is fast enough to end up I'm production and slow enough to cause
        problems at scale.
        
        The worst troublesome cases of inefficient production are almost always
        O(n^2).
       
        pjmlp wrote 1 day ago:
        A very good example that writing code in C doesn't help for
        performance, when the algorithms or data structures aren't properly
        taken into consideration.
       
          IshKebab wrote 1 day ago:
          I would say C makes this sort of thing far more likely because it's
          usually a ton of effort to obtain suitable containers. In C++ or Rust
          they have plenty of things like `unordered_set`/`HashSet` built in,
          so people are much more likely to use it and not go "eh, I'll use a
          for loop".
          
          In this case Git already had a string set, but it's still not
          standard so there's a good chance the original author just didn't
          know about it.
       
            morepedantic wrote 15 hours 0 min ago:
            I've seen this exact problem in C code many times in my life,
            especially in kernel space where data structures and memory
            allocations are fun.
            
            Ironically, this is much _faster_ for small sets. Sometimes the
            error is intentional, because the programmer believes that all
            inputs will be small. IME, those programmers were wrong, but that's
            the inverse of survival bias.
       
              masklinn wrote 10 hours 44 min ago:
              And in fairness I can understand not considering someone might
              eventually be bundling a repository with tens of thousands of
              refs back in early 2009.
       
            masklinn wrote 1 day ago:
            > In this case Git already had a string set, but it's still not
            standard so there's a good chance the original author just didn't
            know about it.
            
            The original commit was made in January 2009 ( [1] ), strmap was
            added in November 2020 ( [2] , strset was added a few days later:
            [3] ). It was first proposed in 2018 ( [4] the proposal
            specifically mentions it fixing possibly quadratic sites).
            
            As noted in the comment, git did have a sorted string list with
            bisection search, and that's from 2008 (and it actually dates back
            to 2006 as the "path list" API, before it was renamed following the
            realisation that it was a generalised string list). Though as the
            hashmap proposal notes, it's a bit tricky because there's a single
            type with functions for sorted and functions for unsorted
            operations, you need to know whether your list is sorted or not
            independent of its type.
            
   URI      [1]: https://github.com/git/git/commit/b2a6d1c6868b6d5e7d2b4fa9...
   URI      [2]: https://github.com/git/git/commit/ae20bf1ad98bdc716879a8da...
   URI      [3]: https://github.com/git/git/commit/1201eb628ac753af57512584...
   URI      [4]: https://lore.kernel.org/git/20180906191203.GA26184@sigill....
       
            pjmlp wrote 1 day ago:
            Yeah, not something that WG14 will ever care about.
       
        divbzero wrote 1 day ago:
        The performance improvement that GitLab contributed to Git is slated to
        be released with v2.50.0:
        
   URI  [1]: https://github.com/git/git/commit/bb74c0abbc31da35be52999569ea...
       
          Nemo_bis wrote 40 min ago:
          Nice that GitLab is active upstream! Do I see correctly that the most
          active contributors to git are currently GitLab employees?
       
        Lammy wrote 1 day ago:
        > What this means for GitLab customers — [a bunch of stuff about how
        customers can now back up more frequently and more robustly]
        
        Realtalk: They should rewrite this post's headline to be in a positive
        tense instead of leading with a negative word. I'm glad I read the
        post, because it is a cool and good fix, but I saw “Decreasing […]
        repo backup” and my first thought was that it was an announcement of
        some service downgrade like some sort of cost-cutting measure.
       
          serial_dev wrote 1 day ago:
          I don’t think it’s unreasonable to expect interested people to
          read five words from the title. I know people don’t always do that,
          but complaining about it as if they did something wrong is
          ridiculous.
       
            robertlagrant wrote 1 day ago:
            No one was complaining.
       
        DamonHD wrote 1 day ago:
        Here comes an unpopular nitpick: "... we traced the issue to a
        15-year-old Git function with O(N²) complexity and fixed it with an
        algorithmic change, reducing backup times exponentially."
        
        Uh no you didn't.  Not possible.  At most a polynomial reduction is
        possible else complexity theory needs a re-write.
        
        (OK, yes, k could be doing some heavy lifting here, but I doubt it.)
        
        If you are going to quote a maths formula then please don't use
        "exponetially" to mean "lots".
        
        I stopped reading there: I don't want to have to read each word and
        wonder if they actually meant it, or it's just bad hype.
       
          bombela wrote 1 day ago:
          I can forgive exponential as a figure of speech. 
          You missed when they describe using a map for the side effect of
          [key] dedupliction.
          
          Instead of saying "set". The code itself uses the type "strset".
       
          immortaljoe wrote 1 day ago:
          OP here. Feedback is always welcome, I did mean exponentially in the
          colloquial sense. I do see how it is confusing here, will change it.
       
            serial_dev wrote 1 day ago:
            You can’t use exponentially in the colloquial sense in a post
            about software performance.
            
            If my barista says that his new coffee is exponentially better,
            it’s ok to use it colloquially.
            
            If my git hosting provider writes about an impactful performance
            improvement, it’s not.
       
            DamonHD wrote 1 day ago:
            Thank you.
            
            (I don't think that anyone should use "exponentially" that way: it
            is an art term with a specific and particular meaning, so find
            another word if you mean something else!  Like misusing specific
            legal or sporting terms...)
       
              taftster wrote 1 day ago:
              Which term is appropriate here? What would you suggest? (honest
              question)
       
                DamonHD wrote 14 hours 41 min ago:
                "hugely" or "a lot" or "to O(XXX)" whatever the new XXX
                complexity is.
       
              ctxc wrote 1 day ago:
              Art term?
       
                wizzwizz4 wrote 1 day ago:
                
                
   URI          [1]: https://en.wiktionary.org/wiki/term_of_art
       
                  ctxc wrote 1 day ago:
                  Ah, thanks. I could only think of art as in literal, artistic
                  art.
       
                    kccqzy wrote 1 day ago:
                    If you have that tendency, you just need to think of TAOCP,
                    this industry's best distillation of intellectual
                    achievement, with the word "art" in its name.
       
        Arcuru wrote 1 day ago:
        48 hours is a crazy amount of time to spend just to compress a git
        folder, it's only a couple GB. 41 minutes still seems like quite a long
        time.
        
        Why aren't they just snapshotting and archiving the full git repo? Does
        `git bundle` add something over frequent ZFS backups?
       
          gregorvand wrote 11 hours 18 min ago:
          15 years also seems like a long time
       
          broken_broken_ wrote 11 hours 42 min ago:
          Reading the article I thought exactly the same! I‘d be curious to
          know how much time the same would take with zfs.
       
          AtlasBarfed wrote 1 day ago:
          ... so they added caching to things that should have been cached?
          
          ... is this really the way people "back up" git repos? I mean, it is
          git, so isn't there some way to mirror changes to the repo in another
          repo and just use ZFS / snapshots / backup software / etc to do that?
          It's a distributed version control system. Just make sure the version
          control information is ... distributed?
       
          unsnap_biceps wrote 1 day ago:
          zfs snapshots are difficult to offsite in non-zfs replicas, say like
          an S3 bucket.
          
          That said, there's another less known feature that bundles help out
          with when used with `git clone --bundle-uri` The client can specify a
          location to a bundle, or the server can send the client the bundle
          location in the clone results and the client can fetch the bundle,
          unpack it, and then update the delta via the git server, so it's a
          lot lighter weight on the server for cloning large repos, and a ton
          faster for the client for initial clones.
       
            benlivengood wrote 1 day ago:
            I think if you want consistent snapshot backups on non-zfs
            destinations the safest thing is to clone the snapshot and rsync
            from the clone.  Not a single-step operation but preserves the
            atomicity of the snapshot.
            
            EDIT: you could also rsync from a .zfs snapshot directory if you
            have them enabled.
       
            nightfly wrote 1 day ago:
            ZFS can send to file or whatever you want to pipe to, you can have
            incremental sends, and if you convert to bookmarks on the sender
            you don't have to keep the historical data after you send it
       
          bombela wrote 1 day ago:
          > Be aware that even with these recommendations, syncing in this way
          has some risk since it bypasses Git’s normal integrity checking for
          repositories, so having backups is advised. You may also wish to do a
          git fsck to verify the integrity of your data on the destination
          system after syncing. [1] It doesn't tell you how to make a backup
          safely though.
          
          On a personal scale, Syncthing and Btrfs snapshots work plenty good
          enough. It's as fast as the storage/network too.
          
   URI    [1]: https://git-scm.com/docs/gitfaq#_transfers
       
            ndriscoll wrote 10 hours 58 min ago:
            If filesystem snapshots weren't safe, wouldn't that also mean git
            is prone to corrupting your repo in the event of a power loss or
            crash? That seems like a bad bug.
       
            nightfly wrote 1 day ago:
            Syncthing is the only way I've ever corrupted a git repo before
       
              hobofan wrote 1 day ago:
              I think that's why they specified the "BTRFS snapshots" part.
              Yes, directly syncing a .git directory seems like a recipe for
              disaster with how often I've seen individual files lagging to
              sync, but I guess with BTRFS snaphots one can ensure that only a
              consistent view of a git directory is being backed up and synced.
       
                bombela wrote 1 day ago:
                Nah I truly do it the wrong way around. Syncthing on the git
                repos. And one of my device in the Syncthing cluster does btrfs
                snapshots minutely for recovery and further backups.
                
                Because it's at a personal scale, the only time I can corrupt a
                git repo is if I work on the same repo (and it's workdir) from
                more than one device in the time it takes for Syncthing to
                replicate the changes.
                
                But even then it's not a big deal because git fsck is quick.
                And I have my snapshots, and the syncthing versioning, and git
                defaults to two weeks before pruning. And because of how git
                works, using hash to identify contents, files are not easily
                overwritten either.
                
                In 10y I only had one git corruption (I ran a command on the
                same repo on a different machine via ssh, yielding a synctning
                conflict). Syncthing kept copies of the conflict file. One
                commit disappeared from the history but not from the database.
                It was easy to rebase the changes. I think I used git fsck to
                deleted the syncthing versioned files.
       
        LgWoodenBadger wrote 1 day ago:
        IME, it has always turned out to be the correct decision to eliminate
        any n^2 operation in anything I’ve written.
        
        I don’t write exotic algorithms, but it’s always astounding how
        small n needs to be to become observably problematic.
       
          solardev wrote 5 hours 34 min ago:
          For those of us without a formal comp sci background, is there a way
          for the IDE to detect and warn about these automatically? Or any
          other easy telltale signs to look for?
          
          As a self taught dev, when I encounter nested loops, I have to
          mentally go through them and try to see if they iterate through each
          item more than once. But that's not a very foolproof method.
       
            chacha102 wrote 4 hours 44 min ago:
            Too much domain knowledge for an IDE to catch. I'm self taught as
            well, and it comes down to spending more time thinking about the
            code than writing the code.
            
            It's a fairly simple thought experiment to ask yourself what if
            there was 10x items in this array? 100x? That is essentially what
            the O(n) notation is trying to quantify. You just don't need to do
            it that formally.
       
          esprehn wrote 20 hours 14 min ago:
          Modern computers are pretty great at scanning small blocks of memory
          repeatedly, so n^2 can be faster than the alternative using a map in
          cases for small N.
          
          I spent a lot of time fixing n^2 in blink, but there were some fun
          surprises: [1] For large N without a cache :nth-child matching would
          be very slow doing n^2 scans of the siblings to compute the index. On
          the other hand for small sibling counts it turned out the cache
          overhead was noticably worse than just doing an n^2. (See the end of
          the linked comment).
          
          This turns out to be true in a lot of surprising places, both where
          linear search beats constant time maps, and where  n^2 is better than
          fancy algorithms to compensate.
          
          Memory latency and instruction timing is the gotcha of many fun
          algorithms in the real world.
          
   URI    [1]: https://source.chromium.org/chromium/chromium/src/+/main:thi...
       
            hinkley wrote 4 hours 19 min ago:
            A lot of computations are really higher complexity order functions
            with stair steps at certain intervals based on hardware trying to
            pretend they are constant time. All operations cost the same amount
            until n doubles again and then it’s slower. If you zoom out
            toward infinity, the stair steps smooth out into a logarithmic
            curve. It becomes logarithmic in the former case and square root in
            the latter. Even dividing two numbers or doing a memory address
            lookup stops being C, which is part of why prime factoring worked
            for RSA for so long.
            
            If anyone had made clockless logic work you would see that adding 1
            + 1 is in fact faster than adding 2^63 + 1.
            
            If you put enough data into a hash table the key length has to
            increase logarithmically to the table size in order to have
            distinct keys per record. Even Knuth points out that hash tables
            are really nlogn - something I’m pretty sure my CS professors
            left out. In multiple classes. Man, did I get tired of hash tables,
            but I see now why they harped on them. Case on point, this article.
       
            kevincox wrote 9 hours 18 min ago:
            This is true. But unless you are sure that current and future
            inputs will always be small I find it is better to start with the
            algorithm that scales better. Then you can add a special case for
            small sizes if it turns up in a hot path.
            
            This is because performance is typically less important for the
            fast/small case and it is generally acceptable for processing twice
            as much to be twice (or slightly more than twice) as slow, but
            users are far more likely to hit and really burned by n^2
            algorithms in things you thought would almost always be small and
            you never tested large enough sizes in testing to notice.
            
            I wrote more on this topic here
            
   URI      [1]: https://kevincox.ca/2023/05/09/less-than-quadratic/
       
            throwaway2037 wrote 19 hours 12 min ago:
            > where linear search beats constant time maps
            
            Can you give an example?  You said lots of good things in your
            post, but I struggling to believe this one.  Also, it would help to
            see some wall clock times or real world impact.
       
              crote wrote 12 hours 13 min ago:
              You've got to keep in mind that computers aren't the
              1-instruction-at-a-time purely sequential machines anymore.
              
              Let's say you've got a linear array of bytes, and you want to see
              if it contains a specific value. What would a modern CPU need to
              execute? Well, we can actually compare 64 values at a time with
              _mm512_mask_cmp_epu8_mask! You still need a little bit of setup
              and a final "did any of them match" check, of course. Want to
              compare 512 values? You can probably do that in less than 10
              clock cycles with modern machines
              
              Doing the same with a hash set? Better make sure that hash
              algorithm is fast. Sure it's O(1), but if calculating the hash
              takes 20 cycles it doesn't matter.
       
              hansvm wrote 18 hours 34 min ago:
              Pick any compiled language and test it. Pick an algorithm making
              heavy use of a small (<10, maybe up to a hundred elements)
              hashset, and try using a linear structure instead. The difference
              will be most apparent with complicated keys, but even strings of
              more than a few characters should work.
              
              Some example workloads include:
              
              1. Tokenization (checking if a word is a keyword)
              
              2. Interpretation (mapping an instruction name to its action)
              
              3. ML (encoding low-cardinality string features in something like
              catboost)
              
              4. JSON parsers (usually key count is low, so parse into a
              linear-scan hashmap rather than a general-purpose hashmap)
              
              Details vary in the exact workload, the hardware you're using,
              what other instructions you're mixing in, etc. It's a well-known
              phenomenon though, and when you're doing a microoptimization pass
              it's definitely something to consider. 2x speedups are common.
              10x or more happen from time to time.
              
              It's similar to (but not _quite_ the same as) the reason
              real-world binary search uses linear scans for small element
              counts.
              
              When you go to really optimize the system, you'll also find that
              the linear scan solution is often more amenable to performance
              improvements from batching.
              
              As to how much it matters for your composite program? Even at a
              microoptimization level I think it's much more important to pay
              attention to memory access patterns. When we wrote our protobuf
              parser that's all we really had to care about to improve
              performance (33% less execution time for the entire workload,
              proto parsing being much better than that). You're much more
              likely to be able to write sane code that way (contrasted with
              focusing on instructions and whatnot first), and it's easier to
              layer CPU improvements on top of a good memory access pattern
              than to go the other way around.
       
          vlovich123 wrote 22 hours 57 min ago:
          I have an n^3 operation that's currently a huge bottleneck at only
          10k elements. Not sure how to fix it.
       
            IshKebab wrote 5 min ago:
            I once looked into tree diffing algorithms (the literature is all
            about diffing XML even though it's really trees). The obvious dumb
            algorithm (which seems to be what everyone uses) is n^4.
       
          EdwardCoffin wrote 1 day ago:
          Bruce Dawson says: I like to call this Dawson’s first law of
          computing: O(n^2) is the sweet spot of badly scaling algorithms: fast
          enough to make it into production, but slow enough to make things
          fall down once it gets there.
          
   URI    [1]: https://bsky.app/profile/randomascii.bsky.social/post/3lk4c6...
       
            hinkley wrote 1 day ago:
            I made the “mistake” in an interview of equating  two
            super-quadratic solutions in an interview. What I meant was what
            Dawson meant. It doesn’t matter because they’re both too
            ridiculous to even discuss.
       
              dieortin wrote 1 day ago:
              They’re too ridiculous… unless a more optimal solution does
              not exist
       
                hinkley wrote 18 hours 12 min ago:
                Absolutely not.
                
                If the cost of doing something goes above quadratic, you
                shouldn't do it at all. Because essentially every customer
                interaction costs you more than the one before. You will never
                be able to come up with ways to cover that cost faster than it
                ramps. You are digging a hole, filling it with cash and
                lighting it on fire.
                
                If you can't do something well you should consider not doing it
                at all. If you can only do it badly with no hope of ever
                correcting it, you should outsource it.
       
                  patrick451 wrote 1 hour 50 min ago:
                  There are two many obvious exceptions to even start taking
                  this seriously. If we all followed this advice, we would
                  never even multiply matrices.
       
                  marcinzm wrote 6 hours 39 min ago:
                  > no hope of ever correcting it
                  
                  That's a pretty bold assumption.
                  
                  Almost every startup that has succeeded was utterly
                  unscalable at first in tons of technical and business ways.
                  Then they fixed it as they scaled. Over-optimizing early has
                  probably killed far more projects and companies than the
                  opposite.
       
                    hinkley wrote 4 hours 40 min ago:
                    > That’s a pretty bold assumption.
                    
                    That’s not a bold assumption it’s the predicate for
                    this entire sidebar. The commenter at the top said some
                    things can’t be done in quadratic time and have to be
                    done anyway, and I took exception.
                    
                    >> unless a more optimal solution does not exist
                    
                    Dropping into the middle of a conversation and ignoring the
                    context so you can treat the participants like they are
                    confused or stupid is very bad manners. I’m not grumpy at
                    you I’m grumpy that this is the eleventeenth time this
                    has happened.
                    
                    > Almost every startup
                    
                    Almost every startup fails. Do you model your behavior on
                    people who fail >90% of the time? Maybe you, and perhaps by
                    extension we, need to reflect on that.
                    
                    > Then we fixed it as we scaled
                    
                    Yes, because you picked a problem that can be architected
                    to run in reasonable time. You elected to do it later. You
                    trusted that you could delay it and turned out to be right.
                    
                    >> unless a more optimal solution does not exist
                    
                    When the devs discover the entire premise is unsustainable
                    or nobody knows how to make it sustainable after banging
                    their heads against it, they quickly find someplace else to
                    be and everyone wonders what went wrong. There was a table
                    of ex employees who knew exactly what went wrong but it was
                    impolitic to say. Don’t want the VCs to wake up.
       
                  Tainnor wrote 11 hours 28 min ago:
                  Gaussian elimination (for square matrices) is O(n^3)
                  arithmetic operations and it's one of the most important
                  algorithms in any scientific domain.
       
                    hinkley wrote 4 hours 26 min ago:
                    I’ll allow that perhaps I should have said “cubic”
                    instead of “quadratic” - there are much worse orders in
                    the menagerie than n^3.  But it’s a constraint we bang
                    into over and over again. We use these systems because
                    they’re cheaper than humans, yes? People are still trying
                    to shave off hundredths of the exponent in matrix
                    multiplication for instance. It makes the front page of HN
                    every time someone makes a “breakthrough”.
       
                  crote wrote 13 hours 3 min ago:
                  That only matters when the constants are nontrivial and N has
                  a potential to get big.
                  
                  Not every app is a B2C product intending to grow to billions
                  of users. If the costs start out as near-zero and are going
                  to grow to still be negligible at 100% market share, who
                  cares that it's _technically_ suboptimal? Sure, you could
                  spend expensive developer-hours trying to find a better way
                  of doing it, but YAGNI.
       
                    hinkley wrote 4 hours 36 min ago:
                    I just exited a B2B that discovered they invested in luxury
                    features and the market tightened their belts by going with
                    cheaper and simpler competitors. Their n wasn’t really
                    that high but they sure tried their damnedest to make it
                    cubic complexity. “Power” and “flexibility”
                    outnumbered, “straightforward” and even “robust”
                    but at least three to one in conversations. A lot of my
                    favorite people saw there was no winning that conversation
                    and noped out long before I did.
                    
                    The devs voted with their feet and the customers with their
                    wallets.
       
                  endgame wrote 13 hours 29 min ago:
                  I feel this is too hardline and e.g. eliminates the useful
                  things people do with SAT solvers.
       
                    hinkley wrote 7 hours 12 min ago:
                    The first SAT solver case that comes to mind is circuit
                    layout, and then you have a k vs n problem. Because you
                    don’t SAT solve per chip, you SAT solve per model and
                    then amortize that cost across the first couple years’
                    sales. And they’re also “cheating” by copy pasting
                    cores, which means the SAT problem is growing much more
                    slowly than the number of gates per chip. Probably more
                    like n^1/2 these days.
                    
                    If SAT solvers suddenly got inordinately more expensive
                    you’d use a human because they used to do this but the
                    solver was better/cheaper.
                    
                    Edit: checking my math, looks like in a 15 year period from
                    around 2005 to 2020, AMD increased the number of cores by
                    about 30x and the transistors per core by about 10x.
       
                      IshKebab wrote 12 min ago:
                      That's quite a contortion to avoid losing the argument!
                      
                      "Oh well my algorithm isn't really O(N^2) because I'm
                      going to print N copies of the answer!"
                      
                      Absurd!
       
                  delusional wrote 16 hours 33 min ago:
                  All of modern Neural Network AI is based on GEMM which are
                  O(n^2) algorithms. There are sub-cubic alternatives, but it's
                  my understanding that the cache behavior of those variants
                  mean they aren't practically faster when memory bound.
                  
                  n is only rarely related to "customers". As long as n doesn't
                  grow, the asymptotic complexity doesn't actually matter.
       
                    saagarjha wrote 10 hours 55 min ago:
                    The GEMM is O(n^3) actually. Transformers are quadratic in
                    the size of their context window.
       
                      hinkley wrote 4 hours 34 min ago:
                      I read that as a typo given the next sentence.
                      
                      I’m on the fence about cubic time. I was mostly
                      thinking of exponential and factorial problems. I think
                      some very clever people can make cubic work despite my
                      warnings. But most of us shouldn’t. General advice is
                      to be ignored by masters when appropriate. That’s also
                      the story arc of about half of kung fu movies.
                      
                      Did chess solvers really progress much before there was a
                      cubic approximation?
       
                  Retric wrote 17 hours 34 min ago:
                  Chess engines faced worse than quadratic scaling and came out
                  the other side…
                  
                  Software operates in a crazy number of different domains with
                  wildly different constraints.
       
                    crabmusket wrote 15 hours 17 min ago:
                    I believe hinkley was commenting on things that are
                    quadratic in the number of users. It doesn't sound like a
                    chess engine would have that property.
       
                      tux3 wrote 12 hours 28 min ago:
                      They did make it sound like almost anything would
                      necessarily have n scale with new users. That assumption
                      is already questionnable
                      
                      There's a bit of a "What Computational Complexity Taught
                      Me About B2B SaaS" bias going.
       
            paulddraper wrote 1 day ago:
            The second law is that O(n * log n) is for practical intents and
            purposes O(n).
       
              sn9 wrote 22 hours 28 min ago:
              Skiena has a great table in his algorithms book mapping time
              complexity to hypothetical times for different input sizes.
              
              For n of 10^9, where lgn takes 0.03 us and n takes 1 s, nlgn
              takes 29.9 s and n^2 takes 31.7 years.
       
              hinkley wrote 1 day ago:
              But sometimes a big enough C can flip which solution helps you
              hit your margins.
       
                hansvm wrote 18 hours 49 min ago:
                In my mind, that's always been the point in dropping log
                factors. The algorithms are comparable enough that the actual
                implementation starts to matter, which is all we're really
                looking for in a Big-O analysis.
       
              EdwardCoffin wrote 1 day ago:
              To be clear though, that isn't his second law, at least as of two
              months ago, according to
              
   URI        [1]: https://bsky.app/profile/randomascii.bsky.social/post/3l...
       
                brucedawson wrote 2 hours 6 min ago:
                Should it be my second law?
                
   URI          [1]: https://bsky.app/profile/randomascii.bsky.social/post/...
       
                paulddraper wrote 22 hours 47 min ago:
                Yes, that isn't actually Dawson's second law.
       
                to11mtm wrote 1 day ago:
                Fair, but `n log n` definitely is the historical "good enough
                to actually sleep at night" in my head, every time I see it I
                think of the prof who taught my first CSC course and our data
                structures course due to how often it came up.
                
                Also, the wise statement that 'memory is fairly cheap compared
                to CPU for scaling'. It's insane to see how often folks would
                rather manually open and scan a 'static-on-deploy' 20-100MB
                Json file for each request vs just parsing it into structures
                in memory (where, for most cases, the in memory usage is a
                fraction of the json itself) and just caching the parsed
                structure for the length of the application.
       
                  hinkley wrote 1 day ago:
                  Not often but occasionally I will chose the nlogn algorithm
                  which obviously has no bugs over the O(n) algorithm with no
                  obvious bugs.
                  
                  Less brittleness is worth paying a few percent. Especially if
                  it unmuddies the waters enough for someone to spot other
                  accidental (time) complexity.
       
                    refulgentis wrote 19 hours 34 min ago:
                    Considerably more than a few percent, IMHO. :)
                    
                    But I also don't dabble in this area nearly enough to know
                    whether there's years of tears and toil finding out
                    repeatedly that O(n) is ~impossible to implement and verify
                    :)
                    
                      | n    | n log n  |
                      | 5    | 8.0472   |
                      | 10    | 23.0259  |
                      | 25    | 80.4719  |
                      | 50    | 195.6012 |
                      | 100 | 460.5170 |
       
                      hinkley wrote 18 hours 17 min ago:
                      This is magic thinking about how C, memory hierarchies,
                      networking, and system calls work.
       
          SAI_Peregrinus wrote 1 day ago:
          I'd say the exception is when `n` is under about 10, and is counting
          some sort of hardware constrained thing (e.g. some operation over all
          CAN interfaces pesent on an OBDII connector can be O(n^(2)) since n
          will always be between 1 and 4). If you wouldn't have to physically
          replace hardware for `n` to increase, you really need to avoid n^2
          operations. And even then consider them carefully, perhaps explicitly
          failing if `n` gets too big to allow for noticing rework is needed
          before new hardware hits the field.
       
            echelon wrote 1 day ago:
            > perhaps explicitly failing if `n` gets too big
            
            That's the problem. A lot of these quadratic time algorithms don't
            set limits.
            
            Even 'n!' is fine for small 'n'. Real production use cases don't
            have small 'n'.
       
              haiku2077 wrote 23 hours 14 min ago:
              I have an app that's been running an O n^2 algorithm in
              "production" (free open source app used by various communities)
              for about half a year now.
              
              It's been fine because "n" is "number of aircraft flying in this
              flight simulator" - and the simulator's engine starts to fail
              above around 2000 anyway. So even in the worst case it's still
              going to run within milliseconds.
       
              masklinn wrote 1 day ago:
              > Real production use cases don't have small 'n'.
              
              Real production use cases absolutely have small n. You don't hear
              about them, because it's very hard for them to cause issues.
              Unless the use case changes and now the n is not small anymore
              and nobody noticed the trap.
       
              lassejansen wrote 1 day ago:
              Or, phrased differently, if n has an upper limit, the algorithm
              is O(1).
       
                connicpu wrote 1 day ago:
                As long as you have tests that regularly exercise your
                algorithm at n=max where you would notice if they were
                exceptionally slow
       
          koala_man wrote 1 day ago:
          Good call. O(N^2) is the worst time complexity because it's fast
          enough to be instantaneous in all your testing, but slow enough to
          explode in prod.
          
          I've seen it several times before, and it's exactly what happened
          here.
       
            david422 wrote 1 day ago:
            We just had this exact problem. Tests ran great, production slowed
            to a crawl.
       
              hinkley wrote 4 hours 24 min ago:
              First big project I worked on a couple of us sped up the db
              initialization scripts so we could use a less trivial set of test
              data to stop this sort of shenanigans.
              
              Things like inserting the test data first and turning on
              constraints and possibly indexes afterward.
       
              bcrl wrote 18 hours 36 min ago:
              I was just helping out with the network at an event.  Worked
              great in testing, but it failed in production due to unicast
              flooding the network core.  Turns out that some of the PoE
              Ethernet switches had an insufficiently sized CAM for the
              deployment combined with STP topology changes reducing the
              effective size of the CAM by a factor of 10 on the larger
              switches.  Gotta love when packet forwarding goes from O(1) to
              O(n) and O(n^2)!  Debugging that in production is non-trivial as
              the needle is in such a large haystack of packets so as to be
              nearly impossible to find in the output of tcpdump and wireshark.
               The horror... The horror...
       
          parthdesai wrote 1 day ago:
          My rule of thumb for 80%-90% of the problems is, if you need
          complicated algorithm, it means your data model isn't right. Sure,
          you do need complicated algorithms for compilers, db internals, route
          planning et all, but all things considered, those are minority of the
          use cases.
       
            LtWorf wrote 20 hours 45 min ago:
            I wrote a funny algorithm to group together words that end the same
            way to write them once in my binary wordlist file, since there is
            an array that points to the start character and a \0 to end the
            word. My initial solution was O(n²) but it was too slow on a real
            wordlist so I had to come up with something better. In the end I
            just sort the list with quicksort, but revert the order of the
            words and then the groupable ones end up nearby each other.
       
            paulddraper wrote 1 day ago:
            This isn't knapsack. This is a dict lookup.
       
            anttihaapala wrote 1 day ago:
            This is not a complicated algorithm. A hash map (dictionary) or a
            hash set is how you would always do deduplication in Python,
            because it is easiest to write / least keystrokes anyway. That is
            not the case in C though, as it is much easier to use arrays and
            nested loops instead of hash maps.
       
              throwaway2037 wrote 19 hours 10 min ago:
              > That is not the case in C though, as it is much easier to use
              arrays and nested loops instead of hash maps.
              
              I am confused.    There are plenty of open source, fast hash map
              impls in C.
       
                saagarjha wrote 10 hours 53 min ago:
                Yes, the problem is getting them into your project.
       
        treesknees wrote 1 day ago:
        Duplicate of
        
   URI  [1]: https://news.ycombinator.com/item?id=44197061
       
          mdaniel wrote 1 day ago:
          Since that one only has one comment, I'm pretty sure that means the
          HN front-page is luck of the draw and I wouldn't have bothered even
          commenting about the other one
       
        tomxor wrote 1 day ago:
        One of the many reasons I moved to self hosting. I use ZFS to backup
        every 15 minutes, ... could do it even more frequently but that seems a
        little pointless.
        
        Also moved away from Gitlab because it's so damn slow.
       
          mdaniel wrote 1 day ago:
          It for sure has an excruciatingly slow UI, but I haven't yet been
          able to stomach the Ruby magick enough to dig in and find out if it's
          all of GitLab that's slow, and thus their culture of who-cares just
          propagates to the UI, or it's literally just the web frontend that's
          molasses yet because so many folks interact with it over the web
          that's where the conclusion ends up
          
          I know their backend git proxy is written in golang, their runner
          agent is written in golang, it spawns CI jobs using containerd,
          written in golang, and they use postgresql and a redis-esque KV --
          although for that part I do believe they're still using Sidekick
          (ruby) for doing job dispatch, so that one could very easily lolol
          back into not actioning tasks efficiently
          
          GitHub Actions sure does enjoy just sitting there twiddling its
          thumbs when I push "run job," and is also both Ruby and their own
          crazypants ajax-y web framework, so I don't think it's exactly the
          shining star of performance itself
       
          haroldp wrote 1 day ago:
          This was my thought too, but I figured I just didn't understand the
          problem.  Why use git commands to backup when a file system copy
          seems like it would do?  zfs snapshot takes less than a second on any
          size repo.  zfs send transfers just the changes since the last
          backup, as fast as your network, more or less.
       
            tomxor wrote 21 hours 16 min ago:
            Yup, once you've used snapshots for backups once, doing any kind of
            filesystem level backup just seems absurd. I now use it as the
            basis for every piece of infrastructure I add that holds valuable
            data. The only place it doesn't really fit is distributed
            databases.
       
          glookler wrote 1 day ago:
          ZFS is great, getting off gitlab would also be great, what did you
          switch to?
       
            tomxor wrote 21 hours 20 min ago:
            Gitea, i can't recommend it enough. Lightening fast compared to the
            proprietary offerings, clean simple UI like an older Github, and
            super simple to self host.
       
              mdaniel wrote 20 hours 31 min ago:
              So long as you have an existing CI/CD story, or are willing to
              invest in head-to-heading them, to find one that fits your needs.
              Because both it and Forgejo's "we're deploying act, what could go
              wrong" give me the heebie-jeebies. And, aside from that, there
              are so many FLOSS ones they could have chosen that legitimately
              work making that decision extra opaque to me
       
        edflsafoiewq wrote 1 day ago:
        So they replaced a nested for loop for checking for duplicates with a
        set data structure. Surprised such a common thing was in git.
       
        SomaticPirate wrote 1 day ago:
        How was the flame graph created? (Not very familiar with C and the
        performance tools around it)
       
          landr0id wrote 1 day ago:
          You can also use [1] to make recording and viewing in the Firefox
          profiler easier.
          
          It will spin up a localhost server after the trace ends, the profiler
          uses the localhost server and nothing is shared with Firefox servers
          unless you explicitly choose to upload the data and create a
          permalink.
          
   URI    [1]: https://github.com/mstange/samply
       
          plorkyeran wrote 1 day ago:
           [1] You record performance data with `perf`, then use the scripts
          there to turn it into a SVG.
          
   URI    [1]: https://github.com/brendangregg/FlameGraph
       
            IshKebab wrote 1 day ago:
            I strongly recommend not using this. Instead use pprof - it has a
            MUCH better interactive flamegraph, plus other nice performance
            visualisations (e.g. a call graph): [1] go install
            github.com/google/pprof@latest
                pprof -http=: prof.out
            
            I normally collect the profiles with gperftools ( [2] ) and then
            just
            
                LD_PRELOAD=/usr/lib/libtcmalloc_and_profiler.so
            CPUPROFILE=prof.out 
            
            I've been meaning to try Samply though. Not sure if it works with
            pprof.
            
   URI      [1]: https://github.com/google/pprof
   URI      [2]: https://github.com/gperftools/gperftools
       
            immortaljoe wrote 1 day ago:
            OP here. In this particular case, I used
            
   URI      [1]: https://github.com/flamegraph-rs/flamegraph
       
        gre wrote 1 day ago:
        looks like the commit is here:
        
   URI  [1]: https://github.com/git/git/commit/a52d459e72b890c192485002ec51...
       
          mortar wrote 1 day ago:
          Thanks, this helped me find the submission and related discussion
          thread -
          
   URI    [1]: https://lore.kernel.org/git/20250401-488-generating-bundles-...
       
        hiddew wrote 1 day ago:
        "fixed it with an algorithmic change, reducing backup times
        exponentially"
        
        If the backup times were O(n^2), are they now O(n^2 / 2^n)? I would
        guess not.
       
          morepedantic wrote 15 hours 20 min ago:
          I was also looking for the exponential algorithm.
       
          wasabi991011 wrote 21 hours 59 min ago:
          I interpreted that as n->log(n) since log and exp are inverses.
          
          Also because I've often heard  tha the quantum Fourier transform is
          an exponential speedup over the discrete Fourier transform, and there
          the scaling goes n^2->nlogn.
       
          sn9 wrote 22 hours 18 min ago:
          The algorithm complexity went down in the function they patched (6x
          improvement in their benchmark), but in the context of how they
          benefited with how they were using the algorithm the impact was much
          larger (improved to taking 1% of the time), which is plausibly
          exponential (and figuring out the actual complexity is neither
          relevant nor an economic use of time).
       
            ndriscoll wrote 9 hours 45 min ago:
            > figuring out the actual complexity is neither relevant nor an
            economic use of time
            
            The fix was replacing a nested loop with a map. Figuring out that
            this goes from O(n^2) to O(n) (modulo details like bucket count) is
            immediate if you know what the words mean and understand enough to
            identify the problem and make the fix in the first place.
       
          csnweb wrote 1 day ago:
          If you replace an n^2 algorithm with a log(n) lookup you get an
          exponential speed up.  Although a hashmap lookup is usually O(1),
          which is even faster.
       
            ndriscoll wrote 9 hours 39 min ago:
            They're still using the map in a loop, so it'd be nlogn for a
            tree-based map or n for a hash map.
       
            ryao wrote 23 hours 19 min ago:
            That is not true unless n^C / e^n = log(n) where C is some
            constant, which it is not. The difference between log(n) and some
            polynomial is logarithmic, not exponential.
       
              wasabi991011 wrote 21 hours 54 min ago:
              It depends if you consider "speedup" to mean dividing the runtime
              or applying a function to the runtime.
              
              I.e. you are saying and f(n) speedup means T(n)/f(n), but others
              would say it means f(T(n)) or some variation of that.
       
                morepedantic wrote 15 hours 5 min ago:
                The man, or llm, used the mathematically imprecise definition
                of exponential in a sentence with a big-O notation. I don't
                think he's going to be writing entire arguments formally.
       
              csnweb wrote 22 hours 5 min ago:
              But if you happen to have n=2^c, then an algorithm with
              logarithmic complexity only needs c time. Thats why this is
              usually referred to as exponential speedup in complexity theory,
              just like from O(2^n) to O(n). More concretely if the first
              algorithm needs 1024 seconds, the second one will need only 10
              seconds in both cases, so I think it makes sense.
       
          marcellus23 wrote 1 day ago:
          Meaningless and non-constructive pedantry.
       
            morepedantic wrote 15 hours 4 min ago:
            >pedantry
            
            I wasted 2 minutes of my life looking for the exponential
            reduction. So did many others.
            
            Now I'm wasting more of my life shit posting about it, but at least
            that's a conscious choice.
       
            msgodel wrote 1 day ago:
            I disagree. Misuse of the word "exponential" is a major pet peeve
            of mine. It's a particular case of the much more common "use
            mathematically precise phrasing to sound careful/precise" that you
            often find in less than honest writing.
            
            Here they are actually using it to refer to growth functions (which
            is rare for this error) and being honest (which is also rare IMO)
            but it's still wrong. They should have written about quadratic or
            quadratic vs linear.
            
            Regardless sloppy language leads to sloppy thought.
       
              sneak wrote 1 day ago:
              My personal pet peeve is when the term “exponentially” is
              used to refer to a change between precisely two data points.
              
              It’s a very specific subset of the one you’re describing.
              
              “Tell me you don’t know what you’re talking about without
              telling me you don’t know what you’re talking about.”
       
                morepedantic wrote 15 hours 2 min ago:
                >My personal pet peeve
                
                Just be glad you have only one.
       
              keybored wrote 1 day ago:
              Sloppy writing is up orders of magnitude lately.
       
                0cf8612b2e1e wrote 1 day ago:
                It will decimate readership.
       
            chrisweekly wrote 1 day ago:
            I'm not the OP you're responding to, but to be fair, in a sentence
            about big-O perf characteristics, which includes the word
            "algorithms", using "exponentially" in a colloquial non-technical
            sense is an absolutely terrible word choice.
       
              linguistbreaker wrote 1 day ago:
              Exponentially bad word choice even... since we're using that word
              however we want now?
              
              I don't think this is meaningless or non-constructive pedantry -
              we're a technical community and those are technical words.
       
          umanwizard wrote 1 day ago:
          This is not the precise mathematical definition of exponential, but
          rather the colloquial one, where it just means "a lot".
       
            morepedantic wrote 15 hours 18 min ago:
            >Ultimately, we traced the issue to a 15-year-old Git function with
            O(N²) complexity and fixed it with an algorithmic change, reducing
            backup times exponentially.
            
            No, not in the exact same sentence as a big-O. That's either author
            error, or an intentional troll. Either way it's egg on their faces.
       
            cvoss wrote 1 day ago:
            You shouldn't use a word that can carry a precise mathematical
            meaning in a sentence that literally uses mathematical notation in
            order to speak precisely and then expect readers not to interpret
            the word in the precise mathematical way.
       
              MyFedora wrote 11 hours 41 min ago:
              We simplify the big O notation in computer science. This is
              standard practice.
       
                rovr138 wrote 8 hours 21 min ago:
                Just drop the constants, it doesn't matter /s
                
                Production systems running and melting here...
       
              morepedantic wrote 15 hours 17 min ago:
              Especially when the colloquial meaning derives from the
              mathematical meaning.
       
              hskalin wrote 18 hours 25 min ago:
              I find the whole article rather poorly written. Most likely using
              an LLM.
       
              sneak wrote 1 day ago:
              “Words mean things.”
              
              If you can’t agree with this, then you shouldn’t be speaking
              or writing, IMO.
              
              Those who argue that words that mean different things are
              actually equivalent have no business dealing with language.
       
                morepedantic wrote 15 hours 8 min ago:
                I understood every word, phrase, and sentence you wrote. But I
                did not understand your point. Still, I got the meaning of your
                words, so presumably you're satisfied.
       
              IshKebab wrote 1 day ago:
              You should if you expect your readers to be normal humans who
              understand obvious context, and not pedantic HN readers who
              understand obvious context but delight in nit-picking it anyway.
       
                globular-toast wrote 14 hours 34 min ago:
                Ah yes because "normal humans" know what O(n^2) means but
                damnit they are going to use exponential wrong.
       
                morepedantic wrote 15 hours 14 min ago:
                >pedantic
                
                Who the fuck do you think is the intended audience for an
                article about an algorithm in `git bundle create`? I spent
                approximately two minutes of my life trying to figure out where
                the O(n^2) algorithm was being invoked in such a way that it
                influenced an exponential.
                
                Exponential was bolded in the same sentence as a big-O. 50/50
                troll/author oversight.
       
                  saagarjha wrote 10 hours 51 min ago:
                  Maybe not 'morepedantic
       
                Dylan16807 wrote 1 day ago:
                You can, but it's not should.
       
              blharr wrote 1 day ago:
              I somewhat agree, but for lack of a better word, what would you
              use? Quadratically doesn't have the same punch
       
                ndriscoll wrote 10 hours 0 min ago:
                Quadratically doesn't have the same punch because it is
                actually exponentially less than exponentially. So doing it for
                extra punch (as opposed to not knowing the correct word) in a
                technical context would just be lying. It'd be like a paper
                saying they had a result with p less than one in a trillion for
                "extra punch" when they actually had p=0.1.
       
                globular-toast wrote 14 hours 31 min ago:
                Runtimes dropped precipitously.
       
                morepedantic wrote 15 hours 10 min ago:
                Algorithmic? Big-O? Polynomially? Linear improvement? O(n^2) to
                O(n)? Or if you want to be less mathematically precise:
                enormous improvement?
                
                Using exponential in this way in any context is a faux pas,
                because it's highly ambiguous, and requires context for
                clarification. But in this situation the context clearly
                resolved to the mathematically accurate definition, except it
                was used in the other way.
       
                bobbylarrybobby wrote 19 hours 8 min ago:
                “From quadratic to linear” or “... to constant” seems
                fine.
       
                bobbylarrybobby wrote 19 hours 8 min ago:
                “From quadratic to linear” seems fine.
       
                jrochkind1 wrote 23 hours 30 min ago:
                If you just mean "a lot" in a non-technical sense, there are
                plenty of words available. enormously. immensely. tremendously.
                remarkably. incredibly. vastly.
       
                jjmarr wrote 1 day ago:
                "by a factor of `n`" also sounds impressive.
       
                remram wrote 1 day ago:
                "a lot"
       
                umanwizard wrote 1 day ago:
                “Dramatically” ?
       
        jeffbee wrote 1 day ago:
        Are there any reimplementations of git, by professional programmers
        using real tools? The source in question — object.c — is "banging
        rocks together" material.
       
          fHr wrote 5 hours 38 min ago:
          lmao
       
          gonzus wrote 13 hours 29 min ago:
          
          
   URI    [1]: https://github.com/libgit2/libgit2
       
          kgeist wrote 1 day ago:
          We use go-git in one of our Go projects that relies on Git for
          version control of its data.
       
          Vilian wrote 1 day ago:
          Git, but as it was intended to be used
       
          umanwizard wrote 1 day ago:
          Ignoring your inaccurate flamebait and answering the underlying
          question: there are several reimplementations of git, including "got"
          (Game of Trees) from the OpenBSD developers, jgit (reimplementation
          in Java), the Haskell git package, the gitoxide rust crate (and I
          assume several more half-finished Rust hobby projects), and probably
          others; that's just what I found with a cursory google.
       
          landr0id wrote 1 day ago:
          
          
   URI    [1]: https://github.com/GitoxideLabs/gitoxide
       
        judofyr wrote 1 day ago:
        See also: [1] Quadratic complexity sits in an awkward sweet spot: Fast
        enough for medium-sized n to pass first QA, but doomed to fail
        eventually as n grows.
        
   URI  [1]: https://www.tumblr.com/accidentallyquadratic
       
          vjerancrnjak wrote 1 day ago:
          This particular change was not accidental. It was announced as
          quadratic.
       
       
   DIR <- back to front page