_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Why the OpenSSL punycode vulnerability was not detected by fuzz testing (2022)
       
       
        cjbprime wrote 1 day ago:
        I'm not sure the post (from 2022) is/was correct.  I've looked into it
        too, and I expect this was reachable by the existing x509 fuzzer. 
        There's a fallacy in assuming that a fuzzer will solve for all
        reachable code paths in a reasonable time, and that if it doesn't then
        there must be a problem with the harness.  The harness is a reasonable
        top-level x509 parsing harness, but starting all the way from the
        network input makes solving those deep constraints unlikely to happen
        by (feedback-driven) chance, which is what I think happened here.
        
        Of course, a harness that started from the punycode parsing -- instead
        of the top-level X509 parsing -- finds this vulnerability immediately.
       
          capitainenemo wrote 1 day ago:
          If finding it required generating a valid signed certificate from
          random string generation that seems essentially equivalent to
          "unreachable" to me....
       
            gravescale wrote 21 hours 27 min ago:
            Not necessarily because the fuzzer has visibility into the
            execution pathways triggered, so it can "feel" its way through the
            maze.
            
            For example, say any certificate that starts with 0x00 is invalid
            (and there's an up-front test and return). Once the fuzzer has
            tried 0x00 and concluded that the pathway is a dead end, it won't
            try again, even though there are squillions of data blocks
            beginning 0x00.
            
            It's not a highly efficient way to generate a valid certificate
            (especially when you have to navigate the network code as well),
            but it's also not just rolling 'n' 256-sided dice until the end of
            the universe.
       
              cjbprime wrote 20 hours 57 min ago:
              I think the parent is pointing out that if the signature has to
              be both syntactically and cryptographically valid, then this
              would defeat the fuzzer for obvious reasons.
              
              But I don't think it does, for this vulnerability.  The signature
              is checked last.
       
          cryptonector wrote 1 day ago:
          > Of course, a harness that started from the punycode parsing --
          instead of the top-level X509 parsing -- finds this vulnerability
          immediately.
          
          Yes, it's nice to have a fuzzer that can start from a very high-level
          entry point, but it's even nicer to fuzz lots of entry points (one
          for every parser).
       
          hedora wrote 1 day ago:
          I've found that writing randomized unit tests for each small part of
          a system finds this sort of stuff immediately.
          
          In this case, a test that just generated 1,000,000 random strings and
          passed them to punycode would probably have found the problem (maybe
          not in the first run, but after a week or two in continuous
          integration).
          
          I try to structure the tests so they can run with dumb random input
          or with coverage-guided input from a fuzzer.  The former usually
          finds >90% of the bugs the fuzzer would, and does so 100x faster, so
          it's better than fuzzing during development, but worse during nightly
          testing.
          
          One other advantage for dumb random input is that it works with
          distributed systems and things written in multiple languages (where
          coverage information isn't readily available).
       
            Arnavion wrote 1 day ago:
            aka property testing.
       
            paulddraper wrote 1 day ago:
            > maybe not in the first run, but after a week or two in continuous
            integration
            
            You'd use a different seed for each CI run??
            
            That sounds like a nightmare of non-determinism, and lost of trust
            in CI system in general.
       
              forty wrote 1 day ago:
              Not if you log the seed of the failing runs
       
                paulddraper wrote 17 hours 50 min ago:
                Or someone just punches retry
       
                  hedora wrote 8 hours 20 min ago:
                  Ignoring test failures is a different issue.  It can range
                  from known bugs and rational tradeoffs to a hiring/hr issue.
                  
                  Establishing a healthy dev/engineering culture is hard.
       
                    paulddraper wrote 3 hours 12 min ago:
                    You don't have auto retries anywhere in your tests?
       
                hedora wrote 22 hours 21 min ago:
                Yep; I definitely log the seed and re-seed every few seconds.
                
                Most software I work on these days is non-deterministic anyway
                (it involves the network, etc.), so CI is fundamentally going
                to fail some runs and not others.
                
                Even stuff like deterministic simulation has this property: 
                Those test suites rely on having a large number of randomly
                generated schedules, so there's always a chance that running
                the test one more time (with a new seed) will find a new bug.
       
                  jhardy54 wrote 21 hours 36 min ago:
                  If you’re using Git, I’d strongly recommend the hash of
                  the current tree (not the commit). That way your tests are
                  deterministic based on the contents of your tree. For
                  example, if you add a commit and then revert, you’ll end up
                  with the same test seed as if you hadn’t committed.
       
                    hedora wrote 8 hours 22 min ago:
                    The important thing to log is the seed passed to the rng. 
                    (It’s usually wallclock time measured at nanosecond
                    granularity.)
                    
                    In a typical night for a production quality system,
                    100-1000+ hours of such tests would run, all with different
                    seeds, and in diverse configurations, etc, so the seed
                    isn’t derived from the git sha, or the source code.
       
            mattgreenrocks wrote 1 day ago:
            I really like this idea because it avoids the issue of fuzzers
            needing to burn tons of CPU just to get down to the actual domain
            logic, which can have really thorny bugs. Per usual, the idea of
            "unit" gets contentious quickly, but with appropriate tooling I
            could foresee adding annotations to code that leverage random
            input, property-based testing, and a user-defined dictionary of
            known weird inputs.
       
        arp242 wrote 1 day ago:
        (2022)
        
        Previous:
        
        Why CVE-2022-3602 was not detected by fuzz testing - [1] - Nov 2022
        (171 comments)
        
   URI  [1]: https://news.ycombinator.com/item?id=33693873
       
          dang wrote 1 day ago:
          Year added above. Thanks!
       
        stcredzero wrote 1 day ago:
        Would it be possible to attach an LLM to a debugger sessions executing
        all of the fuzzer seeds, and ask it to figure out how to expand
        coverage?
       
          swatcoder wrote 1 day ago:
          Not to dismiss the novelty and power of LLM's but why would you turn
          to a black box language interface for that?
          
          Wouldn't you expect a designed system to be more efficient, complete,
          reliable, and auditable? Most or those characteristics are critical
          to security applications and the nature of LLM's largely run counter
          to them.
       
            agilob wrote 1 day ago:
            Turns out LLMs are really good at writing fuzzers [1]
            
   URI      [1]: https://verse.systems/blog/post/2024-03-09-using-llms-to-g...
   URI      [2]: https://google.github.io/oss-fuzz/research/llms/target_gen...
       
              swatcoder wrote 1 day ago:
              Those demonstrate that they're capable of generating capable
              ones, which is really cool but also not surpising.
              
              What matters for engineering is how that technique compares to
              others in the context of specific requirements.
              
              A big part of "falling for hype" is in mistaking a new and
              capable tool for the being the right or oprimal tool.
       
                cjbprime wrote 1 day ago:
                It's fine to have LLM skepticism as a default, but here it's
                not justified.    Google is showing here that the LLM-written
                harnesses improve massively on the harnesses in oss-fuzz that
                were written over many years by the combined sum of open source
                security researchers.  Most dramatically, they improved
                tinyxml2 fuzzing coverage by 31% compared to the existing
                oss-fuzz harnesses, through an entirely automated flow for
                harness generation by LLMs.
                
                Whatever engineering technique you are imagining would be
                better is not one that humanity actually applied to the problem
                before the automated LLM harnesses were written.  In general,
                writing and improving fuzzing harnesses is extremely tedious
                work that is not being done (or paid for) by nearly enough
                people to adequately protect critical open source software. 
                The LLM approach is a legitimate breakthrough in the field of
                open source fuzzing.
       
                  swatcoder wrote 1 day ago:
                  Fair enough, interesting, and plausible! I looked at the
                  first link and saw it as more of a capabilities demo, but
                  didn't dig into the Google one. I'm mostly just encouraging
                  thoughtful reflection on tool choice by raising questions,
                  not making a case against. Very cool.
       
              cjbprime wrote 1 day ago:
              These links are a little different to the GP comment, though. 
              Both of these cases (which I agree show LLMs being an excellent
              choice for improving fuzzing coverage) are static analysis, going
              from the project source code to a new harness.
              
              Some issues with that are that the model probably doesn't have
              enough context to be given all the project source code, so you
              have to work out which subset to share, including definitions of
              all relevant symbols (but not too many).
              
              It helps a bit that the foundation models were already
              pre-trained on these large open source projects in oss-fuzz, so
              they already know something about the project's symbols and
              definitions from their original training sets -- and even from
              public discussions about the code! -- but that wouldn't work for
              a private codebase or for recent changes to a large public one.
              
              Then the harness source that the LLM writes might have syntax
              errors/fail to compile, and you have to deal with that somehow,
              and the harness source that the LLM writes might be valid but not
              generate any coverage improvement and you have to deal with that,
              and so on.
              
              GP seems to be talking about instead some form of LLM-aided
              dynamic analysis, where you are probably using some kind of
              symbolic execution to generate new seeds, not new harnesses.
              
              That's important work too, because I think in this case
              (disagreeing with the blog post author) the vulnerable function
              was actually reachable by existing harnesses, just not through
              the seed corpora (at least the public ones).
              
              One approach could be for the LLM to become a kind of a symbolic
              execution constraint solver, using the debugger as a form of path
              instrumentation and producing new seeds by predicting what a new
              input would look like when you invert each interesting constraint
              that the fuzzing coverage is blocked by, as the debugger hits the
              test for that constraint (which is difficult because it can
              require actual computation, not pattern matching, and because of
              path explosion).
              
              Or perhaps more plausibly, the LLM could generate Z3 or other
              SAT-solver code to define and solve for those constraints to
              generate new seeds, replacing what is currently extremely tedious
              and time-consuming work when done by humans.
       
        TacticalCoder wrote 1 day ago:
        Why is OpenSSL using punycode? To do internationalized domain name
        parsing?
       
          cryptonector wrote 1 day ago:
          Yes.
       
          amiga386 wrote 1 day ago:
          I would assume it's needed for hostname validation. Is one hostname
          equivalent to another? Does this wildcard match this hostname?
          
          EDIT: I looked it up. It is used to implement RFC 8398
          (Internationalized Email Addresses in X.509 Certificates)
          
   URI    [1]: https://www.rfc-editor.org/rfc/rfc8398#section-5
       
        _flux wrote 1 day ago:
        I suppose it would be nice to require at least 100% line coverage for
        tested encryption-related functionality, when tested by a fuzzer.
        Deviations from this could be easily detected in CI.
        
        Testing cannot be used to prove that a flaw doesn't exist, only that it
        does.
       
          taviso wrote 1 day ago:
          > Testing cannot be used to prove that a flaw doesn't exist, only
          that it does.
          
          FWIW, I wrote a similar blog post about a different encryption bug
          that really seemed like it should have been found by fuzzing, and had
          100% coverage. [1] Not that I disagree with you, just a practical
          example.
          
   URI    [1]: https://googleprojectzero.blogspot.com/2021/12/this-shouldnt...
       
            jcims wrote 1 day ago:
            It's pretty remarkable how effective fuzzing is despite the layers
            upon layers of optimizations/assumptions that it requires in order
            to be feasible at all (eg max_len = 10000).  I haven't tinkered
            with fuzzing since afl was a toddler but its mechanism for pruning
            the test space seemed so brilliant and tweakable at the time.
            
            It would be interesting to find a way to create excursions into
            these various parameter spaces but some of them are baked into the
            infrastructure in such a way that it makes it difficult.
       
          Retr0id wrote 1 day ago:
          > Testing cannot be used to prove that a flaw doesn't exist
          
          This is not universally true, formal methods can take you a long way,
          depending on the problem domain.
          
          And sometimes you can test for every possible input. This usually
          isn't feasible, but it's nice when it is.
       
            _flux wrote 1 day ago:
            True, I should have remembered to express it as Non-exhaustive
            testing cannot...
            
            I also consider formal methods not testing. But model checking
            (e.g. with tlc) is something in between, and sort of takes
            advantage of the "This usually isn't feasible, but it's nice when
            it is.": when you limit your domain enough, it becomes feasible,
            yet exhaustively checking each small-world scenario gives a pretty
            good confidence on the model being correct.
       
            bluGill wrote 1 day ago:
            Formal methods catch a different class of problem from other tests
            and so are worth it.  However they cannot catch everything. The
            classic example: "midpoint = (high + low) / 2" is easy to formally
            prove correct - but in fact it isn't correct because high + low
            could overflow.
       
              _flux wrote 1 day ago:
              There is a point here: if your implementation doesn't match the
              specification, then that can be the case. And your example
              presumes the specification assumed unbounded numbers, while an
              implementation did not have them. Well, in that case the
              specification just needs to be revised to include that
              information.
              
              But the trick is to remember making the specification model the
              real world at sufficient detail: not too detailed to make the
              specification unwieldy, not too relaxed to skip real real-world
              issues that need to be modeled.
              
              I don't think there are too many good solutions to that, except
              perhaps theorem provers from where you can also extract code from
              (e.g. Coq), but it seems either these tools are still difficult
              to use for engineers, or there is an expectation that models and
              actual implementations are going to be different anyway, making
              it difficult to see if the implementation actually implements the
              spec.
       
              mathgradthrow wrote 1 day ago:
              I don't mean to be rude, but why are you commenting on formal
              methods when you clearly have no idea what they are?
       
                bluGill wrote 1 day ago:
                The specific example I gave was from a 1982 paper (I forget who
                wrote it, but a big name in computer science)  where that was
                used to prove an algorithm and thus argue that he didn't need
                to test it as it was proved correct.
                
                The famous Donald Knuth has also stated "Beware of bugs in the
                above code; I have only proved it correct, not tried it.",
                which again hints that there are limits to proved programs.
                
                I'll admit that I don't know much about them, what I know
                meringues me - but I can't figure out how to use them in the
                real world. However I know enough to know that they are not
                perfect.
       
                  numeri wrote 5 hours 7 min ago:
                  Is "meringues me" a typo, or a really fun new vocab word for
                  me?
       
                  Jtsummers wrote 1 day ago:
                  You use them judiciously, you learn their capabilities and
                  limitations and the scope of their applicability.
                  
                  TLA+ is very useful, but it's not going to prove out your
                  software itself, that will require testing or other methods,
                  it will help prove out your specification. I've used it to
                  good effect both for designing distributed and concurrent
                  systems and for identifying issues in existing systems
                  (without an existing formal spec, by creating it and then
                  playing with the invariants and other aspects of the TLA+
                  spec). SPARK/Ada, I mentioned in my other comment, will
                  detect many problems, and help you prove the absence of many
                  others. But it's also not suitable (yet) for all projects, or
                  more than a subset of a project, since it has specific
                  limitations on what it can prove and is only compatible with
                  a subset of Ada (this is being expanded each year). The
                  subset of Ada it does apply to is actually quite significant
                  and useful, but you have to be capable of understanding that
                  it is limited and then apply it within that limited scope.
                  
                  This is true for every other formal method approach I've
                  seen. Pay attention, read, ask questions, and don't judge
                  them all by one paper you half remember from 42 years ago.
       
              thfuran wrote 1 day ago:
              If you've formally proved that correct, then either it can't
              overflow or there's a massive bug in the prover.
       
              dinosaurdynasty wrote 1 day ago:
              Any formal system worth its salt is going to model overflow.
       
                Jtsummers wrote 1 day ago:
                And with SPARK/Ada it will. When I was learning it that was
                actually a very common thing that it caught in my code (various
                forms of over/underflow errors). In practice most of them would
                not occur, but a few could have under real-world use-cases.
       
            dllthomas wrote 1 day ago:
            Formal methods aren't "testing".
            
            As you say, though, exhaustive testing is sometimes possible.
       
              Retr0id wrote 1 day ago:
              I agree that it's not testing in the strict sense, but fuzzing
              isn't really testing either.
              
              The work done by a a theorem prover and a fuzzer isn't all that
              different (and sufficiently advanced fuzzers use SMT solvers
              etc.)
       
                _flux wrote 1 day ago:
                I guess I need to disagree :). Why isn't fuzzing testing?
                Ultimately it ends up running the code and seeing if it
                works—for some definition of "working", that is, it doesn't
                crash—even if it may pick the ways it chooses the input by
                considering how the code behaves while running it.
                
                Also, how is even "smart fuzzing" similar to theorem provers in
                any way? Fuzzers still think in terms of runs, but a theorem
                considers all cases. The best fuzzer wouldn't be able to tell
                if a sort algorithm works for all inputs, presuming the input
                size is unbounded. Actually, my understanding is that a fuzzer
                wouldn't be able to prove if identity function works for all
                kinds of objects it may be given.
       
                  dllthomas wrote 20 hours 23 min ago:
                  I think fuzzing is doing testing, but often using other
                  techniques (including sometimes formal methods) to improve
                  the effectiveness of that testing.
       
                NovemberWhiskey wrote 1 day ago:
                This seems to neglect the fundamental difference between static
                and dynamic analysis. Formal methods approaches don’t usually
                need me to run my code.
       
              bluGill wrote 1 day ago:
              It is normally safe to assume the exhaustive testing isn't
              possible because the total states to exhaustively tests exceeds
              the number of atoms in the universe. There are a few exceptions,
              but in general we should assume it is always impossible to
              exhaustively test programs. Which means we need to use something
              else to find what the limits are and test only those, or use
              formal methods (both would be my recommendation - though I'll
              admit I have yet to figure out how to use formal methods)
       
                dllthomas wrote 1 day ago:
                > in general we should assume it is always impossible to
                exhaustively test programs
                
                The whole program? Yes.
                
                For testing individual components, there's no need to assume.
                The answer is very likely either clearly yes (the input is a
                handful of booleans) or clearly no (the input is several
                integers or even unbounded) and for those cases where it's not
                immediately clear (one i16? Three u8s?) it's probably still not
                hard to think through for your particular case in your
                particular context.
       
                  mananaysiempre wrote 1 day ago:
                  Brute forcing a single 32-bit input has also been usefully
                  performed in some cases, such as for single-precision
                  implementations of special functions[1]
                  
   URI            [1]: https://randomascii.wordpress.com/2014/01/27/theres-...
       
                    bluGill wrote 1 day ago:
                    32 bits is a bit over 4 billion states, so that is doable. 
                    However a lot of algorithms can have more than 32 bits of
                    states and so quickly get beyond what is doable.
       
                      dllthomas wrote 20 hours 25 min ago:
                      > 32 bits is a bit over 4 billion states, so that is
                      doable.
                      
                      Whether 4B is doable depends on how much work you're
                      doing for each of those states, but yes, "4B" alone
                      certainly doesn't rule it out.    There's also a question
                      of whether it goes with your most frequently run tests or
                      is something that's only run occasionally.
                      
                      > However a lot of algorithms can have more than 32 bits
                      of states and so quickly get beyond what is doable.
                      
                      I'm pretty confident we're all in agreement there.
       
       
   DIR <- back to front page