_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Parsing and all that
       
       
        keybored wrote 2 days ago:
        How relevant are Context Free grammars for modern programming
        languages? It seems that most modern languages want things like “raw
        strings” where you can deal with arbitrary strings by using syntax
        like (Rust) `r###` (however many `#` you need). It is my understanding
        that syntax like that makes these grammars fall outside of Context
        Free, no?
       
          o11c wrote 2 days ago:
          The actual place where CFGs fail tend to be if types and expressions
          aren't parsed the same. This especially causes problems if your
          language's generics use `<>`, but even C has the `T * v` problem.
       
            adrian_b wrote 20 hours 39 min ago:
            In traditional programming languages the origin of such problems is
            the straightjacket of the ASCII character set.
            
            When a programming language is designed now, the right way is to
            use distinct Unicode characters wherever necessary, easily avoiding
            all ambiguities.
       
              layer8 wrote 20 hours 2 min ago:
              I would like to agree, but since keyboards are still mostly ASCII
              (in particular with regard to punctuation characters), most
              people don’t have a Compose key configured, and also don’t
              want to be constrained to editors with an input method tailored
              to that programming language, it would be an uphill battle for
              adoption.
       
          chowells wrote 2 days ago:
          That sort of thing is typically handled during lexing. By the time
          the parser deals with it, it's a string literal. This is one of the
          advantages of splitting those into separate phases.
          
          There are other hacks in use as well. Haskell's syntax is
          context-sensitive because operator precedence and associativity can
          be declared in the same context as the operator is used. But GHC
          ignores that during the phase where it actually parses the input. It
          just ignores precedence and associativity when parsing operators.
          After it finishes parsing the translation unit, another phase goes
          back, finds nested operator applications in the parse tree, and fixes
          them up based on the precedence and associativity information it has
          found.
          
          It turns out that when your language is only just barely
          context-sensitive, you can find a way to use context-free tools to
          handle it, with a couple extra passes to fix things up.
       
          Rusky wrote 2 days ago:
          Features like that technically make the language context-sensitive,
          but in practice that context sensitivity is entirely contained within
          the lexer, and you can still describe the rest of the language with a
          context free grammar of tokens.
       
            keybored wrote 2 days ago:
            Okay, thanks. I think that’s good.
       
        norir wrote 2 days ago:
        There is a part of me that feels we should just forget that LR ever
        existed and insist on LL(k) for k <= 2 for all new languages. The
        supposed expressivity benefits are not worth the complexity cost in my
        not so humble opinion. For me, the best approach is to write the
        grammar in a dsl that checks that it is unambiguous and in LL(k) and
        then implement by hand with recursive descent.
       
          o11c wrote 2 days ago:
          I've always been hesitant about LL(1) since it requires a hack for
          something as common as expression-parsing, whereas LR is fine. Note
          that the linked article is misleading about LR having a problem with
          right-recursion: it only affects (data) stack depth, rather than
          breaking the machine like the LL problem. Still, LL and LR have
          common ground against the heresy of backtracking at least.
          
          I've never met a sane grammar that's not LALR(1). It's unfortunate
          that LALR(1) is such a leap of complexity beyond SLR(1), which is not
          practical for real languages.
          
          I imagine that LL(1) "mostly" works after the expression hack -
          what's your particular motivation for LL(2)? A lot of weird grammars
          lead to code that's hard to parse for humans.
          
          (Note also that, for languages (Python, Javascript, etc.) that lack
          an efficient `switch` statement, table-based is likely to beat
          function-call-based code)
       
            ufo wrote 1 day ago:
            Needing two tokens of lookahead is fairly common. For example, in
            Lua "local IDENTIFIER" is a local variable declaration and "local
            function" is a private function declaration.
       
          dbcurtis wrote 2 days ago:
          I tend to agree.  Having done a bunch of DSL's over the years, and
          becoming extremely fluent with Bison+FLEX, I have to say that a
          hand-written LL parser with a couple of mutually recursive
          table-driving functions to parse expressions via precedence climbing
          is the way to go.  Precedence climbing (or I think it is also called
          Pratt parsing?) is super simple and flexible.  But the BIG bonus is
          in error reporting, and let's face it, most of the user interaction
          with your parser is with the error messages, otherwise it is push
          'enter' and forget.  Getting a reasonable error message out of an LR
          parser is about as much fun as poking yourself in the eye repeatedly
          with a sharp stick.  With LL, you know what structure you are trying
          to build... with LR, you have to rummage around in parse stack
          looking for obscure clues.
       
            o11c wrote 2 days ago:
            I utterly reject hand-written parsers. You give up the ability to
            trust your grammar - all too often you can accidentally add an
            ambiguity.
            
            I like `bison --xml` for the fact that you can write your own code
            but trust the table.
       
              norir wrote 19 hours 48 min ago:
              Do you utterly reject essentially every mainstream language?
              Which languages do you use that do not use a handwritten parser?
       
            Calavar wrote 2 days ago:
            I am aware this is an unpopular opinion, but I think the idea that
            recursive descent is inherently better suited to error reporting
            than table-driven LR is massively cargo-culted. You can handcraft
            error messages for an LR parser if you overspecify the grammar,
            meaning that you don't just write productions for the actual
            language; you also write productions for likely errors and have the
            predicates for those productions report your handcrafted error
            message.
            
            I'm guessing that a lot of people's first impression of that will
            be "That's insane, are you seriously suggesting writing two
            grammars in one?" But that's exactly what you do when you build
            custom error reporting code into a recursive descent parser. It's
            just less obvious because the code is procedural instead of
            declarative.
       
              norir wrote 19 hours 49 min ago:
              I see the appeal of adding error nodes, but I'm not a huge fan.
              If you do this rather than either bombing out immediately on the
              first error or storing a stack of errors, then the syntax errors
              pollute the generated tree. This will require you to either add a
              separate pass for building a new tree that is the same as the old
              except without the error nodes, or you will have to handle syntax
              errors downstream in the analysis or code gen phase.
              
              A recursive descent parser does not have to be procedural. I have
              written one  in a functional language that I designed that has no
              mutable variables or loops (recursive functions handle both of
              these cases).
              
              Parser generators are great for iterating on language syntax
              design. Once you know the language syntax, handwriting the parser
              should be easy. If it isn't, then it is most likely that either
              your language design or parsing skills need work. Both require
              time and practice.
       
                Calavar wrote 16 hours 13 min ago:
                > If you do this rather than either bombing out immediately on
                the first error or storing a stack of errors, then the syntax
                errors pollute the generated tree. This will require you to
                either add a separate pass for building a new tree that is the
                same as the old except without the error nodes, or you will
                have to handle syntax errors downstream in the analysis or code
                gen phase.
                
                I don't see how this follows? There is no need to append an
                error node (or any node at all) to the AST when reducing an
                error production. You can just print an error message or push
                one onto a stack of error messages.
                
                > Parser generators are great for iterating on language syntax
                design. Once you know the language syntax, handwriting the
                parser should be easy. If it isn't, then it is most likely that
                either your language design or parsing skills need work. Both
                require time and practice.
                
                Parser generators aren't just for people who don't know how to
                write recursive descent parers; they are production ready
                tools. Python, Ruby, PHP, and Bash all use generated parsers.
                
                I think the aversion to parser generators in general and LR
                parser generators in particular is a combination of
                not-invented-here-syndrome and cargo-culting the idea that LR
                parsers can't emit handcrafted error messages.
       
              layer8 wrote 20 hours 13 min ago:
              Yes, that’s also not uncommon for lexical grammars to haven
              token classes that are more general than what is actually valid,
              for example for numerical constants.
              
              What would be useful is tooling that would check that one grammar
              is a sub-language of another grammar which is suitably connected
              (because in the general case that check is undecidable of
              course).
       
              ckcheng wrote 2 days ago:
              That's a very interesting way of looking at errors.
              
              Wouldn’t the grammar of errors have to also be LR? Doesn’t
              that limit the kind of errors you can report?
       
                Calavar wrote 1 day ago:
                Yes, it does have to be LR. Which can be a real PITA if you're
                working with a restrictive subset of LR, like LALR. But there
                are other options too. LR(1) is pretty powerful, and GLR even
                more so.
       
          marcosdumay wrote 2 days ago:
          Well, IMO, but you write a parser only a couple of times for each
          language, but you read code all the time.
          
          So, anything that makes code easier to read is a must. That includes
          the extra expressiveness of an LR grammar.
       
            haberman wrote 2 days ago:
            Do you know of any particularly readable constructs that are
            possible with LR but incompatible with LL?
            
            My general opinion is that any constructs that confuse a parser
            will also confuse a human.
       
              marcosdumay wrote 2 days ago:
              Operators with precedence rules.
              
              And given the modern idea of "everything is either an identifier
              or an operator", precedence became very important.
       
                haberman wrote 2 days ago:
                Operators with precedence rules frequently appear in top-down
                parsers.  I agree that pure-LL form doesn't make this easy, but
                most languages I have seen use Pratt parsers that allow
                precedence parsing within a top-down framework.  It doesn't
                require full LR.
       
                  Rusky wrote 2 days ago:
                  Pratt parsing is essentially hand-written LR. It's certainly
                  bottom-up in the same way, despite Pratt's paper title.
                  
                  (I went into more depth on this a few years ago: [1] )
                  
   URI            [1]: https://www.abubalay.com/blog/2021/12/31/lr-control-...
       
                    chubot wrote 20 hours 40 min ago:
                    Hm I wouldn't call it bottom up -- I would say it's top
                    down, but you consult a table of integers to decide where
                    to recurse.
                    
                    I would even call it a variant of recursive descent, which
                    is top-down.  In plain recursive descent, you often choose
                    where to recurse based on the first token (e.g. if while
                    for).  In Pratt, you choose based on a table indexed by the
                    second operater token.
                    
                    You can have an an entire arbitrary length subexpression
                    before the "second token", but it is still resolved
                    "eagerly".
                    
                    e.g. consulting the loop here - [1] I skimmed over your
                    post, but I don't really see the bottom-up argument.
                    
                    I guess I could concede that it's neither top down nor
                    bottom up, but I don't see the argument for being bottom
                    up.
                    
                    And maybe this is just a taxonomy thing, which doesn't
                    matter in practice.  It could be understood both ways -- I
                    don't think Vaughan Pratt was mistaken when he understood
                    it a certain way :)
                    
                    ---
                    
                    "where to recurse" is basically choosing the parent's
                    production before its children -- that's most similar to
                    top down
                    
                    If you choose the child's production before its parent,
                    then that's LR / bottom-up
                    
                    (using Haberman's explanation - [2] )
                    
                    May also be relevant: [3] [4] [5] I believe some people
                    claimed that Dijikstra is bottom-up and Pratt is top down. 
                    That has a certain nice symmetry to it, but I'm not sure
                    it's true :)  Dijikstra does use an explicit stack and
                    Pratt uses the call stack -- maybe that is how I'd put it.
                    
   URI              [1]: https://github.com/andychu/pratt-parsing-demo/blob...
   URI              [2]: https://blog.reverberate.org/2013/07/ll-and-lr-par...
   URI              [3]: https://www.oilshell.org/blog/2017/03/31.html
   URI              [4]: https://www.oilshell.org/blog/2017/04/22.html
   URI              [5]: https://matklad.github.io/2020/04/15/from-pratt-to...
       
                      Rusky wrote 17 hours 1 min ago:
                      The fact that you parse an entire subtree before you know
                      the kind of its parent is exactly what makes it
                      bottom-up. Pratt is certainly easy to integrate into an
                      otherwise-top-down parser, because you can enter it from
                      the top and it can enter its "leaves" from the top, but
                      internally it is absolutely bottom-up, and you can
                      generalize this to any recursive ascent parser.
                      
                      To make this more explicit:
                      
                      The first thing a Pratt parser does before entering the
                      loop is based on the lookahead token's "nud." This
                      corresponds to an LR automaton in its initial state
                      shifting that token and advancing into the leaf-most
                      rule(s). The "nud" typically switches back to top-down
                      internally, where pure bottom-up would not, but the
                      choice of "nud" is necessarily made in a bottom-up way.
                      
                      Once "nud" returns, or the LR automaton reduces back to
                      the initial state, the next thing it does is based on the
                      operator token's "led." This corresponds to the LR
                      automaton shifting that token and advancing into the
                      parent rule determined by the operator. This resolution
                      is just as "eager" in LR, where it is represented by the
                      new state containing only items from the chosen rule.
                      This is the hallmark of bottom-up parsing: as you
                      progress through the children, you narrow down the
                      possibilities for the parent.
                      
                      Finally, note that none of this has anything to do with
                      the use of function calls vs a table of integers and
                      explicit stack. Both top-down and bottom-up parsers can
                      be implemented using either recursion or a table and
                      stack.
       
            vidarh wrote 2 days ago:
            I agree there are a few places like this, but they're usually
            solved when you want an LL parser for as much as possible of the
            language by either making them part of the tokenization or by
            "escaping" a small subset of the grammar.
            
            I wish that language designers (looking at you, Ruby - as much as I
            love using Ruby, the grammar is an abomination) would at least try
            to stick to LL as much as possible, and consciously, carefully and
            in a limited fashion "escape" into LR for as small subsets as
            possible.
            
            Break rules if it actually helps developers, sure, but it'd be nice
            if people tried to make the weird bits as tiny as possible...
            
            EDIT: Also, in practice this is often what we end up doing when we
            write recursive descent anyway, only in a less than formal way.
            It'd be nice if more tooling let us write grammars that default to
            LL w/checking, but let you insert clearly delineated exceptions.
       
              marcosdumay wrote 2 days ago:
              Oh, I do agree with that!
              
              A language should be LL as much as possible. And also depend on
              the parsed data as little as possible. Stuff like C's variable
              declaration must be avoided.
              
              But I don't think either of those rules can be made universal.
              Some language for expressing exceptions would be great.
       
            mhh__ wrote 2 days ago:
            In 2012 yes, but these days having grammars that lots of tools can
            parse and output easily is a boon for tooling.
       
        kazinator wrote 2 days ago:
        Combining a finite state machine for recognizing sentential patterns,
        with a stack, gives us LR parsing.
        
        But push-down automata are significant not only because they have
        practical uses in parsing, but because they represent a theoretical
        class. They are more powerful than finite automata, but are not Turing
        complete.
        
        If, say, we use a push-down automaton to make a Boolean decision: does
        this input string fall into the to-be-recognized set, or not? then
        there are some kinds of strings it won't be able to decide, that a full
        Turing machine could decide.
        
        The limitation of push down automata is a direct consequence of the
        stack discipline. There is only one stack, and when the stack is
        popped, information is thrown away.
        
        The tape machine that Turing described as part of developing his theory
        of computation doesn't have that limitation; when the tape head moves
        backwards, material it has written to the tape is not erased in a
        stack-like discipline.
       
          eternityforest wrote 2 days ago:
          What else can they do?    Has anyone used them outside of parsing, as a
          way to satisfy the principle of least power?
       
            082349872349872 wrote 21 hours 50 min ago:
            Michael Jackson's single-threaded "Jackson Structured Programming"?
       
          mcguire wrote 2 days ago:
          Neat observation: finite automata, push-down automata, and Turing
          machines can be distinguished purely by the number of stacks they
          need.
          
          Finite automata: no stack.
          
          Pushdown automata: one stack.
          
          Turing machine: two stacks. (Use them like the tape: to move forward,
          pop the f-stack and push the symbol on the b-stack; to move backward
          pop the b-stack and push to the f-stack.)
       
            mark_undoio wrote 20 hours 40 min ago:
            I think it's a shame this wasn't highlighted more explicitly in, at
            least, my computer science degree.
            
            We did finite state machines in quite a bit of detail, vaguely
            referred to pushdown automata but then concentrated on lambda
            calculus for quite a lot, with Turing machines less so.
            
            The Turing machine description, without showing it in this
            hierarchy, seems like a really weird abstraction.  Fitting it in
            here (and combining it with the different Chomsky classes of
            grammar) makes it all feel more joined up.
       
       
   DIR <- back to front page