_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Slightly better named character reference tokenization than Chrome, Safari, FF
       
       
        o11c wrote 1 hour 15 min ago:
        Congratulations, you've reinvented regexes. This is still a win since
        you're using the sane kind of regex and are allowing multiple accept
        states rather than just one, in both cases unlike most modern
        implementations.
        
        (I'm mostly throwing my thoughts as they appear, some parts of this
        ends up duplicating what's in the article, hopefully with more standard
        terminology though)
        
        Note that at runtime there is no difference between a standard DFA and
        what you can a DAFSA. The difference is entirely at construction time.
        
        In lexers, your `end_of_word` is usually called `accept`, and rather
        than being a `bool` it is an integer (0 for no-accept, N for the Nth
        valid accept value, which in your case should probably be an index
        within the array of all possible characters. Note that since multiple
        entity names map to the same character, you will have multiple nodes
        with the same `accept`). I think your perfect-hash approach requires
        duplicating them (which admittedly might be a win since you are far
        from the typical lexing case where there are many possible inputs for
        some outputs. However, this does mean you can't play games with the
        bits of accept` to encode the length of your lookup as well as the
        start - if we're saving size, I lean toward UTF-8, either
        nul-terminated or with an explicit length).
        
        The next thing you should do is use equivalence classes rather than
        dealing with every character individually. For this particular parsing
        problem, almost all of your equivalence classes will only have a single
        character, but you still win big by mapping all invalid characters to a
        single class. Since there are only 51 characters used in entity names,
        this means you only need 6 bits per character (which should be fast
        since you only need to special-case non-letters). And since many of
        those only appear for the first letter, you can probably deal with 5 or
        fewer with minimal logic ahead of time.
        
        That said - one important lesson from lexing is that it is almost
        always a mistake to lex keywords; whenever possible, just lex an
        identifier and then do a map lookup. The reason that can't be done is
        entirely because of those entities which do not require the semicolon,
        so I suspect that the optimal approach is going to be: after resolving
        `document.write`, look ahead for a semicolon, and if found use the fast
        path; only if that fails, enter the (much smaller) DFA for the few that
        do not require a semicolon. But since you don't have identifiers you
        might not be hitting the worst case (explosive splitting) anyway.
        
        For something this small, binary search is probably a mistake (being
        very unpredictable for the CPU) if you're doing everything else right;
        you're better off doing a linear search if you can't just using SIMD
        magic to match them in parallel. Struct-of-arrays is probably pointless
        for a problem set that fits in L1, but might start winning again if you
        want to leave some L1 for other parts of the program. Storing
        siblings/cousins next to each other (as an accident of construction)
        means you're probably already as Eytzinger-like as you can be.
        
        (Edit: fix incomplete and missing thoughts)
       
        masfuerte wrote 2 hours 18 min ago:
        That was a good read.  I reread the relevant section of the HTML5 spec
        and noticed an error in an example:
        
        > For example, &not;in will be parsed as "¬in" whereas &not;in will be
        parsed as "∉".
        
        Only a small minority of the named character references are permitted
        without a closing semicolon, and notin is not one of them.  So &not;in
        is actually parsed as "¬in".  &not;in; is parsed as "∉".
        
   URI  [1]: https://html.spec.whatwg.org/#parse-error-missing-semicolon-af...
       
          squeek502 wrote 1 hour 59 min ago:
          Good catch, that does indeed look like a mistake in the spec.
          Everything past the first sentence of that error description is
          suspect, honestly (seems like it was naively adapted from the example
          in [1] but that example isn't relevant to the
          missing-semicolon-after-character-reference error).
          
          Will submit an issue/PR to correct it when I get a chance.
          
   URI    [1]: https://html.spec.whatwg.org/multipage/parsing.html#named-ch...
       
        Ndymium wrote 3 hours 22 min ago:
        Thanks to your article I just realised my HTML entity codec library
        doesn't support decoding those named entities that can omit the
        semicolon at the end. More work for me, good thing my summer vacation
        just started! :)
       
        deepdarkforest wrote 5 hours 13 min ago:
        This might not get a lot of traction because it's very technical, but i
        wanted to say a massive well done for the effort. 20k words on anything
        this specific is not a joke. I wish i would put this level of
        commitment to anything in life, this was inspiring if nothing else.
       
          squeek502 wrote 4 hours 59 min ago:
          Appreciate it (I'm the author). I'd like to think there's a good bit
          of interesting stuff in here outside of the specific topic of named
          character reference tokenization.
       
            arthurcolle wrote 4 hours 12 min ago:
            Congratulations on your newfound promotion to data structures
            person btw
       
            chaps wrote 4 hours 14 min ago:
            "no[t] a 'data structures' person"
            
            says the person who wrote an extremely technical 20k word blog post
            on data structures! <3
       
       
   DIR <- back to front page