_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   The server chose violence
       
       
        saagarjha wrote 14 hours 17 min ago:
        > Since attempts at exploitation often manifest first as errors or
        misuse of APIs, a system that responds to any misbehavior by wiping the
        state of the misbehaving component ought to be harder to exploit.
        
        In this case your application is one that is a little more rigorous in
        checking what it accepts. So it has a security benefit, but not the
        kind you think it does: an attacker is not set back because you
        destroyed their progress, it’s that you made certain invalid states
        that were previously possible to chain into more desirable invalid
        states no longer work. So an attacker will look elsewhere instead of
        trying to do that.
       
        jerf wrote 1 day ago:
        I would advise the author to read up on "asynchronous exceptions" and
        check out how many systems have had them at some point and removed
        them.
        
        I'm not saying that's because they're fundamentally impossible, but
        because they have a track record of tripping up language designers and
        it's good to cross check against the experiences.
        
        Recommended languages are Java (ultimately a failure despite vast
        effort), and Haskell and Erlang where they work, but a lot of work of
        very different kinds was put in to make it work. I definitely get
        Erlang vibes from this piece so it's possible the preconditions for
        correct asynchronous exceptions are met or can be met here. But they
        are very subtle and have a tempestuous history of working 99.9% but it
        being literally impossible to get to 100%. This could be a big, big,
        big trap.
       
          beeeeerp wrote 1 day ago:
          >Recommended languages are Java (ultimately a failure despite vast
          effort)
          
          Why is Java a failure? Recent JVMs have come a long way, and GraalVM
          makes it somewhat comparable to Go-like languages.
          
          I understand the historical hate and how Oracle bought it, but it
          really isn’t that bad of a language if you’re using modern Java.
       
            j16sdiz wrote 1 day ago:
            The parent was talking about async exception in java -- like the
            InterruptedException . They are hard to work or reason with.
       
            steveklabnik wrote 1 day ago:
            They're referring to that specific feature of Java being a failure,
            not the language in general.
       
          steveklabnik wrote 1 day ago:
          I am not familiar with what you're talking about (but Cliff may
          already know), I'll have to look into it. But Hubris is a synchronous
          system, and also, these faults aren't catchable, so I'm not sure how
          directly relevant it is. What's the specific issue you're worried
          about?
          
          Your Erlang vibes are there for good reason, it's certainly an
          influence.
       
            jerf wrote 1 day ago:
            Reaching out and nuking other... whatever you call them, "execution
            contexts" is what I go for to be maximally generic (thread/async
            task/continuation/generator/etc.), can particularly cause problems
            if the context was going to do X, Y, and Z and expected to be able
            to be guaranteed to run Z, but Y killed it. The standard example is
            for X to be taking a lock and Z to release it, but there are a lot
            of ways to get into trouble and the obvious first solutions don't
            work.
            
            Erlang solves it by locking what things it has that can have that
            problem behind other execution contexts that don't get killed when
            the main one dies, so they can still clean up. ("Ports", in their
            terminology.) Haskell solves it by being a functional language and
            beating the collective community's head in it for several years.
            (Immutability helped a lot, laziness took out back.)
            
            If that sounds impossible... hey, great! Then I just pattern
            matched on something that wasn't a match. If that doesn't sound
            impossible, then it may be worth a look around.
            
            Synchronousness may not really matter, I've kind of thought that
            "asynchronous exception" is not a good name for the issue for a
            while, but it's what it gets called. It's really about one
            execution context lobbing errors/exceptions into others. Although
            being synchronous would avoid the worst timing issues.
       
              steveklabnik wrote 1 day ago:
              Ah, that problem in general I am familiar with, yes.
              
              Tasks in Hubris are independently compiled programs, not threads
              in a shared context. So I don't believe that it's an issue. You
              don't share locks between tasks, you create a task that holds the
              shared resource, and have the two tasks that want to share it
              talk to that task, patterns like that.
       
          theamk wrote 1 day ago:
          huh? Do not see anything asynchronous in author's work. It's all
          synchronous, because IPCs in hubris are synchronous too.
       
        cryptoxchange wrote 1 day ago:
        It’s interesting how in a system where one team writes all the code,
        nuking your clients from orbit when they look at you funny can improve
        iteration speed.
        
        It’s funny to wake up and read this after falling asleep reading
        about algebraic effects.
        
        If you squint the right way, this is a kernel that lets a server
        perform an effect that the client cannot handle.
        
        I feel like this would make code reuse and composition much harder, but
        provides a much simpler execution model. Definitely the right trade off
        in a static embedded system. You can always just vendor and modify a
        task if you need to reuse it.
       
          theamk wrote 1 day ago:
          I don't think this will make reuse much worse even in a general
          programs, as long as there is a good division between expected errors
          (file not found) and unexpected (invalid operation code). In fact,
          there are a lot of ignorable errors in Unix which IMHO should have
          been raising a fatal signal instead, as this would substantially
          improve general software quality.
          
          As an example: trying to close() invalid FD is a a non-fatal error
          which is very often ignored. But it is actually super dangerous,
          especially in multi-threaded apps: closing wrong fd will harmlessly
          fail most of the time, but 1% of time you'll close a logging socket
          or a database lock file or some unrelated IPC connection.. That's how
          you get unreliable software everyone hates.
       
            cryptoxchange wrote 1 day ago:
            I agree with you in general.
            
            However, in your example it’s the kernel that is deciding the
            request (message) is bad. In Hubris it is the message receiver.
            
            This is a bit contrived, but imagine you’re receiving some
            stringly typed data from an external source and sending a message
            to a parsing task that either throws or messages you back with a
            list of some type t. Maybe it is returning ints and you as the
            client know that if something isn’t parsable as an int you want
            it to treat it as a ‘0’ because you’re summing the list.
            Somewhere else you want to call the same task, but you want strings
            that can’t be parsed to be treated as ‘1’ unless they can’t
            be parsed due to overflow (in which case you rethrow) because
            you’re taking the product.
            
            In some situations it’s natural for the client to know more than
            the server about how to handle errors. With this nuke from orbit
            model, there’s some forced coupling between the client and server
            (mutual agreement over what causes a REPLY_FAULT).
       
        lloydatkinson wrote 1 day ago:
        I’m really enjoying his posts on this
       
        ezekiel68 wrote 1 day ago:
        > There is no way to “fix” the problem and resume the task. This
        was a conscious choice to avoid some subtle failure modes and simplify
        reasoning about the system.
        
        One of Einstein's famous quotes is, "...as simple as possible, but no
        simpler."  I'm pretty sure this design violates the latter portion. 
        I'm not interested in operating environments that can tolerate no
        real-world chaos, and I'm not aware of any commercially viable realms
        which would either.  What -- push it back to the init system to keep
        trying again?  But by what mechanism would that strategy be able to
        understand the fault that occurred, in order to try again better?
        
        Anyway, kudos for purity of conviction (I guess).
       
          vvanders wrote 1 day ago:
          >  that can tolerate no real-world chaos, and I'm not aware of any
          commercially viable realms which would either.
          
          Watchdog timers will happily kill/restart your processes that don't
          poke them often enough. Even in my hobby exercises I've seen I2C
          busses hang up often enough(and bring the whole system down!) when
          some protocol bit goes wrong that I think the design is actually
          quite inspired. As I understand it this isn't talking about known
          error cases(that are handled) but protocol mismatches and other
          things that shouldn't ever happen.
          
          Many other comments touched on it but it's a purpose built OS, much
          in the same way I'm not going to build a UI in Erlang, Hubris seems
          well positioned for the space that it occupies.
       
          bcantrill wrote 1 day ago:
          Hubris is not an academic exercise: it runs at the heart of every
          element of the Oxide rack (compute sled, switch, power shelf
          controller) -- and its design is informed by delivered utility above
          all else.  Indeed -- and as Cliff elaborated in the blog --
          REPLY_FAULT was something that he thought initially perhaps too
          aggressive, but it was our own experience in building, deploying, and
          (it must be said!) debugging the system that gave him the confidence
          that it would make our systems more robust, not capriciously faulty.
          
          For more details on the thinking here and what it looks like in
          practice, see (e.g.) [0] and [1].
          
          [0] [1]
          
   URI    [1]: https://www.mattkeeter.com/blog/2024-03-25-packing/
   URI    [2]: https://cliffle.com/blog/who-killed-the-network-switch/
       
          cryptoxchange wrote 1 day ago:
          It’s a 2000 line rust embedded systems kernel that doesn’t
          support adding new tasks at runtime. It is written to go deep in the
          guts of the 0xide server racks.
       
          crote wrote 1 day ago:
          >  But by what mechanism would that strategy be able to understand
          the fault that occurred, in order to try again better?
          
          I think the general idea is to apply this to problems which are
          clearly the result of an invalid program state, and therefore not
          reasonably recoverable. They are either caused by bugs, an attack, or
          corrupted hardware. In all cases you shouldn't continue, because
          there's something seriously wrong with the caller. If the caller
          continues, it could only cause more damage.
          
          It sounds a bit like Erlang/OTP's "let it crash" philosophy. Erlang
          is used in quite a bunch of mission-critical hardware and is famous
          for its reliability, so it might not be such a huge dealbreaker in
          practice.
       
            sillywalk wrote 1 day ago:
            > It sounds a bit like Erlang/OTP's "let it crash" philosophy.
            
            Which was based partly on ideas from Tandem Computers' NonStop /
            Guardian. Hardware and software were fail-fast i.e. they would work
            correctly or stop, so they couldn't corrupt data. If there was a
            problem, the whole processor / process would be stopped, and a
            backup took over, which seems somewhat similar to the "supervisor"
            tasks in hubris.
            
            Quite a bit of a different use cases - an embedded os for
            microcontrollers vs large OLTP applications. They both could be
            considered "mission critical", at least for the people who own/make
            money with them.
       
              vvanders wrote 1 day ago:
              From a "system engineering"(not to be confused with software
              engineering) perspective they seem quite similar, in my view even
              something like a watchdog timer(which just about every CPU/core
              has these days) is just a hardware version of similar
              philosophies. This[1] is one of my favorite overviews on Erlang
              and what drives some of those design decisions. You can
              absolutely apply the same systematic thinking to other
              domains/places without having to bring OTP or even Erlang into
              the conversation.
              
   URI        [1]: https://ferd.ca/the-zen-of-erlang.html
       
        loeg wrote 1 day ago:
        > Take Unix for example. If you call close on a file descriptor you
        never opened, you get an error code back. If you call open and hand it
        a null pointer instead of a pathname? You get an error code back. Both
        of these are violations of a system call’s preconditions, and both
        are handled through the same error mechanism that handles “file not
        found” and other cases that can happen in a correct program.
        
        > On Hubris, if you break a system call’s preconditions, your task is
        immediately destroyed with no opportunity to do anything else.
        
        Oh, yeah.  I've long thought EBADF and EINVALs (and EFAULT, I guess)
        should basically always be fatal.
       
        ahepp wrote 1 day ago:
        It sounds like this may be similar to using signals for error handling
        in a Unix system?
       
          steveklabnik wrote 1 day ago:
          In some sense, yes, this is kind of like the kernel sending SIGKILL
          to a process.
       
            saagarjha wrote 14 hours 19 min ago:
            Which some kernels do, actually. Not Linux but the ones that think
            you’re messing with things you shouldn’t be will SIGKILL you at
            the earliest opportunity.
       
              josephcsible wrote 1 hour 8 min ago:
              Linux will sometimes, e.g., if a process violates seccomp.
       
        crdrost wrote 1 day ago:
        OK, we need to get this as an April Fools RFC for HTTP.
        
        I propose HTTP 499 “Shame on you.” A client receiving 499 (perhaps
        on a request that it must have originated with a specific header like
        “Strict: true”) must terminate, in a language-dependent manner, the
        task which issued the request.
        
        It perfectly balances the “WTF... But actually, hey” that one sees
        in those contexts.
       
        rjbwork wrote 1 day ago:
        Reminds me of Vigil.
        
   URI  [1]: https://github.com/munificent/vigil
       
        Animats wrote 1 day ago:
        That's QNX-type interprocess communication. QNX doesn't offer
        interprocess kill, though.
       
          steveklabnik wrote 1 day ago:
          The designer of Hubris (and several folks who work on it) are
          familiar with QNX, for sure.
       
        hinkley wrote 1 day ago:
        I wonder if they’re going to find this creates security issues.
        
        Processes keep state to analyze abuse of various kinds, and killing a
        process presumably wipes its memory. Unless there’s some way to
        retain state across restarts?
       
          bcantrill wrote 1 day ago:
          Yes, we have an in situ dump facility, which Cliff mentioned at the
          end of [0]; it's been essential for debugging these issues when we
          hit them.
          
          [0]
          
   URI    [1]: https://cliffle.com/blog/who-killed-the-network-switch/
       
        samus wrote 1 day ago:
        I find Humility is a great name for a debugger. Many are the
        programmers that refuse to use debuggers and just stare the code down
        until it yields errors, under the assumptions that "good" code doesn't
        need debugging!
       
          YZF wrote 1 day ago:
          It's partly a religious thing, partly what you're used to and partly
          using the right tool for the job. Some programmers use debuggers as a
          crutch and some complex systems (e.g. that involve multiple
          distributed components or are timing dependent) can't be easily
          debugged using traditional debuggers.
          
          EDIT: yet another factor is sometimes you may not even have access to
          the system you need to troubleshoot. Being able to  reason about code
          execution without observing it is a useful skill (and still a
          debugger is a useful tool).
       
          r2_pilot wrote 1 day ago:
          I find this attitude bizarre. Just earlier today I used python
          debugging to quickly figure out why an error was occurring. Being
          able to see the state of the variables without having to print each
          helped solve it instantly.
       
          hinkley wrote 1 day ago:
          I find more bugs with a debugger. There’s typically the bug I was
          looking for, and then smaller bugs that didn’t technically cause
          the problem but contributed, and may be involved in the next issue. I
          want to fix those too, and sometimes first.
       
        ahepp wrote 1 day ago:
        > The Hubris IPC scheme is deliberately designed to work a lot like a
        function call, at least from the perspective of the client.
        
        That's a bona fide remote procedure call, isn't it?
       
          steveklabnik wrote 1 day ago:
          In a sense, though most would think of "remote" as being "over the
          network," and that's not the case here.
       
        optimalsolver wrote 1 day ago:
        Title sounds like it concerns a really fed-up waiter.
       
          adonovan wrote 1 day ago:
          In a sense, it does: waiting is one of the main jobs of an OS kernel.
       
        manishsharan wrote 1 day ago:
        I recall server ABENDs in Novell NetWare. I think it was the OG of
        server violence.
       
        layer8 wrote 1 day ago:
        Does REPLY_FAULT cascade? Meaning, if A is waiting in a SEND to B, and
        B is waiting in a SEND to C, and C does REPLY_FAULT, does A get killed
        along with B (and any further tasks that may be waiting on A)? Because
        if not, a malicious task could just delegate its experiments to a
        helper task. And if yes, that seems rather brittle overall (without
        having any further familiarity with Hubris). Furthermore, if SENDs can
        be circular/reciprocal, a task may also inadvertently kill itself that
        way — which (for scenarios like B –> A –> B) may incentivize not
        using REPLY_FAULT.
       
          ironhaven wrote 1 day ago:
          I think when B gets faulted A would get an error about a dead server
          and would have the opportunity resend the same message to a newly
          reset server not a cascading crash.
       
          samus wrote 1 day ago:
          It seems that Hubris is not designed as a general-purpose operating
          system. Processes are defined at build time.
          
          The reason why servers can shoot back at their clients is
          reliability, not security. Errors are thought to originate from bugs,
          not from deliberate attacks. The extreme reaction of the kernel
          ensures that developers find them as soon as possible.
          
          Of course, there is an overlap with security, and this can be a
          useful fallback measure in the event that a process tries to do
          something that it isn't supposed to do.
       
            steveklabnik wrote 1 day ago:
            > It seems that Hubris is not designed as a general-purpose
            operating system. Processes are defined at build time.
            
            These are both    correct.
            
            Well, I mean, Hubris is general in the sense that, if you're doing
            an embedded system and you can deal with the constraints it has,
            like the latter, it can work for your projects. But it's not trying
            to be anything other than a good embedded OS, or to handle any
            project.
       
        e-dant wrote 1 day ago:
        I’m wondering if this really is too aggressive.
        
        On Linux, sure it’s not possible to directly crash another program
        you’re talking to via a socket alone (ignoring bad data on the
        socket).
        
        But you can absolutely kill them. Anything running as root can kill
        anything else. Can even reboot and bring down the whole system.
        
        Maybe a bit harder and a bit more unusual, but at least for containers,
        root privileges are common. And yeah, sure, there’s a cgroup there
        are you’re more limited. But you get the idea.
        
        It’s also a bit different from the (conventional?) wisdom about being
        “liberal in what you accept, conservative in what you emit” though
        that’s a bit more tied to networked systems.
        
        Though, maybe it’s inevitable that a system has to be liberal in what
        they accept.
        
        How else can you change the api slightly without breaking existing
        programs?
       
          gary_0 wrote 1 day ago:
          Hubris isn't a general-purpose OS, it runs on a low-level processor
          inside the Oxide server rack. I believe Hubris doesn't even allow new
          kinds of processes at runtime; all possible executables must be
          determined at compile time.
       
            sillywalk wrote 1 day ago:
            > I believe Hubris doesn't even allow new kinds of processes at
            runtime; all possible executables must be determined at compile
            time.
            
            Correct. From [0]:
            
            "Hubris is an aggressively static system. The configuration file
            defines the full set of tasks that may ever be running in the
            application. These tasks are assigned to sections of address space
            by the build system, and they will forever occupy those sections.
            
            Hubris has no operations for creating or destroying tasks at
            runtime. Task resource requirements are determined during the build
            and are fixed once deployed. This takes the kernel out of the
            resource allocation business. Memories are the most visible
            resources we handle this way, but it applies to all allocatable or
            routable resources, including hardware interrupts and memory-mapped
            registers – all are explicitly wired up at compile time and
            cannot be changed at runtime."
            
            [0]
            
   URI      [1]: https://cliffle.com/blog/on-hubris-and-humility/
       
            steveklabnik wrote 1 day ago:
            > I believe Hubris doesn't even allow new kinds of processes at
            runtime; all possible executables must be determined at compile
            time.
            
            This is correct, yes.
       
              gary_0 wrote 1 day ago:
              "Perfection is achieved, not when there is nothing more to add,
              but when there is nothing left to take away."
       
        greenbit wrote 1 day ago:
        Reminded me of the line from Errand of Mercy, "You will find there are
        many rules and regulations. They will be posted. Violation of the
        smallest of them will be punished by death."
       
        autocole wrote 1 day ago:
        Very enjoyable read, and this single supervisor is similar to how I set
        up an application at a previous startup, where we unwrapped everything.
        This reminds me of one of my favorite posts
        
   URI  [1]: https://medium.com/@mattklein123/crash-early-and-crash-often-f...
       
        rcarmo wrote 1 day ago:
        Hubris and Humility (its debugger) are two pieces of tech I would love
        to be deeply engrossed in if I had the time (or the mandate). But alas,
        that is not possible.
       
        pavlov wrote 1 day ago:
        > ‘But REPLY_FAULT also provides a way to define and implement new
        kinds of errors — application-specific errors — such as access
        control rules. For instance, the Hubris IP stack assigns IP ports to
        tasks statically. If a task tries to mess with another task’s IP
        port, the IP stack faults them. This gets us the same sort of “fail
        fast” developer experience, with the smaller and simpler code that
        results from not handling “theoretical” errors that can’t occur
        in practice.‘
        
        This sounds good when the system is small and tight and applications
        are written mostly by people who designed the whole system.
        
        But as an application developer, I’d be somewhat scared to interface
        with third-party code over an IPC model where the other service can at
        any time send back an instant death pill to my process.
        
        I guess I just don’t trust other app developers that much. The world
        is full of terrible drivers and background processes written by
        stressed-out developers harassed by management. They’ll drop in a
        bunch of potentially unsuitable default REPLY_FAULTs if it means they
        get to go home before 8pm.
       
          saagarjha wrote 14 hours 23 min ago:
          To be fair, this is how abort works in any library you call into
          that’s in your process.
       
          aidenn0 wrote 1 day ago:
          > But as an application developer, I’d be somewhat scared to
          interface with third-party code over an IPC model where the other
          service can at any time send back an instant death pill to my
          process.
          
          For service, think "OS interface". If you make a bogus kernel call on
          a monolithic kernel, it would be reasonable for the OS to kill you. 
          Also note that when you say "process" it might be different than you
          think because threads all share the same address space on hubris.
       
          ahepp wrote 1 day ago:
          It seems like in an embedded environment, it's good to resolve these
          misunderstandings immediately when they occur, regardless of whose
          fault it is.
          
          The server says "that client is bad!" so the kernel kills it. The
          problem is really that the two didn't understand each other.
       
          toast0 wrote 1 day ago:
          > This sounds good when the system is small and tight and
          applications are written mostly by people who designed the whole
          system.
          
          Swift death to deviance is a way to keep the system tight. The
          designed scope probably keeps it small anyway. Scopes have a way of
          creeping, but I don't think people will want to force tasks into
          Hubris that would be better on the host rather than in its embedded
          controllers.
       
          password4321 wrote 1 day ago:
          > not handling “theoretical” errors that can’t occur in
          practice
          
          The Dennis Nedry approach to counting dinosaurs in Jurassic Park.
       
          mikaraento wrote 1 day ago:
          Indeed, this happened with Symbian. An IPC server could panic the
          client. As an application developer without access to the OS source
          code this was pretty terrible. Not all preconditions were easily
          understood and could vary between devices and OS versions.
       
          skitter wrote 1 day ago:
          > This sounds good when the system is small and tight and
          applications are written mostly by people who designed the whole
          system.
          
          I think that's intentional because that's what Hubris is aimed at.
       
            mannykannot wrote 1 day ago:
            ...and in that circumstance, the author reports finding, apparently
            serendipitously, that it helped with development: "Initially I was
            concerned that I’d made the kernel too aggressive, but in
            practice, this has meant that errors are caught very early in
            development. A fault is hard to miss, and literally cannot be
            ignored the way an error code might be."
       
       
   DIR <- back to front page