gopher://codevoid.de/1/hn/comments

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Launch HN: RunRL (YC X25) â Reinforcement learning as a service
       
       
        ripbozo wrote 2 hours 14 min ago:
        Was excited to see something about reinforcement learning as I'm
        working on training an agent to play a game, but apparently all
        reinforcement learning nowadays is for LLMs.
       
          -_- wrote 1 hour 21 min ago:
          Have you heard of [1] ? Might fit your use case
          
   URI    [1]: https://puffer.ai
       
          ag8 wrote 2 hours 8 min ago:
          Yeah, for better or worse, the way the median startup interfaces with
          AI these days is through an LLM API, and that's what all the
          workflows are built around, so that's what we're targeting. Though,
          depending on what you're trying to do, I wouldn't discount the use of
          starting with a pretrained modelâthere was that famous result from
          2022 that showed that pretraining a model on _Wikipedia_ made
          training on Atari games more than twice as efficient [0]; these days,
          LLMs have huge amounts of priors about the real world that make them
          great starting points for a surprisingly diverse set of tasks (e.g.
          see the chemistry example in our video!)
          
          [0]:
          
   URI    [1]: https://arxiv.org/abs/2201.12122
       
        nextworddev wrote 3 hours 41 min ago:
        Is there any credence to the view that these startups are basically
        dspy wrappers
       
          omneity wrote 1 hour 11 min ago:
          Perhaps less about DSPy, and rather about this:
          
   URI    [1]: https://github.com/OpenPipe/ART
       
          -_- wrote 3 hours 23 min ago:
          DSPy is great for prompt optimization but not so much for RL
          fine-tuning (their support is "extremely EXPERIMENTAL"). The nice
          thing about RL is that the exact prompts don't matter so much. You
          don't need to spell out every edge case, since the model will get an
          intuition for how to do its job well via the training process.
       
            nextworddev wrote 2 hours 58 min ago:
            Isnât the latest trend in RL mostly about prompt optimization as
            opposed to full fine tuning
       
              ag8 wrote 2 hours 44 min ago:
              prompt optimization is very cool, and we use it for certain
              problems! The main goal with this launch is to democratize access
              to "the real thing"; in many cases, full RL allows you to get the
              last few percent in reliability for things like complex agentic
              workflows where prompt optimization doesn't quite get you far
              enough.
              
              There's also lots of interesting possibilities such as RLing a
              model on a bunch of environments and then prompt optimizing it on
              each specific one, which seems way better than, like, training
              and hot-swapping many LoRAs. In any case, _someone's_ ought to
              provide a full RL api, and we're here to do that well!
       
                nextworddev wrote 2 hours 41 min ago:
                Thanks. Is this mainly for verifiable tasks or any general task
       
                  ag8 wrote 2 hours 28 min ago:
                  It's for any task that has an "eval", which is often
                  verifiable tasks or ones that can be judged by LLMs (e.g. see
                  [0]). There's also been recent work such as BRPO [1] and
                  similar approaches to make more and more "non-verifiable"
                  tasks have verifiable rewards!
                  
                  [0]: [1]:
                  
   URI            [1]: https://runrl.com/blog/funniest-joke
   URI            [2]: https://arxiv.org/abs/2506.00103
       
                  -_- wrote 2 hours 31 min ago:
                  There needs to be some way of automatically assessing
                  performance on the task, though this could be with a Python
                  function or another LLM as a judge (or a combination!)
       
       
   DIR <- back to front page