gopher://codevoid.de/1/hn/comments

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   Show HN: Magnitude â open-source, AI-native test framework for web apps
       
       
        retreatguru wrote 11 hours 43 min ago:
        Any advice about using ai to write test cases? For example recording a
        video while using an app and converting that to test cases. Seems like
        it should work.
       
          anerli wrote 7 hours 23 min ago:
          This is definitely top of mind for us! A lot of ways to potentially
          approach it. We want to make sure the test case execution works
          really well so our focus is there but also want to think about test
          case generation going forward. Recording a video especially with
          small VLMs that can tokenize videos would be super neat.
       
        o1o1o1 wrote 13 hours 50 min ago:
        Thanks for sharing, this looks interesting.
        
        However, I do not see a big advantage over Cypress tests.
        
        The article mentions shortcomings of Cypress (and Playwright):
        
        > They start a dev server with bootstrapping code to load the component
        and/or setup code you want, which limits their ability to handle
        complex enterprise applications that might have OAuth or a complex
        build pipeline.
        
        The simple solution is to containerise the whole application (including
        whatever OAuth provider is used), which then allows you to simply
        launch the whole thing and then run the tests. Most apps (especially in
        enterprise) should already be containerised anyway, so most of the
        times we can just go ahead and run any tests against them.
        
        How is SafeTest better than that when my goal is to test my application
        in a real world scenario?
       
        arendtio wrote 15 hours 7 min ago:
        It looks pretty cool. One thing that has bothered me a bit with
        Playwright is audio input. With modern AI applications, speech
        recognition is often integrated, but with Playwright, using voice as an
        input does not seem straightforward. Given that Magnitude has an AI
        focus, adding a feature like that would be great:
        
          test('can log in and see correct settings')
            .step('log in to the app')
              .say('my username is user@example.com')
       
          anerli wrote 7 hours 25 min ago:
          Huh thatâs an interesting use case. Yeah using an AI driven system
          definitely opens up some cool possibilities that arenât possible
          with playwright alone. Would be curious to hear more about what
          youâre trying to test this way with audio.
       
        chrisweekly wrote 16 hours 42 min ago:
        This looks pretty cool, at least at first glance. I think "traditional
        web testing" means different things to different people. Last year, the
        Netflix engineering team published "SafeTest"[1] an interesting hybrid
        / superset of unit and e2e testing. Have you guys (Magnitude devs)
        considered incorporating any of their ideas?
        
        1.
        
   URI  [1]: https://netflixtechblog.com/introducing-safetest-a-novel-appro...
       
          anerli wrote 15 hours 26 min ago:
          Looks cool! Thanks for sharing! The idea of having a hybrid framework
          for component unit testing + end to end testing is neat. Will
          definitely consider how this might be applicable to magnitude.
       
        aoeusnth1 wrote 19 hours 30 min ago:
        Why not make the strong model compile a non-ai-driven test execution
        plan using selectors / events? Is Moondream that good?
       
          anerli wrote 16 hours 29 min ago:
          Definitely a good question. Using an actual LLM as the execution
          layer allows us to more easily swap to the planner agent in the case
          that the test needs to be adapted. We donât want to store just a
          selector based test because itâs difficult to determine when it
          requires adaptation, and is inherently more brittle to subtle UI
          changes. We think using a tiny model like Moondream makes this cheap
          enough that these benefits outweigh an approach where we cache actual
          playwright code.
       
        sergiomattei wrote 1 day ago:
        Hi, this looks great! Any plans to support Azure OpenAI as a backend?
       
          anerli wrote 1 day ago:
          Hey! We can add this pretty easily! We find that Gemini Pro 2.5 works
          the best as the planner model by a good margin, but we definitely
          want to support a variety of providers. I'll keep this in mind and
          implement soon!
          
          edit: tracking here
          
   URI    [1]: https://github.com/magnitudedev/magnitude/issues/6
       
        SparkyMcUnicorn wrote 1 day ago:
        This is pretty much exactly what I was going to build. It's missing a
        few things, so I'll either be contributing or forking this in the
        future.
        
        I'll need a way to extract data as part of the tests, like screenshots
        and page content. This will allow supplementing the tests with
        non-magnitude features, as well as add things that are a bit more
        deterministic. Assert that the added todo item exactly matches what was
        used as input data, screenshot diffs when the planner fallback came
        into play, execution log data, etc.
        
        This isn't currently possible from what I can see in the docs, but
        maybe I'm wrong?
        
        It'd also be ideal if it had an LLM-free executor mode to reduce costs
        and increase speed (caching outputs, or maybe use accessibility tree
        instead of VLM), and also fit requirements when the planner should not
        automatically kick in.
       
          anerli wrote 1 day ago:
          Hey, awesome to hear! We are definitely open to contributions :)
          
          We plan to (very soon) enable mixing standard Playwright or other
          code in between Magnitude steps, which should enable doing exact
          assertions or anything else you want to do.
          
          Definitely understand the need to reduce costs / increase speed,
          which mainly we think will be best enabled by our plan-caching system
          that will get executed by Moondream (a 2B model). Moondream is very
          fast and also has self-hosted options. However there's no reason we
          couldn't potentially have an option to generate pure Playwright for
          people who would prefer to do that instead.
          
          We have a discord as well if you'd like to easily stay in touch about
          contributing:
          
   URI    [1]: https://discord.gg/VcdpMh9tTy
       
        pandemic_region wrote 1 day ago:
        Bang me sideways, "AI-native" is a thing now? What does that even mean?
       
          anerli wrote 1 day ago:
          Well yeah it's kind of ambiguous, it's just our way of saying that
          we're trying to use AI to make testing easier!
       
          mcbuilder wrote 1 day ago:
          It definitely means something, probably an app designed around being
          interacted by with an LLM, upon first hearing it. Browser interaction
          is one of those things that is a great killer app for LLMs IMO.
          
          For instance, I just discovered there are a ton of high quality scans
          of film and slides available at the Library of Congress website, but
          I don't really enjoy their interface. I could build a scraping tool
          and get too much info, or suffer and use just clicking through their
          search UI. Or I could ask my browser tool wielding LLM agent to
          automate the boring stuff and provide a map of the subjects I would
          be interested in, and give me a different way to discover things.
          I've just discovered the entire browser automation thing, and I'm
          having fun have my LLM go "research" for a few minutes while I go do
          something else.
       
          Alifatisk wrote 1 day ago:
          Had to look it up too!
          
   URI    [1]: https://www.studioglobal.ai/blog/what-is-ai-native
       
        jcmontx wrote 1 day ago:
        Does it only work for node projects? Can I run it against a Staging
        environment without mixing it with my project?
       
          anerli wrote 1 day ago:
          You can run it against any URL, not just node projects! You'll still
          need a skeleton node project for the actual Magnitude tests, but you
          could configure some other public or staging URL as the target site.
       
        dimal wrote 1 day ago:
        > Pure vision instead of error prone "set-of-marks" system (the
        colorful boxes you see in browser-use for example)
        
        One benefit not using pure vision is that it's a strong signal to
        developers to make pages accessible. This would let them off the hook.
        
        Perhaps testing both paths separately would be more appropriate. I
        could imagine a different AI agent attempting to navigate the page
        through accessibility landmarks. Or even different agents that simulate
        different types of disabilities.
       
          anerli wrote 1 day ago:
          Yeah good criticism for sure. We definitely want to keep this in mind
          as we continue to build. Some kind of accessibility tests which run
          in parallel with each visual test that are only allowed to use the
          accessibility tree could make it much easier for developers to
          identify how to address different accessibility concerns.
       
        tobr wrote 1 day ago:
        Interesting! My first concern is - isnât this the ultimate
        non-deterministic test? In practice, does it seem flaky?
       
          anerli wrote 1 day ago:
          So the architecture is built with determinism in mind. The
          plan-caching system is still a work in progress, but especially once
          fully implemented it should be very consistent. As long as your
          interface doesn't change (or changes in trivial ways), Moondream
          alone can execute the same exact web actions as previous test runs
          without relying on any DOM selectors. When the interface does
          eventually change, that's where it becomes non-deterministic again by
          necessity, since the planner will need to generatively update the
          test and continue building the new cache from there. However once
          it's been adapted, it can once again be executed that way every time
          until the interface changes again.
       
            engfan wrote 9 hours 36 min ago:
            Anerli wrote: âWhen the interface does eventually change, that's
            where it becomes non-deterministic again by necessity, since the
            planner will need to generatively update the test and continue
            building the new cache from there.â
            
            But what determines that the UI has changed for a specific URL? 
            Your software independent of the planner LLM or do you require the
            visual LLM to make a determination of change?
            
            You should also stop saying 100% open source when test plan
            generation and execution depend on non-open source AI components. 
            It just doesnât make sense.
       
              anerli wrote 7 hours 18 min ago:
              The small VLM (Moondream) decides when interface changes / its
              actions no longer line up.
              
              We say 100% open source because all of our code (test runner and
              AI agents) is completely open source. Itâs also completely
              possible to run an entire OSS stack because you can configure
              with an open source planner LLM, and Moondream is open source.
              You could run it all locally even if you have solid hardware.
       
            daxfohl wrote 1 day ago:
            In a way, nondeterminism could be an advantage. Instead of using
            these as unit tests, use them as usability tests. Especially if you
            want your site to be accessible by AI agents, it would be good to
            have a way of knowing what tweaks increase the success rate.
            
            Of course that would be even more valuable for testing your MCP or
            A2A services, but could be useful for UI as well. Or it could be
            useless. It would be interesting to see if the same UI changes
            affect both human and AI success rate in the same way.
            
            And if not, could an AI be trained to correlate more closely to
            human behavior. That could be a good selling point if possible.
       
              anerli wrote 1 day ago:
              Originally we were actually thinking about doing exactly this and
              building agents for usability testing. However, we think that
              LLMs are much better suited for tackling well defined tasks
              rather than trying to emulate human nuance, so we pivoted to
              end-to-end testing and figuring out how to make LLM browser
              agents act deterministically.
       
        NitpickLawyer wrote 1 day ago:
        > The idea is the planner builds up a general plan which the executor
        runs. We can save this plan and re-run it with only the executor for
        quick, cheap, and consistent runs. When something goes wrong, it can
        kick back out to the planner agent and re-adjust the test.
        
        I've been recently thinking about testing/qa w/ VLMs + LLMs, one area
        that I haven't seen explored (but should 100% be feasible) is to have
        the first run be LLM + VLM, and then have the LLM(s?) write repeatable
        "cheap" tests w/ traditional libraries (playwright, puppeteer, etc). On
        every run you do the "cheap" traditional checks, if any fail go with
        the LLM + VLM again and see what broke, only fail the test if both
        fail. Makes sense?
       
          tomatohs wrote 22 hours 15 min ago:
          This is exactly our workflow, though we defined our own YAML spec [1]
          for reasons mentioned in previous comments.
          
          We have multiple fallbacks to prevent flakes; The "cheap" command, a
          description of the intended step, and the original prompt.
          
          If any step fails, we fall back to the next source.
          
          1.
          
   URI    [1]: https://docs.testdriver.ai/reference/test-steps
       
          anerli wrote 1 day ago:
          So this is a path that we definitely considered. However we think its
          a half-measure to generate actual Playwright code and just run that.
          Because if you do that, you still have a brittle test at the end of
          the day, and once it breaks you would need to pull in some LLM to try
          and adapt it anyway.
          
          Instead of caching actual code, we cache a "plan" of specific web
          actions that are still described in natural language.
          
          For example, a cached "typing" action might look like:
          {
              variant: 'type';
              target: string;
              content: string;
          }
          
          The target is a natural language description. The content is what to
          type.
          Moondream's job is simply to find the target, and then we will click
          into that target and type whatever content.
          This means it can be full vision and not rely on DOM at all, while
          still being very consistent. Moondream is also trivially cheap to run
          since it's only a 2B model.
          If it can't find the target or it's confidence changed significantly
          (using token probabilities), it's an indication that the action/plan
          requires adjustment, and we can dynamically swap in the planner LLM
          to decide how to adjust the test from there.
       
            ekzy wrote 1 day ago:
            Did you consider also caching the coordinates returned by
            moondream? I understand that it is cheap, but it could be useful to
            detect if an element has changed position as it may be a regression
       
              anerli wrote 1 day ago:
              So the problem is if we cache the coordinates and click blindly
              at the saved positions, there's no way to tell if the interface
              changes or if we are actually clicking the wring things (unless
              we try and do something hacky like listen for events on the DOM).
              Detecting whether elements have changed position though would
              definitely be feasible if re-running a test with Moondream, could
              compared against the coordinates of the last run.
       
                chrisweekly wrote 16 hours 57 min ago:
                sounds a lot like snapshot testing
       
        badmonster wrote 1 day ago:
        How does Magnitude differentiate between the planner and executor LLM
        roles, and how customizable are these components for specific test
        flows?
       
          anerli wrote 1 day ago:
          So the prompts that are sent to the planner vs executor are
          completely distinct. We allow complete customization of the planner
          LLM with all major providers (Anthropic, OpenAI, Google AI Studio,
          Google Vertex AI, AWS Bedrock, OpenAI compatible).
          The executor LLM on the other hand has to fit very specific criteria,
          so we only support the Moondream model right now. For a model to act
          as the executor it needs to be able to specific specific pixel
          coordinates (only a few models support this, for example
          OpenAI/Anthropic computer use, Molmo, Moondream, and some others). We
          like Moondream because its super tiny and fast (2B). This means as
          long as we still have a "smart" planner LLM we can have very
          fast/cheap execution and precise UI interaction.
       
            badmonster wrote 1 day ago:
            does Moondream handle multi-step UI tasks reliably (like opening a
            menu, waiting for render, then clicking), or do you have to
            scaffold that logic separately in the planner?
       
              anerli wrote 1 day ago:
              The planner can plan out multiple web actions at once, which
              Moondream can then execute in sequence on its own. So Moondream
              is never deciding how to execute more than one web action in a
              single prompt.
              
              What this really means for developers writing the tests is you
              don't really have to worry about it. A "step" in Magnitude can
              map to any number of web actions dynamically based on the
              description, and the agents will figure out how to do it
              repeatably.
       
        grbsh wrote 1 day ago:
        I know moondream is cheap / fast and can run locally, but is it good
        enough? In my experience testing things like Computer Use, anything but
        the large LLMs has been so unreliable as to be unworkable. But maybe
        you guys are doing something special to make it work well in concert?
       
          anerli wrote 1 day ago:
          So it's key to still have a big model that is devising the overall
          strategy for executing the test case. Moondream on its own is pretty
          limited and can't handle complex queries. The planner gives very
          specific instructions to Moondream, which is just responsible for
          locating different targets on the screen. It's basically just the
          layer between the big LLM doing the actual "thinking" and grounding
          that to specific UI interactions.
          
          Where it gets interesting, is that we can save the execution plan
          that the big model comes up with and run with ONLY Moondream if the
          plan is specific enough. Then switch back out to the big model if
          some action path requires adjustment. This means we can run repeated
          tests much more efficiently and consistently.
       
            grbsh wrote 1 day ago:
            Ooh, I really like the idea about deciding whether to use the big
            or small model based on task specificity.
       
              tough wrote 1 day ago:
              You might like
              
   URI        [1]: https://pypi.org/project/llm-predictive-router/
       
                anerli wrote 1 day ago:
                Oh this is interesting. In our case we are being very specific
                about which types of prompts go where, so the planner
                essentially creates prompts that will be executed by Moondream,
                instead of trying to route prompts generally to the appropriate
                model. The types of requests that our planner agent vs
                Moondream can handle are fundamentally different for our use
                case.
       
                  tough wrote 1 day ago:
                  interesting, will check out yours i'm mostly interested on
                  these dynamic routers so I can mix local and API based
                  depending on needs, i cannot run some models locally but most
                  of the tasks don't even require such power (on building ai
                  agentic systems)
                  
                  there's also [1] and other similar
                  
                  I guess your system is not as open-ended task oriented so you
                  can just build workflows deciding which model to use at each
                  step, these routing mechanisms are more useful for open-ended
                  tasks that dont fit on a workflow so well (maybe?)
                  
   URI            [1]: https://github.com/lm-sys/RouteLLM
       
       
   DIR <- back to front page