_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   DIR   Show HN: Arch-Router – 1.5B model for LLM routing by preferences, not benchmarks
       
       
        pseudosavant wrote 14 hours 57 min ago:
        Not that LLMs are terribly latency sensitive (you wait on a lot of
        tokens), but what kind of latency impact does this have on requests
        that go through the proxy?
       
          adilhafeez wrote 14 hours 34 min ago:
          Short answer is latency impact is very minimal.
          
          We use envoy as request handler which forwards request to local
          service written in rust. Envoy is proven to be high performance, low
          latency and highly efficient on request handling. If I have to put a
          number it would be in single digit ms per request. I will have more
          detailed benchmark in the coming days.
       
          cotran2 wrote 14 hours 40 min ago:
          The model is compact 1.5B, most GPUs can serve it locally and has
          <100ms e2e latency. For L40s, its 50ms.
       
        _nh_ wrote 15 hours 51 min ago:
        How do you compare with RouteLLM?
       
          cotran2 wrote 14 hours 41 min ago:
          There is a case study comparing with RouteLLM in the appendix.
       
          sparacha wrote 15 hours 8 min ago:
          RouteLLM is essentially a benchmark-driven approach. Their framework
          chooses between a weak and a strong model and helps developers
          optimize for a metric called APGR (Average Performance Gap Recovered)
          — a measure of how much of the stronger model’s performance can
          be recovered when routing some queries to the weaker, cheaper model.
          However, their routing models are trained to maximize performance on
          public benchmarks like MMLU, BBH, or MT-Bench. These benchmarks may
          not capture subjective, domain-specific quality signals that surface
          in practice.
          
          Arch-Router takes a different approach. Instead of focusing benchmark
          scores, we lets developers define routing policies in plain language
          based on their preferences — like “contract analysis →
          GPT-4o” or “lightweight brainstorming → Gemini Flash.” Our
          1.5B model learns to map prompts (along with conversational context)
          to these policies, enabling routing decisions that align with
          real-world expectations, not abstract leaderboards. Also our approach
          doesn't require router model retraining when new LLMs are swapped in
          or when preferences change.
          
          Hope this helps.
       
        jgant13 wrote 17 hours 28 min ago:
        Solid.    Can you show us when to use this vs. say OpenRouter?  The
        performance seems strong for sure.  TIA.
       
          sparacha wrote 16 hours 28 min ago:
          Arch is developer friendly, but designed for enterprise-grade
          customers in mind. The core contributors of Envoy redesigned the
          proxy substrate to handle prompts - offering something that is battle
          tested in terms of resiliency, speed, and deployments. Second,
          OpenRouter offers choice of models, but dynamically routing to LLMs
          based on user-defined usage policies is uniquely available in Arch.
          Hope that helps
       
        jedisct1 wrote 17 hours 59 min ago:
        I tried to use it to rate the difficulty level of coding tasks (for
        InferSwitch, an LLM router), but it performed far worse than
        Qwen2.5-Coder-7B (but sure, 1.5B vs 7B)
       
          cotran2 wrote 17 hours 33 min ago:
          According to the post, the model is fine-tuned for routing to
          different tasks/domains. Classifying difficulty level is probably not
          the intended use case.
       
          sparacha wrote 17 hours 44 min ago:
          Can you share more about your evaluation setup? I would love to see
          the specific usage pattern as we have tested our model against
          smaller LLMs and foundational models and our results show things
          differently. Of course, routing policies should follow best practices
          here: [1] Nonetheless, super curious to learn more and see what we
          may be able to improve. This is technically not a classifier model -
          its a usage prediction model (feels like a classifier, but not quite
          in terms of intended usage)
          
   URI    [1]: https://docs.archgw.com/guides/llm_router.html
       
        tmaly wrote 19 hours 17 min ago:
        do you think it would be possible to quantize this model and still get
        good results?
       
          sparacha wrote 19 hours 15 min ago:
          yes - we have already published a quantized version here: [1] . The
          performance difference with a quant version is negligible. I'll run
          another analysis and update the thread shortly
          
   URI    [1]: https://huggingface.co/katanemo/Arch-Router-1.5B.gguf
       
            sparacha wrote 17 hours 33 min ago:
            Overall performance degrades from 93.17 -> 92.99 with a quantized
            version
       
        sparacha wrote 21 hours 42 min ago:
        Hi HN! I am one of the co-authors of the paper. If there are any
        questions about our approach, I would love to answer them.
       
       
   DIR <- back to front page