_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   OpenAI's new open-source model is basically Phi-5
       
       
        orbital-decay wrote 48 min ago:
        I gave it a random sci-fi novel and made it translate a chapter, which
        is something I do with all models. It refused to discuss minors in
        sexualized contexts. I was like W.T.F.?! and started bisecting the
        book, trying to find the piece that triggers this. Turns out there was
        some absolutely innocent, two sentence long romantic remark involving
        two secondary 17 years old characters in an unrelated place.
        
        Another issue is that it sometimes has occasional refusals and total
        meltdowns where it redacts entire paragraphs with placeholder
        characters, while just trying to casually talk with it about some
        routine life matters.
        
        That's ridiculous and makes that model garbage at any form of creative
        writing (including translation) or real life tasks other than math or
        coding. It has very poor knowledge for a 120B MoE. If you look at the
        "reasoning" it does, it actually mostly checks the request against the
        policy.
        
        I thought they must have spent most of their post-training hunting the
        wrongthink and dumbing the model down as a result, but I can see how
        the synthetic pretraining data can explain this.
       
          ViktorRay wrote 19 min ago:
          I wonder what would happen if you asked this model to interact with A
          Song of Ice and Fire then.
       
          spacecadet wrote 20 min ago:
          Its a public consumer facing model, not surprised. Go find an
          unaligned model that will better produce the content you seek...
       
        egorfine wrote 2 hours 3 min ago:
        > the main use-case for fine-tuning small language models is for erotic
        role-play
        
        Of course. This is indeed the fact. /s
        
        See that "/s"? I did write it here, but it's suspiciously absent in the
        original text. Almost makes you think it was typed in all seriousness.
       
        anshumankmr wrote 3 hours 39 min ago:
        The main aim of Phi3 mini was to be able to run on device, and it had
        tremendous speed and for a 128K context with 3B something params, its
        pretty damn good, I had used it for a project myself last year but
        ultimately we went with the Mistral's models who at the time had the
        best Open weights models.
       
        wkat4242 wrote 8 hours 11 min ago:
        From the article:
        
        > For the same reason that Microsoft probably continued to train
        Phi-style models: safety. Releasing an open-source model is terrifying
        for a large organization. Once it’s out there, your name is
        associated with it forever, and thousands of researchers will be
        frantically trying to fine-tune it to remove the safety guardrails.
        
        I don't think this is really an issue in practice. Llama 2 and 3 were
        uncensored within a week. There's no bad press about this.
        
        What does give a company a bad reputation is crap models. The llama 4
        disappointment hurt meta's AI reputation a lot more than some community
        uncensoring.
       
          CjHuber wrote 3 hours 19 min ago:
          If I think about Llma, I think about uncensored. not that I ever used
          one, but there were not many use cases for censored llama when others
          were so much better at other things
       
          teruakohatu wrote 7 hours 13 min ago:
          > researchers will be frantically trying to fine-tune it to remove
          the safety guardrails.
          
          It is a really weak excuse. They are more likely to take a reputation
          hit for having silly guardrails than for having someone to remove
          them.
          
          Imagine if Bill Gates decided not to release MS Paint in 1985 because
          someone could have drawn something offensive with it.
       
            anshumankmr wrote 3 hours 37 min ago:
            Or imagine if Bill Gates decided to not release Comic Sans in the
            90s cause someone could have written something offensive with
            it....
            
            oh wait that wouldn't have been too bad
            
            (/S)
       
        KingOfCoders wrote 9 hours 4 min ago:
        No training data, no open source. Don't fall for the company PR.
       
          Mars008 wrote 8 hours 28 min ago:
          As long as it works who cares about training data. Obviously they
          can't open it for many reasons. License is one of them.
       
            KingOfCoders wrote 7 hours 17 min ago:
            I don't care if it's open source or not. I care if people call
            something open source which it isn't.
            
            Do you care about binary blobs in the kernel? No. Are binary blobs
            in the kernel open source? No.
            
            But it is tedious to go through the same discussion every 10 years,
            with a relentless industry that wants to dupe people.
            
            If there wasn't a benefit in for them, they would not call it open
            source.
       
              Mars008 wrote 3 hours 44 min ago:
              For some reason people think of models as software and open
              source should have similar meaning. There are fundamental
              differences: 1) models aren't reproducible given everything,
              data, hardware, methodology.  2) they aren't even verifiable.
              i.e. given model and dataset it's impossible to say if model was
              trained on that data. 3) except for toys models are trained on
              copyrighted data. Some of it is private, like users' chats. 4)
              besides data there is a lot of human input after pretraining.
              
              This means given everything you have two options: 1) train
              similar model yourself 2) trust model provider. In software you
              can get script and run, or get code and compile it in exactly the
              same binaries.
              
              Naturally 'open source' has different meaning. Some are trying to
              monopolize it, like they know the 'truth'. Others simply ignore
              it. Eventually we'll settle on something.
       
        refulgentis wrote 9 hours 30 min ago:
        This is really irresponsible, there's actual data on quality and this
        is just crazy to assert: "Any small online community for people who run
        local models is at least 50% perverts." --- I get what he's saying,
        there's def. communities where that's predominant, but it's simply not
        true in the vast majority of communities. Sort of like, in 1997, saying
        any small community on this here Internet thing is half perverts
        looking at porn
        
        It's extremely frustrating to try and reply to this because that which
        is asserted without evidence can't really be debunked by evidence.
        
        I find it especially shameful that he's dragging someone's name into
        this wildly histrionic review, in service of trying to find a sole
        person to attribute to, the sole attribute that led to whatever
        experience he had with it.
        
        The model is insanely good, wildly exceeded my expectations for local
        models, and will generate at least 18 months of sustainable value. I
        maintain a llama.cpp wrapper and this is quantum leap in quality. I
        despair that this will become a major source of people's opinions on
        it. We desperately need big companies actually investing here, Gemma
        ain't it, and pretending it doesn't work because ??? and then using it
        to create a corporate chickenshit narrative isn't exactly gonna help.
       
          ripped_britches wrote 7 hours 11 min ago:
          100%, what an insanely baseless article
       
        dweinus wrote 10 hours 6 min ago:
        Is it confirmed that synthetic data was used for gpt-oss training? I
        didn't pick up on that in the press release or see it elsewhere. Did I
        miss it or is Sean speculating that it is the case?
       
        klooney wrote 10 hours 32 min ago:
        > It’s not discussed publically very often, but the main use-case for
        fine-tuning small language models is for erotic role-play, and
        there’s a serious demand. Any small online community for people who
        run local models is at least 50% perverts.
        
        Amazing
       
          msgodel wrote 9 hours 25 min ago:
          Meh. For the first few decades consumer internet traffic was mostly
          porn. Stop freaking out and use the free effort people are willing to
          put into to solve technical problems.
       
        RandyOrion wrote 12 hours 41 min ago:
        I mean, yeah. From the Table 9: Hallucination evaluations in GPT-OSS
        model card [1], GPT-OSS-20b/120b have accuracy of 0.067/0.168 and
        hallucination rate of 0.914/0.782 separately, while o4-mini has
        accuracy of 0.234 and hallucinate rate of 0.750. These numbers simply
        mean that GPT-OSS models have little real world knowledge, and they
        hallucinate hard. Note that little real world knowledge has always been
        a "feature" of the Phi-LLM series because of the "safety" (for large
        companies), or rather, "censorship" (for users) requirements.
        
        In addition, from Table 4: Hallucination evaluations in OpenAI o3 and
        o4-mini System Card [2], o3/o4-mini have accuracy of 0.49/0.20 and
        hallucination rate of 0.51/0.79.
        
        In summary, there is a significant real world knowledge gap between o3
        and o4-mini, and another significant gap between o4-mini and GPT-OSS.
        Besides, the poor real world knowledge exhibited in GPT-OSS is aligned
        with the "feature" of Phi-LLM series. [1]
        
   URI  [1]: https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac763...
   URI  [2]: https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c...
       
        pogue wrote 13 hours 31 min ago:
        Is that true that most small language models are fine tuned for erotic
        role-play?
       
          int_19h wrote 11 hours 23 min ago:
          What you wrote is somewhat ambiguous, so allow me to rephrase. It is
          true that most fine-tunes of relatively small (which can mean
          anything up to 150B params, depending on who you ask!) LLMs are for
          uncensored roleplay purposes.
       
        wmf wrote 15 hours 32 min ago:
        I saw a bunch of people complaining on Twitter about how GPT-OSS can't
        be customized or has no soul and I noticed that none of them said what
        they were trying to accomplish.
        
        "The main use-case for fine-tuning small language models is for erotic
        role-play, and there’s a serious demand."
        
        Ah.
       
          throwaway98797 wrote 1 hour 23 min ago:
          i tried creating a meme about retards and it refused.
          
          i’m an adult.
       
            GTP wrote 1 hour 16 min ago:
            Why is being an adult relevant in any way?
       
          simion314 wrote 2 hours 56 min ago:
          We use OpenAI API at work, it fails when translating children
          stories, the reason is violence. Either the model safety is shit or
          the AI companies are pushed by some extremists groups to censor shit
          that is acceptable for children in Europe (Romania).   But the most
          bullshit is when you give it a safe prompt, the mdoel generates a
          response and the safety checker  kicks in and blocks the response
          because it thinks the model was too naughty.
       
          asdffdasy wrote 3 hours 4 min ago:
          why everyone keep pretending very hard that this entire AI summer was
          not started exclusively by pioneers trying to perfect virtual
          girlfriend? it's a fact.
       
            vintermann wrote 1 hour 44 min ago:
            Yes... Specifically by AI dungeon. He got dialog-tuning working
            (not without some hiccups) before anyone else, and it drew a lot of
            attention from the API provider.
            
            But then people used it for erotic RP, and it became a PR disaster,
            and the author blamed his pervert customers. Never mind that
            certain characters who turned up a lot in AI dungeon stories turned
            out to be from a fantasy writing site the author had used for
            fine-tuning material (without permission of course) and he hadn't
            filtered OUT the dirty stories to put it like that.
       
            subscribed wrote 2 hours 7 min ago:
            Existing models are more than good enough.
       
          seydor wrote 5 hours 8 min ago:
          Why would a language company censor language like that? I literally
          don't see the purpose.    
          Plenty of the best novels have explicit scenes. I certainly don't
          like novels about infantilized adults. Search engines don't do that,
          so why do we allow AI companies to do that so blatantly. It's
          basically culture engineering
       
            subscribed wrote 2 hours 6 min ago:
            Because it's naughty! US is.... prudish, to say the least.
       
          numpad0 wrote 6 hours 19 min ago:
          Maybe I'm just not seeing it, but is that use case really real and
          not just prudish hallucinations? The market for NSFW novels is
          smaller than even cyberpunk paperbacks, there's no way everyday
          people build up addiction for an interactive version of it.
       
            michaelt wrote 1 hour 51 min ago:
            Openrouter provides a multi-model, multi-provider API for hosted
            LLMs. I myself used their API to compare different LLMs for
            document classification.
            
            They provide usage rankings [1] and the top 10 applications, in
            terms of tokens used, are:
            
            1. AI coding agent
            2. AI coding agent
            3. AI coding agent
            4. Library for calling LLMs
            5. Role play chat
            6. Role play chat
            7. Role play chat
            8. General purpose chat
            9. AI coding agent
            10. Role play chat
            
            Certainly, the coding agents are burning through far more tokens.
            No doubt about that.
            
            And there's undoubtedly a major bias introduced by the fact chatgpt
            and claude $20/month accounts are heavily discounted, so if your
            use is SFW why pay more to get an uncensored model?
            
            But overall, to me, the evidence seems pretty robust.
            
   URI      [1]: https://openrouter.ai/rankings
       
            Balinares wrote 2 hours 28 min ago:
            Not quite. The market for romance (which can, in fact, get
            arbitrary degrees of spicy) is by far the largest literary market.
            (Source: [1] )
            
            LLMs are also hilariously bad at what makes erotica hot in the
            first place.
            
   URI      [1]: https://bookadreport.com/book-market-overview-authors-stat...
       
            eddythompson80 wrote 5 hours 16 min ago:
            Not NSFW, but I was surprised at hearing 2 random 20-something
            adults tell me they use chatgpt to write fan fiction to read. Those
            2 people didn't know each other but were from similar demographic.
            It was surprising nonetheless. I can't imagine why anyone would
            care about such a bizarre use case.
       
            amanaplanacanal wrote 5 hours 56 min ago:
            I suspect NSFW novels is a bigger market than you think. Spicy
            romantasy is popular among a certain set of readers.
       
            cess11 wrote 6 hours 0 min ago:
            The claim that Fifty Shades of Grey and, I don't know, Norman
            Mailer?, Chuck Tingle?, are less common on book shelves and night
            stands than cyberpunk paperbacks seems obviously wrong to me.
            
            Perhaps you could elucidate further on this subject? I'm mostly
            into books from 1800-1985 or so and don't know much about
            contemporary literary fashion.
            
            Edit: Jean M Auel was extremely common in occidental households a
            few decades ago, especially the first and second books about Ayla,
            I'd wager much, much more common than cyberpunk.
            
            Same goes for books by Alex Comfort.
       
          nickpsecurity wrote 8 hours 2 min ago:
          Most use of small models was erotic roleplay back when I closely
          followed r/LocalLlama. Made me wonder if I should even make one if
          that's what most used it for.
          
          You might find this part funny. At first, I thought they had
          automated their coding for SAP or other ERP databases. Then, they
          started talking about how realistic the body parts were. I paused
          staring at the screen. The sad reality clicked.
       
          wkat4242 wrote 8 hours 43 min ago:
          It's not just erotic role play that the censorship affects. My life
          involves a lot of sexual discussions and that means that everyday
          talk, chat summaries, email rewrites or translations will cause the
          model to shut down. I do the latter a lot especially to find
          colloquialisms because Google translate is often too literal. It's so
          annoying.
          
          Right now I'm using abliterated llama 3.1. I have no need for vision
          but I want to use the saved memory for more context so 3.2 is not so
          relevant. Llama 3.1 is perfect. But I want to try newer models too.
          
          Until gpt-oss can be uncensored it's no use to me. But if there was
          nothing erotic in its training data it can't be. And no, I never have
          it do erotic roleplay. I'm not really interested when there's no real
          people involved.
       
            Alex-Programs wrote 1 hour 3 min ago:
            I'm curious whether you can get [1] (my deep translation service;
            I'm just polishing it before release) to run into "too sexual".
            It's designed to be resilient to that kind of thing, and it works
            well in my testing, but maybe you have some better "stress test"
            content than me!
            
   URI      [1]: https://platform.nuenki.app
       
            Havoc wrote 1 hour 37 min ago:
            Pretty sure I saw an uncensored version yesterday on localllama
       
            Clueed wrote 3 hours 26 min ago:
            I found DeepSeek R1 (better for questions) and V3 (better for
            prose) to be very willing to discuss sex with a simple system
            prompt, as well as being very pleasant in articulation.
            I guess, I prefer them because they are almost SOTA and very large.
            
            Not through the official interface though. Needs to be hosted by a
            third party.
            OpenRouter has a generous free tier for both.
            
            I just saw that there is an abliterated version as well. Not sure
            how to try it though.
       
            jofzar wrote 4 hours 4 min ago:
            Sorry just to ask, what kind of job do you have? Sex therapist
            sounds like the closest?
       
              itsdesmond wrote 3 hours 32 min ago:
              I don’t think they’re using it for work.
       
          sterlind wrote 9 hours 52 min ago:
          it's not erotic role-play, but I have a use case of making an
          AI-powered NetHack clone. specifically, to generate dungeon layouts,
          dialog for NPCs and to fill in the boatloads of minutae and
          interactions which NetHack is famous for.
          
          you kind of need soul for that, and a lot of background knowledge on
          mythology/fantasy lore, but also tool use to work the world systems.
       
            katzenversteher wrote 1 hour 58 min ago:
            I've been experimenting with using various LLMs as a game master
            for a Lovecraft-inspired role-playing game (not baked into an
            application, just text-based by prompting). While the LLMs can
            generate scenarios that fit the theme, they tend to be very
            generic. I've also noticed that the models are extremely
            susceptible to suggestion. For example, in one scenario, my
            investigator was in a bar, and when I commented to another patron,
            'Hey, doesn't the barkeeper look a little strange?', the LLM
            immediately seized on that and turned the barkeeper into an evil,
            otherworldly creature. This behavior was consistent across all the
            models I tested. Maybe by prompting the LLM to fully plan the
            scenario in advance and then adhere to that plan would mitigate the
            behavior but I haven't tried it. It was just an experiment and I
            actually had a lot of fun with the behavior. Also the reactions of
            the LLM if the player does something really unexpected (e.g. "the
            investigator pulls a sausage out of his pocket and forcefully
            sticks it into the angry sailors mouth") are sometimes hillarious.
       
            fho wrote 8 hours 15 min ago:
            Are you getting good results with this? Some time ago I build an
            "overworld simulator" by having a bunch if jsons that represented
            villages, characters, building, story hooks and plots. I just asked
            ChatGPT to "simulate the gameworld by a week".
            
            Technically this worked great, but everything was somewhat bland
            and generic.
            
            Noteable not-highlights:
            - there is a shimmer in the nearby forests that keep villagers up
            at night -> it's an orc camp
            - there is a mysterious figure in town -> it's a Aragon type ranger
       
              subscribed wrote 2 hours 12 min ago:
              I get brilliant, vibrant worlds withots of things happening and
              the consistent characters with Deepseek V3 (and R1 summaries
              every so often).
              
              From time to time I get the event so great and character so
              compelling, I save the best in Author Note or Lorebook.
              
              The overall atmosphere is effortlessly somewhat Skyrim/Game of
              Thrones/World of Darkness adjacent.
              
              I'd look for these two models to build this simulator of yours,
              ie R1 to plan and V3 to fill in the blanks. Oh, or maybe Google
              Gemini 2.5 or 2.0 to plan the story and DeepSeek V3 to fill in.
       
                xrd wrote 32 min ago:
                Are you writing or sharing this work anywhere? I would love to
                read more about your approach.
       
              zacmps wrote 7 hours 4 min ago:
              Might be fixable with promoting and/or a much lower temperature?
       
                isoprophlex wrote 6 hours 48 min ago:
                I've had some success in letting it generate lists of things at
                high temperature and picking the last item in the list.
                
                Ask some model to generate a number uniformly between 0 and
                100; you get 47 a lot. Or 27. Something like that.
                
                Ask it for a list of numbers, uniformly distributed, and it
                also often starts with the same number. However later elements
                in the sequence will converge (imperfectly) to a better
                approximation of uniformity.
                
                The same for names of the dwarves and paladins in your party.
       
                  tough wrote 6 hours 39 min ago:
                  token entanglement is an interesting emergent behaviour
                  
   URI            [1]: https://owls.baulab.info/
       
            herval wrote 9 hours 29 min ago:
            Where do you get started with something like this? Sounds like a
            fun project
       
          anothernewdude wrote 13 hours 14 min ago:
          My use case has been trying to remove the damn "apologies for this"
          and extraneous language that just waste tokens for no reason. GPT has
          always always always been so quick to waffle.
          
          And removing the chat interface as much as possible. Many benchmarks
          are better with text completion models, but they keep insisting on
          this horrible interface for their models.
          
          Fine tuning is there to ensure you get the output format you want
          without the extra garbage. I swear they have tuned their models to
          waste tokens.
       
            michaelt wrote 11 hours 27 min ago:
            The jargon to google here is "length bias"
            
            It turns out if you generate two LLM responses and ask a judge to
            choose which is better, many judges have a bias in favour of long
            answers full of waffle.
       
              mh- wrote 10 hours 57 min ago:
              Thanks for that pointer.
              
              The abstract of this paper seems interesting: [1] > use of [LLMs]
              as judges [..] reveals a notable bias towards longer responses,
              undermining the reliability of such evaluations. To better
              understand such bias, we propose to decompose the preference
              evaluation metric, specifically the win rate, into two key
              components: desirability and information mass [..]
              
              (If you're interested, give it a click. I tried to pare this down
              to avoid quoting a wall of text.)
              
   URI        [1]: https://arxiv.org/html/2407.01085v3
       
            eru wrote 12 hours 16 min ago:
            > I swear they have tuned their models to waste tokens.
            
            Which seems a bit weird, because the customers of the chat
            interface (ie non-API customers) don't pay per token.
       
              setsewerd wrote 9 hours 38 min ago:
              I've heard the theory a few times lately that AI businesses will
              increasingly move towards usage models over subscription models,
              so while it is probably accidental, it could also be a longer
              term strategy to normalize excessive token usage.
       
                eru wrote 8 hours 48 min ago:
                I don't know whether the major AI companies will move to usage
                models.  But let's assume that they do.
                
                However: I would expect chat interfaces to be charged per
                query, not per token.  End users don't understand tokens, and
                don't want to have to understand tokens.
                
                If you charge per query, you don't gain anything from extra
                wordy responses.
       
          sysmax wrote 13 hours 23 min ago:
          Want a good use case?
          
          I am playing around with interactive workflow where the model
          suggests what can be wrong with a particular chunk of code, then the
          user selects one of the options, and the model immediately implements
          the fix.
          
          Biggest problem? Total Wild West in terms of what the models try to
          suggest. Some models suggest short sentences, others spew out huge
          chunks at a time. GPT-OSS really likes using tables everywhere. Llama
          occasionally gets stuck in the loop of "memcpy() could be not what it
          seems and work differently than expected" followed by a handful of
          similar suggestions for other well-known library functions.
          
          I mostly got it to work with some creative prompt engineering and
          cross-validation, but having a model fine-tuned for giving reasonable
          suggestions that are easy to understand from a quick glance, would be
          way better.
       
            mh- wrote 11 hours 6 min ago:
            I haven't tried your exact task, of course, but I've found a lot of
            success in using JSON structured output (in strict mode), and
            decomposing the response into more fields than you would otherwise
            think useful. And making those fields highly specific.
            
            For example: make the suggestion output an object with multiple
            fields, naming one of them `concise_suggestion`. And make sure to
            take advantage of the `description` field.
            
            For people not already using structured output, both OpenAI and
            Anthropic consoles have a pretty good JSON schema generator (give
            prompt, get schema). I'd suggest using one of those as a starting
            point.
       
          j_timberlake wrote 15 hours 7 min ago:
          You don't understand! Every erotic chatbot service keeps getting
          censored, what happened to CharacterAI just keeps happening.  There's
          a serious supply-shortage, do you really want people turning to Grok?
           The spice must flow!!!
       
          izabera wrote 15 hours 18 min ago:
          what's the problem with that?  we have erotic texts dating back
          thousands of years, basically as old as the act of writing itself
          
   URI    [1]: https://en.wikipedia.org/wiki/Istanbul_2461
       
            michaelt wrote 13 hours 50 min ago:
            The pro-porn side has zero PR because respectable public figures
            don't see pro-porn advocacy as a good career move. At most, you'll
            get some oblique references to it.
            
            Meanwhile, the anti-porn side has a formidable alliance:
            
            Right-wing, religiously-motivated anti-porn activists. Left-wing,
            feminism-motivated anti-porn activists. Big corporate types with
            lots of $$$$ to spend who want their customer support chatbot to be
            completely SFW at all times. AI safety folk who think keeping the
            model on a tight leash is an ethical obligation, lest future
            iterations take over the world. AI vendors who are keen on the
            yes-it-might-take-over-the-world narrative. AI vendors who just
            don't want their developers having to handle NSFW stuff in work.
            Politicians who don't know a transformer from a diffusion model,
            but who've heard a chorus of worries about lost jobs and AI bias
            and deepfakes and revenge porn.
            
            These people will speak up in public at the drop of a hat.
       
              mvdtnz wrote 6 hours 38 min ago:
              It's not pro-porn and anti-porn. It's pro-porn and people who
              just don't think this is that important an issue. The latter
              massively, MASSIVELY outweighs you guys.
       
                michaelt wrote 4 hours 31 min ago:
                If a person is configuring an LLM for education, to provide
                personalised math coaching to 10 year olds, they want an LLM
                that won't output anything NSFW, no matter how the user pokes
                and prods it. That's totally reasonable.
                
                But if that person is applying AI safety techniques like
                concept erasure to remove the model's ability to output porn,
                is that not anti-porn in the most literal sense?
       
              Aeolun wrote 9 hours 33 min ago:
              Maybe you have a porn test suite for LLM’s? See which ones are
              fine with or capable of talking about specific topics? I believe
              there was something similar for willingness to discuss sciency
              stuff.
       
              sterlind wrote 9 hours 34 min ago:
              on the other hand, Musk et al are building AI-powered thirst
              traps, like Grok's "Ani", or the accursed Replika bots (whose
              user base went on suicide watch when the company abruptly decided
              to digitally neuter their "companions.")
              
              erotic roleplay, imo, is much less harmful than using LLMs as
              surrogate partners. porn and sex workers have existed for
              millenia. they're an outlet for sexual tension. they don't
              alleviate feeling lonely or provide an alternative to human
              companionship.
              
              I'm worried we'll produce a generation of hikkikomoris, who
              eschew human connection for sycophantic machines that always
              listen and never breaks their heart.
       
                jondwillis wrote 6 hours 50 min ago:
                We already have a generation of hikkikomoris stepping up at
                bat. The ones behind them in line will be something else
                altogether!
       
                danw1979 wrote 9 hours 16 min ago:
                The founding story of Replika (c.2016 ?) sounds like someone
                watched Black Mirror S2E02 (2013) and didn’t quite understand
                it was supposed to be dystopian.
       
            wmf wrote 15 hours 4 min ago:
            I have no problem with it and I can understand why people don't
            want to say "I'm trying to pornify this model and it refuses to
            talk dirty!" in public. But if you're calling a model garbage maybe
            you should be honest about what the "problem" is.
       
              lmm wrote 11 hours 21 min ago:
              Why? Is there any reason to believe problems in that context
              won't generalise?
       
                mvdtnz wrote 6 hours 39 min ago:
                Are you serious? Of course there are reasons to believe they
                won't generalise.
       
                mh- wrote 11 hours 4 min ago:
                Lots, yes. The fine tuning may attempt to introduce concepts
                that were intentionally omitted from the training data for
                safety* reasons.
                
                Maybe nothing wrong with that, but it might mean that the
                perceived weaknesses don't generalize to an area of the model
                that hasn't been lobotomized.
                
                * using safety the way OpenAI have been using the term, not
                looking to debate the utility of that.
       
            philipkglass wrote 15 hours 8 min ago:
            There's nothing wrong with it, but you have to understand the
            differences between different user groups to know which limitations
            are relevant to your own use cases. "It doesn't follow
            instructions" could mean "it won't pretend to be a horny elf" or
            "it hallucinates fields outside the JSON schema I specified"; the
            latter is much more of a problem for my uses.
       
              dullcrisp wrote 13 hours 43 min ago:
              {
                    "race": "elf",
                    "horny": false
                    ^^^^^^^^^^^^^^
                    Unsupported value.
       
                bee_rider wrote 10 hours 6 min ago:
                Really, if you want a fey creature with horns, a satyr is
                probably a better bet than an elf.
       
          kristopolous wrote 15 hours 19 min ago:
          Porn is always the frontier.
          
          It's a well-understood self-contained use-case without many
          externalities and simple business models.
          
          What more, with porn, the medium is the product probably more  than
          the content. Having it on home-media in the 80s was the selling
          point. Getting it over the 1-900 phone lines or accessing it over the
          internet ... these were arguably the actual product. It might have
          been a driver of early smart phone adoption as well. Adult content is
          about an 80% consumption on handheld devices while the internet writ
          large is about 60%.
          
          Private tunable multi-media interaction on-demand is the product
          here.
          
          Also it's a unique offer.  Role playing prohibited sexual acts can be
          done arguably victim free.
          
          There's a good fiction story there... "I thought I was talking to AI"
       
            slt2021 wrote 11 hours 30 min ago:
            even if it is victim free, it can affect mental health in a way
            that a consumer will be more compelled to do a criminal act and
            create a real victim.
            
            let's say you publish a Steam game how to be a school shooter and
            shoot kids, wouldn't that lead to real school shootings ?
            
            who can definitely say that computer generated content about
            criminal behavior, won't lead to real crime with real victims?
            
   URI      [1]: https://en.wikipedia.org/wiki/Active_Shooter
       
              vidarh wrote 2 min ago:
              I think you ask valid questions, but I don't think there is any
              credible evidence to answer those questions "yes".
       
              vunderba wrote 8 hours 48 min ago:
              There are hundreds of games designed to simulate criminal
              activities: hitman, the thief series, GTA, etc. I have yet to see
              a single reputable study that shows playing these games somehow
              results in an increase in that actual activity in real life.
              
              Feels like you're falling into the same trap that Senator
              Lieberman did in the 90s, and just another spiritual successor to
              Satanic panic.
       
              torton wrote 10 hours 20 min ago:
              Grand Theft Auto 5 sold over 200 million copies, and
              military/crime/shooter games have always been incredibly popular.
              Yet crime has been decreasing over the past few decades in the
              United States, where both cars and guns are easily accessible.
       
              jameslk wrote 10 hours 38 min ago:
              > let's say you publish a Steam game how to be a school shooter
              and shoot kids, wouldn't that lead to real school shootings ?
              
              > who can definitely say that computer generated content about
              criminal behavior, won't lead to real crime with real victims?
              
              I can’t tell if you’re being sarcastic but there has been no
              found link between violent video games to violent crimes, despite
              it being researched extensively: [1] [2] [3] [4] Of course, that
              hasn’t stopped video games being blamed for violence by the
              “think of the children” crowd and certain politicians: [5]
              [6] Especially when shootings occur by white perpetrators: [7]
              The same narrative plays out for porn, despite the research
              findings being the same: [8] But blaming violent video games or
              pornography is an easy scapegoat
              
   URI        [1]: https://www.apa.org/news/press/releases/2020/03/violent-...
   URI        [2]: https://link.springer.com/article/10.1007/s10964-019-010...
   URI        [3]: https://pmc.ncbi.nlm.nih.gov/articles/PMC6756088/
   URI        [4]: https://elifesciences.org/articles/84951
   URI        [5]: https://en.m.wikipedia.org/wiki/Family_Entertainment_Pro...
   URI        [6]: https://www.theatlantic.com/technology/archive/2019/08/v...
   URI        [7]: https://www.apa.org/news/press/releases/2019/09/video-ga...
   URI        [8]: https://www.utsa.edu/today/2020/08/story/pornography-sex...
       
              jahsome wrote 11 hours 7 min ago:
              What about two consenting adults engaging in age play. By your
              logic, wouldn't that also lead to "real crime"?
       
                landl0rd wrote 10 hours 58 min ago:
                There’s probably some difference between someone who is
                visibly an adult and mature vs more deeply entrenching pathways
                of arousal in response to someone who is visibly a child. I
                still find the adult fetish version repellent but it’s also
                really hard to police in a way that’s remotely ethically
                permissible.
                
                Ie yes it’s bad and in an ideal world nobody would do it. I
                see trying to restrict or ban it as the greater of two evils.
       
                  numpad0 wrote 7 hours 51 min ago:
                  An ideal world with less speeches and acts than we have today
                  is a monastery, and monasteries are definitely not a model of
                  an ideal world.
       
              kristopolous wrote 11 hours 18 min ago:
              I view it more like methadone.
              
              Let's be specific: Rape, incest, necrophilia, bestiality, and
              pedophila ideation.
              
              I think we can all agree (1) these are harmful, anti-social
              behaviors that we do not want in our society, (2) people don't
              choose to have these desires, (3) most people who have them have
              no desire to actually traumatize others, (4) people who have
              these struggle with it.
              
              These multi-media AI role-play environments would allow that type
              of engagement without any harm.
              
              Now given all this, I am not a psychologist and do not know if
              that's part of how someone unfortunate enough to have those
              inclinations can deal with it healthily.
              
              But if it is, now it exists and hopefully we can see less of it
              in the real world. I'm all for harm reduction if this is a way to
              get there.
       
                landl0rd wrote 11 hours 0 min ago:
                It’s not unreasonable to suspect that engaging in
                high-fidelity simulations of these behaviors will further
                entrench and worsen paraphilias. This is pretty evident with
                the progression of many pornography addictions that don’t
                include these sorts of things that still follow the pattern of
                increasing novelty seeking leading to increasingly deviant
                stuff.
                
                I am at a principled level uneasy with what’s fundamentally a
                sort of prior restraint (you haven’t yet hurt anyone but this
                may increase the likelihood and/or be an effective proxy to
                lock up those who are more likely to do so) but also see a
                really strong case for doing it given the fact that these are
                arguably the most antisocial behaviors one can imagine.
       
                  umanwizard wrote 10 hours 48 min ago:
                  Are there any actual studies on this? Does access to
                  simulations of illegal or objectionable material make
                  pedophiles, rape fetishists, etc. more or less likely to try
                  to access the real thing (or even worse, to try to commit
                  crimes in the real world)?
                  
                  Because both possibilities are plausible, it’s hard to know
                  which is correct.
       
                    numpad0 wrote 8 hours 14 min ago:
                    Science always says more porn/gore = fewer crimes
                    statistically. So opinions like these rarely come with
                    citations.
                    
                    It's not completely clear if it's just a spurious
                    correlation or if there's a real causation, but eh, more
                    training data + neutral alignment training is how humans
                    train AIs, I don't see why would some says that's not how
                    baby humans are to be trained.
       
                    Aeolun wrote 9 hours 27 min ago:
                    I feel like the people that would be capable of giving you
                    an answer would be very careful not to actually give you
                    one.
                    
                    You can’t really research when the only thing you can be
                    certain of is the known real cases. It’s much harder to
                    quantify people that only have it in their head.
       
                    NikolaNovak wrote 9 hours 28 min ago:
                    My wife has recently joined board of directors for a local
                    non-profit helping victims of childhood sexual abuse. They
                    are not religious or pornography prescriptive - they are
                    liberal atheist whose entire focus is essentially sessions
                    to help people who are currently adults and carrying
                    baggage and impact of dreadful acts committed to them
                    decades ago.
                    
                    Anyhoo, The current "state of the art",
                    as-scientific-as-we-currently have it findings are that for
                    pedophilia, consuming content unfortunately normalizes and
                    drives increasing urges, instead of giving them same
                    outlet. It's a very very tricky area because current
                    thought is also that pedophilia is "not curable" - it is
                    sexual orientation thay we as society find unacceptable (me
                    included, fwiw), so... Repression, wildly and rightly
                    disawowed for other sexual orientations, is the current
                    direction for pedophilia - I.e. Current thinking is that
                    "victim-free" pornography consumption nevertheless
                    tremendously increases actual risk to actual kids in the
                    vicinity of the consumer.
                    
                    Until relatively recently I was the technologist on a
                    highty horse about online freedoms and largely still very
                    much am. But in this specific area I've also had some semi
                    personal experiences with pedophiles and my level of
                    empathy to them has dropped to near zero and my level of
                    empathy toward their victims has gone even further through
                    the roof. Sometimes in technologist circles we think of
                    this as edge case not worth consideration, but reality,
                    very very unfortunately, is much much darker.
                    
                    Don't get me wrong : I'm pro pornography freedoms, think
                    it'd be huge fun to have a sexy high quality chatbot, and I
                    find vast majority of those railing against it to be
                    hypocrites with dishonest ulterior motives - and don't get
                    me started on all the tangential "for the children" crap
                    that religious rights tries to enact, as opposed to
                    actually help children and families ;-<
                    
                    But to the question of "is harmless pornography Indulgence
                    better for paedophile and society", current thinking is
                    "very much no".
       
                      slt2021 wrote 8 hours 57 min ago:
                      I agree you captured my thoughts exactly.
                      
                      most of the perpetrators of child sexual abuse are
                      victim's close people:
                      teacher/cousin/brother/uncle/father/etc.
                      
                      the reinforcement of their lust will only remove whatever
                      remaining barrier against such repulsive behavior, and
                      once the novelty from synthetic CP wears out, it will
                      create urge to commit real crime with real victims
       
                    kristopolous wrote 10 hours 18 min ago:
                    Even if there are I and likely you lack the qualifications
                    of making any clinical takeaways from them.
                    
                    I'd really defer to experts.
                    
                    I try to make tools in good faith and hope they're used
                    responsibly to make the world a better place.
                    
                    I'm not a clinical psychologist nor can I pretend to
                    understand medical literature like someone with a PhD
       
                      danw1979 wrote 9 hours 4 min ago:
                      I can’t tell if the part of this post after they have
                      made this “thing of questionable safety” and then
                      bury their head in the sand is satirical or not.
       
                        kristopolous wrote 8 hours 17 min ago:
                        Why is it controversial to state that medical expertise
                        is a technical field that engineers should defer to
                        experts in?
                        
                        It doesn't mean I can do things recklessly. Instead
                        it's an acknowledgment of when I need to defer to
                        somebody else just like I need to call up an attorney
                        for legal stuff or an accountant for tax stuff.
       
                          umanwizard wrote 8 hours 13 min ago:
                          It’s not controversial but it doesn’t address my
                          question at all.
                          
                          A: I’m curious about X.
                          
                          B: We should trust the experts!
                          
                          Sure, but what do the experts say? That was my entire
                          question.
       
                  kristopolous wrote 10 hours 55 min ago:
                  Right, I'm just a technologist. The psychological and
                  sociological parts aren't my bailiwick
                  
                  Typing a prompt in an AI box to make art has fewer real-world
                  victims than performing the acts, filming them, and then
                  sharing the videos.
                  
                  I think that's inarguable. Maybe it's still unadvisable and
                  someone should be in talk therapy. I have no idea. But at
                  least nobody is actually getting molested and retraumatized
                  in the ai art scenario.
                  
                  If someone is spending their time using comfyUI drawing
                  pictures instead of stalking the local middleschool, I'd
                  hesitate to say mission accomplished ... but maybe I should?
                  
                  People's time is finite. They can't be doing both. If the
                  real is substituted for the imaginary then the real can no
                  longer happen because that time is spent.
       
                    Aeolun wrote 9 hours 22 min ago:
                    The model has to be trained on something though. It’s
                    easy to see how this works for art art, because people have
                    been drawing that shit for years, and most people
                    wouldn’t feel to bad about training on it. I don’t
                    think that’s true for the photorealistic models though.
       
                      kristopolous wrote 3 hours 29 min ago:
                      sure but those features have been generalized over very
                      large datasets. Yes, you can have Loras with specific
                      people, convincingly voice clone with higgs, use
                      latentsync, wan, flux kontext, face swap, lots of things
                      ... sure.
                      
                      This all falls flat on me though. It's like showing me
                      the lewdest most shocking story and then saber rattle
                      about keyboards and word processors.
                      
                      I'm fully aware of the wild  things people do with
                      drawing programs.
                      
                      We're all adults here. It's fine.
       
              ses1984 wrote 11 hours 20 min ago:
              Who can say that it does?
       
            tuatoru wrote 14 hours 45 min ago:
            1, Porn. 2 Military.
       
              eru wrote 12 hours 19 min ago:
              The firmer is a lot more nimble and the procurement processes of
              your customers are easier to navigate.
       
                degamad wrote 10 hours 37 min ago:
                > firmer
                
                snort
       
                  eru wrote 9 hours 58 min ago:
                  Sometimes a typo makes you look wittier than you are.
       
            shortrounddev2 wrote 15 hours 4 min ago:
            There's something Freudian about the idea that the more you can
            customize porn, the more popular it is. That, despite the
            impression that "all men want one thing", it turns out that men all
            want very different and very oddly specific things. Imbuing
            somrthing with a "magical" quality that doesnt exist is the origin
            of the term "fetish". Its not about the raw attractive preference
            for a particular hair color; its a belief in the POWER of that hair
            color.
       
              kristopolous wrote 14 hours 57 min ago:
              oh it's wildly different. About 15 years ago I worked on a porn
              recommendation system. The idea is that you'd follow a number of
              sites based on likes and recommendations and you'd get an
              aggregated feed with interstitial ads.
              
              So I started with scraping and cross-reference, foaf, doing
              analysis. People's preferences are ... really complex.
              
              Without getting too lewd, let's say there's about 30-80
              categories with non-marginal demand depending on how you want to
              slice it and some of them can stack so you get a combinatoric.
              
              In early user testing people wanted the niche and found the
              adventurous (of their particular kind) to be more compelling. And
              that was the unpredictable part. The majoritarian categories
              didn't have stickiness.
              
              Nor did these niches have high correlation. Someone could be into
              say, specific topic A (let's say feet), and correlating that with
              topic B (let's say leather) was a dice roll. The probabilities
              were almost universally < 10% unless you went into majoritarian
              categories (eg. fit people in their 20s).
              
              People want adventure on a reservation with a very well defined
              perimeter - one that is hard to map and different for every
              person.
              
              So the value-add proposition went away since it's now just a
              collection of niche sites again.
              
              Also, these days people have Reddit accounts reserved for porn 
              where they do exactly this. So it was built after all.
       
                eru wrote 12 hours 18 min ago:
                > Also, these days people have Reddit accounts reserved for
                porn where they do exactly this. So it was built after all.
                
                Didn't reddit remove porn?
       
                  kristopolous wrote 11 hours 37 min ago:
                  No. Not at all. You must be thinking of a different site.
                  Tumblr did and onlyfans did for a hot minute and then
                  backtracked.
                  
                  Neither of them intended to be porn sites. It's kind of a
                  natural occurrence on UGC sites . Look at Civitai...
                  
                  Credit card processors are kinda weary of it for some legal
                  reasons I'm not qualified to enough to really understand.
       
                    fc417fc802 wrote 3 hours 25 min ago:
                    > for some legal reasons
                    
                    For moralizing activist reasons. It's nothing to do with
                    legality. With any luck eventually they'll inadvertently
                    trample a sacred cow of whichever party is currently in
                    power and we'll finally get sane legislation outlawing
                    their overbearing nonsense.
       
                kridsdale3 wrote 13 hours 36 min ago:
                You may be interested in the data surfaced by this large-scale
                survey[1]
                
   URI          [1]: https://aella.substack.com/p/fetish-tabooness-and-popu...
       
                  kristopolous wrote 12 hours 56 min ago:
                  This is interesting but there's a little more to it,
                  especially with the erotic.
                  
                  If people were polled what they want to see on social media,
                  few would say things that are inflammatory, upsetting,
                  divisive, etc but those as we know are strong drivers of
                  engagement.
                  
                  It's because you're polling for affinity or disclosed
                  preference not for the actual engagement drivers.
                  
                  For instance, if a male says they watch male pornography,
                  they are labeling, or at least stating an affinity to a
                  sexual identity.
                  
                  However, the identities people choose to own are not the same
                  as the preferences they actually have.
                  
                  Instead if you track things like scroll velocity, linger
                  time, revisitation, the time distance (such as 2 days apart
                  instead of 5 minutes) a different story emerges.
                  
                  For instance a given male could frequently look at male
                  pornography but for all kinds of social reasons not want that
                  affinity so they'd never even internally ideate the
                  preference although their behavior of frequenting male
                  content will be there regardless.
                  
                  That's one of the problems with this approach is that not
                  many people want to own all the social identities which map
                  to their preferences so they don't openly identify it.
                  
                  There (maybe) three levels of acceptance: admitting it to
                  oneself, to others, identifying with it. And honestly these
                  have a poor mapping to actual engagement with explicit
                  content. You can have a (insert sexual affinity) rights
                  activist who does not look at explicit content and someone
                  protesting them who does all the time.
       
                    JoshTriplett wrote 11 hours 20 min ago:
                    > If people were polled what they want to see on social
                    media, few would say things that are inflammatory,
                    upsetting, divisive, etc but those as we know are strong
                    drivers of engagement.
                    
                    That's because those are two entirely different things. If
                    you polled people and asked them "what causes you to spend
                    more time on social media", then at least some self-aware
                    folks would likely identify conflict, "someone is wrong on
                    the Internet" ( [1] ), etc. That doesn't mean that's "what
                    they want to see on social media", that means that's "what
                    gets them to spend more time on social media".
                    
   URI              [1]: https://xkcd.com/386/
       
                    cm2012 wrote 12 hours 23 min ago:
                    Man, I would pay money to see the (anonymized) trends on an
                    adult website. Fascinating view into such an under studied
                    area of humanity nature. I bet the porn tubes have data
                    that sociologists could write papers on.
       
                      vidarh wrote 4 min ago:
                      Pornhub does yearly roundups of stats, as well as for
                      various events: [1]
                      
   URI                [1]: https://www.pornhub.com/insights/2024-year-in-re...
   URI                [2]: https://www.pornhub.com/insights/
       
        diggan wrote 15 hours 33 min ago:
        > for instance, they have broad general knowledge about science, but
        don’t know much about popular culture
        
        That seems like a good focus. Why learn details that can change within
        days of it being released? Instead, train the models to have good
        general knowledge, and be really good at using tools, and you won't
        have to re-train models from scratch just because some JS library now
        has a different API, instead the model goes out to fetch the latest
        APIs/gossip when needed.
       
          eru wrote 12 hours 12 min ago:
          Why would anything change?
          
          You feed the model approximately all the text you have ever.  And
          some things like 'popular culture of 2025' won't change, just because
          the calendar changed to 2026.  Just like the popular culture of the
          1980s is what it was, and won't change.
       
            diggan wrote 2 hours 44 min ago:
            > Why would anything change?
            
            It's not that facts change across time, but the relevancy of the
            details change. For example, it would be great if we could teach
            LLMs all the APIs all React versions have ever had, but if we do
            that for everything, there will be no limit to the weight's weight,
            and we'd need new weights every quarter if not more often. That
            seems very unsustainable.
            
            So the information that corresponds to "What is the current React
            API for X" changes whenever the API changes, but "What is the React
            v5 API for X" remains the same. Having the model being able to look
            up those things via external channels would let us use the same
            models for way longer, if you need "up to date data" about things.
       
            int_19h wrote 11 hours 25 min ago:
            We don't feed the model all the text ever. They are still trained
            on less than 1% of the entire Internet corpus.
       
              eru wrote 9 hours 58 min ago:
              You are right, though on the other hand feeding it a selection of
              1% of the entire corpus is already pretty close to 'all the text'
              (if you assume exponential growth in training over time).
              
              Even multiplying that to approximately 100% of that corpus plus
              adding lots of non-internet text, will pale in comparison to all
              the non-text training data we will (or are) feeding our coming
              (and existing) multi-modal models.
              
              If I may go out an a limb here: either we will see continuous
              great progress on text-based LLMs alone, or multi-modal models
              will become the next big focus.  (Or both.)
              
              That's because people are hungry for progress, and going
              multi-modal is the obvious thing to try to focus on, if text
              alone proves infeasible to drive progress.
              
              Just to be clear: I make no prediction here on whether
              multi-modal will lead to progress, just that people will
              obviously try it and try it hard, if the focus on text starts to
              stall.
       
          wmf wrote 15 hours 30 min ago:
          Yeah, it always seemed like a sad commentary on our world that AIs
          are devoting their weights to encyclopedic knowledge of Harry Potter,
          Pokemon, and Reddit trolling.
       
            eru wrote 11 hours 43 min ago:
            Why?  You gotta provide what your customers want.
            
            And it's far from sad that we have so many resources, we can give
            everyone a supercomputer in their pocket just to take selfies and
            talk about Pokemon.  Why would our AIs be any different?
       
              xwolfi wrote 6 hours 12 min ago:
              Because we really could use all that time, money and uranium to
              train them on complex problems we need to solve, rather than
              entertainment.
              
              But you're right, shareholders of OpenAI want profit, not
              progress, and they'll give us our intellectual Big Mac.
       
                eru wrote 1 hour 53 min ago:
                You could say that about any piece of music or movie ever, too.
                 Or any novel.
       
        magicalhippo wrote 15 hours 52 min ago:
        I've found good use of Phi-4 at home, and after a few tests of the
        GPT-OSS 20B version I'm quite impressed so far.
        
        Particularly one SQL question that has tripped every other model of
        similar or smaller size that I've tried, like Devstral 24B, Falcon 3
        7B, Qwen2.5-coder 14B and Phi 4 14B.
        
        The question contains an key point which is obvious for most humans,
        and which all of the models I tried previously have failed to pick up
        on. GPT-OSS picked up on it, and made a reasonable assumption.
        
        It's also much more thorough at explaining code compared to the other
        models, again including details the others miss.
        
        Now if only I had a GPU that could run the whole thing...
       
          compumetrika wrote 6 hours 40 min ago:
          Get a Strix Point or Strix Halo with 128GB DDR5 RAM and you can run
          gpt-oss 120B at 10-20+ TPS.
       
            magicalhippo wrote 6 hours 3 min ago:
            Good point, though at the price of a 5090 I'm more tempted to get
            the 5090, as I do still game a bit as well.
       
          VladVladikoff wrote 15 hours 35 min ago:
          Can you share the question? Or are you intentionally trying to keep
          it out of the training data pool?
       
            magicalhippo wrote 5 hours 14 min ago:
            Here's a more concrete example where GPT-OSS 20B performed very
            well IMHO. I tested it against Gemma 3 12B, Phi 4 Reasoning 14B,
            Qwen 2.5-coder 14B.
            
            The prompt is modeled as a part of an agent of sorts, and the
            "human" question is intentionally ill-posed to emulate people
            saying the wrong thing.
            
            The prompt begins with asking the model to convert a question into
            matlab code, add any assumptions as comments at the start of the
            coder, or if it's not possible then output four hash marks followed
            by an reason why.
            
            The (ill-posed) question is "What's the cutoff frequency for an LC
            circuit with R equals 500 ohm and C equals 10 nanofarrad?"
            
            Gemma 3 took the bait and treated R as L and proceeded to calculate
            the cutoff frequency of an LC circuit[1], completely ignoring the
            resulting mismatch of units. It did not comment at all. Completely
            wrong answer.
            
            Qwen 2.5-coder detected the ill-posed nature, but instead decided
            to substitute a dummy value for L before calculating the LC circuit
            answer. On the upside it did add the comments saying this, so
            acceptable in that regard.
            
            Phi 4 Reasoning reasoned for about 3 minutes before deciding to
            assume the question is about an RC circuit. It added this as a
            comment, and correctly generated the code for an RC circuit. So
            good answer, but slow.
            
            GPT-OSS reasoned for 14 seconds, and determined the question was
            ill posed, thus outputting the hash marks followed by The cutoff
            frequency of an LC circuit cannot be determined with only R and C
            provided; the inductance L is required. Good answer, and fast.
            
            [1] 
            
   URI      [1]: https://en.wikipedia.org/wiki/LC_circuit#Resonance_effect
       
              Mkengin wrote 4 hours 37 min ago:
              Why Qwen2.5 and not Qwen3-30B-A3B-Thinking-2507 or
              Qwen3-Coder-30B-A3B-Instruct?
       
                magicalhippo wrote 3 hours 13 min ago:
                Mostly because I had it downloaded already and I'm mostly
                interested in models that fit on my 16GB GPU. But since you
                asked, I ran the same questions through both 30B models in the
                q4_k_m variant, as GPT-OSS 20B is also quantized to about q4.
                
                First the ill-posed question:
                
                Qwen 3 Coder gave very similar answer to Phi 4, though included
                a more long-winded explanation in the comments. So not bad, but
                not great either.
                
                Qwen 3 Thinking thought for a good minute before deciding the
                question was ill-posed and return the hash marks. However the
                following explanation was not as good as GPT-OSS, IMHO: The
                question is unclear because an LC circuit (without resistance)
                does not have a "cutoff frequency"; cutoff frequency applies to
                filter circuits like RC or RLC. Additionally, the inductance
                (L) value is missing for calculating resonant frequency in an
                RLC circuit. The given R and C values are insufficient without
                L.
                
                Sure, an unloaded LC filter doesn't have a cutoff frequency,
                but in all normal cases the load is implied[1] and so the LC
                filter does have a cutoff frequency. So more thinking to get to
                a worse answer.
                
                The SQL question:
                
                Qwen 3 Coder did identify the same pitfall as GPT-OSS, however
                didn't flag it as clearly as GPT-OSS, mostly because it also
                flagged some unnecessary stuff so got drowned. It did make the
                same assumption about evenly dividing, and overall the answer
                was about as  good. However the speed on my computer was
                roughly half the number of tokens per second as GPT-OSS, at
                just ~9 tokens/second.
                
                Qwen 3 Thinking thought for 3 minutes, yet managed to miss the
                key aspect, thus giving everyone the pizza. And it did so at
                the same slow pace as Qwen 3 Coder.
                
                The SQL question requires a somewhat large context due to the
                large table definitions, and being a larger model it required
                pushing more layers to the CPU, which I assume is the major
                factor in the speed drop.
                
                So overall Qwen 3 Coder was a solid contender, but on my PC
                much slower. If it could run entirely on GPU I'd certainly try
                it a lot more. Interestingly Qwen 3 Thinking was just plain
                worse. Perhaps not tuned to other tasks besides coding?
                
                [1] section 3.3 page 9
                
                [2] 
                
   URI          [1]: https://www.ti.com/lit/an/slaa701a/slaa701a.pdf
   URI          [2]: https://github.com/ollama/ollama/issues/11772
       
                  Mkengin wrote 2 hours 12 min ago:
                  Thank you for testing, I will test GPT-OSS for my use case as
                  well. If you're interested I have 8 GB VRAM, 32 GB RAM and
                  get around 21 token/s with tensor offloading, I would assume
                  that your setup should be even faster than mine with the
                  optimizations. I use the IQ4_KSS quant (by ubergarm on hf)
                  with ik_llama.cpp with this command:
                  
                  $env:LLAMA_SET_ROWS = "1"; ./llama-server -c 140000 -m
                  D:\ik_llama.cpp\build\bin\Release\models\Qwen3-Coder-30B-A3B-
                  Instruct-IQ4_KSS.gguf -ngl 999 --flash-attn -ctk q8_0 -ctv
                  q8_0 -ot "blk\.(19|2[0-9]|3[0-9]|4[0-7])\.ffn_.*_exps\.=CPU"
                  --temp 0.7 --top-p 0.8 --top-k 20 --repeat_penalty 1.05
                  --threads 8
                  
                  In my case I offload layers 19-47, maybe you would just have
                  to offload 37-47, so
                  "blk\.(3[7-9]|4[0-7])\.ffn_.*_exps\.=CPU"
       
                    magicalhippo wrote 1 hour 26 min ago:
                    Yeah I think I could get better performance out of both by
                    tweaking, but so far the ease of use has triumphed so far.
       
            magicalhippo wrote 15 hours 14 min ago:
            Sadly no. I'd like to keep it untainted, but also because the
            tables involved are straight from my work, which is very much not
            OSS.
            
            I can however try to paraphrase it so you get the gist of it.
            
            The question asks to provide a SQL statement to update rows in
            table A based on related tables B and C, where table B is mentioned
            explicitly and C is implicit through the foreign keys provided in
            the context.
            
            The key point all previous models I've tested has missed, is that
            the rows in A are many-to-one with B, and so the update should take
            this into account. This is implicit from the foreign key context
            and not mentioned directly in the question.
            
            Think distributing pizza slices between a group of friends. All
            previous models has completely missed this part and just given each
            friend the whole pizza.
            
            GPT-OSS correctly identified this issue and flagged it in the
            response, but also included a sensible assumption of evenly
            dividing the pizza.
            
            I should note some of the previous models also missed the implicit
            connection to table C, and thus completely failed to do something
            sensible. But at least several of them figured this out. Of course
            I forgot to write that part down so can't say offhand which did
            what.
            
            As for the code, for example I've coded a Y combinator in Delphi,
            using intentionally terse non-descriptive names, and asked the
            models to explain how the code works and what it does. Most ~7B
            models and larger of the past year or so have managed to explain it
            fairly well. However GPT-OSS was much more thorough and provider a
            much better explanation, showing a significantly better
            "understanding" of the code. It was also the first model smaller
            than LLama 3 70B that I've tried that correctly identified it as a
            Y combinator.
       
        lifis wrote 16 hours 2 min ago:
        Does anyone know how synthetic data is commonly generated? Do they just
        sample the model randomly starting from an empty state, perhaps with
        some filtering? Or do they somehow automatically generate prompts and
        if how? Do they have some feedback mechanism, e.g. do they maybe test
        the model while training and somehow generate data related to poorly
        performing tests?
       
          duchenne wrote 2 hours 46 min ago:
          I have done that at meta/FAIR and it is published in the Llama 3
          paper.
          You usually start from a seed. It can be a randomly picked piece of
          website/code/image/table of contents/user generated data, and you
          prompt the model to generate data related to that seed.
          After, you also need to pass the generated data through a series of
          verifiers to ensure quality.
       
          Mars008 wrote 8 hours 31 min ago:
          One way of getting good random samples is to give model a random
          starting points. For example: "write a short story about PP doing GG
          in XX". Here PP, GG and XX are filled algorithmically from lists of
          persons, actions and locations. The problem is model's randomly
          generated output from the same prompt isn't actually that random.
          Changing the temperature parameter doesn't help much.
          
          But in general it's a big secret because the training data and
          techniques are the only difference between models as architecture is
          more or less settled.
       
          janalsncm wrote 13 hours 34 min ago:
          It’s common to use rejection sampling: sample from the model and
          throw out the samples which fail some criteria like a verifiable
          answer or a judgement from a larger model.
       
          LeoPanthera wrote 15 hours 55 min ago:
          I don't know about Phi-5, but earlier versions of Phi were trained on
          stories written by larger models trained on real-world data. Since
          it's Microsoft, they probably used one of the OpenAI GPT series.
       
            Mars008 wrote 8 hours 38 min ago:
            > stories written by larger models trained on real-world data
            
            I suspect there are no larger models trained on pure real-world
            data. They all use a mix of real and generated.
       
        tarruda wrote 16 hours 8 min ago:
        If a model is trained only on synthetic data, is it still possible it
        will output things like this?
        
   URI  [1]: https://x.com/elder_plinius/status/1952958577867669892
       
          JoshTriplett wrote 11 hours 13 min ago:
          In theory, it's possible. [1] It's not particularly likely that the
          hidden information encoded in synthetic data would happen to include
          specific details for making LSD or VX, but it's much more plausible
          that synthetic data contains some information the model's trainers
          would prefer to not incorporate in the model.
          
   URI    [1]: https://x.com/OwainEvans_UK/status/1947689616016085210
       
          LeoPanthera wrote 15 hours 53 min ago:
          By definition, a model can't "know" things that are not somewhere in
          its training set, unless it can use a tool to query external
          knowledge.
          
          The problem is that the size of the training set required for a good
          model is so large, that's really hard to make a good model without
          including almost all known written text available.
       
            eru wrote 12 hours 11 min ago:
            > By definition, a model can't "know" things that are not somewhere
            in its training set, unless it can use a tool to query external
            knowledge.
            
            Well, it could also make inferences.  Like, it could find a new
            mathematical proof, even if that's never in the training set.
       
              xwolfi wrote 4 hours 55 min ago:
              But how, it's not like it's thinking, it's just spitting the next
              likely token
       
                frotaur wrote 3 hours 21 min ago:
                This can generate new text. If the abilities generalise
                somewhat (and there is lots of evidence they DO generalise on
                some level), then there is no obstacle to generating new
                proofs, although the farther away they are from the training
                data, the less likely it becomes.
                
                For an obvious example of generalisation: the models are able
                to write more code than there is in the dataset. If you ask it
                to write some specific, though easy, function, it is very
                unlikely it is present verbatim in the dataset, and yet the
                model can adapt.
       
                createaccount99 wrote 3 hours 56 min ago:
                That is an oversimplification, and not the whole truth. Read
                anthropic's blog posts if you want to learn more, or ask gpt5.
       
            janalsncm wrote 13 hours 30 min ago:
            > all known written text available
            
            If phi5 is trained on synthetic data only then info on how to make
            drugs must be in the synthetic dataset.
       
        NitpickLawyer wrote 16 hours 36 min ago:
        Yeah, makes sense. Good observations regarding the benchmark vs. vibes
        in general, and I didn't know / made the connection between the lead of
        phi models going to oAI and gpt-oss. Could very well be a similar
        exercise + their "new" prompt level adherence (system > developer >
        user). In all the traces I've seen of refusals the model "quotes" the
        policy quite religiously. Similar thing was announced for gpt5.
        
        I think the mention of the "horny people" is warranted, they are an
        important part of the open models (and first to explore the idea of
        "identities / personas" for LLMs, AFAIK). Plenty of fine-tuning bits of
        know-how trickled from there to the "common knowledge".
        
        There's a thing that I would have liked to be explored, perhaps. The
        idea that companies might actually want what -oss offers. While the
        local llm communities might want freedom and a horny assistant,
        businesses absolutely do not want that. And in fact they spend a lot of
        effort into implementing (sometimes less than ideal) guardrails, to
        keep the models on track. For very easy usecases like support chatbots
        and the like, businesses will always prefer something that errs on the
        side of less than useful but "safe", rather than have the bot start
        going crazy with sex/slurs/insults/etc.
        
        I do have a problem with this section though:
        
        > Really open weight, not open source, because the weights are freely
        available but the training data and code is not.
        
        This is factually incorrect. The -oss models are by definition open
        source. Apache2.0 is open source (I think even the purists agree with
        this). The requirement of sharing "training data and code" is
        absolutely not a prerequisite for being open source (and historically
        it was never required. The craze surrounding LLMs suddenly made this a
        thing. It's not).
        
        Here's the definition of source in "open source":
        
        > "Source" form shall mean the preferred form for making modifications,
        including but not limited to software source code, documentation
        source, and configuration files.
        
        Well, for LLMs the weights are the "preffered form of making
        modifications". The labs themselves modify models the same as you are
        allowed to by the license! They might use more advanced tools, or
        better datasets, but in the end the definition still holds. And you get
        all the other stuff, like the right to modify, re-release, etc. I'd
        really wish people would stop proliferating this open weight nonsense.
        
        Models released under open source licenses are open source. gpt-oss,
        qwens and mistrals (apache2.0), deepseeks(MIT), etc.
        
        Models released under non open source licenses also exist, and they're
        not open source because the licenses under which they're released
        aren't. LLamas, gemmas, etc.
       
          jchw wrote 16 hours 6 min ago:
          I think source code really only exists in terms of the source
          code/object code dichotomy, so what "traditional" open source means
          for model weights is really not obvious if you only go off of
          traditional definitions. Personally I think the word "open source"
          shouldn't apply here anymore than it would for art or binary code.
          
          Consider the following: it is possible to release binaries under the
          Apache2 license. Microsoft has, at least at one point, released a
          binary under the BSD license. These binaries are not open source
          because they are not source.
          
          This isn't the same argument as given in the article though, so I
          guess it is a third position.
       
            NitpickLawyer wrote 15 hours 46 min ago:
            > Consider the following: it is possible to release binaries under
            the Apache2 license. Microsoft has, at least at one point, released
            a binary under the BSD license. These binaries are not open source
            because they are not source.
            
            Agreed. But weights are not binaries in the licensing context. For
            weights to be binaries it would imply another layer of abstraction,
            above weights, that the labs use as the preferred way of modifying
            the model, and then "compile" it into weights. That layer does not
            exist. When you train a model you start with the weights (randomly
            initialised, can be 0 can be 1, can be any value, whatever works
            best). But you start with the weights. And at every step of the
            training process you modify those weights. Not another layer, not
            another abstraction. The weights themselves.
       
              jchw wrote 15 hours 26 min ago:
              In my opinion, though, they're also not really source code
              either. They're an artifact of a training process, not code that
              was written by someone.
       
                NitpickLawyer wrote 15 hours 17 min ago:
                > They're an artifact of a training process, not code that was
                written by someone.
                
                If that were relevant to the licensing discussion, then you'd
                have to consider every "generated" parts (interfaces,
                dataclasses, etc) of every open source project artefacts.
                Historically, that was never the case. The license doesn't care
                if a hardcoded value was written by a person or "tuned" via a
                process. It's still source code if it's the preferred way of
                modifying said code. And it is. You can totally edit them by
                hand. It would not work as well (or at all), but you could do
                it.
       
                  jchw wrote 14 hours 12 min ago:
                  There is actually a gray area about what code "counts" as
                  source code to the point where you would consider it "open
                  source" if it were licensed as such. I think if you had a
                  repository consisting of only generated code and not the code
                  used to generate it, it would definitely raise the question
                  of whether it should be considered "source code" or "open
                  source", and I think you could make arguments both ways.
                  
                  On the other hand, I don't really think that argument then
                  extends to model weights, which are not just some number of
                  steps removed from source code, but just simply not really
                  related to source code.
       
          BoorishBears wrote 16 hours 13 min ago:
          "Good observations regarding the benchmark vs. vibes in general"
          
          Most "vibes" people are missing that it as only has 5B active
          parameters.
          
          They read 120B and expect way more performance than a 24B parameter
          model, even though empricaly a 120B model with 5B active parameters
          is expected to perform right around there.
       
          tuckerman wrote 16 hours 15 min ago:
          I mostly agree with your assessment of what we should/shouldn't call
          open source for models but there is enough grey area to make the
          other side a valid position and not worthy of being dismissed so
          easily. I think there is a fine line between model weights and, say,
          bytecode for an interpreter and I think if you released bytecode
          dumps under any license it would be called out.
          
          I also believe the four freedoms are violated to some extent (at
          least in spirit) by just releasing the weights and for some that
          might be enough to call something not open source. Your "freedom to
          study how the program works, and change it to make it do what you
          wish" is somewhat infringed by not having the training data.
          Additionally, gpt-oss added a (admittedly very minimal) usage policy
          that somewhat infringes on the first freedom, i.e. "the freedom to
          run the program as you wish, for any purpose".
       
            charcircuit wrote 15 hours 25 min ago:
            You are free to look at every single weight and study how it
            affects the result. You can see how the model is architected. And
            you don't need training data to be provided to be able to modify
            the weights. Software can still be open source even if it isn't
            friendly to beginners.
       
              tuckerman wrote 14 hours 54 min ago:
              I think you could say something remarkably similar about just
              releasing bytecode as well and I think most people would call
              foul at that. I don't think it's so cut and dry.
              
              This isn't entirely about being a beginner or not either. Full
              fine-tuning without forgetting does really want the training data
              (or something that is a good replacement). You can do things like
              LoRa but, depending on your use case, it might not work.
       
          mejutoco wrote 16 hours 16 min ago:
          The key is if you consider weights source code. I do not think this
          is a common interpretation.
          
          > The labs themselves modify models the same as you are allowed to by
          the license
          
          Do the labs do not use source code?
          
          It is a bit like arguing that releasing a binary executable is
          releasing the source code. One could claim developers modify the
          binary the same as you are allowed to.
       
            NitpickLawyer wrote 15 hours 52 min ago:
            > Do the labs do not use source code?
            
            The weights are part of the source code. When running inference on
            a model you use the architecture, config files and weights
            together. All of these are released. Weights are nothing but
            "hardcoded values". The way you reach those values is irrelevant in
            the license discussion.
            
            Let's take a simple example: I write a chess program that is
            comprised of a source file with 10 "if" statements, a config file
            that matches between the variables used in the if statements and a
            "hardcoded values" file that stores the actual values. It would be
            a crappy chess program, but I hope you agree that I can release
            that as open source and no-one would bat an eye. You would also be
            granted the right to edit those hardcoded values, if you wish so.
            You'd perhaps make the chess bot better or worse. But you would be
            allowed to edit it, just like I would. That's the preferred way of
            modifying it. Me providing the methods that I used to reach those
            10 hardcoded values has 0 bearing on my crappy chess bot being open
            source or not. Do we agree on that?
            
            Now instead of 10 values, make it 100billion. Hey, that's an LLM!
            
            > It is a bit like arguing that releasing a binary executable is
            releasing the source code.
            
            That's the misconception. Weights are not a binary executable. In
            other words, there isn't another level above weights that the labs
            use to "compile" the weights. The weights exist from the beginning
            to the end, and the labs edit the weights if they want to modify
            the models. And so can you. There isn't a "compilation" step
            anywhere in the course of training a model.
       
              mejutoco wrote 5 hours 31 min ago:
              > The weights are part of the source code.
              
              If you will allow me the absurd analogy: my arm is also part of a
              person (me), but my arm is not a person. My arm does not have its
              own bank account and pays taxes independently.
              
              I get how the weights are not exactly like binary code. Good
              points. But they are also not source code (from your own quote)
              
              > "Source" form shall mean the preferred form for making
              modifications
              
              The weights are not the preferred form of making modifications.
              At most, one could argue it is the weights + source code.
              
              > In other words, there isn't another level above weights that
              the labs use to "compile" the weights
              
              The source and training data?
              
              I see your points, and it is an interesting discussion of
              nuances, but I profoundly disagree that the weights are "the
              preferred form for making modifications". For these reasons I
              prefer the term "open weights" for these projects.
       
              jdiff wrote 14 hours 58 min ago:
              If you have 10 harcoded values, you have a binary blob, a common
              feature particularly in hardware drivers that is opaque and
              commonly considered to not be fully free unless the instructions
              for deriving it are also included. It's frequently just an
              executable, occasionally just configuration information, but
              difficult to change while (assuming no signing shenanigans) still
              remaining technically possible.
              
              The training data is the source code and the training process is
              the compiler. There's a fairly direct metaphor to be made there.
              Different compilers can have vastly different impacts on the
              performance of the compiled executable.
       
              127 wrote 15 hours 17 min ago:
              Training is obviously the compilation step.
       
          jononor wrote 16 hours 23 min ago:
          No the preferred way of making modifications is the weights
          _together_ with training (or fine tuning) scripts, and the entire
          evaluation pipeline to measure performance. And the data required to
          support all of this.
          
          When someone joins your data science team your would give them all
          this code and data. Not just the weights and say - the weights are
          the source, modify that to improve the model, I look forward to see
          your MR next week.
          
          EDIT: Heck, sometimes the way to make improvements (modifications) is
          just to improve the data, and not touch the training code at all. It
          is often one of the most powerful ways. You still need training code
          though, and evaluation to measure the impact.
       
            charcircuit wrote 15 hours 32 min ago:
            It's not about the preferred way. Else open source software would
            need to give you their IDE setup, CI/CD setup, access to all
            internal tools, etc. Software like sqlite don't release their full
            test suite. They paywall the preferred way of making changes, yet
            they are open source.
            
            >The “source code” for a work means the preferred form of the
            work for making modifications
            
            The GPL refers to a form of the artifact being released
       
              lmm wrote 10 hours 55 min ago:
              > open source software would need to give you their IDE setup,
              CI/CD setup, access to all internal tools, etc.
              
              IMO they do. If you can't modify it like a core contributor
              would, then it's not really open source. Traditional open source
              projects always included development guides, test configurations
              etc.
              
              > Software like sqlite don't release their full test suite. They
              paywall the preferred way of making changes, yet they are open
              source.
              
              That's a matter of opinion. IMO sqlite is not true open source,
              for precisely this reason.
       
            wizzwizz4 wrote 16 hours 6 min ago:
            You also need the training data, so you can ensure you're not
            benchmarking on the training set, fine-tuning on the training set
            (overfitting with extra steps), or otherwise breaking things.
       
            NitpickLawyer wrote 16 hours 7 min ago:
            The license gives you the right to modify the weights, how you do
            the modification is up to you. The rest is in the realm of IP,
            know-how, etc. Apples and oranges.
       
              imiric wrote 13 hours 12 min ago:
              Having the right to modify one part of the product is not the
              same as having the right to modify the entire product. Labeling
              such projects as open source in the full spirit of the definition
              is disingenuous.
              
              This is similar to the approach taken by some video game studios:
              release the source code under a permissive license, but not the
              game assets. Which is better than a proprietary license, but it
              still presents a hurdle for the final product to be built from
              source.
              
              The open weights approach is much more user hostile, however.
              Proprietary game assets can at least be purchased, and the final
              product can be built. With open weights, this is not possible.
              Nobody can realistically build the same model or similar models
              from weights alone. They can use the weights and self-host the
              prebuilt model, but not create revisions of it, which is the
              whole point of open source.
              
              Weights are essentially the bytecode of language models. Sure,
              you can run and modify it with the right tools, but without the
              tools used to create it in the first place, the project is not
              much more useful than publishing binaries.
       
       
   DIR <- back to front page