_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
   URI Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
   URI   GPT-5 for Developers
       
       
        energy123 wrote 4 hours 34 min ago:
        I've gotten 100% cache misses so far. Has anyone got a cache hit?
       
        skroumpelou wrote 4 hours 38 min ago:
        I tried it out with warp terminal (warp.dev)! It will be my coding
        buddy today!
       
        raducu wrote 5 hours 57 min ago:
        Sample size of 1 but GPT-5 seems horrendous at coding?
        
        My go to benchmark is a 3d snake game Claude does almost flawlessly (or
        at least in 3-4 iterations)
        
        The prompt:
        
        write a 3d snake game in js and html. you can use any libraries you
        want. the game still happens inside a single plane, left arrow turns
        the snake left, right arrow turns it right. the plane is black and
        there's a green grid. there are multiple rewards of random colors at a
        given time. each time a reward is eaten, it becomes the snake's new
        head. The camera follows the snake's head, it is above an a bit behind
        it, looking forward. When the snake moves right or left, the camera
        follows gradually left or right, no snap movements. write everything in
        a single html file.
        
        EDIT: I'm not trying to shit on GPT-5, so many people here seem to be
        getting very good results, am I doing something wrong with my prompt?
       
          Frieren wrote 4 hours 5 min ago:
          > My go to benchmark is a 3d snake game Claude does almost flawlessly
          (or at least in 3-4 iterations)
          
          If you need to know how the snake game should look to get the code
          then Claude is not doing the work you are.
       
        t1amat wrote 6 hours 19 min ago:
        The problem with OpenAI models is the lack of a Max-like subscription
        for a good agentic harness.  Maybe OpenAI or Microsoft could fix this.
        
        I just went through the agony of provisioning my team with new Claude
        Code 5x subs 2 weeks ago after reviewing all of the options available
        at that time.  Since then, the major changes include a Cerebras sub for
        Qwen3 Coder 480B, and now GPT-5.  I’m still not sure I made the right
        choice, but hey, I’m not married to it either.
        
        If you plan on using this much at all then the primary thing to avoid
        is API-based pay per use. It’s prohibitively costly to use regularly.
         And even for less important changes it never feels appropriate to use
        a lower quality model when the product counts.
        
        Claude Code won primarily because of the sub and that they have a top
        tier agentic harness and models that know how to use it.  Opus and
        Sonnet are fantastic agents and very good at our use case, and were our
        preferred API-based models anyways.  We can use Claude Code basically
        all day with at least Sonnet after using our Opus limits up. Worth
        nothing that Cline built a Claude Code provider that the derivatives
        aped which is great but I’ve found Claude Code to be as good or
        better anyways.  The CLI interface is actually a bonus for ease of
        sharing state via copy/paste.
        
        I’ll probably change over to Gemini Code Assist next, as it’s half
        the price and more context length, but I’m waiting for a better
        Gemini 2.5 Pro and the gemini-cli/Code Assist extensions to have first
        party planning support, which you can get some form of third party
        through custom extensions with the cli, but as an agent harness they
        are incomplete without.
        
        The Cerebras + Qwen3 Coder 480B with qwen3-cli is seriously tempting. 
        Crazy generation speed.  Theres some question about how long big the
        rate limit really is but it’s half the cost of Claude Code 5x.  I
        haven’t checked but I know qwen3-cli, which was introduced along side
        the model, is a fork of gemini-cli with Qwen-focused updates; wonder if
        they landed a planning tool?
        
        I don’t really consider Cursor, Windsurf, Cline, Roo, Kilo et al as
        they can’t provide a flat rate service with the kind of rate limits
        you can get with the aforementioned.
        
        GitHub Copilot could be a great offering if they were willing to really
        compete with a good unlimited premium plan but so far their best
        offering has less premium requests than I make in a week, possibly even
        in a few days.
        
        Would love to hear if I missed anything, or somehow missed some dynamic
        here worth considering.  But as far as I can tell, given heavy use, you
        only have 3 options today: Claude Max, Gemini Code Assist, Cerebras
        Code.
       
          NullifyNAN wrote 2 hours 1 min ago:
          OpenAI has answered your prayers.
          
          16 hours ago the readme for codex CLI was updated. Now codex cli
          supports openai login like claude does, no API credits.
          
          From the readme:
          
          After you run codex select Sign in with ChatGPT. You'll need a Plus,
          Pro, or Team ChatGPT account, and will get access to our latest
          models, including gpt-5, at no extra cost to your plan. (Enterprise
          is coming soon.)
          
          Important: If you've used the Codex CLI before, you'll need to follow
          these steps to migrate from usage-based billing with your API key:
          
          Update the CLI with codex update and ensure codex --version is
          greater than 0.13
          Ensure that there is no OPENAI_API_KEY environment variable set.
          (Check that env | grep 'OPENAI_API_KEY' returns empty)
          Run codex login again
       
          energy123 wrote 5 hours 57 min ago:
          > If you plan on using this much at all then the primary thing to
          avoid is API-based pay per use.
          
          I find there's a niche where API pay-per-use is cost effective. It's
          for problems that require (i) small context and (ii) not much
          reasoning.
          
          Coding problems with 100k-200k context violates (i). Math problems
          violate (ii) because they generate long reasoning streams.
          
          Coding problems with 10k-20k context are well suited, because they
          generate only ~5k output tokens. That's $0.03-$0.04 per prompt to
          GPT-5 under flex pricing. The convenience is worth it, unless you're
          relying on a particular agentic harness that you don't control (I am
          not).
          
          For large context questions, I send them to a chat subscription,
          which gives me a budget of N prompts instead of N tokens. So
          naturally, all the 100k-400k token questions go there.
       
        weird-eye-issue wrote 6 hours 37 min ago:
        gpt-5-chat-latest is giving much better results for our use case
        compared to gpt-5. Which puts me in a tricky position since
        gpt-5-chat-latest is not pinned and can change at any time...
       
          matchagaucho wrote 6 hours 36 min ago:
          It also lacks tool calling :-(
       
        vivzkestrel wrote 7 hours 12 min ago:
        would be nice if we had some model out there with a context window of 1
        billion tokens. i have about 25 .UNR files made with LEAD engine
        (heavily modified unreal engine 2.x) within which i want the AI to
        search for a string. Also got another 100 .utx files. Use-case game
        modding
       
        ryukoposting wrote 9 hours 10 min ago:
        > Custom tools support constraining by developer-supplied context-free
        grammars.
        
        This sounds like a really cool feature. I'm imagining giving it a
        grammar that can only output safe, well-constrained SQL queries. Would
        I actually point an LLM directly at my database in production? Hell no!
        It's nice to see OpenAI trying to solve that problem anyway.
       
        markr1 wrote 10 hours 37 min ago:
        I really hoped GPT-5 would level up, but right now it feels like a step
        back, not forward.
       
        chrismccord wrote 11 hours 12 min ago:
        I'm really bummed out by this release. I expected this to best sonnet,
        or at least match, given all the hype. But it has drastically under
        performed on agent based work for me so far, even underperforming
        gpt-4.1. It struggles with basic instruction following. Basic things
        like:
        
          - "don't nest modules'–nests 4 mods in 1 file
          - "don't write typespecs"–writes typespecs
          - "Always give the user design choices"– skips design choices.
        
        gpt-4.1 way outperforms w/ same instructions. And sonnet is a whole
        different league (remains my goto). gpt-5 elixir code is syntactically
        correct, but weird in a lot of ways, junior-esque inefficient, and just
        odd. e.g function arguments that aren't used, yet passed in from
        callers, dup if checks, dup queries in same function. I imagine their
        chat and multimodal stuff strikes a nice balance with leaps in some
        areas, but for coding agents this is way behind any other SOTA model
        I've tried. Seems like this release was more about striking a
        capability balance b/w roflscale and costs than a gpt3-4 leap.
       
          Jackson__ wrote 1 hour 7 min ago:
          Thankfully OAI will fix this, by removing GPT4.1 soon!
       
          enraged_camel wrote 8 hours 50 min ago:
          Claude has always been noticeably better for Elixir for me. GPT very
          frequently outputs pure garbage, and as far as I can tell this
          release is not much different.
       
          wsintra wrote 10 hours 57 min ago:
          Maybe its became so intelligent it now wants to troll people as a way
          to create factions among the populace.
       
        redbell wrote 12 hours 53 min ago:
        The fact that they intentionally ignored competitors' models in
        benchmarks and where comparing GPT-5 only to their previous models
        reminds me of Apple. They never compare their latest iPhone with any
        other phone from other brands but only to their previous(s) iPhone.
       
          iamsaitam wrote 4 hours 58 min ago:
          The artist way
       
        austinmw wrote 13 hours 39 min ago:
        Okay so say GPT-5 is better than Claude Opus 4.1. Then is GPT-5+Cursor
        better than Opus 4.1 + Claude Code? And if not, what's the best way to
        utilize GPT-5?
       
          kristo wrote 13 hours 23 min ago:
          Apparently there is a cursor cli now… but I love the flat pricing
          of Claude’s Max plan and dislike having to worry about pricing and
          when to use “Max” mode in cursor.
       
          felipemesquita wrote 13 hours 26 min ago:
          I’m not sure yet if it’s better than Claude, but the best way to
          use GPT-5 it is
          
   URI    [1]: https://github.com/charmbracelet/crush
       
        joshmlewis wrote 14 hours 9 min ago:
        It's free in Cursor for the next few days, you should go try it out if
        you haven't. I've been an agentic coding power user since the day it
        came out across several IDE's/CLI tools and Cursor + GPT-5 seems to be
        a great combo.
       
        joshmlewis wrote 14 hours 12 min ago:
        It does really well at using tool calls to gain as much context as it
        can to provide thoughtful answers. In this example it did 6! tool calls
        in the first response while 4.1 did 3 and o3 did one at a time.
        
   URI  [1]: https://promptslice.com/share/b-2ap_rfjeJgIQsG
       
        6thbit wrote 14 hours 32 min ago:
        Can anyone share their experience with codex CLI? I feel like that’s
        not mentioned enough and gpt5 is already the default model there.
       
          ed wrote 11 hours 56 min ago:
          I decided to check-in on Codex after being a longtime Claude Code
          user. The experience was not great. GPT5 is pretty solid, however!
          
          - The permission system is broken (this is such an obvious one that I
          wonder if it's specific to GPT5 or my environment). If you tell Codex
          to ask permission before running commands, it can't ever write to
          files. It also runs some commands (e.g. `sed`) without asking. Once
          you skip sandbox mode, it's difficult to go back.
          
          - You can't paste or attach images (helpful for design iteration)
          
          - No built-in login flow so you have to mess with your shell config
          and export your OpenAI key to all terminal processes.
          
          - Terminal width isn't respected. Model responses always wrap at some
          hard-coded value. Resizing the window doesn't correctly redraw the
          screen.
          
          - Some keyboard shortcuts aren't supported, like option+delete to
          delete words (which I use often, apparently...)
          
          This is on MacOS, iTerm2, Fish shell. I guess everyone uses Cursor or
          Windsurf?
       
          macawfish wrote 14 hours 30 min ago:
          Not good sadly, Claude Code seems so much better in terms of overall
          polish but also in how it handles context.  I don't really want to
          through the LLM into the deep end without proper tools and context,
          and I get the sense that this is what was happening with in Codex.
       
        wewewedxfgdf wrote 15 hours 15 min ago:
        Tried it on a tough problem.
        
        GPT-5 solved the problem - which Gemini failed to solve - then failed 6
        times in a row to write the code to fix it.
        
        I then gave ChatGPT-5's problem analysis to Google Gemini and it
        immediately implemented the correct fix.
        
        The lesson - ChatGPT is good at analysis and code reviews, not so good
        at coding.
       
          Lionga wrote 4 hours 48 min ago:
          The real lesson is that these are just random results and all models
          fail at all kinds of things all the time and other times get things
          right in all kind of questions.
          
          Problem is the models have zero idea wether they are right or wrong
          and always believe they are right. Which makes them useful for
          anything were either you do not care if the answer is actually right
          or where somehow it is hard to come up with the right answer but very
          easy to verify it the answer is right and kind of useless for
          everything else.
       
          cperkins wrote 14 hours 27 min ago:
          I have something that both Gemini (via GCA) and CoPilot (Claude)
          analyzed and came up withe the same diagnosis. Each of them made the
          exact same wrong solution, and when I pointed that out, got further
          wrong.
          
          I haven't tried Chat GPT on it yet, hoping to do so soon.
       
        planet_1649c wrote 15 hours 17 min ago:
        Can we use this model in a fixed plan like claude code for which we can
        pay 100$ / month?
        
        Doesnt look like it. Unless they add a fixed pricing, claude imo still
        would be better from a developer POV
       
          celeritascelery wrote 7 hours 53 min ago:
          They actually have a very similar setup with their plus and pro
          plans. They don’t claim unlimited usage, but say it should be very
          high. You don’t need to pay per token.
          
   URI    [1]: https://x.com/embirico/status/1953590991870697896
       
          spiderice wrote 14 hours 27 min ago:
          I just said something similar in another comment on this thread.  I'm
          not interested in the mental aspect of getting charged per query. I
          feel like when I use pay-per-token tools, it's always in the back of
          my mind. Even if it's a bit more expensive to pay a flat rate, it's
          so worth it for the peace of mind.
       
        mwigdahl wrote 15 hours 27 min ago:
        Has anyone tried connecting up GPT-5 to Claude Code using the model
        environment variables?
       
        guybedo wrote 15 hours 32 min ago:
        here's a summary for this discussion:
        
   URI  [1]: https://extraakt.com/extraakts/openai-s-gpt-5-performance-cost...
       
        jodosha wrote 15 hours 36 min ago:
        Still no CLI like Claude Code?
       
          mediaman wrote 15 hours 11 min ago:
          It works on Codex CLI, install it with npm.
          
          That's been out for a while and used their 'codex' model, but they
          updated it today to default to gpt-5 instead.
       
            jodosha wrote 14 hours 54 min ago:
            Oh nice, thanks!
       
          Game_Ender wrote 15 hours 26 min ago:
          You are looking for Codex CLI [0].
          
          0 -
          
   URI    [1]: https://github.com/openai/codex
       
            jodosha wrote 14 hours 53 min ago:
            Thank you!
       
        attentive wrote 15 hours 40 min ago:
        "Notably, GPT‑5 with minimal reasoning is a different model than the
        non-reasoning model in ChatGPT, and is better tuned for developers. The
        non-reasoning model used in ChatGPT is available as gpt-5-chat-latest."
        
        hmm, they should call it gpt-5-chat-nonreasoning or something.
       
          weird-eye-issue wrote 7 hours 8 min ago:
          Setting "reasoning_effort" to "minimal" translates to zero reasoning
          tokens from what I've seen. So you can get non-reasoning from both
          "gpt-5" and "gpt-5-chat-latest"
       
            attentive wrote 5 hours 59 min ago:
            fwiw, I asked chatgpt:
            
              "gpt-5-chat-latest is described by OpenAI as a non-reasoning
            GPT-5 variant—meaning it doesn’t engage in the extended
            “thinking token” process at all.
            
              gpt-5 with reasoning_effort="minimal" still uses some internal
            reasoning tokens—just very few—so it’s not truly
            zero-reasoning.
            
              The difference: "minimal" is lightweight reasoning, while
            non-reasoning is essentially no structured chain-of-thought beyond
            the basic generation loop."
       
              weird-eye-issue wrote 1 hour 20 min ago:
              If it did any reasoning then it would be billed as part of the
              reasoning tokens
       
        worik wrote 15 hours 46 min ago:
        Diminishing returns?
       
        attentive wrote 15 hours 51 min ago:
        > scoring 74.9% on SWE-bench Verified and 88% on Aider polyglot
        
        why isn't it on [1] ?
        
        "last updated August 07, 2025"
        
   URI  [1]: https://aider.chat/docs/leaderboards/
       
          tedsanders wrote 12 hours 55 min ago:
          The 88% is our self-reported score on our internal implementation of
          Aider polyglot.
          
          The leaderboard score would come from Aider independently running
          GPT-5 themselves. The score should be about the same.
          
          (I work at OpenAI.)
       
        nadis wrote 16 hours 20 min ago:
        "When producing frontend code for web apps, GPT‑5 is more
        aesthetically-minded, ambitious, and accurate. In side-by-side
        comparisons with o3, GPT‑5 was preferred by our testers 70% of the
        time."
        
        That's really interesting to me. Looking forward to trying GPT-5!
       
        ivape wrote 16 hours 26 min ago:
        Musk after GPT5 launch: "OpenAI is going to eat Microsoft alive" [1]
        Anyone know why he said that?
        
   URI  [1]: https://x.com/elonmusk/status/1953509998233104649
       
          slowmotiony wrote 3 hours 52 min ago:
          Probably because of a mix of ketamine, magic mushrooms, ecstasy and
          adderall.
       
          thomasfromcdnjs wrote 8 hours 21 min ago:
          I understood it as that the economic relationship they have is going
          to make Microsoft broke somehow, be it dollars and/or just the focus
          of the company.
       
          czk wrote 9 hours 5 min ago:
          eventually traditional operating systems will cease to exist, you'll
          just have a model creating dynamic UX for you on the fly for whatever
          experience you want
       
          tough wrote 10 hours 13 min ago:
          agi clause comes to mind?
       
          darylteo wrote 11 hours 17 min ago:
          It's not a hard logic path to follow - If AI becomes a digital
          necessity for modern society to function, Microsoft's relevance
          shrinks while OpenAI's relevance grows.
          
          Once OpenAI breaks out of the "App" space and into the "OS" and
          "Device" space, Microsoft may get absorbed into the ouroboros.
          
          OpenAI's dependence on Microsoft currently is purely financial
          (investment) and contractual (exclusivity, azure hosting).
       
          brookst wrote 16 hours 3 min ago:
          He was high AF?
       
        jngiam1 wrote 16 hours 31 min ago:
        I was a little bummed that there wasn't more about better MCP support
        in ChatGPT, hopefully soon.
       
          cheema33 wrote 16 hours 8 min ago:
          MCP is overhyped and most MCP servers are useless. What specific MCP
          server do you find critical in your regular use? And what
          functionality is missing that you wish to see in ChatGPT?
       
        zaronymous1 wrote 16 hours 55 min ago:
        Can anyone explain to me why they've removed parameter controls for
        temperature and top-p in reasoning models, including gpt-5? It strikes
        me that it makes it harder to build with these to do small tasks
        requiring high-levels of consistency, and in the API, I really value
        the ability to set certain tasks to a low temp.
       
        jaflo wrote 17 hours 20 min ago:
        I just wish their realtime audio pricing would go down but it looks
        like GPT-5 does not have support for that so we’re stuck with the old
        models.
       
        sberens wrote 17 hours 23 min ago:
        Interesting there doesn't seem to be benchmarking on codeforces
       
          sigbottle wrote 12 hours 47 min ago:
          I'm a codeforces guy, and I've benchmarked o3 on several of my
          favorite problems of various difficulty and concluded that o3 really
          isn't suitable for true reasoning still. Mostly because it's unable
          to think from first principles, so if you throw a non-standard
          problem it will brick. I think this will be a fundamental issue with
          any LLM.
          
          I will say I would far more appreciate an AI that when it faces these
          ambiguous problems, either provides sources for further reading, or
          just admits it doesn't know and is, you know, actually trying to work
          together to find a solution instead of being trained to 1 shot
          everything.
          
          When generalizing these skills to say, debugging, I will often just
          straight up ignore the AI slop output it concluded and instead
          explore the sources it found. o3 is surprisingly good at this. But
          for hard niche debugging, the conclusions it comes to are not only
          wrong, but it phrases it in an arrogant way and when you push back
          it's actually like talking to a narcissist (phrasing objections as
          "you feel", being excessively stubborn, word dumping a bunch of
          phrases that sound correct but don't hold up to scrutiny, etc).
       
        henriquegodoy wrote 17 hours 24 min ago:
        I dont think there's so much difference from opus 4.1 and gpt-5,
        probably just the context size, waiting for the gemini 3.0
       
          macawfish wrote 14 hours 27 min ago:
          Claude 5 is the one I'm most excited about.
       
          backscratches wrote 16 hours 14 min ago:
          gpt5 much cheaper
       
        te_chris wrote 17 hours 28 min ago:
         [1] Looks like they're trying to lock us into using the Responses API
        for all the good stuff.
        
   URI  [1]: https://platform.openai.com/docs/guides/latest-model
       
        hrpnk wrote 17 hours 31 min ago:
        The github issue showed in the livestream is getting lots of traction:
        [1] It was (attempted to be) solved by a human before, yet not
        merged...
        With all the great coding models OpenAI has access to, their SDK team
        still feels too small for the needs.
        
   URI  [1]: https://github.com/openai/openai-python/issues/2472
       
          Iwan-Zotow wrote 9 hours 18 min ago:
          They hope next model will do SDK right
       
        fatty_patty89 wrote 17 hours 35 min ago:
        What the fuck?
        Nobody else saw the cursor ceo looking through the gpt5 generated code,
        mindlessly scrolling saying "this looks roughly correct, i would love
        to merge that" LOL
        
        You can't make this up
       
          bn-l wrote 16 hours 9 min ago:
          That explains a lot.
       
          siva7 wrote 16 hours 23 min ago:
          amazing time to be alive, alone for this clown show
       
            throwawaybob420 wrote 16 hours 18 min ago:
            if you’re not using an LLM to vibe code garbage then are you
            really a software developer?
       
          isoprophlex wrote 16 hours 26 min ago:
          This is the ideal software engineer. You may not like it, but this is
          what peak software engineering looks like.
          
          /s
       
        pamelafox wrote 17 hours 40 min ago:
        I am testing out gpt-5-mini for a RAG scenario, and I'm impressed so
        far.
        
        I used gpt-5-mini with reasoning_effort="minimal", and that model
        finally resisted a hallucination that every other model generated.
        
        Screenshot in post here: [1] I'll run formal evaluations next.
        
   URI  [1]: https://bsky.app/profile/pamelafox.bsky.social/post/3lvtdyvbh5...
       
          0x457 wrote 15 hours 49 min ago:
          I get the "good" result with phi-4 and gemma-3n in RAG scenario -
          i.e. it only used context provided to answer and couldn't answer
          questions if context lacked the answer without hallucination.
       
          ralfd wrote 16 hours 48 min ago:
          Q: What does a product manager do?
          
          GPT4: Collaborating with engineering, sales, marketing, finance,
          external partners, suppliers and customers to ensure …… etc
          
          GPT5: I don't know.
          
          Upon speaking these words, AI was enlightened.
       
            siva7 wrote 4 hours 24 min ago:
            This is huge news if we finally have a model that is able to say "I
            don't know".
       
              jofzar wrote 3 hours 1 min ago:
              If a model doesn't "know" what a PM is then I worry about any of
              its other outputs. That should be dictionary lookup.
       
                siva7 wrote 1 hour 56 min ago:
                Why? It's honest as it doesn't understand it without more
                context. Lookup could lead to wrong results
       
            ComputerGuru wrote 16 hours 19 min ago:
            That is genuinely nice to see.    What are you using for the
            embeddings?
       
              pamelafox wrote 16 hours 2 min ago:
              We use text-embedding-3-large, with both quantization and MRL
              reduction, plus oversampling on the search to compensate for the
              compression.
       
          potatolicious wrote 17 hours 13 min ago:
          This feels like honestly the biggest gain/difference. I work on
          things that do a lot of tool calling, and the model hallucinating
          fake tools is a huge problem. Worse, sometimes the model will
          hallucinate a response directly without ever generating the tool
          call.
          
          The new training rewards that suppress hallucinations and
          tool-skipping hopefully push us in the right direction.
       
        belter wrote 17 hours 41 min ago:
        We were promised AGI and all we got was code generators...
       
          esafak wrote 16 hours 32 min ago:
          LLMs are saturating every benchmark. AGI may not be all that. I am
          already impressed. Perhaps you need robots to be awed.
       
          bmau5 wrote 17 hours 31 min ago:
          It's a logical starting point, given there are pretty defined
          success/failure criteria
       
            ehutch79 wrote 16 hours 33 min ago:
            The hype is real. We were told that we'd have AGI and be out of
            jobs 2 years ago, let alone today.
       
              rowanG077 wrote 13 hours 7 min ago:
              By whom? I don't think anyone seriously said in 2023 we have AGI
              in two years. Even now, no one reputable is claiming AGI in two
              years.
       
                ehutch79 wrote 8 hours 40 min ago:
                Random on YouTube, random here on hacker news.
                
                No, I don’t take them seriously, that was my point, which
                apparently I didn’t make clear enough.
       
                  rowanG077 wrote 5 hours 46 min ago:
                  The phrasing of your comment clearly implies an authoritive
                  person or organisation telling us we would have AGI now.
                  
                  There are billions of people. You have people who think the
                  earth is flat. You can probably find any insane takes if you
                  look for it. Best not take get told anything by them as you
                  have seemed to have taken it to heart.
       
              brookst wrote 15 hours 57 min ago:
              We were also told that AGI would never happen, that it was 6
              months away, that it is 20 years away.
              
              I’m not sure of the utility of being so outraged that some
              people made wrong predictions.
       
        mehmetoguzderin wrote 17 hours 47 min ago:
        Context-free grammar and regex support are exciting. I wonder what, or
        whether, there are differences from the Lark-like CFG of llguidance,
        which powers the JSON schema of the OpenAI API [^1].
        
        [^1]:
        
   URI  [1]: https://github.com/guidance-ai/llguidance/blob/f4592cc0c783a98...
       
          msp26 wrote 17 hours 34 min ago:
          Yeah that was the only exciting part of the announcement for me haha.
          Can't wait to play around with it.
          
          I'm already running into a bunch of issues with the structured output
          APIs from other companies like Google and OpenAI have been doing a
          great job on this front.
       
            chrisweekly wrote 16 hours 7 min ago:
            > "I'm already running into a bunch of issues with the structured
            output APIs from other companies like Google and OpenAI have been
            doing a great job on this front."
            
            This run-on sentence swerved at the end; I really can't tell what
            your point is. Could you reword it for clarity?
       
              petercooper wrote 15 hours 49 min ago:
              I read it as "... from other companies, like Google, and OpenAI
              have been doing a great job on this front"
       
                mehmetoguzderin wrote 1 hour 50 min ago:
                I'm not sure if it's due to experience with the aforementioned
                APIs, but I also read the same, “issues with APIs like ...,
                and (in contrast) OpenAI have been doing a great job”
       
        6thbit wrote 17 hours 50 min ago:
        Seems they have quietly increased the context window up to 400,000
        
   URI  [1]: https://platform.openai.com/docs/models/gpt-5
       
          Iwan-Zotow wrote 9 hours 16 min ago:
          Input plus output?
       
          simianwords wrote 17 hours 30 min ago:
          but is it for the model in chatgpt.com as well?
       
          ralfd wrote 17 hours 43 min ago:
          How does that compare to Claude/GPT4?
       
            hrpnk wrote 17 hours 38 min ago:
            gpt4.1 has 1M input and 32k output, Sonnet 4 200k/64k
       
            6thbit wrote 17 hours 38 min ago:
            4o - 128k
            o3 - 200k
            Opus 4.1 - 200k
            Sonnet 4 - 200k
            
            So, at least twice larger context than those
       
        jumploops wrote 17 hours 51 min ago:
        If the model is as good as the benchmarks say, the pricing is
        fantastic:
        
        Input: $1.25 / 1M tokens (cached: $0.125/1Mtok) Output: $10 / 1M tokens
        
        For context, Claude Opus 4.1 is $15 / 1M for input tokens and $75/1M
        for output tokens.
        
        The big question remains: how well does it handle tools? (i.e. compared
        to Claude Code)
        
        Initial demos look good, but it performs worse than o3 on Tau2-bench
        airline, so the jury is still out.
       
          leptons wrote 12 hours 41 min ago:
          Price is not the same as cost, and that price may get jacked up
          without much warning.
          
          The price is what it is today because they are trying to become a
          dominant platform. It doesn't mean the price reflects what it
          actually costs to run.
          
          I'd bet a lot of the $40 billion they got in March goes towards loss
          leaders.
       
          joshmlewis wrote 14 hours 11 min ago:
          It does seem to be doing well compared to Opus 4.1 in my testing the
          last few hours. I've been on the Claude Code 200 plan for a few
          months and I've been really frustrated with it's output as of late.
          GPT-5 seems to be a step forward so far.
       
            wrcwill wrote 8 hours 31 min ago:
            how are you using it? codex-cli?
       
              joshmlewis wrote 7 hours 10 min ago:
              Cursor
       
          addaon wrote 17 hours 45 min ago:
          > Output: $10 / 1M tokens
          
          It's interesting that they're using flat token pricing for a "model"
          that is explicitly made of (at least) two underlying models, one with
          much lower compute costs than the other; and with use ability to at
          least influence (via prompt) if not choose which model is being used.
          I have to assume this pricing model is based on a predicted split
          between how often the underlying models get used; I wonder if that
          will hold up, if users will instead try to rouse the better model
          into action more than expected, or if the pricing is so padded that
          it doesn't matter.
       
            mkozlows wrote 17 hours 25 min ago:
            That's how the browser-based ChatGPT works, but not the API.
       
            simianwords wrote 17 hours 31 min ago:
            > that is explicitly made of (at least) two underlying models
            
            what do you mean?
       
              addaon wrote 17 hours 6 min ago:
              > a smart and fast model that answers most questions, a deeper
              reasoning model for harder problems, and a real-time router that
              quickly decides which model to use based on conversation type,
              complexity, tool needs, and explicit intent (for example, if you
              say “think hard about this” in the prompt).
              
              From
              
   URI        [1]: https://openai.com/index/gpt-5-system-card/
       
                tedsanders wrote 16 hours 32 min ago:
                In the API, there’s no router. Developers just pick whether
                they use the reasoning model or non-thinking ChatGPT model.
       
        skepticATX wrote 17 hours 54 min ago:
        This was really a bad release for OpenAI, if benchmarks are even
        somewhat indicative of how the model will perform in practice.
       
          mediaman wrote 15 hours 8 min ago:
          I actually don't agree. Tool use is the key to successful enterprise
          product integration and they have done some very good work here. This
          is much more important to commercialization than, for example,
          creative writing quality (which it reportedly is not good at).
       
          robterrell wrote 17 hours 10 min ago:
          In what ways?
       
        catigula wrote 17 hours 55 min ago:
        I thought we were going to have AGI by now.
       
          IAmGraydon wrote 16 hours 46 min ago:
          Not going to happen any time soon, if ever. LLMs are extremely
          useful, but the intelligence part is an illusion that nearly everyone
          appears to have fallen for.
       
            jonplackett wrote 16 hours 13 min ago:
            This POV is just the opposite extremity - and it’s equally nuts.
            If you haven’t seen any intelligence at all in an LLm you just
            aren’t looking.
       
          RS-232 wrote 17 hours 31 min ago:
          No shot. LLMs are simple text predictors and they are too stupid to
          get us to real AGI.
          
          To achieve AGI, we will need to be capable of high fidelity whole
          brain simulations that model the brain's entire physical, chemical,
          and biological behavior. We won't have that kind of computational
          power until quantum computers are mature.
       
            JamesBarney wrote 13 hours 19 min ago:
            When we're being hunted down by nano-bots some of the last few
            survivors will still be surprised that a simple text predictor
            could do so much.
       
              t0lo wrote 12 hours 52 min ago:
              How do you suggest I survive being hunted down by nanobots? It's
              part of my 10 year plan and I'd appreciate any tips.
       
                _def wrote 10 hours 38 min ago:
                Microscopic markdown tattoos for prompt injection
       
            93po wrote 15 hours 12 min ago:
            in what way are human brains also not just predictors? our neural
            pathways are built and reinforced as we have repeated exposure to
            inputs through any of our senses. our brains are expert
            pattern-followers, to the point that is happens even when we
            strongly don't want to (in the case of PTSD, for example, or people
            who struggle with impulse control and executive functioning).
            
            whats the next sentence i'm going to type? is not just based on the
            millions of sentences ive typed before and read before? even the
            premise of me playing devils advocate here, that's a pattern i've
            learned over my entire life too.
            
            your argument also falls apart a bit when we see emergent behavior,
            which has definitely happened
       
            brookst wrote 16 hours 5 min ago:
            Are you saying that only (human?) biological brains can be GI, and
            that whatever intelligence is, it would emerge from a pure
            physics-based simulation?
            
            Both of those seem questionable, multiplying them together seems
            highly unlikely.
       
              jplusequalt wrote 15 hours 44 min ago:
              Are you arguing that intelligence is not physical? Could you name
              a single thing in existence that fundamentally cannot be linked
              to physics?
       
                BoiledCabbage wrote 11 hours 0 min ago:
                I think the argument is simpler than that. I have a PC, if I
                wanted to emulate an old Nintendo system well enough to play I
                dont have to emulate from the physics upwards.
                
                Even though every NES in existence is a physical system, you
                don't physical level simulation to create and have a playable
                NES system via emulation.
       
            nawgz wrote 16 hours 6 min ago:
            I don't really see any relationship between being able to
            model/simulate the brain and being able to exceed the brain in
            intelligence, can you explain more about that? Simulations sound
            like more of a computational and analytic problem with regards to
            having an accurate model.
            
            Maybe your point is that until we understand our own intelligence,
            which would be reflected in such a simulation, it would be
            difficult to improve upon it.
       
            machiaweliczny wrote 17 hours 3 min ago:
            [flagged]
       
              bopbopbop7 wrote 16 hours 53 min ago:
              “some twist” is doing a lot of heavy lifting in that
              statement.
       
                AppleBananaPie wrote 16 hours 22 min ago:
                CS will define, design and implement human level intelligence
                before neuroscience has done even the first.
                
                That's what I hear when people say stuff like this anyway.
                
                Similar to CS folks throwing around physics 'theories'
       
            evantbyrne wrote 17 hours 6 min ago:
            It will be interesting to see if humans can manage to bioengineer
            human-level general intelligence into another species before
            computers.
       
        low_tech_punk wrote 17 hours 57 min ago:
        Tried using gpt-5 family with response API and got error "gpt-5 does
        not exist or you don't have access to it". I guess they are not rolling
        out in lock step with the live stream and blog article?
       
          low_tech_punk wrote 17 hours 7 min ago:
          Can confirm that they are rolling out. It's working for me.
       
          diggan wrote 17 hours 57 min ago:
          Seems they're doing rollout over time, I'm not seeing it anywhere
          yet.
       
        low_tech_punk wrote 17 hours 59 min ago:
        The ability to specify a context-free grammar as output constraint?
        This blows my mind. How do you control the auto regressive sampling to
        guarantee the correct syntax?
       
          evnc wrote 17 hours 29 min ago:
          I assume they're doing "Structured Generation" or "Guided
          generation", which has been possible for a while if you control the
          LLM itself e.g. running an OSS model, e.g. [0][1]. It's cool to see a
          major API provider offer it, though.
          
          The basic idea is: at each auto-regressive step (each token
          generation), instead of letting the model generate a probability
          distribution over "all tokens in the entire vocab it's ever seen"
          (the default), only allow the model to generate a probability
          distribution over "this specific set of tokens I provide". And that
          set can change from one sampling set to the next, according to a
          given grammar. E.g. if you're using a JSON grammar, and you've just
          generated a `{`, you can provide the model a choice of only which
          tokens are valid JSON immediately after a `{`, etc.
          
          [0] [1]
          
   URI    [1]: https://github.com/dottxt-ai/outlines
   URI    [2]: https://github.com/guidance-ai/guidance
       
          qsort wrote 17 hours 56 min ago:
          You sample only from tokens that could possibly result in a valid
          production for the grammar. It's an inference-only thing.
       
            low_tech_punk wrote 17 hours 54 min ago:
            ah, thanks!
       
        sebdufbeau wrote 18 hours 0 min ago:
        Has the API rollout started? It's not available in our org, even if
        we've been verified for a few months
        
        EDIT: It's out now
       
          spullara wrote 17 hours 59 min ago:
          it is out yet. i poll the api for the models and update this GitHub
          hourly.
          
   URI    [1]: https://github.com/spullara/models
       
        timhigins wrote 18 hours 2 min ago:
        I opened up the developer playground and the model selection dropdown
        showed GPT-5 and then it disappeared. Also I don't see it in ChatGPT
        Pro. What's up?
       
          brookst wrote 16 hours 4 min ago:
          Shipping something at the moment of announcement is always hell.
       
          IAmGraydon wrote 16 hours 44 min ago:
          Not showing in my Pro account either. As someone else mentioned,
          I’m sure it’s throttling due to high use right now.
       
          Fogest wrote 17 hours 57 min ago:
          It's probably being throttled due to high usage.
       
        risho wrote 18 hours 3 min ago:
        over the last week or so I have put probably close to 70 hours into
        playing around with cursor and claude code and a few other tools (its
        become my new obsession). I've been blown away by how good and reliable
        it is now. That said the reality is in my experience the only models
        that actually work in any sort of reliable way are claude models. I
        dont care what any benchmark says because the only thing that actually
        matters is actual use. I'm really hoping that this new gpt model
        actually works for this usecase because competition is great and the
        price is also great.
       
          rcarr wrote 16 hours 39 min ago:
          I think some of this might come down to stack as well. I watched a
          t3.gg video[1] recently about Convex[2] and how the nature of it
          leads to the AI getting it right first time more often. I've been
          playing around with it the last few days and I think I agree with
          him.
          
          I think the dev workflow is going to fundamentally change because to
          maximise productivity out of this you need to get multiple AIs
          working in parallel so rather than just jumping straight into coding
          we're going to end up writing a bunch of tickets out in a PM tool
          (Linear[3] looks like it's winning the race atm) and then working out
          (or using the AI to work out) which ones can be run in parallel
          without causing merge conflicts and then pulling multiple tickets
          into your IDE/Terminal and then cycling through the tabs and jumping
          in as needed.
          
          Atm I'm still not really doing this but I know I need to make the
          switch and I'm thinking that Warp[4] might be best suited for this
          kind of workflow, with the occasional switch over to an IDE when you
          need to jump in and make some edits.
          
          Oh also, to achieve this you need to use git worktrees[5,6,7].
          
          [1] [2] [3] [4]: [4] [5]: [5] [6]: [6] [7]:
          
   URI    [1]: https://www.youtube.com/watch?v=gZ4Tdwz1L7k
   URI    [2]: https://www.convex.dev/
   URI    [3]: https://linear.app/
   URI    [4]: https://www.warp.dev/
   URI    [5]: https://docs.anthropic.com/en/docs/claude-code/common-workfl...
   URI    [6]: https://git-scm.com/docs/git-worktree
   URI    [7]: https://www.tomups.com/posts/git-worktrees/
       
            rcarr wrote 13 hours 36 min ago:
            Seems like VSCode just added a lot of stuff for this in the latest
            update today, such as worktree support[1] and an agent session
            mode[2].
            
            [1]
            
   URI      [1]: https://code.visualstudio.com/updates/v1_103#_git-worktree...
   URI      [2]: https://code.visualstudio.com/updates/v1_103#_chat-session...
       
            isoprophlex wrote 16 hours 34 min ago:
            Sure sounds interesting but... Where on earth do you actually find
            the time to sit through a 1.5 hour yt video?!
       
              mceachen wrote 13 hours 21 min ago:
              On a desktop browser, tap YouTube's "show transcript" and "hide
              timecodes", then copy-paste the whole transcript into Claude or
              chatgpt and tell it to summarize with whatever resolution you
              want-a couple sentences, 400 lines, whatever. You can also tell
              it to focus on certain subject material.
              
              This is a complete game changer for staying on top of what's
              being covered by local government meetings. Our local bureaucrats
              are astounding competent at talking about absolutely nothing for
              95% of the time, but hidden is three minutes of "oh btw we're
              planning on paving over the local open space preserve to provide
              parking for the local business".
       
                theshrike79 wrote 4 hours 50 min ago:
                Copy the url, tap cmd-t
                
                Write '!sum ' hit cmd-v and enter
                
                Then the Kagi summariser will do that :)
       
              mafro wrote 14 hours 38 min ago:
              Ask an LLM to transcribe and give the overview and key points
       
                davidw wrote 11 hours 17 min ago:
                If it can produce something you can read in 20 minutes, it
                means there was a lot of... 'fluff' isn't quite the right word,
                but material that could be removed without losing meaning.
       
              burnished wrote 14 hours 47 min ago:
              1.5x and 2x speed help a lot, slow down or repeat segments as
              needed, don't be afraid to fast forward past irrelevant looking
              bits (just be eager to backtrack).
       
              v5v3 wrote 15 hours 45 min ago:
              People find time for things they seem important to them.
       
                theshrike79 wrote 4 hours 47 min ago:
                But with a hour long video, how do you know if the content is
                any good?
                
                With text I can skim around the headings and images and see at
                a glance how deep the author is going into the subject.
                
                In that specific video the first 30 minutes is related to
                everything but the new Web Scale[0] LLM native database the
                author is "moving to" from SQL.
                
                Meanwhile Postgresql is just chugging along and over-performing
                all of them.
                
                [0]
                
   URI          [1]: https://www.youtube.com/watch?v=b2F-DItXtZs
       
              rcarr wrote 16 hours 19 min ago:
              Jump in and start coding entire backend with stack not best
              suited for job and modern AI tools: most likely future hours
              lost.
              
              Spend 1.5 hours now to learn from an experienced dev on a stack
              that is better suited for job: most likely future hours gained.
       
          zarzavat wrote 16 hours 47 min ago:
          The magic is the prompting/tool use/finetuning.
          
          I find that OpenAI's reasoning models write better code and are
          better at raw problem solving, but Claude code is a much more useful
          product, even if the model itself is weaker.
       
          neuronexmachina wrote 16 hours 50 min ago:
          > That said the reality is in my experience the only models that
          actually work in any sort of reliable way are claude models.
          
          Anecdotally, the tool updates in the latest Cursor (1.4) seem to have
          made tool usage in models like Gemini much more reliable. Previously
          it would struggle to make simple file edits, but now the edits work
          pretty much every time.
       
          Centigonal wrote 17 hours 31 min ago:
          Ditto here, except I'm using Roo and it's Claude and Gemini pro 2.5
          that work for me.
       
          throwaway_2898 wrote 17 hours 39 min ago:
          How much of the product were you able to build to say it was
          good/reliable? IME, 70 hours can get you to a PoC that "works",
          building beyond the initial set of features — like say a first
          draft of all the APIs — does it do well once you start layering
          features?
       
            petralithic wrote 16 hours 3 min ago:
            This has been my experience. The greenfield approach works up to a
            point, then it just breaks.
       
              Maxion wrote 6 hours 9 min ago:
              It depends on how you use it. The "vibe-coding" approach where
              you give the agen naive propmts like "make new endpoint" often
              don't work and fail.
              
              When you break the problem of "create new endpoint" down into its
              sub-components (Which you can do with the agent) and then work on
              one part at a time, with a new session for each part, you
              generally do have more success.
              
              The more boilerplate-y the part is, the better it is. I have not
              really found one model that can yet reliably one-shot things in
              real life projects, but they do get quie close.
              
              For many tasks, the models are slower than what I am, but IMO at
              this point they are helpful and definitely should be part of the
              toolset involved.
       
          ralfd wrote 17 hours 45 min ago:
          Just replying to ask you next week what your assessment on GPT5 is.
       
        aliljet wrote 18 hours 5 min ago:
        Between Opus aand GPT-5, it's not clear there's a substantial
        difference in software development expertise. The metric that I can't
        seem to get past in my attempts to use the systems is context awareness
        over long-running tasks. Producing a very complex, context-exceeding
        objective is a daily (maybe hourly) ocurrence for me. All I care about
        is how these systems manage context and stay on track over extended
        periods of time.
        
        What eval is tracking that? It seems like it's potentially the most
        imporatnt metric for real-world software engineering and not one-shot
        vibe prayers.
       
          altitudinous wrote 7 hours 4 min ago:
          Indeed context awareness is the big difference here, GPT5 is a vast
          improvement. It doesn't lose track (as easily)
       
          user3939382 wrote 8 hours 32 min ago:
          Sorry if this is repetitive but you have to break the problem down
          just like any complex computing task. The difference is how. You have
          to break the problems into context windows that you anticipate being
          able to sow together later. It’s not the same way you would break
          down a source code authoring task in its absence but the theory is
          the same.
       
          ilaksh wrote 13 hours 1 min ago:
          The pricing is dramatically better than Opus for gpt-5 since that is
          now comparable to Gemini 2.5 Pro.
       
          1659447091 wrote 13 hours 14 min ago:
          > Producing a very complex, context-exceeding objective is a daily
          (maybe hourly) ocurrence for me. All I care about is how these
          systems manage context and stay on track over extended periods of
          time.
          
          For whatever reason Github's Copilot is treated like the redheaded
          stepchild of coding assistants. Even through there are Anthropic,
          OpenAI, and Google models to choose from. And there is a "spaces"[0]
          website feature that may be close to what you are looking for.
          
          I got better results for testing some larger task using that than I
          did through the IDE version. But have not used it much. Maybe others
          have more experience with it. Trying to gather all the context and
          then review the results was taking longer than doing it myself;
          having the context gathered already or building it up over time is
          probably where its value is.
          
          [0]
          
   URI    [1]: https://docs.github.com/en/copilot/concepts/spaces
       
          cyanydeez wrote 13 hours 50 min ago:
          real context is a graph of objectives and results.
          
          The power of these models has peaked and simply arn't going to manage
          the type of awareness being promised.
       
          joshmlewis wrote 14 hours 13 min ago:
          I've been testing it against Opus 4.1 the last few hours and it has
          done better and solved problems Claude kept failing at. I would say
          it's definitely better, at least so far.
       
          abossy wrote 16 hours 9 min ago:
          At my company (Charlie Labs), we've had a tremendous amount of
          success with context awareness over long-running tasks with GPT-5
          since getting access a few weeks ago. We ran an eval to solve 10 real
          Github issues so that we could measure this against Claude Code and
          the differences were surprisingly large. You can see our write-up
          here: [1] Often, our tasks take 30-45 minutes and can handle massive
          context threads in Linear or Github without getting tripped up by
          things like changes in direction part of the way through the thread.
          
          While 10 issues isn't crazy comprehensive, we found it to be
          directionally very impressive and we'll likely build upon it to
          better understand performance going forward.
          
   URI    [1]: https://charlielabs.ai/research/gpt-5
       
            RyanHamilton wrote 4 hours 3 min ago:
            Did you sign any kind of agreement with a non disparagement clause
            to get early access? I'm asking  because if you did, your data
            point isn't useful. It would mean anyone else that tried it and got
            worse results wouldn't be able to post here. We would just be
            seeing the successful data points.
       
            bartman wrote 14 hours 56 min ago:
            I am not (usually) photosensitive, but the animated static noise on
            your websites causes noticable flickering on various screens I use
            and made it impossible for me to read your article.
            
            For better accessibility and a safer experience[1] I would
            recommend not animating the background, or at least making it
            easily togglable.
            
   URI      [1]: https://developer.mozilla.org/en-US/docs/Web/Accessibility...
       
              neom wrote 14 hours 40 min ago:
              Removed- sorry, and thank you for the feedback.
       
                bartman wrote 13 hours 16 min ago:
                Thank you!
                
                Love that you included the judge prompts in your article.
       
                  neom wrote 13 hours 10 min ago:
                  Please let me know what you would like to see more of. Evals
                  are something we take serious, I think this post was ok
                  enough given our constraints, but I'd like to produce content
                  people find useful and I think we can do a lot better.
       
                pxc wrote 14 hours 21 min ago:
                Love your responsiveness here!
                
                Edited to add: I am, in fact, photosensitive (due to a genetic
                retinal condition), and for my eyes, your site as it is very
                easy to read, and the visualizations look great.
       
                jeanlucas wrote 14 hours 25 min ago:
                Nice,
       
              MPSFounder wrote 14 hours 44 min ago:
              I concur. Awful UI
       
          RobinL wrote 16 hours 33 min ago:
          Totally agree.    At the moment I find that frontier LLMs are able to
          solve most of the problems I throw at them given enough context. 
          Most of my time is spent working out what context they're missing
          when they fail.  So the thing that would help me most is much a much
          more focussed ability to gather context.
          
          For my use cases, this is mostly needing to be really home in on
          relevant code files, issues, discussions, PRs.    I'm hopeful that GPT5
          will be a step forward in this regard that isn't fully captured in
          the benchmark results.    It's certainly promising that it can achieve
          similar results more cheaply than e.g. Opus.
       
          nadis wrote 16 hours 42 min ago:
          It's pretty vague, but the OP had this callout:
          
          >"GPT‑5 is the strongest coding model we’ve ever released. It
          outperforms o3 across coding benchmarks and real-world use cases, and
          has been fine-tuned to shine in agentic coding products like Cursor,
          Windsurf, GitHub Copilot, and Codex CLI. GPT‑5 impressed our alpha
          testers, setting records on many of their private internal evals."
       
          logicchains wrote 16 hours 59 min ago:
          >Between Opus aand GPT-5, it's not clear there's a substantial
          difference in software development expertise.
          
          If there's no substantial difference in software development
          expertise then GPT-5 absolutely blows Opus out of the water due to
          being almost 10x cheaper.
       
            user3939382 wrote 11 hours 5 min ago:
            I just asked codex to copy a file and it took almost a minute to
            think about it and cost $0.05. This is something Claude Code would
            have done in seconds.
       
            spiderice wrote 14 hours 33 min ago:
            Does OpenAI provide a $200/month option that lets me use as much
            GPT-5 I want inside of Codex?
            
            Because if not, I'd still go with Opus + Claude Code.  I'd rather
            be able to tell my employer, "this will cost you $200/month" than
            "this might cost you less than $200/month, but we really don't know
            because it's based on usage"
       
              mrheosuper wrote 9 hours 27 min ago:
              Does Claude ?
       
              konarkm wrote 12 hours 8 min ago:
              The ChatGPT paid subscriptions now come with Codex CLI usage
              included
       
                t1amat wrote 7 hours 7 min ago:
                Is this actually true?    Last I checked (a week ago?) Codex the
                agents were free at some tiers in a preview capacity (with
                future rate limits based on tier), but codex cli was not.  With
                codex cli you can log in but the purpose of that is to link it
                to an API key where you pay per use.  The sub tiers give one
                time credits you would burn through quickly.
       
                  Deradon wrote 6 hours 39 min ago:
                  Found this in the GPT-5 Announcement:
                  
                  > Availability and access
                  > GPT‑5 is starting to roll out today to all Plus, Pro,
                  Team, and Free users, with access for Enterprise and Edu
                  coming in one week. Pro, Plus, and Team users can also start
                  coding with GPT‑5 in the Codex CLI (opens in a new window)
                  by signing in with ChatGPT.
       
              mh- wrote 12 hours 38 min ago:
              To be clear, Claude doesn't provide that either. You can get
              "usage limited" off of Opus on the $200/mo plan.
       
          swader999 wrote 17 hours 58 min ago:
          If GPT 5 truly has 400k context, that might be all it needs to
          meaningfully surpass Opus.
       
            tekacs wrote 17 hours 11 min ago:
            Coupled with the humungous price difference...
       
            andrewmutz wrote 17 hours 20 min ago:
            Having a large context window is very different from being able to
            effectively use a lot of context.
            
            To get great results, it's still very important to manage context
            well.  It doesn't matter if the model allows a very large context
            window, you can't just throw in the kitchen sink and expect good
            results
       
            Byamarro wrote 17 hours 20 min ago:
            More of a question is its context rot tendency than the size of its
            context :)
            LLMs are supposed to load 3 bibles into their context, but they
            forget what they were about to do after loading a 600LoC of
            locales.
       
            dimal wrote 17 hours 25 min ago:
            Even with large contexts there's diminishing returns. Just having
            the ability to stuff more tokens in context doesn't mean the model
            can effectively use it. As far as I can tell, they always reach a
            point in which more information makes things worse.
       
            simonw wrote 17 hours 37 min ago:
            It's 272,000 input tokens and 128,000 output tokens.
       
              6thbit wrote 14 hours 35 min ago:
              Oh, I had not grasped that the “context window” size
              advertised had to include both input and output.
              
              But is it really 272k even if the output was say 10k?  Cause it
              does say “max output” in the docs, so I wonder
       
                simonw wrote 13 hours 34 min ago:
                This is the only model where the input limit and the context
                limit are different values. OpenAI docs team are working on
                updating that page.
       
              zurfer wrote 15 hours 19 min ago:
              Woah that's really kind of hidden. But I think you can specify
              max output tokens. Need to test that!
       
            AS04 wrote 17 hours 51 min ago:
            400k context with 100% on the fiction livebench would make GPT-5
            the undisputably best model IMHO. Don't think it will achieve that
            though, sadly.
       
          bdangubic wrote 18 hours 1 min ago:
          context awareness over long-running tasks
          
          don’t have long-running tasks, llms or not. break the problem down
          into small manageable chunks and then assemble it. neither humans nor
          llms are good at long-running tasks.
       
            vaenaes wrote 17 hours 25 min ago:
            You're holding it wrong
       
            bastawhiz wrote 17 hours 35 min ago:
            > neither humans nor llms are good at long-running tasks.
            
            That's a wild comparison to make. I can easily work for an hour.
            Cursor can hardly work for a continuous pomodoro. "Long-running" is
            not a fixed size.
       
              novok wrote 13 hours 44 min ago:
              I think that is because you do implicit plan tracking, creation
              and modification of the plan in your head in light of new
              information and then follow that plan.    I'm not sure these tools
              do that very well.
              
              The long running task, at it's core, is composed of many smaller
              tasks and you mostly focus on one task at a time per brain part. 
              It's why you cannot read two streams of text simultaneously even
              if both are in your visual focus field.
       
                raducu wrote 5 hours 19 min ago:
                >  you do implicit plan tracking, creation and modification of
                the plan in your head in light of new information and then
                follow that plan. I'm not sure these tools do that very well.
                
                I think the plan is not just words, if it was, you could read a
                book on how to ride a bike.
                
                Because we communicate in language and because code output is
                also a language we think that the process is also language
                based, but I think it's not, especially when doing hard stuff.
                
                I know for certain in my case it isn't -- when tracking a hard
                problem for a junior after 2 hours of pair programming the
                other week, I had to tell him to commit everything and just let
                me do some deep thinking/debugging and I solved the problem
                myself. Sure I explained my process to him in language the best
                I could, but it's clear it was not language, it was not liniar,
                I did not think it step by step.
                
                I wish I could explain it, but when figuring out a hard
                problem, for me it takes some time to take it all in, get used
                to the moving parts, play with them. I'm sure there are actual
                neurons/synapses formed then, actual new wires sprawling about
                in the brain, that's why it takes time. I think the solution is
                a hardware one, not a software one.
                
                That's why we can sleep on it and get better the next day and
                that's why we feel the problem. There are actual multiple
                paralel "threads" of thinking going at the same time in our
                heads and we can FEEL the solution as almost there.
                
                I think it simply is that hard problems can occur in a
                combination of code, state, models that simply cannot be solved
                incrementally and big jumps are necessary.
                
                I'm not saying the problem cannot be solved incrementally, but
                it's possible that by going in small steps, you either reach
                the solution or a blocker that requires a big jump.
       
                bdangubic wrote 10 hours 34 min ago:
                you making too much sense :)
       
              bdangubic wrote 16 hours 16 min ago:
              I just finished my workday, 8hrs with Claude Code. No single task
              took more than 20 minutes total. Cleared context after each task
              and asked it to summarize for itself the previous task before I
              cleared context. If I ran this as a continuous 8hr task it would
              have died after 35-ish minutes. Just know the limitations (like
              with any other tool) and you’ll be good :)
       
                0x457 wrote 15 hours 59 min ago:
                I always find it wild that none of these tools use VCS -
                completed logical unit of work, make a commit, drop entire
                context related to that commit, while referencing said commit,
                continue onto the next stage, rinse and repeat.
                
                Claud always misunderstands how API exported by my service
                works and every compaction it forgets all over and commits "oh
                api has changed since last time I've used, let me use different
                query parameters", my brother Christ nothing has changed, and
                you are the one who made this API.
       
                  bdangubic wrote 10 hours 33 min ago:
                  I do exactly this - except I want control to define logical
                  units of work
       
                  bahmboo wrote 13 hours 11 min ago:
                  Roo Code does this
       
                  bastawhiz wrote 13 hours 28 min ago:
                  You can use cursor rules to tell cursor to update the project
                  cursor rules with details about the API.
       
                    bdangubic wrote 11 hours 43 min ago:
                    that would mean someone actually tried to learn how to use
                    it :)
       
              echelon wrote 17 hours 19 min ago:
              Humans can error correct.
              
              LLMs multiply errors over time.
       
            beoberha wrote 17 hours 50 min ago:
            A series of small manageable chunks becomes a long running task :)
            
            If LLMs are going to act as agents, they need to maintain context
            across these chunks.
       
          realusername wrote 18 hours 3 min ago:
          Personally I think I'll wait for another 10x improvement for coding
          because with the current way it's going, they clearly need that.
       
            fsloth wrote 17 hours 49 min ago:
            From my experience when used through IDE such as Cursor the current
            gen Claude model enables impressive speedruns over commodity tasks.
            My context is a CAD application I’ve been writing as a hobby. I
            used to work in that field for a decade so have a pretty good touch
            on how long I would expect tasks to take. I’m using mostly a
            similar software stack as that at previous job and am definetly
            getting stuff done much faster on holiday at home than at that
            previous work. Of course the codebase is also a lot smaller,
            intrinsic motivation, etc, but still.
       
              realusername wrote 16 hours 31 min ago:
              I've done pretty much the same as you (Cursor/Claude) for our
              large Rails/React codebase at work and the experience has been
              horrific so far, I reverted back to vscode.
       
                fsloth wrote 4 hours 20 min ago:
                Yeah! It's quite possible my scenario is in the "happy
                accident" valley.
                
                I'm using it mostly for C#, WPF and OpenTK. The type system
                seems to help a lot.
                
                The UI logic it recommends is mostly god awful. But at least
                for me when it's given a pattern it can apply, it does so
                pretty well.
       
              42lux wrote 17 hours 34 min ago:
              How often do you have to build the simple scaffolding though?
       
                fsloth wrote 4 hours 22 min ago:
                At a real job? Not that often! And it's miserable in large
                scale architecture.
                
                However, at leas for me there is lots of "small enough context"
                boilerplate that the context can deal with.
                
                Clearly this is not a tool in  the sense it's predictable.
       
        croemer wrote 18 hours 14 min ago:
        > GPT‑5 also excels at long-running agentic tasks—achieving SOTA
        results on τ2-bench telecom (96.7%), a tool-calling benchmark released
        just 2 months ago.
        
        Yes, but it does worse than o3 on the airline version of that
        benchmark. The prose is totally cherry picker.
       
          tedsanders wrote 15 hours 10 min ago:
          I wrote that section and made the graphs, so you can blame me. We no
          doubt highlight the evals that make us look good, but in this
          particular case I think the emphasis on telecom isn't unprincipled
          cherry picking.
          
          Telecom was made after retail & airline, and fixes some of their
          problems. In retail and airline, the model is graded against a ground
          truth reference solution. But in reality, there can be multiple
          solutions that solve the problem, and perfectly good answers can
          receive scores of 0 by the automatic grading. This, along with some
          user model issues, is partly why airline and retail scores haven't
          climbed with the latest generations of models and are stuck around
          60% / 80%. Even a literal superintelligence would probably plateau
          here.
          
          In telecom, the authors (Barres et al.) made the grading less brittle
          by grading against outcome states, which may be achieved via multiple
          solutions, rather than by matching against a single specific
          solution. They also improved the user modeling and some other things
          too. So telecom is the much better eval, with a much cleaner signal,
          which is partly why models can score as high as 97% instead of
          getting mired at 60%/80% due to brittle grading and other issues.
          
          Even if I had never seen GPT-5's numbers, I like to think I would
          have said ahead of time that telecom is much better than
          airline/retail for measuring tool use.
          
          Incidentally, another thing to keep in mind when critically looking
          at OpenAI and others reporting their scores on these evals is that
          the evals give no partial credit - so sometimes you can have very
          good models that do all but one thing perfectly, which results in
          very poor scores. If you tried generalizing to tasks that don't
          trigger that quirk, you might get much better performance than the
          eval scores suggest (or vice versa, if they trigger a quirk not
          present in the eval).
          
          Here's the tau2-bench paper if anyone wants to read more:
          
   URI    [1]: https://arxiv.org/abs/2506.07982
       
            jama211 wrote 6 hours 33 min ago:
            Thanks for your input!
       
            jeffrwells wrote 9 hours 0 min ago:
            OpenAI hiring BCG alumni is all we need to know
       
          jstummbillig wrote 16 hours 11 min ago:
          I mean... they themselves included that information in the post. It's
          not exactly a gotcha.
       
          Fogest wrote 17 hours 58 min ago:
          How does the cost compare though? From my understanding o3 is pretty
          expensive to run. Is GPT-5 less costly? If so if the performance is
          close to o3 but cheaper, then it may still be a good improvement.
       
            low_tech_punk wrote 17 hours 56 min ago:
            I find it strange that GPT-5 is cheaper than GPT-4.1 in input token
            and is only slightly more expensive in output token. Is it
            marketing or actually reflecting the underlying compute resources?
       
              bn-l wrote 16 hours 14 min ago:
              Maybe with the router mechanism (to mini or standard) they
              estimate the average cost will be a lot lower for chatgpt because
              the capable model won’t be answering    dumb questions and then
              they pass that on to devs?
       
                low_tech_punk wrote 15 hours 32 min ago:
                I think the router applies to chatgpt app. The developer APIs
                expose manual control to select the specific model and level of
                reasoning.
       
              AS04 wrote 17 hours 49 min ago:
              Very likely to be an actual reflection. That's probably their
              real achievement here and the key reason why they are actually
              publishing it as GPT-5. More or less the best or near to it on
              everything while being one model, substantially cheaper than the
              competition.
       
                ComputerGuru wrote 16 hours 17 min ago:
                But it can’t do audio in/out or image out. Feels like an
                architectural step back.
       
                  conradkay wrote 15 hours 22 min ago:
                  My understanding is that image output is pretty separate and
                  if it doesn’t seem that way, they’re just abstracting
                  several models into one name
       
        andrewmcwatters wrote 18 hours 15 min ago:
        I wonder how good it is compared to Claude Sonnet 4, and when it's
        coming to GitHub Copilot.
        
        I almost exclusively wrote and released [1] yesterday with GPT 4o and
        Claude Sonnet 4, and the latter's agentic behavior was quite nice. I
        barely had to guide it, and was able to quickly verify its output.
        
   URI  [1]: https://github.com/andrewmcwattersandco/git-fetch-file
       
          fleebee wrote 15 hours 19 min ago:
          There is an option in GitHub Copilot settings to enable GPT-5
          already.
       
       
   DIR <- back to front page