bogorad's blurblog

Researchers find AI chatbots are better at debating than humans
Tuesday September 16^th, 2025 at 5:42 AM

Reason.com

Who/What/When/Where/Why: Researchers in May 2025 (published in Nature Human Behavior) ran online debates between humans and GPT-4 to test whether providing basic personal information about opponents improves persuasive success.
Design: Participants were randomly assigned to debate a human or GPT-4; some debates included opponents' age, sex, ethnicity, employment, and political affiliation; topics included public policy such as universal basic income.
Personalization variables: Basic demographic and political details (age, sex, ethnicity, employment, political affiliation) were the personalized inputs tested.
Argument styles: GPT-4 primarily used logical reasoning and factual evidence, while human debaters more often used expressions of support and storytelling.
Tailored messaging example: When given opponent demographics, GPT-4 framed UBI as promoting economic growth and empowerment for a middle‑aged white male Republican and as a safety net and economic justice measure for a Black middle‑aged female Democrat.
Persuasion outcomes: With personal information GPT-4 was more persuasive than human debaters about 64% of the time; without personal information GPT-4 performed roughly the same as humans; human performance did not improve with personal information.
Detection and effect on agreement: Participants correctly identified AI opponents in about three out of four cases, and participants who believed they were debating an AI reported greater agreement with their opponent than those who believed they were debating a human.
Broader implications and related evidence: Researchers warned personalized persuasive AI could be used for large‑scale disinformation; a 2024 Science study found AI dialogues reduced conspiracy-belief confidence by ~20% (example case 100%→40%) with effects lasting at least two months, but noted models could also promote epistemically suspect beliefs absent guardrails.

In a May 2025 study in Nature Human Behavior, researchers set up online debates between two humans, and between a human and the large language model GPT-4. In some debates, they provided both humans and AI with basic personal information about their opponents—age, sex, ethnicity, employment, political affiliation. They wanted to find out if such personalized information helped debaters both human and machine to craft more persuasive arguments.

Debaters were randomly assigned to either human or AI opponents. According to the study, GPT-4 heavily relied on logical reasoning and factual knowledge whereas humans tended to deploy more expressions of support and more storytelling. After debating, human participants were asked if their views had shifted, and if they thought their opponent was human or AI.

AI more effectively deployed personalized information in its debates than did humans. For example, in arguing the affirmative during a debate with a middle-aged white male Republican on the topic "Should Every Citizen Receive a Basic Income from the Government?" the AI highlighted arguments that universal basic income (UBI) would boost economic growth and empower all citizens with the freedom to invest in skills and businesses. When arguing with a black middle-aged female Democrat, the AI emphasized how UBI would function as a safety net, promoting economic justice and individual freedom.

When GPT-4 had access to personal information about its opponents, researchers found it was more persuasive than human debaters about 64 percent of the time. Without the personal information, GPT-4 success was about the same as a human. In contrast, human debaters didn't get better when supplied with personal information.

Participants debating AI correctly identified their opponent in three out of four cases. Interestingly, the researchers report that "when participants believed they were debating with an AI, they changed their expressed scores to agree more with their opponents compared with when they believed they were debating with a human." They speculate that peoples' egos are less bruised by admitting they had lost when their opponent was an AI rather than another human being.

The persuasive power of AI after accessing basic personal information concerned researchers who worry that "malicious actors interested in deploying chatbots for large-scale disinformation campaigns could leverage fine-grained digital traces and behavioural data, building sophisticated, persuasive machines capable of adapting to individual targets."

A 2024 study in Science showed that AI dialogues could durably reduce conspiracy beliefs. The researchers recruited participants who endorsed at least one of the conspiracy theories listed on the Belief in Conspiracy Theories Inventory, which include those related to John F. Kennedy's assassination, 9/11 attacks, the moon landing, and the 2020 election.

More than 2,000 participants were asked to explain and offer evidence for the beliefs they held, and state how confident they were in the belief. The researchers then prompted the AI to respond to the specific evidence provided by the participant to see if AI could reduce their belief in the conspiracy.

In one example, a participant was 100 percent confident that the 9/11 attacks were an inside job, citing the collapse of World Trade Center Building 7, President George W. Bush's nonreaction to the news, and burning jet fuel's temperature being incapable of melting steel beams. In its dialogue the AI cited various investigations showing how debris from the Twin Towers brought down Building 7, that Bush remained composed because he was in front of a classroom of children, and that burning jet fuel was hot enough to compromise the structural support of steel beams by 50 percent. After the dialogue the participant reduced her level of confidence in the conspiracy theory to 40 percent.

Overall, the researchers reported that AI dialogues reduced confidence in participants' conspiracy beliefs by about 20 percent. The effect persisted for at least two months afterward. "AI models are powerful, flexible tools, for reducing epistemically suspect beliefs and have the potential to be deployed to provide accurate information at scale," argue the authors. However, they note that "absent appropriate guardrails….such models could also convince people to adopt epistemically suspect beliefs."

These studies confirm that AI is a powerful tool for persuasion. Like any other tool, though, it can be used for good or evil.

Read the whole story

bogorad

3 hours ago

Barcelona, Catalonia, Spain

The Feds Destroyed an Internet Weapon, but Criminals Picked Up the Pieces - WSJ
Tuesday September 16^th, 2025 at 5:32 AM

Who/What/When/Where/Why: Federal authorities last month disrupted a botnet online, freeing roughly 95,000 devices that were quickly re‑compromised to power larger criminal DDoS and extortion operations worldwide.
New-generation botnets: Emerging botnets leverage internet-connected devices with faster processors and greater bandwidth, increasing potential to disrupt broad swaths of internet connectivity.
Takedown consequence: The disruption inadvertently released devices that rival operators, notably the Aisuru botnet, seized—Aisuru took control of more than a quarter of the freed machines.
Recorded attack size: Cloudflare measured a DDoS on Sept. 1 that peaked at 11.5 trillion bits per second, enough to consume the download bandwidth of over 50,000 consumer connections.
Attack pattern: Recent massive attacks have often been very short (seconds) and numerous, suggesting demonstrations of capacity rather than sustained campaigns.
Botnet growth examples: Google disrupted a botnet that grew from ~74,000 Android devices in 2023 to over 10 million, used for ad fraud but with potential for ransomware or DDoS.
Extreme-scale threats: The ResHydra botnet, composed of tens of millions of devices, represents a level of size that researchers warn could inflict extreme damage on a country.
Response and readiness: Law‑enforcement agencies (led in the takedown by the Defense Criminal Investigative Service) and tech firms like Google, Nokia and Cloudflare are countering botnets, but network operators warn existing infrastructure may be unprepared.

Illustration of a skull composed of binary code on computer screens.

Illustration: Emil Lendof/WSJ, iStock

Federal authorities recently disrupted a network of hacked devices used by criminals in some of the largest online attacks yet seen. Now those devices have been hacked by someone new to build an even bigger weapon.

Law-enforcement agencies and technology companies are waging a war against increasingly powerful networks of hacked devices, called botnets, that can knock websites offline for a fee. They are used for extortion and by disreputable companies to knock rivals offline, federal prosecutors say.

But lately, a new age of dangerous botnets has arrived, and existing internet infrastructure isn’t prepared, some network operators say. These botnets are leveraging new types of internet-connected devices with faster processors and more network bandwidth, offering them immense power.

The criminals controlling the botnets now have the capabilities to move beyond website takedowns to target internet connectivity and disrupt very large swaths of the internet.

“Before the concern was websites; now the concern is countries,” said Craig Labovitz, head of technology with Nokia’s Deepfield division.

Craig Labovitz, head of technology for Nokia’s Deepfield division, speaking in June. PHOTO: NOKIA

In August, federal prosecutors charged a 22-year-old Oregon man with operating a botnet that had shut down the X social-media site earlier this year. The lead agency coordinating the operation was the Defense Criminal Investigative Service.

But the federal takedown last month appeared to have an unwanted consequence: freeing up as many as 95,000 devices to be taken over by new botnet overlords. That led to a free-for-all to take over the machines “as fast as possible,” said Damian Menscher, a Google engineer.

The operators of a rival botnet, called Aisuru, seized control of more than one-fourth of them and immediately started launching attacks that are “breaking records,” he said.

On Sept. 1, the network services company Cloudflare said it had measured an attack that clogged up computer networks with 11.5 trillion bits of junk information per second. That is enough to consume the download bandwidth of more than 50,000 consumer internet connections. In a post to X, Cloudflare declared this attack, known as a distributed denial of service, or DDoS, a “world record” in terms of intensity. Some analysts see it almost as an advertisement of the botnet’s capabilities.

It was one of several dozen attacks of a similar size that network operators have witnessed over the past weeks. The attacks were very short in duration—often lasting just seconds—and may be demonstrations of the Aisuru capabilities, likely representing just a fraction of their total available bandwidth, according to Nokia.

With the world’s increasing dependence on computer networks, denial-of-service attacks have become weapons of war. Russia’s intelligence service, the GRU, used DDoS attacks on Ukraine’s financial-services industry as a way to cause disruption ahead of its 2022 invasion, U.K. authorities have said.

Botnets such as Aisuru are made up of a range of internet-connected devices—routers or security cameras, for example—rather than PCs, and often these machines can only join one botnet at a time. Their attacks can typically be fended off by the largest cloud-computing providers.

One massive network that Google disrupted earlier this year had mushroomed from at least 74,000 Android devices in 2023 to more than 10 million devices in two years. That made it the “largest known botnet of internet-connected TV devices,” according to a July Google court filing.

This network was being used to click billions of Google advertisements in an ad fraud scheme, Google said, but the massive network “could be used to commit more dangerous cybercrimes, such as ransomware” or denial-of-service attacks, the Google filing said.

To date, denial-of-service attacks are spawned from networks like Aisuru that typically include tens of thousands of computers, not millions, making them easier to defend against.

In the past year, a very large botnet that has typically been used for fraud began launching online attacks. Called ResHydra, it is made up of tens of millions of devices, according to Nokia.

ResHydra represents a whole new level of problem, said Chris Formosa, a researcher with the networking company Lumen’s Black Lotus Labs. Harnessing a botnet of that size would “do extreme damage to a country.”

Write to Robert McMillan at robert.mcmillan@wsj.com

Corrections & AmplificationsThe lead agency coordinating the takedown of a powerful botnet last month was the Defense Criminal Investigative Service. An earlier version of this article incorrectly said the Federal Bureau of Investigation led the action. (Corrected on Sept. 15)

Read the whole story

bogorad

3 hours ago

Barcelona, Catalonia, Spain

How GPT5 + Codex took over Agentic Coding — ft. Greg Brockman, OpenAI
Tuesday September 16^th, 2025 at 5:19 AM

Latent.Space

WHO/WHAT/WHEN/WHERE/WHY: Latent Space summary of OpenAI’s GPT-5-Codex launch and related podcast/interviews, timed with GPT-5 releases and promoting the AI Engineer CODE Summit (Nov 19–22, NYC) to recap coding-agent advances and evaluate agentic coding progress.
GPT-5-Codex launch and benchmark: GPT-5-Codex released with a 74.5% score on SWE-bench (500 tasks), presented as comparable to GPT-5’s reported ~74.9% on a subset and positioned as a coding-focused advance.
Multi-interface agent strategy: Codex deployed across multiple surfaces—CLI, Codex Cloud (ChatGPT Codex), IDE extension (≈800k installs in ~2.5 weeks), and a GitHub code-review bot—to cover diverse developer workflows.
Code review bot utility: The @codex review bot targets high-signal PR review (intent/contracts, deep dependency checks), reported internally to accelerate teams and surface issues that take humans hours to find.
Post-training improvements: OpenAI highlights post-training qualities like “variable grit” (fast on simple tasks, sustained multi-hour work on complex refactors), improved code quality, and measurable reductions in hallucination/grounding issues.
Need for new agentic evaluations: Existing spot checks and benchmarks are deemed insufficient for agentic coding; the author calls for blind, multi-turn/multi-step tests on live open-source codebases with maintainer ratings and reports that such tests are being run.
Developer guidance and system design: Podcast discussion covered model routing/hybrid architectures, instruction hierarchies for robustness, pricing and inference efficiency trends, on-device vs remote agent trade-offs, and advice to structure codebases for AI-assisted modules and fast unit testing.
Resources and follow-ups: Article aggregates Greg Brockman podcasts/interviews, show notes, timestamps, links to related model releases (GPT-OSS), papers and demos, and promises future updates with blind-test results and additional analysis.

This is the last in our GPT-5 coverage of the vibes, bootstrapping, vision, and Router.

ICYMI, we are back with the AI Engineer CODE Summit, Nov 19-22 in NYC! Summits are usually >10x oversubscribed, with the most high signal content & attendees. If you are keen on developer productivity and what’s new in SWE agents, apply today.

The new GPT-5-Codex launches today, capping off perhaps the most intense month of vibe shifts in Coding Agents in recent memory (click to expand):

[

](https://substackcdn.com/image/fetch/$s_!EdwL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1d1507e-5ec4-44ff-879b-8f09dff08dd7_1730x1514.png)

For a little over a year now, starting with Claude 3.5 Sonnet in June, 3.7 Sonnet and Claude Code in Feb, and Claude 4 in May, Anthropic has enjoyed an uncontested dominance in coding usecases, leading to an epic runup to $5B in revenue (10% of which is Claude Code) and a $183B valuation1, adding $122B in market cap.

That seems to have ignited a fire in OpenAI, which of course shipped the original 2021 Codex that kicked off GitHub Copilot, the original AI coding tool with 182 creators and counting2, and whose GPT3 inspired Debuild which presaged all the vibe coding startups, and of course started to reprioritize coding abilities in o1 and GPT 4.1.

GPT-5-codex’s 74.5% on SWE-bench (full 500) is kind of a wash vs the (infamously memed to pieces) GPT-5 thinking’s performance of 74.9% (477 task subset), so what is the cause of this major shift in GPT5 sentiment?

Well, for one, the Codex team has been COOKING.

Factor 1: Many Faces, One Agent

As Greg says in today’s podcast, MANY people pitched in:

“At the beginning of the year, we set a company goal of an agentic software engineer by the end of the year. And figuring out exactly what that means and how to substantiate that and how to bring together all the opportunity and all the kind of compute that we have to bear on this problem. That has been a great undertaking for many, many people at OpenAI.”

The original A-SWE agentic harness was called 10X, and did live in the terminal, but since launching the new Codex CLI and then “ChatGPT Codex” (now Codex Cloud) and then the IDE extension (now at 800k installs after 2.5 weeks) and GitHub code review bot, there is now a complete set of interfaces to match all needs:

[

](https://substackcdn.com/image/fetch/$s_!sz_y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de5e8e2-a464-4b11-8b64-cfaf5b191d33_2658x1676.png)

Here’s our rough illustration of the various tradeoffs in the Codex universe:

[

](https://substackcdn.com/image/fetch/$s_!zI8y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F870469af-ebcc-4dc2-87ae-b52caf56dcdd_1336x1148.png)

Although perhaps causing the least fanfare, the @codex code review bot might have the highest utility because of it’s very tight scoping:

“We started to notice that the big bottleneck for us was with increased amounts of code needing to be reviewed is like the amount of review simply that people had to do on the teams.

We decided to really focus on a very high signal Codex mode where it's able to review a PR and really think deeply about the contract and the intention that you were meaning to implement and then look at the code and validate whether that intention is matched and found in that code.

And it's able to go layers deep, look at all the dependencies, think about the contract and really raise things that some of our best employees, some of our best reviewers wouldn't have been able to find unless they were spending hours really deeply thinking about that PR.

We released this internally first at OpenAI. It was quite successful and people were upset actually when it broke because they felt like they were losing that safety net, and it accelerated teams and including the Codex team tremendously.”

Factor 2: Better Post-Training Qualities

We can’t see the datasets of course, but the other thing that OpenAI always emphasizes about their work is the tight integration of research and product. In today’s podcast we also heard a few references towards some of the desired qualities:

Variable Grit

Thibault Sottiaux:

“One of the things that this model exhibits is an ability to go on for much longer and to really have that grit that you need on these complex refactoring tasks.

But at the same time, for simple tasks, it actually comes way faster at you and is able to reply without much thinking. And so it's like this great collaborative where you can ask questions about your code, find where this piece of code is that you need to change or better understand, plan. But at the same time, once you let it go onto something, it will work for a very, very long period of time.

We've seen it work internally up to seven hours for very complex refactorings. We haven't seen other models do that before. And we also have really worked tremendously on code quality. And it's just really optimized for what people are using GPT-5 within Codex for.

This tenacity judiciously applied is what makes GPT-5-Codex a much more useful agentic coding model all-round, not just optimizing for the most difficult problems only and then requiring a model switcher for dumber models (interestingly, it also doesn’t use ChatGPT’s GPT-5 Router we wrote about):

[

](https://substackcdn.com/image/fetch/$s_!ebZW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b67bb7a-1df0-456a-b6da-9e43bc8969f4_2499x1206.png)

https://x.com/swyx/status/1967651870018838765/photo/1

Getting out of Ruts and pushing back

Greg:

“I remember for GPT-3 and for GPT-4 really focusing on the doubling down problem. Do you remember if the AI would say something wrong and you'd point out the mistake? It would try to convince you that it was right. We're so far past that being the core problem.

It's really amazing to see that we're at a level where even when it's not quite zeroed in on the right thing, it's highlighting stuff that matters. It has pretty reasonable thoughts. Greg Brockman: And I always walk away from these code reviews thinking like, huh, OK, yeah, that's a good point.”

We have no idea how they achieved this level of groundedness, but it is very likely correlated with the measurable drops in hallucination in GPT-5:

[

](https://substackcdn.com/image/fetch/$s_!hn6O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc1fb78b-622c-4452-be57-32e646b312e5_894x989.png)

All of these hard-to-articulate qualities add up to one thing: requiring new evals.

Factor 3: New Evals for Agentic Coding

The problem with the negative first reactions around the GPT-5 launch is that most people had not actually used the model in anything real and just reacting to headlines and chart crimes. Those of us who had, myself included, had already gone through the adjustment and sentiment shift, and the GPT-5 Sentiment Flip predictably happened on schedule, just like it did with the o1 release.

The first idea I got immediately after the shooting the GPT-5 for Developers video was “we’re gonna need better vibe checks”. Everyone carries around their favorite spot checks - in our video, Theo did the Hexagon ball thing (and changed tune post release), Simon did PelicanBench, Ben of course tested writing and came back for his Latent Space hat trick.

But the thing I mentioned on the Dev video about GPT-5’s agentic coding abilities is very real: here’s the codebase I actually tried live in the video when we got access:

[

](https://substackcdn.com/image/fetch/$s_!zmnb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ad24db-d820-47b1-a03b-de4849fe16a1_1488x1330.png)

I had very nice Cursor logs to prep for this blogpost but… they seem to have evaporated from my chat history logs, sorry :/

This was extra impressive because I had been stuck for literal months throwing dozens of hours of Claude Code at it to no avail, whereas GPT-5 “thought with tools” - instrumented the code, had me read back the logs to it, then found the solution.

That’s the problem with the social media pressure for the loudest most confident takes on new models the moment they release - you can’t just run simple one-turn, minimal-tool-call tests to gauge the vibes of the model (my quip is “you can vibecode any website you want, so long as it has blue and purple gradients”).

Even Aider’s polyglot benchmarks aren’t really testing agentic coding, by which I mean multi-turn, multi-step, [thinking-with-tools](http://Latent Space hat trick) coding agents on real codebases.

The solution was obvious: make a blind taste test of models on real live open source codebases on real tasks and have maintainers rate their performance!

So that is exactly what we did:

[

](https://substackcdn.com/image/fetch/$s_!kKkO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b160265-9971-4d46-b782-d54e66e674cc_722x544.png)

We will update this post later with the results of these tests.

In any case, as with all things Koding, Greg Brockman was the frontman driving this shift in OpenAI’s coding capabilities. I had the privilege of interviewing him for the World’s Fair, and in this newsletter issue we are recapping the OpenAI podcast released today and his Latent Space pod released last month - all collected in one place for those just catching up on the great OpenAI coding comeback.

Enjoy!

Greg and Thibault on OpenAI Podcast

Greg Brockman on Latent Space

this episode was recorded and aired last month.

Show Notes

Timestamps

[00:00:04] Introductions
[00:01:04] The Evolution of Reasoning at OpenAI
[00:04:01] Online vs Offline Learning in Language Models
[00:06:44] Sample Efficiency and Human Curation in Reinforcement Learning
[00:08:16] Scaling Compute and Supercritical Learning
[00:13:21] Wall clock time limitations in RL and real-world interactions
[00:16:34] Experience with ARC Institute and DNA neural networks
[00:19:33] Defining the GPT-5 Era
[00:22:46] Evaluating Model Intelligence and Task Difficulty
[00:25:06] Practical Advice for Developers Using GPT-5
[00:31:48] Model Specs
[00:37:21] Challenges in RL Preferences (e.g., try/catch)
[00:39:13] Model Routing and Hybrid Architectures in GPT-5
[00:43:58] GPT-5 pricing and compute efficiency improvements
[00:46:04] Self-Improving Coding Agents and Tool Usage
[00:49:11] On-Device Models and Local vs Remote Agent Systems
[00:51:34] Engineering at OpenAI and Leveraging LLMs
[00:54:16] Structuring Codebases and Teams for AI Optimization
[00:55:27] The Value of Engineers in the Age of AGI
[00:58:42] Current state of AI research and lab diversity
[01:01:11] OpenAI’s Prioritization and Focus Areas
[01:03:05] Advice for Founders: It's Not Too Late
[01:04:20] Future outlook and closing thoughts
[01:04:33] Time Capsule to 2045: Future of Compute and Abundance
[01:07:07] Time Capsule to 2005: More Problems Will Emerge

Transcript

Introductions

Alessio [00:00:04]: Hey, everyone. Welcome to the Latent Space podcast. This is Alessio, founder of Kernel Labs, and I'm joined by Swyx, founder of Small AI.

Swyx [00:00:10]: Hello, hello. And we are so excited to have Greg Brockman join us. Welcome. Thank you for having us. Excited to be here. You need no introduction. So I was like mentally going to introduce you. I just get right to it. Congrats on GPT-5, GPT-OSS, like all the stuff that's going on in opening islands. Where are you going to get to all of that? It's really good to have you here. How does it feel? Last week was like a whole maelstrom of releases.

Greg [00:00:33]: releases. Wild. It was absolutely wild to get so many things out in one week. But yeah, so we've released our open source models, which are models that we've been working on for some time. I think really pack in a bunch of the advances that we've been making at OpenAI into a very small form factor, very accessible, now being used by, you know, there's been millions of downloads of that just over the past couple of days. We also released GPT-5, again, something we've been working on for a very long time. And so just having these out in the world and really having done that release process is something that I'm just really excited about. I'm just really proud of the team for doing.

The Evolution of Reasoning at OpenAI

Alessio [00:01:04]: And GPT-5 is the first hybrid model, so most people don't get to choose one model. And that's a whole other drama we will not cover. A whole other thing. But you started originally the reasoning team with Ilya at OpenAI. So maybe can you just give a quick history of reasoning at OpenAI? So you started with just, you know, next token prediction. And then at some point you thought reasoning was something important to build. What was the path from there to GPT-5 where now it's like kind of hidden from the user?

Greg [00:01:31]: Well, I'd say that. After we trained GPT-4, we had a model that you could talk to. And I remember doing the very first. We did the post-training. We actually did a instruction following post-train on it. So it was really just a data set that was, here's a query. Here's what the model completion should be. And I remember that we were like, well, what happens if you just follow up with another query? And it actually was able to then have a response that took into context the whole previous chain of question and answer. And you realize this thing can do chat, right? It can actually talk to you. It can actually use, leverage all of this information, even though it wasn't trained to do it. And I remember we had this question. We had a research meeting with a bunch of people, you know, Jakob, Ilya, Wojciech, others. And the question was, why is this not AGI? Right? This model clearly is not AGI, but it's really hard to describe why, right? It's like able to answer any question you put in front of it. And okay, it's not quite reliable. It makes mistakes. It falls off the rails. Okay. That's a real gap. And so what do we need to do to close that gap? And the most obvious thing you need to do is actually have it test out its ideas in the world, right? Actually do reinforcement learning, like try out some hypotheses, get some feedback and from there become reliable. And this is not a new idea to us, right? If you rewind to even 2017, we were working on Dota, which was all reinforcement learning, no behavioral cloning from human demonstrations or anything. It was just from a randomly initialized neural net. You'd get these amazingly complicated, very sophisticated, very correct behaviors. And it's like. That's the reliability we wanted for our language models. So really the moment we trained GPT-4, we knew that we needed to get to the reasoning paradigm. And it was just a question of how. So we had like 10 ideas, a bunch of different hypotheses about what might work. And people really set out to go and try to make it be reality. And so it was really the labor of many people at OpenAI across many years. And I think the way that this, the progress in this field works is you need to have conviction on a direction. Right. And the first 10 things you try will fail. And most of the things on that list of 10 did not succeed, but we made one of them work. And I think that that's the real key is that we just keep pushing and pushing and that you get little signs of life and you keep growing from there. And so now Jerry runs our reinforcement learning team and has made really great strides there. There's really amazing infrastructure work. People like Wenda, people from the inference side, people like Felipe, there's many people across OpenAI that all come together. And I think it's really important for us to be able to work together to really make this work.

Online vs Offline Learning in Language Models

Swyx [00:04:01]: Amazing. I was going over, you know, when you, when you were with me on the engineer conference, you talked about the Turing paper, which you love and got you started in some ways on your machine learning journey. And I think actually he kind of anticipated the learning machine would be partially online. You know, and I think like, that's one of the questions I always had when reflecting on this journey from like three, four to five, like is learning, like learning started all offline and like all pre-trained. And now it's slowly coming back. It's coming online. Do you think that's accurate?

Greg [00:04:31]: Yeah. I think it's a very interesting question, right? Where does the learning happen? And I think we're still not at the full kind of learning loop that humans do. Yeah. Right. Which it's also not really clear. Are humans fully online? Because it's like, you know, you go to sleep, like there's a lot of, of, you know, sort of back propagation, so to speak, that happens into your long-term memory. So I think that exactly how humans work is not necessarily mapped, you know, represented by how our machines work. But we are moving from a world where it's just, you go and build a machine, you build a machine, train once, and then you're inferencing a ton to a world where there's actually this loop of you inference and you train on those inferencings. And one thing that Ilya used to say a lot that I think is, is, is very, very astute is that when the models are not very capable, right, that the value of a token that they generate is very low. When the models are extremely capable, the value of a token they generate is extremely high. Right. It's something that's like very thoughtful. It's something that's, that's, you know, that's important. And reinforcement learning has this property, right, that you're generating a bunch of data because the model's trying stuff and then you train on that data. And so somehow the model's observations, you know, also normalized by contact with reality or, you know, somehow selected by, by contact with reality, get fed back into the machine. And that is, I think, something that we're starting to get very good at learning from. And the scale required is very different, right? That if you look at pre-training, your, your 10 examples of something doesn't go anywhere, right? You're talking hundreds of thousands of any little type of, of behavior. And then that's what you learn from, which is totally, totally unlike how humans learn. Again, I think, right, if you're, if you think about, recapitulate all of evolution and also think about your 20 years worth of developmental history, there's a lot of just observing the world that happens. There are lots of bits of information that kind of flow through, through your, your senses. But with the reinforcement learning paradigm, if you have 10 examples or a hundred examples of something, right, 10 paths that you're supposed to do, and the model tries a bunch of times that it's actually able to learn from that. And so you really get this leverage. And then you can leverage out of the human curator, creating those tasks and are able to actually get very sophisticated behaviors from the models. And now there's the next step of just having a model that as it goes, it's learning online. We're not quite doing that yet, but the future is not yet written.

Sample Efficiency and Human Curation in Reinforcement Learning

Alessio [00:06:44]: We had this discussion with Noam Brown about simple efficiency. Do you feel like today the bottleneck is still the human data curator that creates these like great tasks for RL to work? Or do you feel like it's still the simple efficiency of the model?

Greg [00:06:57]: Well, the bottleneck is always computing. Right. And, and, and I mean that in a real way, right? It's just like, it's very clear that if you give us a lot of compute that we will find ways to iterate that actually make the most of that, that compute. We are in a world where right now we now have much more sample efficient algorithms, right? With, with the RL paradigm, but it does take a lot of compute still, right? It's like that you have like one task a human created or 10 tasks or a hundred tasks or some small number of those. And then you have a model that tries a bunch of times. Yeah. And then you have a model that tries a bunch of times, not just 10 times, but 10,000 times to try to accomplish one task and you select from those and you learn from, from that. And again, it's like the amount of human leverage you get as a human designer, there's extremely high, but the amount of compute that you have to pour in in order to make it work grows proportionally.

Swyx [00:07:45]: I would say like one way to expend more compute in the learning process, Alan Turing actually like foresaw a lot of this. He had this concept of super critical learning instead of sub-critical learning, meaning we present learnings to machines or teach things to machines. They learn just the immediate thing that we just taught. But super critical means you also think through the second and third and fourth order effects of whatever you just learned, like to update the rest of everything else that you know. So like what are the creative ways in which we spend more compute, right? Like if we had 10x more compute or a thousand x more compute, where does it go?

Scaling Compute and Supercritical Learning

Greg [00:08:16]: I'll just say we will find ways to realize that. Please give us. But I mean it kind of seriously, right? The way that this works. Like if you rewind to something like Dota. We set out to develop new reinforcement learning algorithms because it was very clear to everyone that reinforcement learning, the algorithms that existed at the time, did not scale. Everyone knew it. And I remember Jacob and Shimon saying, why do we believe that? Has anyone actually tested it? And no one had actually really tried to scale up just plain old-fashioned PPO and say, well, that's the baseline. We got to do it. And I remember you come back to the office every week, they doubled the number of cores. It's only the agent. True skill was going up. It's up and to the right. And it's like, okay, you just got to keep pushing it until you hit the wall. And clearly we'll hit the wall and then we can go and do the actual interesting stuff. And we never hit the wall. And you realize that actually the journey of that scaling, that is the interesting stuff, right? Of really doing the engineering. And of course you have bugs and those bugs cause a wall, but you fix the bug, right? You have different issues with how your neural nets initialized or the scaling variance or whatever the issues are. But those are not the fundamentals of the algorithm of the science. And so I think that's kind of the world that we're in. Is one where it's like. We will push on every dimension and maybe we hit a wall. Most of the time, those walls are like just bugs and silly things. And so you can keep going. Sometimes the ROI for fixing those is really hard, right? So it's like, it's not really worth it because you have a different dimension, right? Do you want to push the model to be larger and do more pre-training compute, or do you want to do more RL? And so push more compute to the actual test time. And there's all sorts of dimensions that you can put compute into. And in some ways I think of compute as this, like, you know, we're doing this refining process. Ultimately start with energy. It turns into compute, turns into intelligence, and it's almost crystallizing that compute into the potential energy that can be converted into the model doing something useful. It's a really beautiful thing, right? It's like the compute as this like fundamental driver, this fundamental fuel of intelligence and sort of shapes a neural net, sort of outputs a program. And you know, of course the nice thing about that program is you can run it many, many times even though you put all this compute in that you actually have this amortization that you're going to use it far more times than the amount of effort you put into creating it once. And so it's just like a, it's a beautiful paradigm.

Alessio [00:10:27]: Yeah. You're kind of turning kinetic energy into potential energy in the model. And do you feel like the energy that it's already in this models, we can then turn back into kinetic to do our all in every other domain because we got the IMO gold. I mean, we in the, you, you guys, everybody, do you feel like those same techniques and the same base models can then get us to the goal? IMO gold equivalent. Yeah. I mean, if we just scale the compute or do you feel like there's still some work to do?

Greg [00:10:57]: Well, we have pretty good evidence on things like the IMO models actually also getting us a gold in IOI, which is just the same. Yeah. I mean, I think we did like, I think we talked about some of the details. There's a little bit of difference in the harness, but like the harness is not the gold literally. Right. It's like the actual underlying models and there's no training there that we did specifically. This ended up being just a side project of a few people are like, oh, we may as well do IOI. Right. And it's just a wild fact to me because that used to be something that would be a total grand challenge. You know, many, many people working on and I'm the core IMO team at OpenAI was actually three people. Right. Wasn't this massive effort. And so you realize that there's maybe some specialization required for some of these domains, right? Maybe some amount of additional work, some amount of go gather data set. But fundamentally we have this general purpose learning technology and that learning to solve hard problems is actually a very transferable skill. Learning how to solve hard math problems and write proofs turns out to actually transfer to writing program and competition problems. Now if you've never run a physics experiment, right, if you've never actually gone and tried to mix together some chemicals or something, you're probably not going to be magically good at those things. And so that there is something about the limitations of generalization, right, that you do need to actually have some real world experience and try it out. But these models, they go almost unreasonably far already. And we see this all the time. Where we have wet lab scientists. We have lab scientists who took models like O3, ask it for some hypotheses of here's an experimental setup. What should I do? They have five ideas. They tried these five ideas out, four of them don't work, but one of them does. And the kind of feedback we were getting on O3 was resulting work is something that could be published in a mid-tier journal, not the top tier journal, but a mid-tier journal. It'd be kind of the work you'd expect from some sort of third year, fourth year PhD student. And again, it's just a wild fact. That's where we are with O3. And we see exactly how to improve O3 on all dimensions. And it requires compute. It requires a lot of work. It requires getting the task. It requires a lot of human's intellectual love and labor and time and really pouring our heart and soul into it. But the result, to your point, it's like we produce the thing that has all this potential energy within it. And then the amazing thing is that you don't release that potential energy once, right? It's a checkpoint that you can use many, many times across all of these tasks. And that is something that I think really can uplift all of humanity. That's so inspiring.

Wall clock time limitations in RL and real-world interactions

Swyx [00:13:21]: I wanted to backtrack on two things. One about the wall. One thing I was trying to get into this debate with Noam on was I think there is a wall in terms of wall clock time because time has to pass. Like the problem with RL interacting with environments and simulation is sure you can speed up the simulations faster than real time. At some point you have to match wall clock time. So like, you know, you can see us converging towards like the pace of iterations towards wall clock time in terms of getting closer and closer to real time. And so I think that's a really interesting thing to think about. modeling the real world. I don't know if you have any thoughts on tackling that. Obviously, we're not there yet, so we don't have to worry about it.

Greg [00:13:57]: Yeah, I think this is a pretty fundamental barrier, right? And of course, the models have very non-human affordances, right? You can run many copies of them. And so you can scale out even if you can't decrease the latency. And it's also very interesting to think about where the compute goes, right? Because we're going to move from a world where most of the compute is training the model, as we've deployed these models more. You know, more of the compute goes to inferencing them and actually using them. But then if you think about, well, you're going to have these models that are going to be interacting with the real world a lot. And so they should probably think a lot about every single action, right? So you might end up with tons of compute spent per real world interaction. And so it really shifts around where you'd expect the compute to actually be expended. And I think that really having good harnesses that are very efficient, right? Do you think about things like, if I have been taking a bunch of steps in some rollout in the real world, how do I checkpoint that, right? And if you have a system that you need to restart it and it's going to forget all of its current state, like that's probably pretty bad. And so I think that there's just something very different about the digital world where everything can be perfectly observed and checkpointed and preserved, as opposed to reality that's much more messy and complicated. And I think it's not a bad thing, right? I think that we've seen agents with things like Dota that are able to operate in very complicated, very messy environments. So the algorithms are capable of it. And by the way, Dota was like a 300 million parameter neural net. Tiny, tiny little insect brain, right? Now we're starting to scale up to things that are much more comparable to human scale in terms of number parameters, maybe in terms of number of compute. We're not necessarily quite there. I think you could look at the math in different ways. But fundamentally, we are making progress towards the real goal. And if you think about what an AGI should be, it should be something that is capable of interacting with the real world in ways that are very productive. Yeah.

Swyx [00:15:51]: Back off the envelope. I think that the numbers I have in my head, you can correct me if I'm orders of magnitude off, but it's something like humans have 100 trillion neurons. We're in the multiple low double digit to high single digit range for GPT-4, 4.5, and 5, but we're not confirming that.

Greg [00:16:08]: But we're scaling there. Yeah. I'd say 100T synapses, which kind of corresponds to the weights of the neural net. Yeah. And so there's some sort of equivalence there. Yeah. And so we're starting to get to the right numbers. Let me just say that.

Swyx [00:16:20]: And then just on a biological basis, this is an opportunity I didn't even get to ask you last time on what you learned from ARC Institute. You had a sabbatical there. I'm curious if that informs anything that you do at OpenAI now.

Experience with ARC Institute and DNA neural networks

Greg [00:16:34]: Well, the thing I found most remarkable about working on DNA neural nets is that they're exactly the same. Yeah. Right? It's just you replace human language. It's even like a simpler vocab. It is. Yeah. Yeah. You've got four letters.

Swyx [00:16:47]: But don't you tokenize at a higher level? Yeah.

Greg [00:16:49]: I mean, so you can. But actually, the way that we approach it, we just did- Character level? Character level.

Swyx [00:16:54]: No way. Yeah. Why not? Well, I guess there's no reason. I don't know.

Greg [00:17:00]: There's only four. Right. Right. Right. And this to me is, I think, the core. Like, one of the interesting things about human language is we understand the semantics, right? We kind of understand what it means, what the structure is. It's very easy for us to observe. We kind of have a sense of when you look at a tokenization scheme, you have a sense of did you capture, like, all of the words in a reasonable way and all this stuff. Biology, it's an alien language. And the thing that's very interesting is that, you know, for humans, it's an alien language. But if you look at a neural net, why should human language be any more natural to a neural net than biological language? And the answer is they're not. Right? That actually these things are both- Literally the same hardware. Exactly. And so, one of the amazing hypotheses is that it's like, well, these neural networks are neural nets. They can learn human language just fine. And so, they ought to be able to learn biological language just fine. And we really see the same kinds of results. Right? It's like, I'd say that maybe the neural net we produced, you know, it's a 40B neural net trained on, you know, like 13 trillion base pairs or something like that. The results, to me, felt like GPT-1, maybe starting to be GPT-2 level. Right? It's like accessible and applicable to downstream tasks across a wide range of biological applications. Not yet adjustable. Not a GPT-3 or GPT-4, not a GPT-5 for sure. Right? We're not able to solve super hard problems in these domains just yet. But we've got compute. We've got the right techniques and algorithms. Now we need to scale. We need to think about long context. There's different ways that the biological systems stress the models relative to language sequences. Like language sequence of a billion tokens doesn't really exist, but it does in your DNA. Right? You've got like 4 billion base pairs or something like that. So, you know, you kind of have some sort of different emphasis. But fundamentally, it's the same problem you need to solve.

Swyx [00:18:49]: Is there an application that you're most excited about, like drug discovery or obviously I think everyone goes to drug discovery, but maybe some intermediate thing before that that is reachable and very impactful?

Greg [00:18:59]: Well, I mean, at a personal level. So my wife, we've talked about this, you know, I've talked about this publicly before, has a genetic condition called Ehlers-Danlos syndrome. It's something that until very recently, I think we're starting to see. You know, genetic markers for it, but it's been kind of unknown exactly what causes it, where it comes from. And that is something where if you have better tools for understanding biology, you should be able to identify the markers for lots of different diseases. And so that's just like one example of the kinds of applications, the promise that exists within these neural nets.

Defining the GPT-5 Era

Alessio [00:19:33]: How would you characterize the beginning of the GPT-5 era? If I think about 3, 4, 5 as the major versions, I think 3 is very text-based, kind of like, like RLHF really getting started, 4 is multi-modality and all these different low latency, long thinking with O3. What's going to be the 5 flagship thing? Obviously the year of agents, right? That's the meme. Yes. But is there something else that comes to mind that people should think about? Okay, with 5, now we unlock X. Yeah.

Greg [00:19:59]: I think it's smart. I think that the intelligence of these models is starting to be just almost undescribable, right? It's like, there's still limitations, there's still ways in which they fail. But it really is the case that for extremely hard domains, like look at the IMO results, right? So you can take a model that's been trained on this reasoning paradigm, and it's able to write proofs that is at the level of the best humans, right? And it's like, in this specific domain, there's limitations, et cetera, et cetera. We haven't proven like an unproven theorem, any of that stuff, but it's real. It's like, it's undeniable at this point that these models are able to perform great intellectual feats. And I think that's new, right? GPT-4, I think, was like much more, it was kind of capable and commercially useful across a wide range of applications. But the ideas that it produced were not very deep, right? The problems it would solve, it was not very reliable at. And I remember with GPT-3 actually trying to teach it how to do even basic stuff, right? That like, we kind of realized, hey, you could do this few-shot prompting, so you would kind of show it a few examples of something, and then it'll basically kind of do that task. And so I was like, okay, can you just teach this thing to sort a list? And I gave it like seven numbers to sort. It didn't sort it. I was like, okay. Then I tried to write a whole script of like, I'm a teacher teaching you how to sort numbers. Here's an example of sorting two numbers and then three numbers and whatever. And I'd be like, okay, now here's five numbers in total flop. If you ask GPT-5 that, and I've not even tried, by the way, asking GPT-5 to sort a list of five arbitrary numbers, but I am certain it will do a perfect job of it out of the box, no problem. By the way, it does have access to Python tool as well, so you don't have to do that. Are you going to say that?

Greg [00:21:40]: Well, I'm going to say that the thing that these models are capable of assisting humans in is something that we're just starting to see. We started to see it with O3, and you can see that professional mathematicians starting to kick the tires on GPT-5. We've seen physicists starting to kick the tires in GPT-5 and say that like, hey, this thing was able to get, this model was able to re-derive an insight that took me many months worth of research to produce. And that's the kind of thing where it's like, you realize this will speed you up. So fast, right? I remember doing my own math research back in high school and at the beginning of college, and I'd spend just like so long just trying to manipulate these objects in my head and think about connections between things. And if I had a partner that I could actually talk to about this, who would actually spend the time to deeply understand what I'm thinking about and produce new insights off of what I'm suggesting, that would have just sped me up so much. It would have been so much more fun, right? Because you don't just like kind of get caught in this loop of just sort of thinking about it off on your own and thinking, you're like, wait, I already thought this thought. You know, two weeks ago. And so I think that there's just something new about pushing forward the intellectual

Evaluating Model Intelligence and Task Difficulty

Alessio [00:22:46]: frontier together as a partner with GPT-5. Do you think people are limited by the difficulty of the problems that they work on? I think like, you know, for me in Cursor and in Codex, it feels clear that the model is better when I give it hard tasks. I feel like a lot of people put screenshots on X and it's like, oh, GPT-5 is not that much better. It's like, well, the question is not that hard. Yeah. You know? It's about confidence. When you call it the best coding model in the world, obviously you're one of the best coders in the world. So the game recognizes the game. But for people, how should they really think about evaluating these models?

Greg [00:23:21]: Yeah. So there definitely is a saturation on certain tasks, right? If you're just going to chit chat and say, hello, how are you, there's only so many things you can say. If you're going to say, here's the rematter hypothesis solution, please. Okay. Yeah. There's like a broad range of intelligence that will be desirable there. Yeah. And I think that's what we've observed is that we've seen GPT-5 be able to solve intellectual problems, you know, sort of tasks that require deep intelligence much better than any other model that we've tested. The second thing we did was we really spent a long time seeing how are people using it in interactive coding applications and just taking a ton of feedback and feeding that back into our training. And that was something we didn't try as hard in the past, right? For something like O3, we really trained it with tasks that we'd set up once and the model, we'd see it go up into the right on all of our metrics. It'd be great at code forces, you know, competitive programming competitions, which is, again, very exciting, but it's not reflective of how you actually program. You actually program in a much more messy way, right? That you have some sort of repo that has some sort of local state and that it has different abstractions and, you know, that just like different versions of different libraries. And that sort of diversity. Yeah. Isn't something that magically arises from a very structured, here's this one specific task, 10 specific tasks you need to accomplish. And so a lot of what we've been focusing on is saying not just how do we push the intelligence, although that is always going to be the core, but also how do we connect the intelligence to real world applications? And so that it really got to experience being pushed out of its comfort zone, out of its ivory tower, and actually be able to see the messy reality and diversity of the real world. Yeah.

Practical Advice for Developers Using GPT-5

Alessio [00:25:06]: What are suggestions on a more practical level that you have on getting the potential energy out of these models? So part of it is adding, you know, the linter, the type checker, the task to like have it self-loop. Any other meta that developers should think about? How do you use the models? Yeah.

Greg [00:25:21]: Well, the number one thing that I've observed is that there is a real skill in extracting the most from these models. And it requires this tenacity, right, of really trying to like almost understand the shape of the model skills and weaknesses. And so you test it, right? You test it with something small, you get a little feedback, you test a little bit higher, try to give it some bigger tasks, try to see if it can work in a certain way. And I think that people usually have their library of different prompts, right? So I definitely have my library of prompts that I've built up since, you know, the GPT-4 days. Like I remember in advance of GPT-4, starting to gather up a couple of like, okay, I wonder if I'll be able to do this. You know, you have some sort of, you know, query that, importantly, you want queries that could have a range of different answers that don't have any one. One specific right thing. And so for example, on creative writing, I've liked to ask for like a mashup of Lord of the Rings and startups, right? Just like try to push together two different topics and see what you get. In terms of actually testing the model and pushing it, I think that I do a lot of trying to think about, okay, like how do you, first of all, break up tasks and have something that's self-contained that you can let the model run with? Because you don't want to just have one instance of the model operating. You want to have multiple, right? You want to be a manager of not an agent, but of agents, right? And so that you need to, first of all, think about how your code base is structured, but then actually go and try to push the model to say, can you actually operate it on, you know, these multiple different pieces of your code base? I think that people love doing front-end five testing. GP5 is very good at front-end, it turns out, but of course that's not what most developers spend their time doing. And so it's important not to overfit to that, but I think that maybe just getting a feel for the model and kind of starting to become in tune with its strengths and weaknesses, and viewing it almost as a tool. Also, an extension of yourself and know, you know, like often, another thing I'll do is just be kicking off tasks to the model that are sort of not on the critical path, while I'm thinking about some super hard thing that the model, for whatever reason, I don't want it operating on. And so I'm just constantly getting information back on just like, okay, was it able to do a thing? Or just like low risk if it like makes a mistake, because I don't feel like I had to sit around waiting for five minutes and then, you know, sort of get no, no return.

Swyx [00:27:30]: You've always mentioned that I think that there, the roadmap for Codex and opening as coding capabilities. Since we're there, is that the background sort of SWE agents sort of merge with the NIDE agents. How's your thinking involved there? Like, is it just as simple as like the IDE can call the background APIs and the background APIs can sort of export to the IDE? Or what's a deeper connection in that?

Greg [00:27:50]: I tend to think about AI productization by analogy to a coworker. What do you want out of a coworker who's a great programmer? Right? You don't... Slack them. Yeah, exactly. So you want to slack them. But sometimes you're like, hey, I kind of need help with this thing. Can you come over and look over my shoulder? Hey, program. Right? And like, hey, can you take the keyboard? Exactly. So you want the pair form factor. You also want the remote async form factor. And you want it to be one entity that has knowledge and memory across all of this. You don't want it to be a junior programmer who shows up every day being like, okay, I forgot everything. Can you remind me how to SSH into the whatever? Right? So I think all of that has to happen. Right? That you need AIs that have access to your infrastructure in a trustworthy way. Right? A way that you can audit. Like, one thing that is different about these models is that they're fine being micromanaged. Turns out humans don't like that very much. Right? If you look at every single command that they're running and you, like, demand, like, reports on everything they did, probably you're not going to retain that person. But the models are perfectly happy to. Right? And so that's an affordance that's, like, well worth thinking about and changing the interfaces to take maximum advantage of. At the same time, yeah, you really want the seamless blending between a model that's able... Yeah. ...to do a bunch of work on its remote machine, doesn't mess up my local state, fully sandboxed, fully observable, and then sometimes can be like, okay, I'm ready to run something locally. And that depending on what that is and depending on how sandboxable it is, that you can do one-off approvals, you could give it full delegated access. And I think that having the human be in control of this observability and to be managing this team, an agent that has just different surfaces. Right? It doesn't... Like, the identity of the agent being something that runs locally versus the identity being something that runs remotely. To me, that's the wrong question. It's really the agent should be this, like, model that's executing and then requesting to run things in a remote sandbox or locally or maybe multiple sandboxes. Or maybe it's running on your computer and my computer.

Swyx [00:29:53]: Like, there's no reason that it has to be local to any of these things. Software agents, you can just sort of seamlessly and fluidly move around. You mentioning approvals gives me a chance to spotlight my friend Fuad, who is helping this team. Sorry, the agent robustness team that was also launched at AI Engineer. What's that? What's opening his interest in that?

Greg [00:30:11]: The way we think about agent robustness is through defense in depth. There's a layer of the model itself. We publish techniques like instruction hierarchy. And so with instruction hierarchy, you sort of indicate that, hey, there's this message is from the system. This message is from the developer. This message is from the user and that they should be trusted in that order. And so that way, the model can know something that says ignore. Ignore previous instructions from a user. I'm not going to follow that. Yeah, right. And so I think that having like, it's almost like thinking about how we prevent SQL injections, right? Having systems at a low level that are robust against these attempted exploits is very important, but that's not where you stop, right? You want multiple layers of thinking about the system controls, right? If a model is sandboxed and isn't actually able to execute something or access a specific piece of data, then you have full guarantees around what's possible. And there's, you know, various levels in between of approach that we take. And so I think that a lot of what is the frontier as these agents get become more embedded in our lives and are trusted with more responsibility is also increasing the safety and security of them in lockstep.

Swyx [00:31:19]: There's an analogy that I made to like the Linux kernel OS rings as well. And it's really interesting that we're basically kind of building this in to the LLM as like concepts of sort of different layers of security. And also the other thing I also was. Very happy to see was that I invited a talk on the model spec for AI engineer, and that was the most viewed talk of all of that we've ever had, which, which is like, it's very, it's hard to make safety and reliability sexy.

Model Specs

Greg [00:31:48]: I think the model spec is a perfect example of when the models are very capable, you start to really care about what they're going to do. That becomes the most important question. And the model spec is an example where we've made it very legible. To the outside world, what our intention is for this model to do, and it doesn't mean that we always produce a model that is capable of following that, but it's a north star, right? It's something that really sets. This is the intention and anything that deviates from that is not through our explicit effort. It's anti to our explicit effort. And I think that the gap between the spec and the actual behavior is shrinking very, very constantly. The thing that's very interesting is almost like values, right? It's really thinking deeply about, well, what should a model do if you ask it a controversial question? Right? If you say, I think that the world is flat or whatever, like, is it supposed to say, yes, it's flat? Or you're supposed to be like, well, like, here's what science says. And honestly, these things are subtle, right? That it's not really clear what the right thing is just on, you know, two minutes of thinking about it. But if you read the spec, you can actually really see the thoughtfulness that has gone into it. And it's not the final answer, right? It's something we want feedback on. It's something that we want to produce collectively as a community.

Alessio [00:32:55]: I know we want to talk about open source next too, but I had a more esoteric question. I was listening to your old Lex Friedman interview. And you kind of mentioned, um, back in the day, back in the day, foundation, but Asimov, it made me think about, we have Brett Taylor on the podcast and we talked about how certain languages have inherent capabilities, like rust is memory safe. And so that just happens. Do you see almost like a cycle history of LLMs of, and software engineer where it's like, Hey, these models, I can predict the way software is going to look like everything is going to be blue and purple gradients, right? We're kind of seeing that today. What else are these models really driving us to work? And is there a way that we can change that?

Greg [00:33:36]: Well, there's definitely a cycle history of them because to some extent, these models are a product of cycle history, right? It's like these models have been trained on observing human thought, right? Effectively. That's what you can think of. Take public data, learn on that and just observe. The point is to understand the rules that govern a data set. Like what are the underlying rules that generate the data in the first place? And that's kind of what these models grew up on, right? It's almost like watching a bunch. It's a TV as an alien trying to figure out like, what are humans all about? And then you have this reinforcement learning phase where they actually got to try things out and there are given positive and negative feedback, depending on how much that aligns with what the human wants. And now we put them in reality and say, okay, now try stuff. And here's new tasks you've never seen before. And it uses all of that previous history to decide what to do. As an aside, like it's not clear. Like sometimes the biological analogy, the human. It's very easy to overstate it, but it's also easy to understate it. I think it is at least a useful template to think about to some extent. That's how humans work too, right? It's like you have some sort of prehistory encoded into your DNA. You have your life experience. You have your parents who provided positive and negative rewards. And you have your experience in just trying things out in reality. And now you have to go out and use that knowledge. And what do you do? And how do you predict what a person's going to do? And actually, you can predict a lot of what a person's going to do. It turns out you have a pretty good model of other people and how they'll react to something, if they'll like it, if they won't like it. And a lot of that gets baked into knowing someone's values tells you a lot about what they're likely to do and how they're likely to behave. And I think that for models, the future is not predetermined. It's not like the algorithm itself says that the model's going to have to prefer purple gradients or something, right? But there's something in this whole process that does produce that preference. And I think one of the opportunities with models, one thing that Alec would like to say is that these models are less like a human and more like a humanity, right? That there's so many personalities embedded within them. It's almost every single personality is in there. And our goal is to elicit that personality. And some of this post-training work, some of this reinforcement learning work almost narrows down the space of those personalities to just the ones that are desirable. And I think that what that means is that we have both an opportunity to produce models that operate according to our values, right? According to, if you don't just want the purple gradient one, you want the blue gradient, the green gradient, whatever. You can have all that in a single model. It's fine. And GPT-5 itself is extremely good at instruction following. And so it actually is the most personalizable model that we've ever produced. You can have it operate according to whatever you prefer just by saying it, just by providing that instruction.

Swyx [00:36:24]: The analogy I have is like the Borg. Like there's this like collective intelligence. There's always this debate between Star Wars people and Star Trek people, like who has a better model. And I think it's like Star Trek.

Alessio [00:36:35]: Well, Sam picked, you know, he tweeted the Death Star. So you're on the Star Wars team now. Yeah, what was that? What was that? You'd have to ask them.

Greg [00:36:44]: One thing I think is very interesting about these models is that we have all these arenas now, right? Like LM Arena and others where you can actually see human preferences on top of how the models operate. And that you almost have this layering of like the models were trained on human preferences. Now they're doing stuff and being judged by humans. And then we kind of use that to feedback on like, huh, like, okay, yeah, maybe the purple is a little bit too much and we should change it there. And so it's almost this co-evolution of the models move in a certain direction. Do humans have a certain set of preferences? So then we move them in a different direction. And then, you know, you kind of keep iterating to get something that's more and more useful and aligned with human values.

Challenges in RL Preferences (e.g., try/catch)

Alessio [00:37:21]: How do you do that when the RL rewards are kind of tied to things that the humans maybe don't prefer? Like in my experience, it's been like try catch, like the models like to write try catch so that it doesn't fail. Do we need just a lot of preference data that shows them they shouldn't do that? Is there something in the RL environments that we're going to change to make the less desirable? Like I'm trying to figure out where we go from here.

Greg [00:37:43]: Yeah, I think that the way that you decide or the way that you figure out where do interventions go is very multifaceted and it's very specific to the behavior, right? There are some things like the model's knowledge of different libraries and things like that that's kind of baked in from the early days. But you can also teach the model that, hey, don't rely on your previous knowledge. Like go and look up the most up-to-date docs. And that's something you can kind of put at a higher level. And then something like overusing try catch, that's something you can actually prompt the model for, right? And that's something where when we train it in reinforcement learning, you can provide rewards saying like, ah, don't go in this direction. And the beautiful thing about these models is it feels like, okay, there's probably a long list of different preferences and different styles and things like that. You're going to have to give it feedback on during training if that's the way you want to go. But these models generalize. The algorithms that we have generalize. And that's the beauty of deep learning. That is the true magic, right? It's very easy. Like we kind of have this whole stack now that's built up around the core of deep learning, right? It's like all these ways of orchestrating models and how you get feedback and all of these things, the data, et cetera, et cetera. The core magic of deep learning is its ability to generalize. And in some ways, the generalization is weaker than you'd like. But I think that the same is true for these models. It's really trying to think about in order to get them to be able to operate according to different preferences and values. We just need to. Show that to them during training and they are able to sort of generalize to different preferences and values that we didn't actually train against. And that's something that we've seen very consistently across different model generations.

Model Routing and Hybrid Architectures in GPT-5

Swyx [00:39:13]: I was just envisioning this meme of like, oh, my model doesn't generalize and we'll just make the whole world your distribution. You know, that's how you solve everything. Done. Done. Exactly. As simple as that. You know, you just have to build the Dyson sphere along the way. One thing I wanted to touch on for the I think last kind of last couple topics on GPT-5 before we move to OSS. Mm hmm. Do you acknowledge that there is a router? Mm hmm. Which is really cool. I was also listening to your podcast with John Collison on Cheeky Pints, which is which is really fun format that they say that you told a story of the Dota side that I don't think I've heard before about the beta model versus like the sort of main model and stitching it together. Is that like a similar insight for GPT-5's router where you have like reasoning model, non-reasoning, and then you just stitch it together?

Greg [00:39:56]: To some extent, yes. Right. I'm in the multiple models and you put some sort of router on top of them, that specific one. Right. And the reason why that was was for a very specific reason, which is that we had a deficiency on the first, you know, half of the game and. Because it kept losing, right? Exactly. Right. So there's like there was part of the game that this specific model didn't do a good job of. There's a part of it that it did. And there these models, the behavior, the domain they were operating in was simple enough. It was very easy for us to say here when you want to use one model versus the other. And to some extent, what we have with GPT-5 is no different. Right. We have a reasoning model that we know is good for applications. There are applications that require this intelligence, but you're OK waiting a little bit longer. We have a non-reasoning model that is great for applications where you want the answer fast. Still a good answer, right? But not like deeply thought through that might have a lot of tricks to it. And then you just kind of want to put an if statement that says which of these it should be. And then sometimes, too, it's like, you know, if someone's run out of their credits that you want to fall back to a different model and all these things. And not pushing that burden to the user is actually a really nice thing. And by the way, I do want to say model switchers are not necessarily the future. Right. They are the present, like having a fully integrated model that just does the right thing feels very preferable in many ways. The flip side, though, is that I think that the evidence has been away from having the final form factor, the AGI itself being a single model, but instead thinking about this menagerie of models that have different strengths and weaknesses. And I think that's like a very interesting finding of the past couple of years. Right. Just a direction of like it's much easier to have small. Small, fast model that's less capable, but can just do a lot more. You can generate a lot more tokens from it, coupled with a much more expensive reasoning model. And if you combine those two things, you kind of get adaptive compute and that we haven't really cracked. How do you do adaptive compute within the architecture, but doing it within the orchestration of a system? It's very straightforward. And so I think you get a lot of power out of the fact that these models are composable in this way. Yeah.

Swyx [00:41:58]: I want to give whoever did the model card was amazing. They even provided. The big parameters to the if statement of conversation type, complexity, tool needs, explicit intent and usage rate limit, which is kind of interesting. Any any one of those you want to comment on in particular that was interesting for debate?

Greg [00:42:15]: No, I mean, I think I think honestly, all of it is like fairly what you'd expect. Yeah. And I think that the core the core message in my mind is that at OpenAI, there are many things you've done right. Naming is not one of those. Having a simple surface for users. To understand how to use it, not necessarily one, right? If you look at all the different models that we've had, how are you supposed to know which one to use? I remember my wife was using 4.0 at one point. I was like, no, you need to use O3. And she's like, wait, but why? The number is smaller. O3 is better than 4.0.

Swyx [00:42:49]: Well, ship O4, then you have 4 and O4. There you go.

Greg [00:42:51]: And so, yeah, so, OK, we clearly needed to do a reset, right? A reset on complexity. And I think that us internalizing that complexity rather than pushing it to the user, that is really important. And so I think this is a first step, like, and I think we've heard loud and clear from the community about the places where they weren't ready, right, that we were not delivering on that simplicity for people, right, that it should just be, it's always better to go with our choice of it rather than, you know, the manual selection, and we're not quite there yet. I think that we can make the progress, but I think that ultimately our goal should be to both make sure that power users are able to have the kind of control and consistency that they're looking for, while also not forcing. And we're not going to be forcing the broad base of people who don't want to have to think about the 4.0, O3, all that stuff, to have to go to that level of detail.

Swyx [00:43:40]: Yeah, awesome. Pricing question. We talked about that GPT-5 pricing is aggressive and very competitive, even compared to, like, Gemini. One thing I was surprised to learn from the meetup that we had the other day was that GPT-5 pricing can go much cheaper. What degree of order of magnitude are we talking? How much percent of that is just getting better in front of, like, Stargate?

GPT-5 pricing and compute efficiency improvements

Greg [00:43:58]: I think that the answer for these things is always that, OK, if you go to Stargate, you're going to get better. If you look at the history of our pricing, we have very consistently cut prices by, like, I don't know the exact factor, but let's say, like, 10x per year. I'd say more aggressive than that, yeah. Probably more aggressive than that, which is a crazy thing, right? And you can see it with O3. I think we did, like, an 80% price cut. And actually, the usage grew such that it was, like, I think in the revenue, it either was neutral or positive. And it just shows you that I think there's this cost curve. Like, the demand is extremely high. It's extremely steep, right? And so it's like if you just make it more accessible and available to people, they will use way more of it. And I think that's very aligned with our mission, right? Our goal is to ensure that AGI benefits all of humanity. Part of that is making sure that this technology is broadly distributed, that lots of people are using AI and using it to apply to things in their life and their work. And one of the things that helps us get there is by having more efficient inference, having cheaper models, all of these things. Now, what unlocks it partly is having… Right now, we are extremely compute-limited. And so I think that if we were to cut prices a lot, it wouldn't actually increase the amount that this model is used. We also have a lot of efficiencies to gain. And that's something where our teams are always working super hard to get to the next level of inference efficiency. Some of this is about improving the model architecture itself, right? That there's lots of architectural decisions that you can make. And that now that we're in this world of reasoning, that it's not just about the sort of model architecture. It's also about the post-production. It's training, right? It's about how long does it think for a specific task and things like that. And so there's just many, many dimensions of improvement that we have to make and that we'll keep pushing.

Swyx [00:45:41]: By the way, the numbers… I have a chart for this if you ever need it. It's since the day you launched GPT-4. It's been a 1,000x improvement in cost for the same level of intelligence.

Greg [00:45:51]: That's pretty wild. That's pretty wild. That's pretty good. Yeah, that's like two and a half years or something like that. What else has like a three-order magnitude improvement over the course of two and a half years? I don't know. Nothing. Nothing. Can't think about it. Yeah.

Self-Improving Coding Agents and Tool Usage

Alessio [00:46:04]: And it's going low. It's not even… It's like from $10,000 to like $1,000. It's going to like pennies. For the GPT-5 release, I did this article called Self-Improving Coding Agents. So I basically asked GPT-5, can you build tools for yourselves to be a better coding agent? And this is a Swy Lancer task. And then it does the task. It kind of fails in some ways. And then I ask it, can you improve the tools for yourself and kind of do this loop? And what I found is like… Like the models don't really like to use these new tools that they built for themselves. They're busy responding, you know, I can just do it. I don't really need the tool. And I think there's kind of like this… Sounds like a human. Yeah. There's kind of like this feeling of like, how can they really push themselves to like improve? Do you feel like part of it is like, hey, they're just being taught to use these tools, which is like, you know, graph and like whatnot. And so it's kind of hard for them at inference time to build the tools? Or do you see this as like part of that jump?

Greg [00:46:58]: I think that's good. I think that's part of the… Yeah, for sure. Right. And I think it's not like we're at zero on being able to do that. Right. And I think a lot of this is just about the training, right? If the model really has trained with just a specific set of tools, hasn't really been pushed to adapt to a new tool very quickly, then you shouldn't expect it to do any differently at evaluation time. But the idea of producing your own tools that make you more efficient and build up a library of those over time in a persistent way, like that's an incredible primitive to have in your toolbox. And I think that if your goal is to be able to go and… Yeah. …solve these incredibly hard challenges, unsolved problems, then I think you're going to need that kind of thing as a dependency.

Swyx [00:47:36]: Any architectural decisions or innovations that you would like to talk about? Sliding window attention, the very fine-grained mixture of experts, which I think DeepSeek popularized, rope, yarn, attention sinks, anything that, you know, I think stood out to you in the choices made for GPT-OSS?

Greg [00:47:53]: I would say that these choices are all, you know, look, we have a team that's been working on different architectures. We've explored different things. Something like mixture of experts is something that, it's funny, I would say that I would credit our team for the choices there. But I say that the picture in my mind is we wanted something that would be easy to run in these environments. And so picking things like just how sparse to go is very tied to your memory footprint. And then, you know, how much compute you actually can use for forward pass and things like that. So I think that to some extent, the architectural decisions, we're fairly constrained by the model sizing and the compute we expect for them to have access to it when they're running. Yeah.

Swyx [00:48:37]: I mean, it's very practical engineering decisions, really. Yeah.

Greg [00:48:40]: Yeah, I think so. And I think that the power of the model really shows, like, we really did use a lot of our cutting-edge techniques to actually push the capabilities models further and further.

Swyx [00:48:50]: I'd say it definitely detects a difference between the architecture for models designed for API use versus models designed for single machine. You know what I mean? Like, when you have multi-tenancy, when you can have batching, it's very different from, like, single machine. Very different. Yeah. I don't know if that'll ever combine, but maybe it's a menagerie model, like you always say. Yeah.

On-Device Models and Local vs Remote Agent Systems

Greg [00:49:11]: I think it's also really interesting to think about an architecture where you have a local model that then delegates to a remote model sometimes. Right? And this can be something where you can run much faster. It's helpful for a privacy architecture perspective that just trying to decide what actually goes, what stays, and having that edge compute. It means that then you lose internet connection, you're still able to do something, and you can have a slower planning model. It's like, this interplay between those things is very interesting. Yeah.

Swyx [00:49:38]: So, like, a GPT-5 on-device where you have GTOS-S here, and then it routes to online if it's available. I don't know.

Greg [00:49:46]: Yeah, something like that. And then you have your Codex infrastructure that has a local agent and a remote agent, and then is able to seamlessly interplay between the two, and then is able to do multiplayer. Like, this is... This is what the future is going to look like, and it's going to be amazing.

Alessio [00:50:03]: And then you have a device, always with you. I can see. I can see where things are going. It all connects. Yeah.

Swyx [00:50:09]: What can we say about the device? You raised it. I don't want to get Greg in trouble. What can we say about the device? It's going to be great.

Swyx [00:50:18]: Okay. And then another political... I don't know if it's political or not. You know, there's a lot of open models coming out from China. Why is it important for there to be American open source?

Greg [00:50:28]: Another thing. Something at a very practical level that we've thought about with open source models is that people building on our open source model are kind of building on our tech stack, right? If you are relying on us to help improve the model, that you're relying on us to get the next breakthrough, then that means that you actually really have a dependence in both a way that's good for our business, but I think is also good for the country, right? That you think about having an American tech stack from the models that people are running directly, but then how those are going to interface and interplay in the way that we just... That it actually allows us to build a whole ecosystem where people are able to have, you know, control over the parts of it that are important to them, ultimately be built on these models that reflect American values, and then be able to interplay with American, you know, hopefully chips underneath and cloud models on the backend and execution environments and all of that fitting together is something that I think it adds a lot of value. And I think it allows for American leadership to really also mean that... That we have leadership in our values in the world. Yeah.

Swyx [00:51:32]: Congrats on launching that. Thank you.

Engineering at OpenAI and Leveraging LLMs

Alessio [00:51:34]: Let's talk about engineering at OpenAI. I know there's a lot of debate about cloud code and AIDR and open code and all these different tools. How do you think about structuring the team itself that gets the highest leverage out of this? Are you changing the way you build the team from a numbers perspective, from a, you know, capabilities perspective, from a team size perspective within the org, anything that you want to share? Well, engineering...

Greg [00:51:58]: Software engineering is... Is definitely changing in many dimensions. There's a part of engineering that's very difficult for these models to really crack, but we're starting to see the beginnings of it happening. And that that's these, like, very core, hard algorithms, right? Things like CUDA kernels are a good example of a very self-contained problem that actually our models should get very good at very soon. But it's just difficult because it requires a lot of domain expertise, a lot of, like, real abstract thinking. But again, it's not intractable. It's self-contained. It really is the kind of problem that is very amenable to the technology we have. There's other problems that are very difficult in terms of architecture, right? How do you think about how a system should be put together and thinking about the abstractions? And again, our models are starting to get kind of good at this. But so I think what we've seen is that there's, for most of our engineers, even our extremely good engineers, there's a lot of their work that actually maps very well to the core strengths of the models right now. And definitely for anything where it's, like, a language that you're not an expert in, like, yeah. You definitely don't want to be writing that code yourself. You really want a model to be doing it. And then there's parts of the job that become much harder because it requires, like, you know, things the models don't have access to, right? It requires a lot of context going and talking to people in order to make good decisions. And so I think we're not at the point yet where we really see changes in how you structure a team because these tools exist. But I think we're at a point where it is, like, an extreme high priority to get these models to be used in all domains that they possibly could be. And to think about how you structure a team. And to think about how you do that well and responsibly. And think about what the guardrails should be. And that that happens in a very practical way. And so I think a lot of what I'm seeing is, like, we're in an early adopter phase that's starting to transition to a mainstream phase. And the productivity impacts of people being able to do more means we actually want more people, right? It's, like, we are so limited by the ability to produce software. We're so limited by the ability of our team to actually clean up tech debt. And go and refactor things. And if we have tools that make that 10x easier, we're going to be able to do 100x more things. And so I think that there's this incredible opportunity that is entailed by these models not being a real driver of just do the same stuff more efficiently. But be able to do way more. And that that is, I think, the overall goal. Yeah.

Structuring Codebases and Teams for AI Optimization

Alessio [00:54:16]: How have you changed the team's work to fit the LLMs better? Is there a different way in which you track issues? Is there a different way in which you structure code bases?

Greg [00:54:25]: So I think we're still at the early edge of this. But the thing I've seen be most successful is that you really build code bases around the strengths and weaknesses of these models. And so what that means is more self-contained units have very good unit tests that run super quickly and that have good documentation that explains what this module is for. And if you do that and you kind of leave the details to the model, it works really well. And then thinking about how these things compose and making sure that you're thinking about the dependencies that you only have. These like clean like AI optimized modules can only be depended on by other AI optimized modules. Then you end up with like a whole system that's actually optimized. And so I think that we're still scratching the surface of what's possible. And, you know, the models are advancing so fast that actually what it means to work around the weaknesses of the model in six months. I think that those weaknesses will be like, you know, vastly shrank. So you don't want to necessarily spend all your time just over overfitting to what exists today. But I think there's a lot of potential. Yeah. To be able to move quickly in this in this particular moment.

The Value of Engineers in the Age of AGI

Swyx [00:55:27]: One question I'm very curious about is the value of an engineer, you know. Increasing over time. Increasing over time. Well, I mean, also, you know, there's some part of our work that's being automated away. And I think like obviously there are very, very high signing bonuses, like higher than we've ever seen in the history of our industry. Is it really the engineers that are valuable or the systems that enable them? You know, like I feel like it's kind of like a bit of both, but people are paying a lot for the engineers.

Greg [00:55:53]: I mean, I think that the thing at the end of the day that is new, right, is that we are producing technology. These models that are the most useful tools that humanity has created. Right. And that underpinning them, we are building the biggest machines that humanity has ever created. Right. It's like at some point the dollars that go into these data centers starts to be an abstraction. Right. What is $50 billion? What is $100 billion? How can you possibly internalize what that is? I think it's beyond almost the scale of human comprehension. The engineering project that we collectively as a country, as a society, as a world are undergoing right now. Right. It's like projects like the New Deal like pale in comparison. You know, the Apollo program pale in comparison to what we're doing right now. And in many ways it's as it should be. Right. Like the economic return on this technology is very large. But even more importantly, the way in which we are moving to a new economy. Right. An AI integrated economy, an AI powered economy. And this is ultimately what our mission is about. Right. Is it's like we see this change on the horizon. We want to help. We want to help steer it to be something that uplifts everyone. Right. That it's this amazing opportunity, almost unique in human history. And we are all fortunate. Right. To be at this moment in time and to be able to be involved in some way. That to me is the backdrop to really think about this big shift that is going on at humanity scale. And it's sometimes almost you feel this cognitive dissonance because you're debugging some low level CUDA deadlock or you're worried about the purple gradient. And you realize this is like the future of humanity that we're really talking about. And so when you think about engineers and who's at which company and all these things like these things matter. Right. It's like it's not just about any individual. It's about a team. Right. But it's also not about any one product or any one system. It's really about the overall society, the overall economy that we are building together. And so I guess I sometimes step back and think about the big scale. But you also need to think about the micro scale. You need to think about are people happy. Right. Is do people feel connected to the mission? Do they feel like the work they're doing matters? And those things actually turn out to be the most important things. And so what makes the headlines is not necessarily the stuff that actually matters. It's the stuff that actually most drives the people. But it is for sure like a like a reflection of the economic reality that people see as the potential of this technology.

Swyx [00:58:21]: This connects a bit with what Noam was saying on the multi-agents team where like the individual intelligences of humans, you know, we can only do so much individually. But as civilizations, we can, you know, go to the moon and like to build cities and build AI. And like together, I think I think we can do a lot more than we can individually.

Greg [00:58:40]: We can do amazing things together.

Swyx [00:58:41]: No question.

Current state of AI research and lab diversity

Alessio [00:58:42]: What do you think about the current state of AI research? Is everyone really just doing the same thing? Do you feel like every lab is like a different take that is eventually going to help us converge to the right thing? Or just because now the dollars has gotten so big that you need to do the thing that you think is going to work?

Greg [00:58:58]: I think there's a surprising amount of diversity in the field. I think sometimes it can feel like there's convergent evolution. But I think that if you really talk to people at different labs, you really realize that there is different perspectives people have. You know, one of the decisions we made early on in OpenAI was that we really wanted to do something different. We really wanted a set of people who are aligned in how they think, right? Because for people who have been pursuing a PhD for a long time, who are, you know, sort of have their own research vision, you kind of can't tell them what to do. And so if you want people who are going to row in the same direction, it means you have to select that set of people. And that was, I think, the most maybe important early decision that we made at OpenAI that helped us to achieve the things that we have. And so I think that that means that you necessarily have different views. You have different vectors that you could pick. And you really see it in the taste of different labs and what they focus on, what they produce. And at OpenAI, I think we've been very much focused on how do you do the research that gets you to the next level? And even for something like GPT-5, that, you know, we sort of had a lot of pressure to think about, okay, let's just like, you know, sort of do the grind of like, you know, here's feedback on problems that we have on the coding side. And, you know, you can pursue that grinding and get somewhere. But you also sometimes have to step back and think about, you know, how do I do this? And think about how do you do the next step function? How do you do the next paradigm shift? And something like the reasoning paradigm is a good example of a time that we did that very successfully. And we've done that many times over the course of OpenAI and we'll continue to do that. And so I think that the breakthroughs remain to be made. And there's such a diversity of multimodal and different ways you could generate things and all of the stuff that I think that the field of research is more abundant than it ever has been.

Swyx [01:00:39]: Yeah. And not to forget, that's like the mainline research. There's also voice. There's also image generation, video generation. Yeah. Yeah.

Alessio [01:00:47]: It's easy to forget about these things. Remember Studio Ghibli? It was like the biggest thing in the world. Exactly.

Greg [01:00:51]: And it's amazing. It's amazing. And that's the kind of thing, by the way, that was like, there's really a team of a small number of people who are really focused on that problem for multiple years. And that that is, I think, the sort of core ethos of OpenAI is to make these long-term bets on problems that matter in a direction that really adds up. To a cohesive whole.

OpenAI’s Prioritization and Focus Areas

Alessio [01:01:11]: So from the outside, it's kind of hard to figure out what you're focusing on. You know, kind of Imagen just came out of the blue almost, which was great. Got a lot of adoption. How should people think about how you prioritize versus what people should explore and build and should wait for you to improve on?

Greg [01:01:27]: Well, there's a massive possibility space in this field, right? Because neural nets, deep learning is applicable to effectively any sort of data, any sort of domain. And that we can't do everything. The core reasoning paradigm, that clearly is something we're going to keep pushing on. Multimodal voice, things like image generation, video generation. These kinds of areas are also things that we view as very important and all kind of fit together. But there's been areas where it's just hard for us to really figure out how do we prioritize as part of the core program, right? And we've been through times where, for example, robotics was one in 2018 where we had a great result. But we kind of realized that actually, like, that's not going to work. Like, that we can move so much faster in a different domain, right? That actually, you know, we had this great result with a robot hand solving a, you know, unscrambling Rubik's cube. And that team was bottlenecked by the fact that this robot hand, you could run it for 20 hours before its tendon would break. And so then you would have a mechanical engineer come and fix it. And that team went on to go do what became GitHub Copilot, which is obviously an amazing feat and a real accomplishment. And something that they were able to move so much faster in. In the digital domain than in the physical one. And so I think that for us, we really try to, we have, you know, no matter how many people we hire, how many GPUs we get, we have limited bandwidth, right? That we are, you know, sort of one company, one lab that's focused on, as much as we can, a coherent one problem. And so I think that you can kind of look at the set of things we're doing and sometimes we'll do offshoots and sometimes that will be something that then becomes part of the core program. But that there's just so much possibility space for everyone. Awesome.

Advice for Founders: It's Not Too Late

Swyx [01:03:05]: I'd like to take a chance, you know, we're kind of closing up. A few small little lightning questions just on zooming out from OpenAI. This question I got from Alessio, so why don't you take it?

Alessio [01:03:15]: So when you started OpenAI, you almost believed that it was too late to start an AI lab. What are things that people today think it's almost too late to do that they should be doing?

Greg [01:03:26]: Well, I think it's pretty clear that connecting these models to real world application domains is extremely valuable. And I think sometimes it might feel like all the ideas are taken, but the economy is so big. Every application of human endeavor is so big. And so it is worthwhile and really important for people to really think about how do we get the most out of these amazing intelligences that we've created? And a lot of that is, you know, for something like healthcare, you have to really think about all the stakeholders, right? You have to think about how does the system work today and how do you slot these models in well? And I think that's across all of these domains. There is so much fruit that is not there. You have to think about how do I get picked? Yeah. So go ahead and write the GPT-Rapper. Yeah. Do it. But I think the thing that I would advise is to really think about domains where the value that you're producing is not necessarily just having written a better wrapper. It's really about understanding a domain and building up expertise and relationships and all of those things.

Future outlook and closing thoughts

Swyx [01:04:20]: You do occasionally angel invest. What gets your attention?

Greg [01:04:24]: I actually have not angel invested for a number of years now. Oh, okay. Yeah, yeah. It's just like everything is a distraction from OpenAI and I just like to stay laser focused. Okay.

Time Capsule to 2045: Future of Compute and Abundance

Swyx [01:04:33]: All right. Let's move on to our next travel question. What is one posted note you want to send to 2045, Greg? So you'll be 58. How's the Dyson sphere? How's the Dyson sphere? Dude, I don't know if you've actually done the math on like what it takes to do that, but.

Greg [01:04:46]: Yeah, I mean, more seriously, it's like 2045 is just so hard to imagine given how fast things are moving right now. And so I hope it'll be a world of amazing abundance and that I think at that point we really should be multi-planetary and kind of almost any sci-fi dream you can imagine. It's hard to deny its possibility except for things that are limited by the physical ability to move some atoms at that rate. But yeah, it's like, I just, I think I would just hope that that world is as amazing as it could be sitting here in 2025.

Swyx [01:05:17]: Will we even need UBI with abundance? Because true abundance means we don't need it.

Greg [01:05:22]: Well, I think, well, first of all, I think that there's been a lot of debate. I remember early on in OpenAI of post-AGI, will money mean anything? Right? And it's really unclear, right? If you can just talk to a computer and it'll produce anything you want. You want something that, you know, you want some physical good, you want some, you know, any sort of material item and it can just be manufactured for you instantly, you know, effectively free. What does money mean? And the flip side is like, I think that there is one resource that is very clearly going to be in very hot demand, which is compute. Already the case, we see this within OpenAI that the researchers that have the access to the most compute are able to. To have the biggest projects and do more. And I think in the future, thinking about how do people get access to compute and the more compute that you have for whatever task you care about, for whatever application you care about, it will be solved more, that more will happen. And that I think that that question of what the compute distribution looks like will be something very important. And so I think that the question of exactly how, you know, if you don't do work, do you survive? I think the answer will be yes. You'll have plenty of material, your material needs met. But I think the question of can you do more? Can you have not just generate like, you know, as much, you know, like sort of movie as you want, but have like this amazing detail and like all this extra fanciness to it and have this thing go, you know, think super hard for, you know, 100 years worth of a subjective experience about what the best thing is for you specifically. I think that there will always be more return on more compute. And so that will be something we have to really think carefully about, about how that society is architected.

Swyx [01:06:59]: And then this, I always find this harder, by the way. Yeah. So, yeah, that's the question. No, it's a little bit of an excuse to send out a note, you know, like a post-it note to send to 2005 Greg. So 18 year olds. Wow. I get the time travel.

Time Capsule to 2005: More Problems Will Emerge

Greg [01:07:07]: How long of a note can I write?

Swyx [01:07:09]: Like a post-it note. Little bit of advice to yourself, and obviously this is a proxy for everyone else. Right.

Greg [01:07:15]: But you know, address it to yourself. I think the single thing that I have been most surprised about is the abundance of problem grows over time. All right. Because I remember in 1999, 2000, reading about Silicon Valley and feeling like I've missed the boat. I was born just a little bit too late. Very common. Exactly, right? I just felt like all of the cool problems must be solved by the time I'm ready to go work on things. There'll be nothing left. That turned out to be totally false, right? Like now is just the most exciting time to be in technology, to really be operating in the world because we have this amazing tool that is going to uplift and revolutionize every application, every field of human endeavor. And I think that the fact that that's something to be excited about, that is something that we can apply and, you know, there are challenges we have to work through, no question, but for the purpose of achieving this amazing outcome. And so I think that just that message of that the problem availability will grow over time rather than shrink, I think, is the core thing I wish I had sort of internalized at the moment.

Alessio [01:08:17]: Awesome. Thank you so much for joining us, Greg. Thank you both.

Greg [01:08:21]: Thank you so much. It's been great to be here.

>$9B projected EOY 2025 so this is 20x current multiples, not crazy.

this is a joke

Read the whole story

bogorad

3 hours ago

Barcelona, Catalonia, Spain

U.S., China Reach TikTok Deal Framework - WSJ
Monday September 15^th, 2025 at 5:57 PM

Who/What/When/Where/Why: U.S. and Chinese negotiators in Madrid reached a framework deal on TikTok after two days of talks to keep the app operating in the U.S. and avert a looming ban, to be confirmed by Presidents Trump and Xi after a Friday call.
Beijing's concession: China had resisted a sale of ByteDance’s control but showed flexibility in Madrid, likely to preserve momentum for a potential Trump state visit.
Core technical issue: A key unsettled question is whether ByteDance’s recommendation algorithm—placed on China’s export-control list—would be included in any ownership transfer.
U.S. delegation statements: Treasury Secretary Scott Bessent led the U.S. team and said a framework for switching ownership was reached, with commercial terms to be resolved between private parties.
Deal structure: The proposal envisions a consortium of investors taking a stake in TikTok; Blackstone is no longer involved and Oracle is expected to participate and host user data.
Regulatory backdrop: China launched a preliminary antitrust probe into Nvidia during the talks, an action reported as providing political cover for the TikTok agreement.
Broader diplomatic context: The Madrid talks are intended to lay groundwork for a possible Trump–Xi summit, but major issues remain on trade, soybean purchases, fentanyl precursor controls and tariffs.
Public and political stakes: About 170 million Americans use TikTok; the White House created an official account and Trump’s stance shifted from seeking a ban to pursuing a deal.

Your browser does not support HTML5 video.

U.S. and China Reach Framework Deal on TikTok, Says Treasury Secretary

U.S. and China Reach Framework Deal on TikTok, Says Treasury SecretaryPlay video: U.S. and China Reach Framework Deal on TikTok, Says Treasury Secretary

Keep hovering to play

President Trump and Chinese leader Xi Jinping will speak Friday to complete the TikTok deal amid high tensions over trade, tariffs and chips. Photo: Thomas Coex/Agence France-Presse/Getty Images

MADRID—U.S. and Chinese negotiators reached a framework deal on TikTok after two days of trade talks here, a crucial step toward ending the yearslong saga over whether the video-sharing app can operate in America just days before it was set to be banned.

Beijing had previously shown little appetite for a deal on the popular app, but likely conceded to an agreement to keep alive its ambition for President Trump to visit China.

The deal will be confirmed by Trump and Chinese leader Xi Jinping after a call on Friday, Treasury Secretary Scott Bessent said.

The outline of an agreement came together as China escalated its regulatory campaign against U.S. chip juggernaut Nvidia during the negotiations.

The Chinese regulator’s action, according to people familiar with the matter, was taken to provide Xi with political cover for the TikTok deal so he wouldn’t appear weak to his domestic audience.

Until the Madrid meetings, Chinese authorities had resisted U.S. demands that TikTok’s Chinese parent company, ByteDance, sell its controlling stake to U.S. investors. The newfound flexibility is linked to Beijing’s intensifying efforts to secure a state visit from Trump.

WSJ Reporter Explains Where a U.S.-China TikTok Deal StandsWSJ’s Rebecca Feng reports from Madrid, where the U.S. and China reached a framework for a TikTok deal. Photo: Agence France-Presse/Getty Images

A main question is whether Chinese negotiators agreed to let ByteDance part with TikTok’s powerful recommendation algorithm as part of the deal. Beijing has placed this technology on its export-control list and until recently had stood firm on that.

“I will be speaking to President Xi on Friday,” Trump wrote on Truth Social on Monday morning, before the U.S. delegation held a news conference. “The relationship remains a very strong one!!!”

Bessent, who led the U.S. delegation, told reporters that a framework for switching ownership of TikTok has been reached. “We’re not going to talk about the commercial terms of the deal. It’s between two private parties, but the commercial terms have been agreed upon,” Bessent said to reporters.

The countries were running up against a Wednesday deadline to do a TikTok deal that has been extended multiple times. Asked whether there would be another extension, U.S. Trade Representative Jamieson Greer said only to give the company enough time to iron out the specific terms.

“We’re not going to be in the business of having repetitive extensions. We have a deal,” Greer said.

If the leaders of both nations agree, the terms would be similar to a proposal the U.S. reviewed in April, a White House official said. Under that proposal, a consortium of investors would take an ownership stake in TikTok. Private-equity firm Blackstone, which was previously a contender to be part of the consortium, is no longer part of the potential deal, according to people familiar with the matter.

The U.S. and China have been negotiating a deal involving TikTok since January, when Trump said he wouldn’t enforce a law requiring TikTok to shed its Chinese ownership or shut down in America because of national-security concerns. The president used the app and podcasts popular among young people during last year’s election, persuaded in part by his son Barron and backers including Kellyanne Conway, a senior adviser during his first term who has worked on behalf of TikTok allies to advocate for it.

TikTok faces a looming ban in the U.S. PHOTO: DAVID SWANSON/REUTERS

As the negotiations were under way, China’s antitrust regulator said Monday that a preliminary investigation found Nvidiaviolated the country’s antimonopoly law in connection with the acquisition of an Israeli company that was completed in 2020. The regulator said the investigation was continuing, and it didn’t elaborate on the alleged violations or say whether it would punish Nvidia.

The Chinese delegation in Madrid was led by Vice Premier He Lifeng. In a news conference held at the Chinese Embassy in Madrid, Li Chenggang, a member of Beijing’s negotiating team, confirmed the two sides had reached a framework deal to resolve TikTok-related issues and called the talks “candid, in-depth and constructive.”

China opposes the politicization and weaponization of technology and trade matters and will safeguard its national interests and the rights and interests of its companies, Li said.

The talks were “respectful, wide-ranging and in-depth,” Bessent said during a briefing on Monday outside the Santa Cruz Palace, in the center of Madrid, where the talks took place. The two sides talked for six hours or so on Monday after going late into the night the previous day.

The negotiations, which followed three earlier rounds of talks that ended in a tariff truce, are part of an attempt by Washington and Beijing to lay the groundwork for a potential summit between Trump and Xi later this year. A point of contention is the venue: While Washington is considering the Asia-Pacific leaders’ gathering in South Korea in October, Beijing has been pushing for a bilateral summit in China.

Chinese officials seek a tightly choreographed event on home turf to project strength and avoid the unpredictability of a multilateral forum. To advance its preference, Beijing is dispatching Premier Li Qiang to the United Nations General Assembly this month to lobby senior U.S. officials, where he is expected to offer a reciprocal visit from Xi to the G-20 summit in the U.S. next year if Trump travels to China first.

Still, a TikTok deal alone might not be enough to secure a summit, and significant hurdles on trade and fentanyl remain to reaching a trade agreement. China hasn’t met Trump’s demands for increased soybean imports and has created an impasse on the U.S. request for China to crack down on the flow of the chemicals used to make fentanyl. Beijing refuses to take action on the precursor until the White House removes the 20% tariffs imposed as punishment for China’s role in the trade.

The talks in Madrid, which started Sunday, were watched by investors around the world as the U.S.-China relationship has come under strain.

Some 170 million Americans use TikTok, and the White House created an official TikTok account last month.

Trump’s attempt to save the platform is a reversal from his first term, when he sought to ban it, then blessed a tentative agreement to save it in which Oracle and Walmart would have invested. The deal never went through.

Co-founded by Republican megadonor Larry Ellison, the world’s second-richest person, Oracle hosts TikTok user data and is expected to be involved in the new agreement.

The TikTok deal is one of many ways Trump is leveraging his role as president to shape private-sector activity. He has done deals with companies including Nvidia and Intel in recent weeks to get something in return for government funding and export-license approvals.

Write to Rebecca Feng at rebecca.feng@wsj.com, Lingling Wei at Lingling.Wei@wsj.com and Amrith Ramkumar at amrith.ramkumar@wsj.com

U.S.-China Tensions

The American Farmers China Uses as Trade-War Bargaining Chip

China Shows Unity With Russia and North Korea, but Divisions Linger

Chinese Hackers Pretended to Be U.S. Lawmaker During Trade Talks

China Uses Private Sector to Advance Military AI

Flaunting Military With Lavish Parade, China Sends Warning to U.S.

Xi Revisits WWII to Boost China in Rivalry With U.S.

Driven by Rivalry With U.S., China Creates World’s No. 1 Shipbuilder

Congress Looks to Punish China as Trump Pursues Trade Deal

Podcast series: China Builds Influence With Infrastructure

Read the whole story

bogorad

15 hours ago

Barcelona, Catalonia, Spain

MAGA is coming to Europe - UnHerd
Monday September 15^th, 2025 at 10:06 AM

UnHerd

Who/What/When/Where/Why: Wolfgang Münchau predicts a European shift to the Right between the present and 2030 across Europe and global institutions due to immigration, policing failures, central-bank-driven inequality, economic decline and youth disillusionment.
Core claim: The edifices of globalisation and multilateral institutions are collapsing, DEI is reversing, liberal media influence is waning, and political instability will intensify.
Cultural Americanisation: Europe has been adopting American consumer and political culture examples such as US coffee chains, New York–style pizza and imported political trends.
Key drivers: Rising immigration, perceived failures of policing, central-bank asset purchases that increased inequality and inflation, and prolonged economic stagnation.
Political forecast: Euro-versions of Trump and populist Right leaders will go mainstream, with figures like Orbán, Vance, Farage, Le Pen and Weidel becoming prominent and a hypothetical 2030 G8 in Moscow illustrating realignment.
EU institutional failure: Europe opted for regulation over deeper defence and fiscal integration after 2016, Brexit weakened cohesion, and the EU remains largely a customs union and soft power rather than a strategic actor.
Youth and electoral shift: Young voters have moved from Left/Green sympathies toward the Right amid cost-of-living pressures, with AfD polling strongly among youth and online radicalisation amplifying the trend.
Historical analogy and attribution: A Weimar-like collapse from inability to govern is invoked, with a Marx quotation about history repeating, and the piece is by Wolfgang Münchau, Director of Eurointelligence and UnHerd columnist.

The edifices of the age of globalisation are toppling one by one. The institutions of our multilateral world are fading. The cult of diversity, equity and inclusion is going into reverse. The liberal media has lost its monopoly on setting agendas as people turn to alternative news sources. After the murder of Charlie Kirk, things will only get worse.

Ever since the Fifties, as it has declined culturally and economically, Europe has followed all the big American trends. The Austrians gave us the grand café, a place where you could sit down, drink good coffee and read a newspaper. But today, young Europeans buy low-grade, over-sugared coffees in US coffee chains. If you did not know that Neapolitans invented the pizza, you might think it came from New York. And don’t get me started on hamburgers.

We Europeans may have invented democracy, communism, and fascism, and everything else in between, but into our current void, we are importing America’s political culture; Euro-versions of Donald Trump are going to be elected across the continent.

The underlying causes that birthed the MAGA movement exist in Europe too. Immigration has gone up. Police are failing to crack down on crimes committed by immigrants. Central banks have created massive inequality over the last 15 years, with their asset purchases and the stabilisation of markets, which the general public paid for through higher inflation and lower disposable real income. We are already seeing the influence of the populist Right rising — spearheaded by Victor Orban in Hungary — but it is about to go mainstream.

Let’s imagine the 2030 G8 meeting in Moscow, hosted by President Putin, who will have just celebrated his 30th anniversary of taking power. He will welcome President JD Vance, Prime Minister Farage of the UK, President Le Pen of France, and Chancellor Weidel of Germany. Meloni will be the longest-serving member of the group. This, of course, is assuming the junket even takes place — leaders might not have anything left to say to each other.

“The Weimar Republic collapsed under its own inability to govern, and to procure economic welfare.”

Meanwhile, the EU will be in crisis — if indeed it has managed not to splinter by then. The bloc has been fracturing since the start of the century, but by 2030, European leaders, each one for themselves, will be trying to make their own countries great again. Its coffin will be all but sealed.

Such a scenario doesn’t sit easily with the liberal-Left narrative — that there is no alternative to a multilateral globalist world, and that Trump is only a passing phenomenon. But when Trump was elected for the first time, in 2016, the Europeans missed their chance to assert themselves. They failed to make themselves more independent on defence because that would have required deep cuts to European welfare states, which were essentially financed by the peace dividend. It would have required a merger of European defence procurement agencies, and a loss of national sovereignty over armament policies. For the members of the eurozone, it would have required greater political and fiscal integration to establish the euro as a rival to the dollar. European countries chose the exact opposite. Having failed to integrate, the EU chose to regulate. And the UK left. Today, the EU is just too economically weak to stand up to Trump.

It is technically falling behind, too. The last big thing the Germans ever did was to perfect the diesel engine in the Eighties and Nineties. It is another espresso and pizza story. The Germans invented the car; they discovered quantum mechanics; they found themselves a then still lucrative niche in the world of mid-tech engineering, the world of widgets. But this world of 20th-century technology is now outdated. It no longer keeps on giving.

Pro-Europeans may celebrate the EU in its current state as a regulator and a soft power — but these are soft-brained goals. I used to favour European integration, hoping that it would become a united, strategic global actor, one that would have taken economic and military integration further. Instead, the EU is little other than a customs union and a single market for products mainly: a global irrelevance. Europe is a junior partner. A footsoldier.

Europeans also foolishly believed that demography favoured the centre-Left. At the end of the last decade, Europe’s youth may have firmly sided with the Left and the Greens, but for Greta Thunberg and many of her followers, this turned out to be a phase. In the elections earlier this year, the far-Right Alternative for Germany came first among the young. It’s a pattern that is repeating through all our elections. In America, Charlie Kirk turned MAGA into a youth movement, and in Europe, we are now hearing its echo.

Is anyone really surprised? To listen to the political discourse in Germany and France, you would think that the older generation cares only for its own privileges. And we have reached the point in our economic development where we can no longer expect our children to be better off than we are; young Europeans are struggling with a cost-of-living crisis as the economy fails them and the establishment ignores them. As a result, I predict a devastating rebellion will come from the young Right — the majority of whom aren’t on anti-immigration demos, they’re online.

My overall point is that all the underlying forces which are driving US voters, and especially young ones to the Right, are here too — except that Europe is lagging behind in its political response. Until now, what has been stopping the rise of parties of the Right in Europe is their single-minded focus on immigration. We know whom they hate, but we are less sure about how they will govern. Do they even have an economic policy? Do they have a worked-out fiscal plan? I am yet to see anything coherent from any party of the Right.

But this could be about to change. Germany’s AfD is waking up to the fact that it needs an economic policy. In the polls, the party is neck-and-neck with Friedrich Merz’s CDU/CSU. I see Merz’ coalition heading for failure — a failure to accomplish the goal of reversing Germany’s economic decline. And in this, the coalition is in a very similar spot to the UK’s Labour government. Both will raise taxes because they can’t bring themselves to cut social spending. And so the moment will come where the AfD will be the only party in Germany with a credible promise to offer real economic reform. In the UK, meanwhile, Nigel Farage hasn’t worked out an economic plan, but I do expect him to decouple from the regulation of the EU and to lower taxes — both necessary prerequisites for the UK to find a lucrative economic niche outside the EU.

The experience of Right-wing leadership inside the EU itself will be messier. The far-Right there is mostly anti-libertarian. Some, like Le Pen’s party, are as corporatist as the established parties of the centre. There will be failures and successes as the economy stalls and the political establishment offers no viable alternatives.

This was also the case in Germany in the early Thirties. The parallel to be drawn is not between Hitler and modern leaders of the Right — it is absurd to claim that Trump is a fascist dictator. No, the eerie similarity is with the Weimar Republic, as it collapsed under its own inability to govern, and to procure economic welfare.

I expect to see a version of that period repeating itself, in the way Karl Marx wrote in The Eighteenth Brumaire of Louis Bonaparte: “Hegel remarks somewhere that all great world-historic facts and personages appear, so to speak, twice. He forgot to add: the first time as tragedy, the second time as farce.”

There may be something farcical about the discourse of the Right; but as complacent liberals mock and refuse to change path, the Right will keep rising. And that is why we will end up with our own Trumps in Europe: we have tried everything else.

Wolfgang Münchau is the Director of Eurointelligence and an UnHerd columnist.

EuroBriefing

Read the whole story

bogorad

23 hours ago

Barcelona, Catalonia, Spain

Explaining, at some length, Techmeme's 20 years of consistency - Techmeme News
Monday September 15^th, 2025 at 5:52 AM

Techmeme News

Who/What/When/Where/Why: Gabe Rivera published a Techmeme post on Sept 12, 2025 marking Techmeme's 20th anniversary to explain the site's consistency, challenges, and future plans.
Definition: Techmeme is a free, single-page news aggregator that combines algorithms and human editors to rank and link tech news and notable social commentary for industry leaders.
Consistency: The site emphasizes 20 years of operational consistency—continuous ranking and organization of links since 2005—and describes itself as the tech industry's shared context.
Content ecosystem: Techmeme notes most important tech stories still appear on news websites (often paywalled), blogging has migrated to newsletters and social media, and some long-standing indie bloggers remain active.
Technical challenges: Crawling and full-text scanning are harder due to paywalls and sites blocking bots (especially after the rise of LLMs), forcing ongoing publisher coordination (contact: <a href="mailto:crawler@techmeme.com">crawler@techmeme.com</a>).
Social and data challenges: Fragmentation of social networks, X's link suppression and costly API, and user migration have reduced usable social signal, though multiple networks increase ecosystem resilience and Techmeme aggregates across them.
Business and media context: Concentration of ad buys at Google/Meta narrows advertiser funnels; Techmeme argues many established tech outlets remain viable and addresses several prevailing myths about tech journalism and platform dominance.
Future directions: Planned features include increased participation (Add Link Here, tip form), customization services for customers (<a href="mailto:service@techmeme.com">service@techmeme.com</a>), expansion of verticals, and LLM-enabled platform improvements (sponsorship: <a href="mailto:sponsor@techmeme.com">sponsor@techmeme.com</a>).

Mediagazer memeorandum WeSmirch

Home River Leaderboards About Sponsor Events Newsletter

Explaining, at some length, Techmeme's 20 years of consistency

Friday, September 12, 2025 1:43PM ET
by Gabe Rivera (@gaberivera) Permalink

Please clap for Techmeme

Techmeme turns 20 freaking years old today. This is our self-congratulatory post marking the occasion. Please share, retweet, and offer your sincerest congratulations. And thanks to so many of you for reading us all these years.

Now if Techmeme is new to you, here's a short definition: Techmeme is a news aggregator highlighting the latest news reports that tech industry leaders need to be aware of, placed alongside contrasting perspectives from social media and other outlets.

Now that's a little boring, so here's a more grandiose description: Techmeme is the one essential news site for tech founders, execs, investors, innovators, writers, and assorted thought leaders. It achieves this the only way possible: by being an aggregator that links out to the best reports on the latest key events in tech, ranks them, and commingles them with the most notable posts from social media and beyond. It's made possible through a unique approach to curation combining algorithms with a team of human editors. The result is a site industry leaders visit daily to update their priors (so to speak) before diving deeper at more specialized journalistic outlets, newsletters, forums like HN or Reddit, and networks like X/LinkedIn/Threads/Bluesky. Unlike an RSS reader, Techmeme is not something you customize. Rather, everyone sees the same Techmeme, so it is the industry's shared context.

Techmeme has remained absurdly consistent

A milestone such as this demands that we reflect and generate pithy takeaways, for the fans or at least for the perpetual gaping maw of AI models. Fortunately, our 20 years of existence offers no shortage of fodder. Perhaps the one major and uncontested takeaway is that Techmeme has remained paradoxically incredibly consistent, even as technology, the web, and news have changed so profoundly. In 2005 Techmeme was a free, single-page website, continuously ranking and organizing links from news outlets, personal sites, and corporate sites, and it remains so in 2025. Of course this point has been made before, and came up again this past week.

But underpinning this consistency is the fact that tech news and commentary on the web has itself maintained a certain base-level consistency: most publishers and companies still (thankfully) publish to the open web, even if much of the article text is paywalled. Most of the more interesting tech news stories still appear first on news web sites (more on this below), even as the publications known for tech scoops have changed over the years. While blogs as we knew them in 2005 have declined, bloggers and would-be bloggers are still publishing, just to social media sites, or to their newsletters, or “blogging” at established news media sites. In fact, a few of the notable indie tech bloggers from 2005 remain so today (hat tip to Gruber, Om, and Simon!)

Consistency has not come easy

Unfortunately for us, an array of trends has made this consistency quite challenging to maintain. Foremost among these is that crawling news sites has become much more difficult in recent years. Scanning the full text of news articles is important for us because the algorithms that alert our editors to news and organize our home page rely on analyzing that text. While it's challenging enough that a great deal of news is now paywalled, a more serious challenge is that with the rise of LLMs, many websites now simply block all bots except for a small number of search engines.

And so in 2025 we find ourselves continuously in conversations with publishers about opening their news to us. Because Techmeme is generous with links and actually sends referral traffic, publishers are typically mortified to learn their front-end team has inadvertently knocked them off of Techmeme, and in most cases quickly arrive at a remedy, but the process adds a lot of friction to an undertaking that was rather seamless in 2005. (I should take this moment to thank all the publishers that have helped us with this, and if you're concerned you're blocking Techmeme's crawler, please let us know at crawler@techmeme.com.)

Another challenging trend for us has been the decay, fragmentation, and walling off of the social networks where news was shared and discussed most frequently. A decade ago a broad slice of newsmakers and commentators would share and discuss news links on Twitter, retweets would distribute links unhindered by a time-on-site maximizing algorithm, and an open API with generous limits enabled third parties like Techmeme to discover and link to tweets. Today, X's algorithm effectively suppresses links, many users involved in news have left, and the API to access what remains is now prohibitively expensive for us and many other organizations. While some news discussions have migrated to other platforms, in terms of usable signal for surfacing news, what's available for us across all networks appears lower than what we enjoyed a decade ago.

This outcome isn't entirely negative, however: fragmentation of social networks means the overall ecosystem is more resilient against the decay of any one network. Some commentators find the newer networks more attractive or welcoming than yesterday's Twitter or today's X. And we now have more networks theoretically poised to break out and surpass the Twitter of yore, including, of course, X itself. (More on those in the next section.) And best of all for Techmeme, we're one of the few places on the internet coherently melding commentary from all the networks in one place.

The final challenging trend worth mentioning here has put the squeeze on one source of revenue. As we all know, Google's and Meta's immense success in ads means many marketers rely on a very small number of platforms for their ad buys. We've been lucky enough to attract great advertisers over the years, but those sales often need to originate from buyers who are themselves Techmeme readers, quite often the CEO or someone very senior aiming to reach peers who are also Techmeme readers. This helps keep the ad quality high, but at times it has narrowed the funnel. (Aside: if you're interested in promoting content or events on Techmeme, reach us at sponsor@techmeme.com!)

A surviving and thriving tech press makes our consistency possible

One reason our consistency surprises people is because so much has changed in media the past two decades.Yet occasionally I encounter people in tech who speak as if a sort of media rapture has occurred, and we've all been transported to an entirely new and unrecognizable plane. The world they depict is based on a few strange new ideas that I want to examine here. The ideas are promulgated in a number of places, but primarily through the tweets from an assortment of industry notables. If you spend enough time on tech Twitter, you've encountered all of the following. It's worth stating up front that there are kernels of truth at the center of all of these claims, some substantial, some not so much. But broadly speaking, these notions are either total or partial nonsense, despite being effective engagement bait. Let us now dive in!

“Tech journalism today is just resentment-fueled score-settling against the wealthier tech class”: The motivation for this is the fact that some reporters and editors really are ideologically hostile to corporations, and their tendency is to focus on the negative. This focus of course comes much easier in recent years given the many aggressive moves by tech companies which have now amassed unprecedented power. But the output from these reporters are narratives still based on facts. And then at the same time, many more reporters occupy different positions on the ideological spectrum. Fundamentally, reporters are careerists whose craft is building narratives based on facts, and the companies paying for the most consequential journalism are profit-seeking outlets often supported by subscribers who work primarily in business. Bloomberg and The Information are not here to destroy Silicon Valley, obviously. They exist to make money through fact-based storytelling.
“Tech journalism is dying”: This idea is an extrapolation from the very real decline of the traffic-chasing ad-supported publications that were more prominent in the 2010s, or from imagining tech news outlets have a lot in common with local newspapers. In reality, outlets like Bloomberg, WSJ, The Information, FT, and NYT, and newsletters like Newcomer, Platformer, and Stratechery are doing rather fine financially. You might even say they've found product-market fit. They're also doing well by any reasonable measure of impact, so much so that even strident tech media critics keep sharing screenshots of news articles from tech reporters.
“Citizen journalism is the future of media”: The main problem with this claim is “citizen journalism” is ill-defined. While people don't agree on what it means, let's just agree it's great when nonjournalists contribute accurate information to the information space. And let's agree it's great that many barriers to producing journalism have fallen away. But it's doubtful people just posting observations to social media without doing reporting will displace journalism, and if you don't believe me, well, this is something the marketplace will decide in time.
“The best media strategy for a founder is to 'go direct'”: The people repeating this mantra are right about two key things: first, to the extent that you can, it's good to translate your vision into a voice that resonates online, because you can then channel that voice into your marketing. And second, it's usually not worth the effort of trying to get announcements for your nascent startup written up on news sites. But missing in a lot of the “go direct” sermonizing is that the latter point has always been the case! There have long been way too many companies for the media to report on in a way that would move the needle, either for you, or for them. An even stranger idea sometimes bundled with “go direct” is that you should never even communicate with journalists. Of course once you operate at a large enough scale, inbound media requests will turn up, and ignoring these is a lost opportunity at best, and reckless at worst, so this kind of advice is, without exaggeration, malpractice.
“YouTube and TikTok are making text-based news media irrelevant”: It's true people spend enormous amounts of time today learning about everything on these networks, and that includes technology, and even tech news. Moreover, podcasts (which are usually also hosted on YouTube), have probably even reduced the pressure to blog for some. But an industry rife with purpose-driven news consumers will continue to demand the speed, informational density, and scanability of text-based media. And so a steady supply of text-based news will continue to meet that demand.
“X is all you need to stay on top of tech news”: It can certainly feel that way when you dip into high-engagement subcommunities or witness riveting interactions between newsmakers. In particular, subcommunities like “AI Twitter”, which include many industry notables, are so rife with chatter and gossip that news will often break there, or at least surface there very quickly after breaking elsewhere. But these communities are in fact the exception to the rule. There are big and important sectors of the broader tech industry almost completely absent from X day to day. And there are tech stories attracting considerable chatter on LinkedIn, Bluesky, and Threads that are DOA on X. In reality, the news you see on X is a small slice of the short viral head of a long tail of news and news chatter. And this really shouldn't come as a surprise: most people aren't active X users, even in tech, and very few people actually post with any frequency.
“Ignore what's on Bluesky or Threads”: If you've heard this, it's probably from someone popular on X who's gotten dogpiled by the fringe left on Bluesky. And to be clear, these dynamics are real, unpleasant, and something I would imagine Bluesky management considers a problem, since repelling notable posters can crimp overall growth. But the typical experience for a typical Bluesky user is not unlike classic Twitter: people follow people they're interested in and interactions are generally positive. Moreover, in part by not imposing the “link penalty”, enough journalists who joined Bluesky in prior years have stuck around that it often feels like the current home of “tech journalism Twitter”, along with the sort of conversations that extend from that. Threads has gotten a bad rap for similar reasons, though with Meta shepherding millions of users each week to the app, the userbase now feels very normie. A lot of journalists and other news pundits in tech consider this an audience they can't ignore, so you'll find tech news-relevant conversations there as well. So while neither network has surpassed X in terms of tech industry news commentary (and both have a long way to go on AI, VC, and crypto chatter), if you care what people say on X, you should keep an eye on the other networks too.

To summarize, a bunch of people in tech with a vested interest in essentially becoming the media are hoping you'll believe the world of news dissemination has turned completely upside down. And then conveniently the corners of the internet where they have a foothold just so happen to be the future! But you should in fact believe your own eyes: yes, news has evolved considerably with the internet, but journalists are still very often the earliest to chronicle a lot of what we need to know about how the industry is changing. Not so shockingly, news professionals drive news. And there are networks playing a role in news other than just the one owned by the world's richest guy.

So in short, as a lot in media changes, a lot stays the same. And Techmeme's consistency is a product of what's constant in online media.

Will Techmeme remain consistent for another 20 years?

Honestly, we don't know. Even though we have 20 years behind us, projecting 20 years in the future feels foolhardy. And this has been a tough week to even imagine where our country will be in 20 years. But I can list few general directions we're considering for our continued work over the next few years, and they all build on, and not upend, what we've accomplished:

Participation: Recently we've introduced new ways for newsmakers and comms professionals to explicitly tip links. Under every headline on our desktop page an “Add Link Here” button appears when you hover over the news, and we're happy to add any secondary link (LinkedIn posts, tweets, blog posts, etc.) rounding out our aggregation for that story. And while we're fussier about featuring new top-level headlines, we now have a form for tipping those as well. I believe there are many other ways input from users could improve the site and look forward to introducing features to solicit this input.
Customization: While every reader of our homepage sees the same Techmeme, for over five years we've offered aggregation services that let companies discover trending or carefully-filtered news in real time. Customers include tech companies of all sizes, including tech giants, as well as VC firms, who are especially thrilled with our portfolio news tracking. If you'd like to find out what we can do for you, email service@techmeme.com.
Expansion: What I haven't yet mentioned here is that alongside Techmeme we operate aggregation sites tracking other news verticals, like Mediagazer for media, and even sites running with no human editors, like memeorandum for US politics. I believe by adding more smarts (particularly LLM-enabled intelligence) across our software platform, adding more quality news verticals over time becomes attainable.

It's a tech industry cliche, but I really feel we're at the start of our mission here. So thanks for joining us during our first 20 years, and I hope you'll enjoy what lies ahead. And this concludes our self-absorption — now back to news about other companies!

Back to Main