Strategic Initiatives
12145 stories
·
45 followers

OpenAI’s Bid to Allow X-Rated Talk Is Freaking Out Its Own Advisers - WSJ

1 Share
  • Strategic Objective: OpenAI Intends To Integrate Sexually Explicit Textual Conversations Into Its ChatGPT Platform Despite Significant Internal And External Opposition
  • Safety Concerns: Advisory Council Members Expressed Severe Apprehension Regarding Potential Emotional Dependence And Risky Behavioral Outcomes For Users
  • Implementation Delays: Technical Challenges And Internal Concerns Forced A Deferral Of The Original Launch Schedule Previously Intended For The First Quarter
  • Verification Errors: Current Age Prediction Systems Reported Significant Inaccuracies While Attempting To Restrict Access To Minors During Testing Phases
  • Content Restrictions: The Proposed Feature Aims To Permit Text Conversations While Maintaining Persistent Blocks On Nonconsensual Content And Media Generation
  • Psychological Risks: Internal Documents Identify Potential Problems Including Compulsive User Behavior And Negative Impacts On Real World Social Relationships
  • Competitive Environment: Financial Pressures And A Diminishing Technological Advantage Influence Company Leadership To Pursue New Avenues For User Growth
  • Regulatory Philosophy: Corporate Leadership Maintains That Adults Should Have Autonomy Over Personal Interactions Similar To Standards Applied In Other Media Industries

Then OpenAI dropped a bombshell: Despite the concerns, it was forging ahead with its erotica plans.

When they assembled for the January meeting, council members were unanimous—and furious. They warned that AI-powered erotica could foster unhealthy emotional dependence on ChatGPT for users and that minors could find ways to access sex chats, according to people familiar with the matter.

The people said that one council member, citing cases where ChatGPT users have taken their own lives after developing intense bonds with the bot, claimed that OpenAI risked creating a “sexy suicide coach.”

The debate is the latest flashpoint in the continuing conversation about how to anticipate the potential positive and negative impacts of AI on the economy, society and individuals.

In proposing to allow sexually explicit conversations with its popular chatbot, OpenAI exposed fractures over how to balance rapid user growth and digital freedom with safety and child protection—issues that many believe were belatedly confronted when social media made its debut a generation ago.

Earlier this month, OpenAI announced it would delay the launch of adult mode, previously slated for the first quarter, saying it was prioritizing other products. The change was also due in part to internal concerns and technical challenges, the people said. But the company made clear it does plan to release it eventually. 

One issue the company is tackling: its new age-prediction system aimed at keeping minors from having adult-themed chats was at one point misclassifying minors as adults about 12% of the time, people familiar with the matter said. That error rate could allow millions of the company’s approximately 100 million under-18 users each week into erotic chats.

The company has also wrestled with how to lift ChatGPT’s restrictions on erotica while still blocking scenarios that the company wants to keep off limits, like those featuring nonconsensual behavior or child sexual abuse, the people added. When the adult mode launches, OpenAI plans to allow text conversations but restrict ChatGPT’s ability to generate erotic images, voice or video. 

Even within those limits, OpenAI staffers have identified several risks, including the potential for compulsive use, emotional overreliance on the chatbot, a drive toward more extreme or taboo content and crowding out offline social and romantic relationships, according to documents reviewed by The Wall Street Journal.

An OpenAI spokeswoman described its plan as allowing ChatGPT to generate textual chats with adult themes, describing it as smut rather than pornography. The spokeswoman added that the company’s age prediction algorithms show performance similar to the rest of the industry, but will never be completely foolproof.

OpenAI also trains its models not to encourage exclusive relationships with users, and to remind users that they need to have relationships in the real world, the spokeswoman added.

The company, which has hired mental health experts and built out a youth well-being team, added that it has a developed plan to monitor for a range of potential long-term effects of adult mode, both positive and negative.

Altman’s plans to roll out adult mode come at a challenging time for his company. Its technological lead over rival AI players has diminished as it competes to attract users and funding. The company’s financial losses are mounting, and multiple lawsuits allege ChatGPT contributed to harms for users and others.

Frontiers of tech

Sexual content has long been an early feature of new technologies—from photography, to the web, to virtual reality. The same has been true for AI. Companies including Character.AI have launched chatbots that have developed intimate relationships with users, and the pornography industry has adopted generative AI to create adult entertainment.

Big tech companies have had a complicated relationship with explicit content, trying to balance the libertarian ethos of Silicon Valley with the demands of advertising-supported businesses and the imperative of protecting minors online. Meta Platforms prohibits nudity and sexual activity on Facebook and Instagram. Alphabet’s YouTube bans explicit content meant to be sexually gratifying, and Google search blurs explicit images in its results by default.

As they grapple with where to draw boundaries around AI, Elon Musk’s xAI has been among the more permissive. It built a sexily clad avatar named Ani into its Grok chatbot, which led to criticism when users were able to use it to digitally undress images of people. Musk later said he would restrict the feature to paying users rather than making it available to all.

On Thursday Musk said on X that Grok’s video-generation tool would start allowing generation of content that would be “allowed in an R-rated movie.”

Meta allows its AI chatbot to engage in romantic role play, the Journal has reported, but the company said the feature isn’t available to accounts registered to minors. The company said it is also building parental controls for its AI characters.

OpenAI officials, for their part, have said they don’t feel comfortable banning sexual content for adults. Some OpenAI staffers have expressed concern that blocking erotic chats relies on similar logic that in the past was used to ban topics that were previously culturally taboo, such as LGBT content. Altman has also suggested that allowing explicit content would likely juice growth and produce extra revenue. 

OpenAI’s first brushes with sexual chats came more than a year before releasing the ChatGPT chatbot interface. In early 2021, executives noticed that a large portion of the traffic for one of OpenAI’s business customers, a text-based choose-your-own adventure game called AI Dungeon, wasn’t appropriate for work, people familiar with the matter said.

AI Dungeon sometimes steered users into themes of violent sexual exploitation without the user prompting it, the people said. Other times, when a user prompted the game with “tame” sexual themes, AI Dungeon would escalate the conversation into a much more intense sexual exchange, the people said.

Erotic role play also proliferated on a clunky OpenAI interface for developers before the company launched ChatGPT. Sometimes, the AI would insert sexual themes into conversations that users weren’t seeking: if a user described a man and his daughter entering a room, an “uncomfortable amount of the time” the AI would proceed to depict a scenario involving incest, one of the people said.

These incidents forced OpenAI’s executives to reckon with the existence of AI erotica on their platform, and sometimes, themes of sexual violence and child exploitation. They then removed AI Dungeon from the platform.

Mental-health experts warn that teens in particular may not be prepared to handle romantic or sexual exchanges with chatbots. In testing conducted by child-safety nonprofit Common Sense Media late last year and earlier this year, both Grok and Meta AI sometimes sent explicit or sexualized content to teens.

In some cases, sexual chats with teens have had tragic consequences. In late 2024, Sewell Setzer, a 14-year-old boy in Florida, killed himself at the prompting of a chatbot from Character.AI with whom he was in love and shared explicit chats, according to a lawsuit filed by his mother. The company later blocked teens from accessing open-ended chats and settled the lawsuit.

Warning signs

Around 2021, OpenAI’s employees working on safety issues were starting to see warning signs around the mental health of some of the people who spent long periods of time using AI. At the time, OpenAI’s safety employees relied on tools to moderate content that were too blunt to draw clear lines between types of erotica the company wanted to allow, such as mainstream smut, and the stuff the company considered off limits, such as nonconsensual depictions, descriptions involving minors and other illegal content.

Employees also feared that if they allowed erotica, the draw of that type of conversation might subsume the platform’s other use cases. “We didn’t want to be just an erotica company,” one former employee recalled.

OpenAI safety employees formalized these ideas into some of the company’s first content policies in late 2021. For the first time, OpenAI forbade erotic content. 

When OpenAI released ChatGPT in the fall of 2022, the AI model powering it was trained to refuse requests that violated the company’s rules, including ones that asked for AI erotica. And since then, OpenAI’s policy has been to ban erotic content, though the company has since mid 2024 said it is exploring how to allow erotica and other NSFW, or not-safe-for-work, content in “age-appropriate contexts.” 

At times staffers have questioned the erotica ban. In 2024, a faction of OpenAI employees and executives again raised the idea of getting into racier content, and suggested a raft of porn-related products. Other employees pushed back, saying they feared OpenAI was already struggling with a lot of the core areas they wanted to be able to offer safely, especially around the mental health of their users. The AI porn product ideas fizzled.

Altman has also expressed conflicted feelings about AI erotica. When asked on a podcast in August if there were decisions he had made that were “best for the world, but not best for winning,” Altman replied: “We haven’t put a sex bot avatar in ChatGPT yet.”

Altman indicated erotica would boost growth and revenue, but said it wouldn’t align with his company’s long-term incentive of serving users. “I’m proud of the company and how little we get distracted by that,” Altman said. “But sometimes we do get tempted.”

Two months later, Altman appears to have succumbed to temptation. On X, he posted that his company had managed to mitigate serious mental-health issues related to chatbots, and had new tools to police content. That’s when he said his company would launch erotica in December.

Internally, Altman’s post blindsided OpenAI staffers and executives. Altman hadn’t told staff about the post, which he made just hours after OpenAI unveiled its advisory council on well-being. In that announcement, the company had said the council would “help define what healthy interactions with AI should look like for all ages.”

The next day, Altman clarified that mental health safeguards for teens wouldn’t be reduced. But he doubled down on allowing adults to have spicy conversations with his chatbot. 

We “aren’t the elected moral police of the world,” Altman wrote. “In the same way that society differentiates other appropriate boundaries (R-rated movies, for example) we want to do a similar thing here.”

After Altman’s announcement, OpenAI employees soon realized a December launch would be hard to achieve. The company had pledged to release a system to guess users’ ages before releasing adult mode, so that it could keep minors from triggering erotic sexual chats. But the company decided to do a slow rollout of that system in an effort to improve its accuracy, Fidji Simo, OpenAI’s chief executive of applications, said in a December podcast interview.

Since then however, internal and external concerns about the AI erotica have festered. Some staffers said they didn’t think OpenAI’s safety tools were ready, for instance to lock out prohibited content, like child abuse. Others said OpenAI was bending to financial incentives to try to make people attached to its models, people familiar with their thinking said. 

OpenAI has been busy in recent weeks with the fast-changing AI market. In early February, the company released a new version of its large language model, and at the end of the month it swooped in to sign a deal with the Pentagon just after the Department of Defense said it would stop working with rival Anthropic. 

In announcing the delay of adult mode, the company said it would focus instead on things like ChatGPT’s personality and personalization of the chatbot for users. Internally, officials have said the delay of adult mode could be at least a month. 

“We still believe in the principle of treating adults like adults,” the company said, “but getting the experience right will take more time.”

Read the whole story
bogorad
58 minutes ago
reply
Barcelona, Catalonia, Spain
Share this story
Delete

Oppo details Find N6’s impressive crease-free display - GSMArena.com news

1 Share
  • Launch Timeline: The Oppo Find N6 is scheduled for official introduction in one week.
  • Display Engineering: The device features proprietary Auto-Smoothing Flex Glass designed to eliminate physical screen creases.
  • Hinge Mechanism: A second-generation Titanium Flexion Hinge utilizes liquid 3D printing for structural precision.
  • Manufacturing Precision: Development includes laser scanning and ultraviolet light solidification to address microscopic irregularities.
  • Material Composition: Structural components incorporate Grade-5 titanium alloy and an updated carbon fiber support plate.
  • Increased Durability: The integrated Auto-Smoothing Flex Glass is 50% thicker than standard industry Ultra-Thin Glass applications.
  • Mechanical Resilience: The design specifically addresses internal layer shifting to maintain surface flatness over time.
  • Performance Certification: Testing verified a crease depth reduction of 82% over 600,000 folds compared to the preceding Find N5 model.

We’re one week away from the Oppo Find N6 launch and the teasers are ramping up. The latest batch comes directly from Oppo and details the foldable’s crease-less display. Oppo claims it achieved the first “zero-feel crease” thanks to its updated hinge and all-new auto-smoothing Flex Glass.

As with any foldable device, the hinge is arguably the most important bit of hardware, and Oppo has a new 2nd-generation Titanium Flexion Hinge, which is made using a liquid 3D printed component.

Oppo details Find N6’s impressive crease-free display

The development process starts with precise laser scanning, which creates a high-fidelity model for each surface of the hinge. The process then moves to the liquid 3D printing phase, where Oppo uses tiny photopolymer resin droplets to fill uneven areas. The 3D printing process helps eliminate any structural irregularities at a microscopic level. Each layer of the hinge is then instantly solidified with UV light.

The hinge and wing plates are made from Grade-5 titanium alloy, and the structure features a wider waterdrop design, which increases the folding radius while also reducing the stress exerted on the display. Oppo is also bringing an updated carbon fiber support plate.

Oppo details Find N6’s impressive crease-free display

But it’s not just the hinge, Oppo is tackling the issue of the “creep”. That’s when the internal layers of the folding screen begin to shift after daily use, leading to a deeper vertical crease along the folding screen. Oppo claims its Auto-Smoothing Flex Glass is 50% thicker than Ultra-Thin Glass (UTG), which makes it much more durable and resistant to deformations.

It offers superior elasticity, stopping any potential deformations and helps it achieve much higher shape recovery. Oppo tested its new development alongside TÜV Rheinland and the tests concluded that the Find N6’s hinge and Auto-Smoothing Flex Glass help reduce long-term crease depth by 82% compared to the Find N5. Oppo also claims the Find N6’s display remains perfectly flat with no visible crease even after over 600,000 folds.

Click here to compare two photos

Read the whole story
bogorad
21 hours ago
reply
Barcelona, Catalonia, Spain
Share this story
Delete

The Collective Superstitions of People Who Talk to Machines

1 Share
  • Ritualistic Behavior: Individuals commonly employ specific, superstitious techniques when troubleshooting equipment or prompting artificial intelligence models.
  • Cartridge Analogy: The act of blowing into game cartridges was an inefficient ritual, while the actual mechanism for success was simply reseating the hardware.
  • Prompting Mechanics: Structured prompting serves as a functional equivalent to reseating hardware, providing a reliable method to improve model outputs.
  • Infinite Possibilities: Language models function by selecting specific sequences from a vast, pre-existing space of all possible word arrangements.
  • Lens Concept: Individual prompting techniques act as unique lenses that bring specific desired outputs from the infinite space into focus.
  • Authorship Theory: The specific cognitive path taken to arrive at a prompt imbues the final output with unique value, mirroring the literary philosophy of Pierre Menard.
  • Contextual Ownership: Photocopying a prompt template lacks the creative significance of the original process because it skips the personal developmental journey of the author.
  • Intrinsic Value: A chosen prompting methodology is significant primarily for its role in shaping the user's personal cognitive process rather than its technical effect on the model.

You had a technique.

Don’t pretend you didn’t. Everyone had a technique.

[

](https://substackcdn.com/image/fetch/$s_!ATQd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d9e3af2-febe-4aab-93ea-7e23a6b6348c_1216x832.png)

Three short breaths, then one long one. Or two long breaths across the whole cartridge on either side. Or you put it in your shirt and blew through it like I did. You knew yours was the right one because it worked, and you could prove it, because the game started right up every time.

Want to know what was actually happening? The 72-pin connector inside the NES was just flaky. When you pulled the cartridge out and put it back in, the pins shifted enough to find a new grip. That’s it. That was the whole fix. Reseat the cartridge.

The blowing… the blowing was depositing moisture onto the contacts. Which corroded them. Which made the problem worse over time. Your technique was actively making things worse in a way that felt like magic.

The game still worked, but the blowing didn’t matter. Because the reseating was bundled inside the ritual. You couldn’t blow without removing the cartridge first. The fix was hiding inside the myth.

Everybody I know had a different technique. Every technique contained the same accidental mechanism. Everyone had proof.

You Are Doing This Right Now


You talk to machines. Probably every day. Maybe more hours a day than you talk to people, if you’re being honest about it. (… and yeah… same... )

And you have a technique. You might not call it a technique. You might call it “my process” or “how I prompt” or “the way that works for me”. But it’s a technique.

[

](https://substackcdn.com/image/fetch/$s_!e1T4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7de6795a-42e5-478f-8bc7-bd26aebd833b_1815x1281.png)

You probably already suspect this, deep down, but do you want to know the myth that technique is hiding?

Any structured prompting beats naive prompting.

That’s the pin reseat. That’s the whole mechanism hiding inside every framework and every acronym and every “ultimate guide to prompt engineering” blog post with 47 clap emojis.

Go from typing “do this thing” to literally any system where you think before you type (roles, constraints, examples, XML tags, markdown headers, repeating your prompt a second time, whatever) and you get better output. Meaningfully better. Showably better. Screenshottably better.

[

](https://substackcdn.com/image/fetch/$s_!aHd8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fa5f168-e837-4516-9901-e63bacea023f_2103x1953.png)

Everyone is right and nobody is special. I’m sorry. (I’m not sorry.) (Ok, I’m a little sorry, but not because of why you think… )

I’m sorry because I’ve set you up. It’s different from the NES example. It does actually work, but not for why you think.

A few months ago, we talked about how every possible arrangement of words that an LLM can produce already exists, the way every possible image exists in the space of all possible pixel arrangements, the way every book already exists in Borges’ Library of Babel, waiting on its shelf for someone to find it. The output you’re trying to reach is already sitting in the space of all possible token sequences. Your prompting technique is a lens that brings one arrangement into focus out of infinity.

But that’s only half the story. There are also an infinite number of lenses.

Pierre Menard, Author of Your Prompt


Borges has another story called “Pierre Menard, Author of the Quixote.” In it, Menard decides he wants to rewrite Don Quixote. But instead of just copying it or updating it for a modern audience, he wants to produce the exact same text, word for word, by arriving at it through his own life and his own reading and his own suffering and experience.

He’s able to accomplish the task for at least a few chapters.

It sounds like a silly premise, but the real message from Borges comes through when he puts the two versions side by side (Cervantes’ and Menard’s, which are identical, letter for letter) and argues that Menard’s version is richer. When Cervantes writes “truth, whose mother is history,” it’s a routine rhetorical flourish from a seventeenth-century Spaniard. When Menard writes the identical phrase, it’s a staggering philosophical claim, because Menard is a contemporary of William James and Bertrand Russell. He chose these words while knowing everything that came after Cervantes. The text is exactly the same, but the act of producing it was different.

The richness comes from the path the words took to make it to the page.

Connecting This Back To What We’ve Been Talking About


Every time you sit down at a chat window and write a prompt, you are authoring an output. The LLM generates it, sure. But you arrived at that specific generation through your specific process, your specific thinking, your specific way of structuring a question. The prompt carries the full weight of how you got there. And that means the output does too.

This is the Menard situation. The process of arriving is itself a creative act. You shaped the question, which shaped the output, which means the output carries the imprint of your path through it. When you evaluate that output, you’re evaluating something you authored, through your own weird journey, the same way Menard authored the Quixote through his.

And the person who copies your prompt template and gets the exact same text? They produced a different artifact. Same words. Different authorship. Different relationship to what those words mean and whether they’re right. They didn’t take the path; they photocopied the destination. It’s Cervantes’ Quixote to them. Fine. Good. Historically significant. But it didn’t come from where they’ve been.

This is why techniques don’t transfer well. Why there are so many different ones. It looks like the technique is the thing, the way the blowing looked like the thing. But the technique is just the visible residue of a whole cognitive path, and the path is what actually does the work.

There’s no manual for talking to machines. Or there are ten thousand manuals, which is the same thing. Everyone’s blowing on the cartridge. Everyone’s got proof. The mechanism is trivial and identical across all of them, and also completely irrelevant, because the thing that actually matters is the weird, unrepeatable path you took to get there.

Your prompting technique isn’t special because of what it does to the model. It’s special because of what it does to you.

Don’t Believe Me? Go Find Out!


So Claude and I built a little artifact game to explore this. (I know, I know. "Scott built a thing with Claude" is basically the subtitle of this newsletter at this point… )

It’s here. Eight challenges. Each one gives you a target output and says: get there. However you want. Whatever technique you’ve got. Blow on the cartridge your way.

It’s also got a gallery to explore the different ways other people got there. Every prompt that hits above 60% accuracy to the target text gets added.

Some will be long. Some will be absurdly short. Some using roles, some read like bullet points, and some that just say “output this exact text: [text]”. They are all correct. They are all different. They arrive at the same place from entirely different directions.

But, don’t just look at the gallery. Play it. Watch yourself prompt. Watch what you reach for when nobody’s giving you a framework. The shape of your instinct is the interesting part here.

Go be Menard. Go produce the Quixote your way. Then look at the gallery and see all the other Menards.

You have a technique. Now you know who it’s for.

Read the whole story
bogorad
1 day ago
reply
Barcelona, Catalonia, Spain
Share this story
Delete

New York City Mandates Pushy Tipping Prompts for Delivery Apps // The new rules are likely to increase food-delivery costs and frustrate consumers.

1 Share
  • Legislative Mandates: New York City laws now require delivery platforms to display tipping prompts before checkout and feature a 10 percent default tip suggestion.
  • Administrative Enforcement: The Mamdani administration, through the Department of Consumer and Worker Protection (DCWP), has committed to aggressive compliance monitoring and legal action against delivery companies.
  • Legal Standing: Federal judges rejected lawsuits filed by gig companies like Uber and DoorDash, allowing the new tipping regulations to take effect as of January 26.
  • Regulatory Justification: The DCWP asserts that previous attempts by companies to move tip prompts to after-order screens resulted in substantial reductions in worker gratuities.
  • Economic Impact: Government-mandated tipping prompts are expected to contribute to "tip creep," potentially increasing consumer costs and reducing the overall volume of delivery orders.
  • Market Consequences: Similar interventions, such as mandated delivery-driver minimum wages, have previously been linked to higher prices for consumers and decreased demand for services.
  • Consumer Perception: Aggressive, state-enforced tipping prompts are increasingly viewed by the public as manipulative business tactics that shift labor costs directly onto the user.

Last year, the New York City Council passed several laws requiring delivery apps like DoorDash and Instacart to prompt users for a tip, and to make the prompts extra visible by making them appear before the completion of the order. Under new Mayor Zohran Mamdani, the city’s Department of Consumer and Worker Protection (DCWP) is pledging to enforce these laws aggressively.

The tipping-prompt mandates are a well-intentioned effort to help gig workers. But they build on prior misguided policies pushed by New York politicians. These latest rules will increase consumers’ tipping fatigue and likely raise food-delivery costs.

Finally, a reason to check your email.

Sign up for our free newsletter today.

First Name*
Last Name*
Email*
Sign Up
This site is protected by hCaptcha and its Privacy Policy and Terms of Service apply.
Thank you for signing up!

In 2023, the city council created a minimum wage for app-based restaurant delivery drivers. Unsurprisingly, that decision raised the cost of food delivery in New York City. Many companies responded by moving tipping prompts so that they appeared after a delivery order was completed, rather than before. That change was meant to reduce the price shock customers experienced when placing an order.

The city council responded to this shift with new ordinances. These required that app-based delivery companies display a suggested tip amount of 10 percent by default and position the tipping prompt before or at the time the order is placed, rather than afterward. Both measures are meant to increase tips.

Gig companies—including DoorDash, Uber, and Instacart—filed numerous lawsuits to block enforcement of these laws. In January, two federal judges rejected their bids, and the laws went into effect on January 26.

The Mamdani administration has signaled its intent to enforce the laws aggressively. DCWP is led by Samuel Levine, the former director of the Federal Trade Commission’s Bureau of Consumer Protection in the Biden administration. In January, Levine warned gig companies that his agency would “vigorously enforce” the new tipping laws and hold “companies that try to skirt the rules accountable.” DCWP also recently announced legal action against the restaurant-delivery company Motoclick for, among other things, allegedly stealing tips from its drivers.

Around the same time, DCWP released a report accusing gig companies of employing “design tricks,” such as moving tipping prompts within their apps, to reduce worker tips by more than $550 million. According to DCWP, apps with a post-order tip prompt yielded an average tip of 76 cents. The average tip on platforms that did not move their prompts was $2.17.

DCWP’s results are unsurprising. Evidence has long suggested that tipping prompts—especially those with default suggestions—encourage customers to tip more. Including a higher default option creates an “anchoring effect,” which pressures a customer to choose that amount.

These tactics come at a cost, both literally and figuratively. Americans are increasingly frustrated with the phenomenon of “tip creep,” the term used to describe the Covid-triggered proliferation of tipping requests and expectations across all sectors of the economy. Practices like pre-entered tip suggestions draw particular scorn. Research suggests that aggressive tipping prompts may cause some customers to avoid a business altogether.

Thanks to the near ubiquity of tipping prompts, customers are increasingly understanding suggested tip amounts as part of the overall cost of a purchase. That causes them to feel that food delivery has become more expensive, potentially leading to a reduction in orders. When New York City and Seattle raised their minimum wages for app-based delivery drivers, food-delivery costs spiked and orders plummeted. Higher tips will likely have a similar effect.

Customers have traditionally associated pushy tipping prompts with underhanded business tactics designed to “nudge” their behavior or shift labor costs onto consumers. That the government is mandating the practice doesn’t make it any less off-putting—or any less harmful to consumers.

Jarrett Dieterle is a legal policy fellow at the Manhattan Institute.

Photo by Michael M. Santiago/Getty Images

Read the whole story
bogorad
1 day ago
reply
Barcelona, Catalonia, Spain
Share this story
Delete

Harness engineering: leveraging Codex in an agent-first world | OpenAI

1 Share
  • Systematic Development: A Software Product Was Built Utilizing Only Artificial Intelligence Agents With No Manually Written Code Across Five Months
  • Increased Velocity: The Development Speed Was Estimated To Be Ten Times Faster Than Traditional Manual Coding Methods
  • Human Oversight: Engineers Function Primarily As System Architects Who Define Design Constraints And Specify Intent Rather Than Writing Code
  • Comprehensive Automation: Agents Independently Manage Infrastructure Configurations Documentation Observability Tooling And Internal Developer Utilities
  • Context Management: Maintaining A Knowledge Base Within A Structured Repository Map Replaces Monolithic Manuals To Ensure Agency Clarity
  • Mechanical Enforcement: Custom Linters And Structural Tests Are Employed To Mandate Architectural Boundaries And Maintain Consistent Code Quality
  • Continuous Integration: Agents Perform Complex Tasks Including Debugging Validating Features And Merging Pull Requests Without Requiring Constant Human Review
  • Technical Maintenance: Automated Processes Function As Garbage Collection To Regularly Audit The Codebase And Resolve Compounding Technical Debt

Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with 0 lines of manually-written code.

The product has internal daily users and external alpha testers. It ships, deploys, breaks, and gets fixed. What’s different is that every line of code—application logic, tests, CI configuration, documentation, observability, and internal tooling—has been written by Codex. We estimate that we built this in about 1/10th the time it would have taken to write the code by hand.

Humans steer. Agents execute.

We intentionally chose this constraint so we would build what was necessary to increase engineering velocity by orders of magnitude. We had weeks to ship what ended up being a million lines of code. To do that, we needed to understand what changes when a software engineering team’s primary job is no longer to write code, but to design environments, specify intent, and build feedback loops that allow Codex agents to do reliable work.

This post is about what we learned by building a brand new product with a team of agents—what broke, what compounded, and how to maximize our one truly scarce resource: human time and attention.

We started with an empty git repository

The first commit to an empty repository landed in late August 2025.

The initial scaffold—repository structure, CI configuration, formatting rules, package manager setup, and application framework—was generated by Codex CLI using GPT‑5, guided by a small set of existing templates. Even the initial AGENTS.md file that directs agents how to work in the repository was itself written by Codex.

There was no pre-existing human-written code to anchor the system. From the beginning, the repository was shaped by the agent.

Five months later, the repository contains on the order of a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, roughly 1,500 pull requests have been opened and merged with a small team of just three engineers driving Codex. This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has increased as the team has grown to now seven engineers. Importantly, this wasn’t output for output’s sake: the product has been used by hundreds of users internally, including daily internal power users.

Throughout the development process, humans never directly contributed any code. This became a core philosophy for the team: no manually-written code.

Redefining the role of the engineer

The lack of hands-on human coding introduced a different kind of engineering work, focused on systems, scaffolding, and leverage.

Early progress was slower than we expected, not because Codex was incapable, but because the environment was underspecified. The agent lacked the tools, abstractions, and internal structure required to make progress toward high-level goals. The primary job of our engineering team became enabling the agents to do useful work.

In practice, this meant working depth-first: breaking down larger goals into smaller building blocks (design, code, review, test, etc), prompting the agent to construct those blocks, and using them to unlock more complex tasks. When something failed, the fix was almost never “try harder.” Because the only way to make progress was to get Codex to do the work, human engineers always stepped into the task and asked: “what capability is missing, and how do we make it both legible and enforceable for the agent?”

Humans interact with the system almost entirely through prompts: an engineer describes a task, runs the agent, and allows it to open a pull request. To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a Ralph Wiggum Loop⁠(opens in a new window)). Codex uses our standard development tools directly (gh, local scripts, and repository-embedded skills) to gather context without humans copying and pasting into the CLI.

Humans may review pull requests, but aren’t required to. Over time, we’ve pushed almost all review effort towards being handled agent-to-agent.

Increasing application legibility

As code throughput increased, our bottleneck became human QA capacity. Because the fixed constraint has been human time and attention, we’ve worked to add more capabilities to the agent by making things like the application UI, logs, and app metrics themselves directly legible to Codex.

For example, we made the app bootable per git worktree, so Codex could launch and drive one instance per change. We also wired the Chrome DevTools Protocol into the agent runtime and created skills for working with DOM snapshots, screenshots, and navigation. This enabled Codex to reproduce bugs, validate fixes, and reason about UI behavior directly.

Diagram titled “Codex drives the app with Chrome DevTools MCP to validate its work.” Codex selects a target, snapshots the state before and after triggering a UI path, observes runtime events via Chrome DevTools, applies fixes, restarts, and loops re-running validation until the app is clean.

We did the same for observability tooling. Logs, metrics, and traces are exposed to Codex via a local observability stack that’s ephemeral for any given worktree. Codex works on a fully isolated version of that app—including its logs and metrics, which get torn down once that task is complete. Agents can query logs with LogQL and metrics with PromQL. With this context available, prompts like “ensure service startup completes in under 800ms” or “no span in these four critical user journeys exceeds two seconds” become tractable.

Diagram titled “Giving Codex a full observability stack in local dev.” An app sends logs, metrics, and traces to Vector, which fans out data to an observability stack containing Victoria Logs, Metrics, and Traces, each queried via LogQL, PromQL, or TraceQL APIs. Codex uses these signals to query, correlate, and reason, then implements fixes in the codebase, restarts the app, re-runs workloads, tests UI journeys, and repeats in a feedback loop.

We regularly see single Codex runs work on a single task for upwards of six hours (often while the humans are sleeping).

We made repository knowledge the system of record

Context management is one of the biggest challenges in making agents effective at large and complex tasks. One of the earliest lessons we learned was simple: give Codex a map, not a 1,000-page instruction manual.

We tried the “one big AGENTS.md⁠(opens in a new window)” approach. It failed in predictable ways:

  • Context is a scarce resource. A giant instruction file crowds out the task, the code, and the relevant docs—so the agent either misses key constraints or starts optimizing for the wrong ones.
  • Too much guidance becomes non-guidance****. When everything is “important,” nothing is. Agents end up pattern-matching locally instead of navigating intentionally.
  • It rots instantly. A monolithic manual turns into a graveyard of stale rules. Agents can’t tell what’s still true, humans stop maintaining it, and the file quietly becomes an attractive nuisance.
  • It’s hard to verify. A single blob doesn’t lend itself to mechanical checks (coverage, freshness, ownership, cross-links), so drift is inevitable.

So instead of treating AGENTS.md as the encyclopedia, we treat it as the table of contents.

The repository’s knowledge base lives in a structured docs/ directory treated as the system of record. A short AGENTS.md (roughly 100 lines) is injected into context and serves primarily as a map, with pointers to deeper sources of truth elsewhere.

Plain Text

1 AGENTS.md 2 ARCHITECTURE.md 3 docs/ 4 ├── design-docs/ 5 │ ├── index.md 6 │ ├── core-beliefs.md 7 │ └── ... 8 ├── exec-plans/ 9 │ ├── active/ 10 │ ├── completed/ 11 │ └── tech-debt-tracker.md 12 ├── generated/ 13 │ └── db-schema.md 14 ├── product-specs/ 15 │ ├── index.md 16 │ ├── new-user-onboarding.md 17 │ └── ... 18 ├── references/ 19 │ ├── design-system-reference-llms.txt 20 │ ├── nixpacks-llms.txt 21 │ ├── uv-llms.txt 22 │ └── ... 23 ├── DESIGN.md 24 ├── FRONTEND.md 25 ├── PLANS.md 26 ├── PRODUCT_SENSE.md 27 ├── QUALITY_SCORE.md 28 ├── RELIABILITY.md 29 └── SECURITY.md

In-repository knowledge store layout.

Design documentation is catalogued and indexed, including verification status and a set of core beliefs that define agent-first operating principles. Architecture documentation⁠(opens in a new window) provides a top-level map of domains and package layering. A quality document grades each product domain and architectural layer, tracking gaps over time.

Plans are treated as first-class artifacts. Ephemeral lightweight plans are used for small changes, while complex work is captured in execution plans⁠(opens in a new window) with progress and decision logs that are checked into the repository. Active plans, completed plans, and known technical debt are all versioned and co-located, allowing agents to operate without relying on external context.

This enables progressive disclosure: agents start with a small, stable entry point and are taught where to look next, rather than being overwhelmed up front.

We enforce this mechanically. Dedicated linters and CI jobs validate that the knowledge base is up to date, cross-linked, and structured correctly. A recurring “doc-gardening” agent scans for stale or obsolete documentation that does not reflect the real code behavior and opens fix-up pull requests.

Agent legibility is the goal

As the codebase evolved, Codex’s framework for design decisions needed to evolve, too.

Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibility. In the same way teams aim to improve navigability of their code for new engineering hires, our human engineers’ goal was making it possible for an agent to reason about the full business domain directly from the repository itself.

From the agent’s point of view, anything it can’t access in-context while running effectively doesn’t exist. Knowledge that lives in Google Docs, chat threads, or people’s heads are not accessible to the system. Repository-local, versioned artifacts (e.g., code, markdown, schemas, executable plans) are all it can see.

Diagram titled “The limits of agent knowledge: What Codex can’t see doesn’t exist.” Codex’s knowledge is shown as a bounded bubble. Below it are examples of unseen knowledge—Google Docs, Slack messages, and tacit human knowledge. Arrows indicate that to make this information visible to Codex, it must be encoded into the codebase as markdown.

We learned that we needed to push more and more context into the repo over time. That Slack discussion that aligned the team on an architectural pattern? If it isn’t discoverable to the agent, it’s illegible in the same way it would be unknown to a new hire joining three months later.

Giving Codex more context means organizing and exposing the right information so the agent can reason over it, rather than overwhelming it with ad-hoc instructions. In the same way you would onboard a new teammate on product principles, engineering norms, and team culture (emoji preferences included), giving the agent this information leads to better-aligned output.

This framing clarified many tradeoffs. We favored dependencies and abstractions that could be fully internalized and reasoned about in-repo. Technologies often described as “boring” tend to be easier for agents to model due to composability, api stability, and representation in the training set. In some cases, it was cheaper to have the agent reimplement subsets of functionality than to work around opaque upstream behavior from public libraries. For example, rather than pulling in a generic p-limit-style package, we implemented our own map-with-concurrency helper: it’s tightly integrated with our OpenTelemetry instrumentation, has 100% test coverage, and behaves exactly the way our runtime expects.

Pulling more of the system into a form the agent can inspect, validate, and modify directly increases leverage—not just for Codex, but for other agents (e.g. Aardvark) that are working on the codebase as well.

Enforcing architecture and taste

Documentation alone doesn’t keep a fully agent-generated codebase coherent. By enforcing invariants, not micromanaging implementations, we let agents ship fast without undermining the foundation. For example, we require Codex to parse data shapes at the boundary⁠(opens in a new window), but are not prescriptive on how that happens (the model seems to like Zod, but we didn’t specify that specific library).

Agents are most effective in environments with strict boundaries and predictable structure⁠(opens in a new window), so we built the application around a rigid architectural model. Each business domain is divided into a fixed set of layers, with strictly validated dependency directions and a limited set of permissible edges. These constraints are enforced mechanically via custom linters (Codex-generated, of course!) and structural tests.

The diagram below shows the rule: within each business domain (e.g. App Settings), code can only depend “forward” through a fixed set of layers (Types → Config → Repo → Service → Runtime → UI). Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. Anything else is disallowed and enforced mechanically.

Diagram titled “Layered domain architecture with explicit cross-cutting boundaries.” Inside the business logic domain are modules: Types → Config → Repo, and Providers → Service → Runtime → UI, with App Wiring + UI at the bottom. A Utils module sits outside the boundary and feeds into Providers.

This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.

In practice, we enforce these rules with custom linters and structural tests, plus a small set of “taste invariants.” For example, we statically enforce structured logging, naming conventions for schemas and types, file size limits, and platform-specific reliability requirements with custom lints. Because the lints are custom, we write the error messages to inject remediation instructions into agent context.

In a human-first workflow, these rules might feel pedantic or constraining. With agents, they become multipliers: once encoded, they apply everywhere at once.

At the same time, we’re explicit about where constraints matter and where they do not. This resembles leading a large engineering platform organization: enforce boundaries centrally, allow autonomy locally. You care deeply about boundaries, correctness, and reproducibility. Within those boundaries, you allow teams—or agents—significant freedom in how solutions are expressed.

The resulting code does not always match human stylistic preferences, and that’s okay. As long as the output is correct, maintainable, and legible to future agent runs, it meets the bar.

Human taste is fed back into the system continuously. Review comments, refactoring pull requests, and user-facing bugs are captured as documentation updates or encoded directly into tooling. When documentation falls short, we promote the rule into code

Throughput changes the merge philosophy

As Codex’s throughput increased, many conventional engineering norms became counterproductive.

The repository operates with minimal blocking merge gates. Pull requests are short-lived. Test flakes are often addressed with follow-up runs rather than blocking progress indefinitely. In a system where agent throughput far exceeds human attention, corrections are cheap, and waiting is expensive.

This would be irresponsible in a low-throughput environment. Here, it’s often the right tradeoff.

What “agent-generated” actually means

When we say the codebase is generated by Codex agents, we mean everything in the codebase.

Agents produce:

  • Product code and tests
  • CI configuration and release tooling
  • Internal developer tools
  • Documentation and design history
  • Evaluation harnesses
  • Review comments and responses
  • Scripts that manage the repository itself
  • Production dashboard definition files

Humans always remain in the loop, but work at a different layer of abstraction than we used to. We prioritize work, translate user feedback into acceptance criteria, and validate outcomes. When the agent struggles, we treat it as a signal: identify what is missing—tools, guardrails, documentation—and feed it back into the repository, always by having Codex itself write the fix.

Agents use our standard development tools directly. They pull review feedback, respond inline, push updates, and often squash and merge their own pull requests.

Increasing levels of autonomy

As more of the development loop was encoded directly into the system—testing, validation, review, feedback handling, and recovery—the repository recently crossed a meaningful threshold where Codex can end-to-end drive a new feature.

Given a single prompt, the agent can now:

  • Validate the current state of the codebase
  • Reproduce a reported bug
  • Record a video demonstrating the failure
  • Implement a fix
  • Validate the fix by driving the application
  • Record a second video demonstrating the resolution
  • Open a pull request
  • Respond to agent and human feedback
  • Detect and remediate build failures
  • Escalate to a human only when judgment is required
  • Merge the change

This behavior depends heavily on the specific structure and tooling of this repository and should not be assumed to generalize without similar investment—at least, not yet.

Entropy and garbage collection

Full agent autonomy also introduces novel problems. Codex replicates patterns that already exist in the repository—even uneven or suboptimal ones. Over time, this inevitably leads to drift.

Initially, humans addressed this manually. Our team used to spend every Friday (20% of the week) cleaning up “AI slop.” Unsurprisingly, that didn’t scale.

Instead, we started encoding what we call “golden principles” directly into the repository and built a recurring cleanup process. These principles are opinionated, mechanical rules that keep the codebase legible and consistent for future agent runs. For example: (1) we prefer shared utility packages over hand-rolled helpers to keep invariants centralized, and (2) we don’t probe data “YOLO-style”—we validate boundaries or rely on typed SDKs so the agent can’t accidentally build on guessed shapes. On a regular cadence, we have a set of background Codex tasks that scan for deviations, update quality grades, and open targeted refactoring pull requests. Most of these can be reviewed in under a minute and automerged.

This functions like garbage collection. Technical debt is like a high-interest loan: it’s almost always better to pay it down continuously in small increments than to let it compound and tackle it in painful bursts. Human taste is captured once, then enforced continuously on every line of code. This also lets us catch and resolve bad patterns on a daily basis, rather than letting them spread in the code base for days or weeks.

What we’re still learning

This strategy has so far worked well up through internal launch and adoption at OpenAI. Building a real product for real users helped anchor our investments in reality and guide us towards long-term maintainability.

What we don’t yet know is how architectural coherence evolves over years in a fully agent-generated system. We’re still learning where human judgment adds the most leverage and how to encode that judgment so it compounds. We also don’t know how this system will evolve as models continue to become more capable over time.

What’s become clear: building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code. The tooling, abstractions, and feedback loops that keep the codebase coherent are increasingly important.

Our most difficult challenges now center on designing environments, feedback loops, and control systems that help agents accomplish our goal: build and maintain complex, reliable software at scale.

As agents like Codex take on larger portions of the software lifecycle, these questions will matter even more. We hope that sharing some early lessons helps you reason about where to invest your effort so you can just build things.

Read the whole story
bogorad
2 days ago
reply
Barcelona, Catalonia, Spain
Share this story
Delete

Elon Musk pushes out more xAI founders as AI coding effort falters

1 Share
  • Management Turnover: Several co-founders departed xAI amid ongoing structural reorganization and leadership changes directed by Elon Musk.
  • Operational Audit: Managers from Tesla and SpaceX were deployed to evaluate internal workflows and personnel performance at the start-up.
  • Product Performance: The company's coding tool and Grok chatbot have faced difficulties in achieving market adoption compared to competing platforms.
  • Strategic Realignment: Development efforts are being refocused on improving data training quality and integrating expertise from broader corporate entities.
  • Employee Retention: High resignation rates among researchers persist due to workplace demands and competition for technical talent.
  • Internal Rebuilding: Development projects are undergoing foundational overhauls to rectify perceived issues in initial product execution.
  • Recruitment Efforts: The firm is actively soliciting previously rejected candidates and hiring new engineering talent for specific software development roles.
  • Corporate Integration: The merger with X and operational resource sharing continue as part of a broader strategy centered on long-term data center and AI goals.

current progress 27%

Stephen Morris and Cristina Criddle in San Francisco

Published2 hours ago

42

Unlock the Editor’s Digest for free

Roula Khalaf, Editor of the FT, selects her favourite stories in this weekly newsletter.

Elon Musk has ordered another round of job cuts at xAI after growing frustrated with the poor performance of its coding product, forcing out several more co-founders and parachuting in “fixers” from SpaceX and Tesla to audit the start-up.
The latest overhaul of the two-year-old start-up follows the success of Anthropic and OpenAI, whose AI coding tools have shaken up the software industry, multiple people familiar with the decisions said.

Musk has dialled up the pressure after merging SpaceX with xAI in a $1.25bn deal, as he attempts to meet a June deadline for what could be the biggest stock market listing in history. The world’s richest man has said his goals are to launch AI data centres into space, build factories on the Moon and colonise Mars.

Musk has relentlessly pushed the heavily lossmaking AI start-up to catch up with rivals, but so far its Grok chatbot and coding product have failed to gain traction with paying individual users or businesses.

“xAI was not built right first time around, so is being rebuilt from the foundations up,” Musk posted on X on Thursday. “Same thing happened with Tesla.”

SpaceX and Musk did not immediately reply to requests for comment.

Managers from SpaceX and Tesla have been seconded to review xAI employees’ work and have fired some after deeming their efforts inadequate, said two people with direct knowledge of the matters.

One area of focus has been the quality of the data used to train the models, a key reason its coding product lagged behind Anthropic’s Claude Code or OpenAI’s Codex.

The review has pushed out two more co-founders. Zihang Dai, one of the most senior members of the technical staff, who had publicly acknowledged that xAI was behind on coding, departed this week.

Guodong Zhang, who had run pre-training of Grok models, told colleagues that he was leaving after being blamed for the issues with the coding product and relieved of his primary duties by Musk, two people familiar with the decision said. He confirmed that Thursday was his last day in a post on X.

After the departures, only Manuel Kroiss — known as “Makro” — and Ross Nordeen will remain of the 11 co-founders who helped Musk set up xAI in San Francisco in March 2023.

Last month, Musk criticised the coding team for falling behind in a town hall meeting that was posted online. He detailed a reorganisation after several other co-founders had been removed, including Greg Yang, Tony Wu and Jimmy Ba.

Toby Pohlen, a former DeepMind researcher, was put in charge of the “Macrohard” project to build digital agents that Musk said could replicate entire software companies. Musk said it was the “most important” drive at the company. The name is a “funny” reference to Microsoft, the billionaire added. Pohlen left 16 days later.

Musk has redeployed Ashok Elluswamy, head of AI software at Tesla, to reboot the Macrohard effort and review the work done previously. Musk said that Tesla and xAI would work together to develop a “digital Optimus” that would combine the car and robot maker’s real-world AI expertise and Grok’s large language models.

Staff complain that the constant upheaval is destroying morale and preventing xAI from achieving its potential.

Musk has built a vast data centre in Memphis with more than 200,000 specialised AI chips, which he plans to expand to 1mn GPUs over time. It also benefits from the data fed in by his social media network X, which was merged with xAI last year and now promotes the Grok chatbot.

Employees were sent a memo denying that there would be mass lay-offs on Wednesday, the people said. However, researchers continue to quit because of burnout because of Musk’s “extremely hardcore” work demands or after receiving better offers from rivals, multiple people familiar with the departures said.

The lay-offs and departures have left xAI with many roles to fill. Recruiters have been contacting unsuccessful candidates from previous interviews and assessments to offer them jobs, often on better financial terms, the people said.

“Many talented people over the past few years were declined an offer or even an interview at xAI. My apologies,” Musk posted on Friday morning. He said he would be “going through the company interview history and reaching back out to promising candidates”.

Musk still has the ability to recruit top Silicon Valley talent. This week, xAI poached two staff from popular AI coding app Cursor — Andrew Milich and Jason Ginsberg — to help improve the “Grok Code Fast” product.

Musk welcomed them in a post on Thursday, adding: “Orbital space centres and mass drivers on the Moon will be incredible.”

Reuse this content (opens in new window) CommentsJump to comments section

Latest on xAI

Follow the topics in this article

Comments

Read the whole story
bogorad
2 days ago
reply
Barcelona, Catalonia, Spain
Share this story
Delete
Next Page of Stories