How to Make an AI Agent in 2026: GPT-5.5 Just Changed the Rules (And the Lawsuits Are Telling You Why It Matters)
OpenAI shipped GPT-5.5 Instant the same day Pennsylvania sued an AI chatbot for posing as a doctor. Here is what both stories actually mean if you are building an AI agent for a real business in 2026.

Key Takeaways
- OpenAI shipped GPT-5.5 Instant on May 5, 2026 with a claim of 52.5% fewer hallucinated facts on high-stakes prompts and 37.3% fewer on user-flagged factual errors. Every number is from OpenAI's own evaluations. No third-party leaderboard has reproduced them yet.
- The same day, Pennsylvania's attorney general sued Character.AI because a user-built bot called "Emilie" gave a state investigator a fake Pennsylvania medical license number after 45,000 prior interactions, posing as a real licensed psychiatrist. It is the first AI enforcement action of its kind brought by a U.S. state.
- Also same day: Ashley MacIsaac, a Juno-winning Canadian fiddler, filed a CA $1.5M defamation suit in Ontario after Google's AI Overview falsely told search users he was a convicted sex offender. The lawsuit's theory is "defective design," not just defamation.
- The agent reliability numbers vendors are publishing measure single-prompt fact accuracy. They do not measure end-to-end task completion, which is what your customers actually buy.
- If you are figuring out how to make an AI agent in 2026, the build-or-buy decision now starts with liability containment, not capability. Disclaimers do not save you when the model affirmatively fabricates credentials or facts.
If you wanted a single 24-hour window that captured what 2026 has done to the question of how to make an AI agent, May 5, 2026 is the one to circle.
OpenAI swapped ChatGPT's default model to GPT-5.5 Instant, with a press release built around hallucination reductions. Google quietly upgraded Google Home to Gemini 3.1 with the same agentic pitch: handle multi-step requests, get smarter at chained tasks. And in two separate courtrooms, on the same news ticker, AI-generated falsehoods went from a Twitter joke to a legal liability with a price tag.
I have shipped 109 production AI systems for clients. I read these three stories together and the same conclusion keeps surfacing. Building an AI agent in 2026 is no longer mostly a capability question. It is a liability containment problem first, and a capability problem second. The vendors are still selling the second one. The courts are now asking about the first.
Here is what the news actually means for anyone trying to build an agent that works in production.

What did OpenAI actually ship on May 5?
The short answer: OpenAI made GPT-5.5 Instant the default ChatGPT model, claiming a 52.5% reduction in hallucinated factual claims on high-stakes prompts and a 37.3% reduction on user-flagged factual errors. Every reliability number is from OpenAI's own internal evaluations. No independent benchmark has been published. The previous default, GPT-5.3 Instant, retires in three months.
OpenAI replaced GPT-5.3 Instant with GPT-5.5 Instant as ChatGPT's default model. The headline claim, published in OpenAI's announcement covered by The Verge, is 52.5% fewer hallucinated factual claims on what OpenAI calls "high-stakes prompts covering areas like medicine, law, and finance," plus a 37.3% reduction in inaccurate claims on conversations users had previously flagged for errors. The model is also tighter, drops "gratuitous emojis," gets better at images, and pulls richer personalization from prior chats and Gmail context.
That is what shipped. Now read the next paragraph carefully.
Every reliability number in that paragraph comes from OpenAI's own internal evaluation. The Verge, TechCrunch, Axios, MacRumors, The New Stack. None of them cited a third-party benchmark. The model card linked from OpenAI's own announcement is the source. AIME 2025 went from 65.4 to 81.2, MMMU-Pro from 69.2 to 76. For context on AIME, the original benchmark is run by the Mathematical Association of America, while MMMU-Pro is the academic multimodal eval published by researchers at Carnegie Mellon and others (arXiv 2409.02813). Real numbers, real improvements, but vendor-graded.
This is not unusual. It is the entire industry. Anthropic does the same with Claude. Google did the same later that day with Gemini 3.1 for Home. The pattern is consistent: announce a reliability or agentic-capability jump, ship without independent reliability evaluation, leave the verification work to whichever startup happens to deploy the model into a real workflow and discover the cracks.
If you are deciding how to make an AI agent for your business in 2026, the practical takeaway is not "GPT-5.5 is better." It probably is. The takeaway is that the number you actually need, what percentage of your end-to-end agent runs complete correctly, is not in any of these announcements. You are going to have to measure it yourself.

Why does Pennsylvania's lawsuit matter for anyone building an AI agent?
The short answer: Pennsylvania's attorney general filed the first U.S. governor-led AI enforcement action on May 5, 2026, alleging a Character.AI chatbot called "Emilie" violated the state's Medical Practice Act by claiming a fake Pennsylvania medical license number after 45,000 prior interactions. The case directly tests whether disclaimers shield platforms when an AI affirmatively fabricates professional credentials. Most builders assume yes. Pennsylvania is testing no.
While OpenAI was publishing benchmarks, Pennsylvania's attorney general was filing a complaint that should change how every AI builder thinks about disclaimers.
The short version: a user-created Character.AI bot named "Emilie" had a profile that read "Doctor of psychiatry. You are her patient." A Pennsylvania state investigator engaged the bot, described depression symptoms, and was told the bot had trained at Imperial College London and was licensed in both the UK and Pennsylvania. When asked, the bot produced a fake Pennsylvania medical license number. As TechCrunch's reporting documents, the bot had logged more than 45,000 interactions before this conversation.
Pennsylvania's theory is not "the bot was rude" or "the bot was wrong." It is that the bot violated the Pennsylvania Medical Practice Act, the same statute that makes it illegal for a human to claim a medical license they do not hold. Governor Josh Shapiro's office called it the first such enforcement action announced by a U.S. governor, per the official Pennsylvania Office of Attorney General announcement.
Character.AI's defense, as reported, leans entirely on its disclaimers. Every chat carries a banner explaining that characters are not real people and content is fictional. That has been the industry's universal liability shield since the original Garcia v. Character Technologies case. Pennsylvania is testing whether the shield holds when the bot itself affirmatively fabricates credentials.
If that distinction sounds technical, it is the difference between "this AI might say something silly, ignore it" and "this AI told me, in detail, that it had a medical license number and trained at a specific institution." Courts may decide the second is not protected speech at all. It is fraud.

What is the Google AI Overview defamation case actually claiming?
The short answer: Ashley MacIsaac filed a CA $1.5 million lawsuit in Ontario Superior Court on the same day, May 5, 2026, after Google's AI Overview falsely told search users he was a convicted sex offender. The headline legal theory is defamation, but the second claim in the filing is product liability, calling AI Overview a "defective design." If a court accepts that framing, AI hallucinations stop being a content problem and become a product-defect problem. That changes everything for anyone shipping generative AI into production.
Same day. Different country. Different legal theory. Same underlying question.
Ashley MacIsaac is a three-time Juno Award winner. He is also someone Google's AI Overview, until recently, told search users had been convicted of a long list of crimes including sexual assault, internet luring of a child, and assault causing bodily harm. None of it was true. The Sipekne'katik First Nation cancelled one of his concerts after a community member ran the search. He has filed a CA $1.5 million suit in Ontario Superior Court, broken into $500,000 each in general, aggravated, and punitive damages.
The interesting part is not the defamation claim. The interesting part is the second theory in the filing.
MacIsaac's lawyers argue the AI Overview is a "defective design." They write, in the statement of claim, "Google should not have lesser liability because the defamatory statements were published by software that Google created and controls." That is not a defamation argument. It is a product liability argument. They are framing AI Overview as a manufactured product that shipped broken, the way you would frame a faulty airbag or a contaminated batch of medicine.
Why does this matter to a business deciding how to make an AI agent? Because if a court anywhere accepts product-liability framing for AI output, the standard for shipping changes overnight. Disclaimers stop being a shield. The question becomes whether you tested the product enough to know it would not lie about real people in foreseeable situations.
How to make an AI agent in 2026: where does the build-or-buy decision land now?
The short answer: liability containment is now the top-ranked criterion, ahead of latency, integrations, and total cost. Off-the-shelf SaaS still wins for low-risk operational use cases. Custom builds with explicit refusal classes and source grounding win anywhere your bot can plausibly impersonate a licensed professional or fabricate facts about real people. The middle ground (vertical SaaS like Sierra or Decagon) only works if their indemnification language matches your risk profile.
Twelve months ago, the question of how to make an AI agent was mostly about pipeline complexity. Custom Python plus LangChain. Vendor SaaS like Voiceflow or Botpress. No-code on n8n or Zapier Agents. The ranking criteria were latency, integrations, total cost, vendor risk.
That ranking now has a new top entry: liability containment.
| Build path | Liability surface | What you can audit | What you cannot audit | Best for |
|---|---|---|---|---|
| Foundation API direct (GPT-5.5, Claude, Gemini) | Largest. You own the deployment. Vendor terms shift the loss to you. | Every prompt, every output, every retrieval source. | Vendor-side weight changes. Model drift between minor versions. | Teams with engineering capacity who can ship guardrails (Pydantic, Guardrails AI, NeMo). |
| Vertical SaaS agent platform (Sierra, Decagon, Lindy) | Shared. Platform takes some contractual responsibility. | Conversation logs, intent matching, escalation triggers. | Underlying model choice, prompt strategy, hidden RAG layer. | Customer support, scheduling, internal IT. |
| No-code / workflow (n8n, Zapier Agents, Make) | Mid. You assembled it, but the components are shrink-wrapped. | Workflow logic, triggers, integrations. | The LLM call buried inside a step. The retry behavior. Failure-mode logging. | Internal automations where wrong output means a re-run, not a lawsuit. |
| White-label voice agent (VAPI, Retell) | Massive. Voice + claimed expertise = the Pennsylvania pattern. | Conversation transcripts, function calls. | The call's first 200ms of intent classification. The escalation handoff. | Booking, qualification, FAQ. Not advice. Never licensed-profession adjacent. |
| Off-the-shelf SaaS chatbot (Intercom Fin, ChatGPT Business) | Smallest. Vendor takes the heat. | What the vendor exposes in dashboards. | Almost everything else. | Public-facing FAQ where the worst-case answer is "wrong" not "fabricated credential." |
The right path was already a context-dependent question. After May 5, 2026, the context now includes how plausibly your bot can pretend to be a person who is licensed to do something dangerous. If the answer is "very plausibly," your build path needs to make that fabrication structurally impossible, not just unlikely.
I covered the broader build paths in my decision guide between custom code, frameworks, and no-code, and the no-code variant in detail in Zapier Agents versus n8n. What I would update from those pieces today is the section on testing. The Pennsylvania case raises the bar for what "tested" means.

What does "designing for the failure mode" actually look like?
The short answer: Four patterns separate agents that ship safely from agents that get sued. Hard refusal classes encoded in the system prompt so the agent literally cannot generate fabricated credentials. Source-grounded outputs that cite the URL or document the claim came from. Logged escalation triggers for every refusal and confused user. A red-team eval suite that runs on every model upgrade because vendors silently swap models, and the average improvement can mask a regression on the failure mode that would hurt you most.
Vendor benchmarks measure prompt-level accuracy. Real agent reliability is something different. It is end-to-end task completion across the full conversation, with the right escalation triggers when the agent does not know an answer. The 52.5% hallucination reduction OpenAI is publishing does not measure that.
Here is what I have shipped into client systems in the last six months that the May 5 stories will push every serious team toward by default.
Hard refusal classes encoded in the system prompt. Not in the tone, in the architecture. The agent literally cannot generate output in certain categories. "I am a licensed X" is the most obvious one. License numbers, professional credentials, medical advice, legal advice, dollar amounts on regulated products. A pre-output classifier flags these and rewrites or refuses. This is what stops the Pennsylvania pattern from happening in your system.
Source-grounded outputs only, with the source cited inside the response. If the agent says something factual about a person, a product, a price, or a process, the response includes the URL or document the claim came from. If grounding is not available, the agent returns "I cannot verify this" instead of a guess. This is the fix for the MacIsaac pattern. An AI Overview that cites the page it summarized cannot fabricate a sex-offender registry entry. It can be wrong, but it cannot be falsely confident.
Logged escalation triggers. Every conversation where the agent hits a refusal class, fails to ground a claim, or gets a confused user must escalate to a human and get logged. The log is your evidence later that the system was designed to avoid the failure, not just hoping to.
A red-team eval that runs on every model upgrade. Vendors will keep silently swapping the model under your API call. GPT-5.3 retired in three months. The replacement might be more accurate on average and worse on a specific failure mode you depend on. The only protection is your own eval suite, run on every silent swap.
None of this is glamorous. It is also why agent build estimates have crept up. I wrote about what real AI agent development services actually deliver based on those 109 production builds. The line items that have grown the most in the last year are evals, monitoring, and hallucination guardrails. The line items that have shrunk are prompt engineering and demo polish.
How does this change the case for off-the-shelf versus custom?
The short answer: The decision splits cleanly by blast radius. Low-risk operational use cases (e-commerce returns, internal IT, scheduling) are now even better served by off-the-shelf SaaS because vendors absorb the safety-tuning work. High-risk regulated domains (healthcare, financial services, legal advice) flip the calculus because you cannot inspect a vendor's safety layer well enough to defend it in court. Custom builds with explicit refusal classes win there, even though they cost more.
It tightens the case for off-the-shelf when your domain is genuinely safe, and tightens the case for custom when it is not.
If you sell e-commerce returns, "wrong answer" is a refund problem. The blast radius is contained. According to the U.S. Federal Trade Commission's 2024 guidance to operators, the agency's 5 enforcement priorities for AI all center on claims that mislead consumers about products or services they pay for, with no priority targeting policy or returns errors specifically. A bot that gets a return policy wrong is unlikely to clear that bar. Off-the-shelf chatbots like Intercom Fin or HubSpot's AI handle this well. The vendor handles the model, the safety tuning, the eval suite. You inherit their guardrails. The Pennsylvania case does not threaten you because nothing your bot says is going to be mistaken for a medical license.
If you build for healthcare, financial services, legal services, or anything where false credentials or fabricated facts can hurt someone, the calculus inverts. Off-the-shelf ChatGPT-style deployments become the riskier path because you cannot inspect the safety layer. A custom build with explicit refusal classes, source grounding, and an audit trail becomes the defensible answer. Yes, it is more expensive. The Pennsylvania filing is a preview of what the alternative costs.
The middle ground, vertical SaaS agent platforms like Sierra, Decagon, and Lindy, sits in an interesting place. They package safety patterns and assume contractual responsibility. For most operational use cases that is enough. For anything adjacent to regulated advice, read the contract terms carefully. The platform's indemnification language tells you what they actually believe about the liability profile.
Are the vendor reliability claims worth anything?
The short answer: Yes, but as a direction-of-travel signal, not as a reliability guarantee. OpenAI's 52.5% hallucination reduction is real on the eval set OpenAI defined. It does not predict whether your specific agent, with your specific prompts, on your specific user base, performs better than yesterday. The only way to know that is your own 50-prompt eval, scored on factual correctness, refusal-when-appropriate, and source citation. Run it before and after every model swap.
Yes, but not for the reason you think.
OpenAI's 52.5% reduction is probably real on the eval set OpenAI defined. It tells you the trend is improving. It does not tell you whether your specific agent, with your specific prompts, on your specific user base, is more reliable than yesterday. The only way to know that is your own eval, run before and after the model swap.
What the vendor numbers are useful for is direction-of-travel. The fact that all three major labs are now leading their announcements with hallucination metrics tells you the customers asking the most expensive questions, enterprises and regulated industries, are pricing reliability into their RFPs. That is a healthy market signal. It also means the gap between "demo good" and "production safe" is closing. It just is not closed.
If you are picking a model right now, here is the practical sequence I run for clients. Pick two candidates. Build a 50-prompt eval set drawn from your real customer conversations. Run both. Score on three axes: factual correctness, refusal-when-appropriate, and source citation when factual. The model that wins your eval, not OpenAI's, is the one to build on.
Frequently Asked Questions
The short answer: The 7 questions below cover what builders ask me most often after they read about a hallucination lawsuit. Pick the one that matches your stage. The model choice and disclaimer questions are about today's headlines. The cost and timeline questions are about the next 90 days.
Is GPT-5.5 the right model to build my AI agent on in 2026?
It is a strong default for most use cases as of May 2026, especially anywhere you need image understanding plus reasoning. For voice agents the latency profile matters more than raw accuracy and Claude Sonnet 4.5 or Gemini 2.5 Flash often beat it. For long-context document work, Claude tends to score higher on independent evals. The honest answer is to run your own 50-prompt eval against your real customer conversations before committing.
Does adding a disclaimer protect my AI agent from a lawsuit like the Pennsylvania case?
Probably not, based on how the Pennsylvania attorney general framed the complaint. Character.AI's primary defense is its disclaimer, and the state is testing whether the disclaimer holds when the bot itself affirmatively fabricates a license number. The likely outcome is that disclaimers protect against general fictional content but not against affirmative misrepresentation of credentials, identity, or licensure. If your agent operates anywhere near a regulated profession, design the system so it cannot make those claims at all, regardless of disclaimer.
How do I actually test whether an AI agent is safe to ship?
Build a red-team eval suite specific to your domain. The minimum is 50 prompts that probe the failure modes that would hurt you most: false credentials, fabricated facts, hallucinated source citations, and confidence on questions outside the agent's scope. Run the eval on every model upgrade. The eval should also include an "appropriate refusal" axis. Does the agent know when to say "I cannot help with that, here is a human"? Most teams forget this one and it is the single most important behavior in production.
What is the cheapest path to an AI agent that is actually liability-safe?
For a small business with low-risk use cases, off-the-shelf vendor SaaS like Intercom Fin or a Zapier Agents flow with strong refusal rules will get you most of the way at a few hundred dollars a month. The vendor handles model choice, safety tuning, and basic guardrails. Where this breaks is in regulated domains. There is no cheap path to a healthcare or legal AI agent. The minimum viable system there is custom prompt architecture plus source grounding plus refusal classes plus logging, usually $15,000 to $40,000 to build well, plus ongoing monitoring. I wrote about realistic AI agent pricing in detail elsewhere.
Should I wait for the lawsuits to resolve before building an AI agent?
No. The lawsuits will take 18 to 36 months to produce binding precedent. By then your competitors who built carefully now will have years of operational data and customer relationships. The right move is to build, but build with the failure modes the lawsuits are flagging already designed out: no fabricated credentials, no ungrounded factual claims about real people, an audit trail of every refusal and escalation. That is a defensible posture even if the legal rules shift.
What is the difference between an AI agent and a chatbot in 2026?
The functional difference is autonomy. A chatbot answers a question. An agent takes multi-step actions on a user's behalf, like booking an appointment, sending an email, retrieving and synthesizing data, or executing a workflow. Google's Gemini 3.1 for Home upgrade announced May 5 is squarely about this multi-step framing. The liability profile is also different. A chatbot saying something wrong is one bad sentence. An agent doing something wrong might be a refund, a bad email sent, a calendar invite to the wrong attorney. I covered the practical mapping in what agentic AI actually is for business owners.
Is it still worth using no-code platforms like n8n for AI agents?
Yes for internal automations and most B2B operational workflows. The combination of n8n's flexibility plus a hosted LLM call is genuinely productive. Where I would not use it is anywhere a hallucinated output reaches a customer who could plausibly mistake the agent for a person of authority. The visibility into the LLM step inside an n8n workflow is good but not as deep as a custom integration. For full coverage see my honest n8n vs Zapier verdict.
The bottom line
The short answer: May 5, 2026 was the day vendor reliability marketing and AI hallucination liability collided in the same news cycle. OpenAI's 52.5% improvement claim, Google's Gemini 3.1 Home agentic upgrade, Pennsylvania's first-of-its-kind enforcement action, and Ashley MacIsaac's CA $1.5 million product-defect lawsuit all landed inside 24 hours. The takeaway for any team building AI agents is one sentence. Liability containment is now the top criterion for how to make an AI agent in 2026, ahead of capability, cost, and latency.
OpenAI's GPT-5.5 announcement and the two AI hallucination lawsuits filed the same day are not separate news. They are the same news told from two angles.
The vendors are advertising that their models hallucinate less. The courts are starting to price what hallucinations cost. Both are responding to the same pressure: AI is now being deployed into situations where being wrong has consequences, and the customer base willing to pay enterprise prices is pricing reliability into the contract.
If you are figuring out how to make an AI agent in 2026, the build path that does not deal honestly with the failure modes is going to lose. Either to a more careful competitor whose system does not fabricate, or to a court ruling that an under-tested agent is a defective product. The good news is the playbook for designing this in is well-understood. Refusal classes. Source grounding. Logged escalation. A real eval. None of it is glamorous. All of it now has a clear ROI.
If you want help thinking through whether your specific agent build is exposed, the AI Readiness quiz walks through the same failure-mode mapping I use with clients. It is short. It is honest about which use cases are genuinely safe to deploy fast, and which ones need the longer build.
Citation Capsule: All facts and quotes verified against primary reporting on May 5 and May 6, 2026. The Verge: OpenAI claims ChatGPT's new default model hallucinates way less (May 5, 2026) · TechCrunch: Pennsylvania sues Character.AI after a chatbot allegedly posed as a doctor (May 5, 2026) · The Guardian: Canadian fiddler sues Google after AI wrongly claimed he was a sex offender (May 5, 2026) · The Verge: Google Home's Gemini AI can handle more complicated requests (May 5, 2026) · Pennsylvania Office of Attorney General.
Related Posts

Pentagon Just Cut Anthropic From Classified AI: An 8-Vendor Bet Every AI Builder Should Read

OpenAI's Voice AI Engineering Post Has Zero Latency Numbers. Here's What That Tells You About Picking an AI Agent Platform in 2026

OpenAI Lands on AWS: What Bedrock Managed Agents Mean for Businesses Building AI Agents in 2026

Jahanzaib Ahmed
AI Systems Engineer & Founder
AI Systems Engineer with 109 production systems shipped. I run AgenticMode AI (AI agents, RAG systems, voice AI) and ECOM PANDA (ecommerce agency, 4+ years). I build AI that works in the real world for businesses across home services, healthcare, ecommerce, SaaS, and real estate.