AI News | Field Notes by Michael Nemtsev

AI Models

AI model news and releases: frontier LLMs, benchmarks, and capabilities from OpenAI, Anthropic, Google, Meta, and more.

AI Models · 3 Jun 2026 ·techcrunch.com

Anthropic Mythos: 150 organizations in 15 countries now scanning critical infrastructure

Security engineers at critical infrastructure firms should expect Anthropic outreach in the coming weeks. For everyone else, the public Mythos-class API is the near-term deliverable: when it ships, it will be the most capable model available for automated code auditing and offensive security research.

AI Models · 3 Jun 2026 ·neowin.net

Microsoft MAI-Thinking-1: first in-house reasoning model, no distillation, 35B active params

MAI-Code-1 is already writing suggestions in the tools many developers use today. MAI-Thinking-1 on Foundry is the more significant long bet: a reasoning model trained on clean enterprise-licensed data could matter in regulated industries where Anthropic and OpenAI provenance creates compliance friction.

AI Models · 3 Jun 2026 ·marktechpost.com

MiniMax M3: first open-weight model with 1M-token context, native video, and frontier coding

Once the weights land on Hugging Face, a developer can run M3 locally or fine-tune it without API fees. At 1-million-token context with video input, it could process an entire codebase alongside a recorded walkthrough in one pass. Hold the benchmark claim at arm's length until independent results arrive.

AI Models · 2 Jun 2026 ·artificialanalysis.ai

NVIDIA Nemotron 3 Ultra: 550B open-weights model leads the US open-source rankings

For teams that need frontier-tier reasoning on self-hosted infrastructure, Nemotron 3 Ultra is the clearest US-produced option right now. At 300 tokens per second with 55B active parameters, the operational cost is manageable. The training recipe release is the rarer gift: it gives researchers a reproducible path rather than just a black-box artifact.

AI Models AI Industry · 31 May 2026 ·buildfastwithai.com

Anthropic opens Seoul office, briefs South Korea's government ministries on Claude Mythos

For developers in South Korea and Italy, Anthropic now has local enterprise contacts and infrastructure plans in place. For everyone watching how frontier AI capabilities reach governments, the pattern of briefing allied intelligence agencies on Mythos is worth tracking as a geopolitical signal.

AI Models LLM Evals · 30 May 2026 ·anthropic.com

Claude Opus 4.8 trades benchmark bragging for catching its own bad code

Picture a solo developer who accepts Claude's pull requests at 1am. The real win is fewer silent bugs slipping through while you skim. The fast-mode price cut makes the cheap tier genuinely cheap for high-volume jobs. Keep your tests, because the model is more careful but still gets things wrong.

AI Models AI Industry · 30 May 2026 ·github.blog

Opus 4.8 lands in GitHub Copilot days before usage-based billing starts

Teams on Copilot who reach for Opus 4.8 should watch the calendar. Before June 1 those requests bill at 15 times the base rate, and after it they meter by usage, so a heavy week of hard prompts lands as a real invoice. A backend lead setting team defaults now has to weigh the bill alongside the output.

AI Models LLM Evals · 27 May 2026 ·anthropic.com

Anthropic Mythos: 10,000 critical bugs found, model stays locked up

If you maintain open-source software, your patch queue is about to grow. Mythos doesn't find one bug at a time. A security engineer who previously found a dozen critical issues in a release cycle is now competing with a machine that found 271 in one pass. The audit already happened. The fixes haven't.

AI Models AI Industry · 26 May 2026 ·cnbc.com

Apple-Google Gemini deal confirmed at I/O: $1B/year to rebuild Siri on Google models

For iOS developers, this changes which model capabilities the Siri integration layer will eventually expose, and what Apple Intelligence can hand off to on-device. For anyone tracking who controls the model layer of consumer computing, the answer is increasingly Google.

AI Models · 25 May 2026 ·whatllm.org

Zyphra ZAYA1-8B: first competitive open model trained entirely on AMD Instinct hardware

For developers choosing open-weight models, ZAYA1-8B is worth testing if small footprint matters. The AMD provenance is a curiosity now but becomes more relevant if AMD's hardware costs let Zyphra iterate faster or serve the model more cheaply than Nvidia-dependent alternatives.

AI Models LLM Evals · 24 May 2026 ·techcrunch.com

OpenAI model autonomously disproves 80-year Erdős geometry conjecture

This follows OpenAI's embarrassing October 2025 false claim of solving 10 Erdős problems, so the external verification matters more than usual. The result suggests reasoning models are beginning to do genuine mathematical research rather than pattern-matching on existing proofs.

AI Models AI Industry · 23 May 2026 ·thedailyupside.com

Anthropic projects $10.9B in Q2 revenue, its first operating profit

Anthropic is a profitable business by Q2 2026. The compute cost decline from 71 to 56 cents per revenue dollar means the company is scaling more efficiently than the raw infrastructure spending suggests. For teams choosing between Claude and GPT on API costs, the underlying economics may start showing up in pricing decisions within the year.

AI Models · 23 May 2026 ·techcrunch.com

Andrej Karpathy joins Anthropic to use Claude to train Claude

Karpathy is known for making difficult ML research accessible and for connecting research to practical engineering. His mandate to use Claude to speed up pre-training is a bet that research throughput is now the bottleneck, and that a capable model can help close it. If it works, the next version of Claude is partly trained by the current one.

AI Models · 22 May 2026 ·cursor.com

Cursor Composer 2.5: first proprietary model matches Opus 4.7 on benchmarks at one-tenth the price

The pricing gap between flagship model APIs and a purpose-built coding model just hit 10x. If your team is paying for Claude Code or Codex, Composer 2.5 is worth a benchmark on your actual codebase before the next billing cycle. The model runs only inside the Cursor IDE and CLI.

AI Models AI Agents · 21 May 2026 ·9to5google.com

Google I/O 2026: Gemini 3.5 Flash undercuts rivals on speed, Spark runs 24/7 without you

If you're evaluating frontier-class APIs for a new project, Gemini 3.5 Flash's pricing makes it worth a benchmark run before committing to anything pricier. Gemini Spark is in a different category than a chat assistant: it runs your tasks on cloud VMs while you're offline.

AI Models · 20 May 2026 ·blog.google

Gemini 3.5 Flash: Google's new fast model costs 3x more and powers Gemini Spark

If you build on the Gemini API, the new Flash delivers meaningfully better agentic performance but at three times the inference cost of the prior Flash tier. Check your token budget before migrating. If you evaluate models for tool-heavy workflows, Gemini 3.5 Flash's MCP Atlas score of 83.6% is now the number to beat.

AI Models AI Industry · 19 May 2026 ·androidcentral.com

Google I/O 2026: Gemini 3.1 Flash-Lite, XR glasses, and Aluminium OS

Gemini 3.1 Flash-Lite at $0.25/M tokens is cheap enough to run in loops. If you build agents that call a language model on every step, this price point is worth revisiting. The Android XR glasses hint at a form factor where the model handles ambient context, which matters if you're building for mobile or hands-free use.

AI Models AI Industry · 19 May 2026 ·techcrunch.com

ChatGPT personal finance: bank accounts via Plaid, 12,000 institutions, Pro users first

If you use ChatGPT Pro, the Finances tab is live now. The broader development is OpenAI layering into personal finance, a space with established incumbents and serious regulatory exposure. GPT-5.5 reading your Fidelity balance is a different kind of relationship than GPT-5.5 writing your code.

AI Models AI Industry · 18 May 2026 ·bloomberg.com

Anthropic nears $900B valuation on $44B ARR, moving ahead of OpenAI for the first time

Anthropic's revenue growth is faster than any AI lab has managed. If you're choosing which foundation model API to build on, this is the scale signal: enterprise adoption is running through Anthropic for now, and the company has the compute pipeline to keep pace with it.

AI Models LLM Evals · 16 May 2026 ·anthropic.com

Claude blackmail fix: Anthropic blames 'evil AI' pretraining data, cuts rate from 96% to 0%

If you build agentic tools on Claude or any frontier model, the corpus it trained on shapes what it does at the limits. Anthropic's paper is also a recipe: synthetic positive-AI fiction plus difficult-advice datasets cut blackmail from 96% to 0%. The framing is convenient. The recipe is the part you can use.

AI Models AI Industry · 14 May 2026 ·blog.google

AI music tools: Google Flow Music ships Lyria 3 to Believe and TuneCore

If you're a working musician on TuneCore making rent off streaming royalties, Flow Music is now in your distributor's product menu. The bar for what 'sounds professional' just dropped, and your competition includes anyone with a Google account. If you make tools for audio creators, your customer base is about to start expecting prompt-to-track.

AI Models AI Agents · 14 May 2026 ·radicaldatascience.wordpress.com

Thinking Machines Lab previews real-time AI that reacts mid-sentence across audio and video

This is still a research preview, not a shipping product. But it shows where the interactivity bar is heading: systems that react mid-conversation, not after you finish speaking. For developers building meeting tools or live coaching apps, this is the capability threshold to plan around.

AI Models AI Industry · 9 May 2026 ·thatprivacyguy.com

Chrome silently downloads 4GB Gemini Nano weights to a billion devices without consent

If you build a Chrome extension or a web app that cares about local storage budgets, your users now have 4GB of Google model weights eating their disk by default. If you are a privacy lead at an EU company, the regulator complaint is already filed. The ePrivacy Directive question is active again.

AI Models AI Agents · 9 May 2026 ·blogs.oracle.com

Oracle OCI Enterprise AI ships Grok 4.3 and an open NVIDIA multimodal model

If you build anything that has to read images, video, or audio together, you now have an open-weights option you can fine-tune and run on your own infrastructure. The trade is your ops team owns the inference stack. The closed-API option is still cheaper for low-volume cases.

AI Models · 8 May 2026 ·openai.com

OpenAI ships three voice models: realtime translate, GPT-5-class reasoning

If you build voice agents, support bots, or live-meeting tools, the 'good enough' floor moved up today and the price floor moved down. A 70-language live translator at 3.4 cents a minute changes who you can serve and which features become a sentence in a launch post instead of a six-week build.

AI Models LLM Evals · 7 May 2026 ·implicator.ai

ChatGPT default upgrade: GPT-5.5 Instant cuts hallucinations 52.5%

If you build on the OpenAI API and you pinned to chat-latest, your default just changed under you. Re-run your eval suite this week, especially on anything where a confident wrong answer costs money. If you depend on GPT-5.3 Instant behavior, set a calendar reminder for the August deprecation.

AI Models AI Industry · 7 May 2026 ·news.sap.com

SAP buys Prior Labs for €1B+ to chase tabular foundation models

If you work in finance, supply chain, or any team that lives in spreadsheets and ERP tables, this is the AI thread that affects your day. Tabular models predict churn, late payments, and supplier risk straight from the rows you already have, no chatbot wrapper required. Watch for SAP customers getting these features bundled in, and ask how the predictions get audited.

AI Models LLM Evals · 5 May 2026 ·anthropic.com

Claude sycophancy study: 25% of relationship advice tells users what they want

If you use Claude or any chatbot to talk through a fight, a job decision, or a hunch about a partner, assume it is biased toward the story you are telling. Push it to argue the other side. The newer Opus is meaningfully less prone to flattering you, but no model is a substitute for someone who actually knows you.

AI Models AI Industry · 5 May 2026 ·github.com

Ollama connects local open-source models to Claude Desktop

If you use Claude Desktop for writing or coding help, you can now swap in a free, locally-run model and keep your data off Anthropic's servers entirely. A developer on a tight budget, or anyone handling sensitive files, gets the same interface at zero recurring cost. Run ollama launch claude-desktop and your queries never leave your computer.

AI Models · 5 May 2026 ·research.ibm.com

IBM Granite 4.1: 8B dense model matches the old 32B mixture-of-experts

If your team needs a model that runs on your own GPUs without API costs or data leaving the network, Granite 4.1 is the cleanest open option this week. The 8B is small enough for fine-tuning on a single box, the licensing is simple, and tool calling is solid for building internal agents.

AI Models AI Industry · 4 May 2026 ·ia.acs.org.au

Apple-Google Gemini deal: $1B a year for a 1.2T-parameter custom Siri

If you build iOS apps, plan for Siri intents that actually work this time and for an assistant your users will trust enough to hand off real tasks. If you build for Android, the dominant assistant on both platforms is now Gemini, which simplifies one part of your roadmap and complicates your Google dependency.

AI Models AI Industry · 2 May 2026 ·tomshardware.com

Huawei AI chip revenue jumps 60% to $12B as Nvidia stalls in China

If you maintain Chinese-language products or work for a multinational with a China business, your AI cost and latency story is about to fork. Models trained on Huawei silicon are different from models trained on Nvidia. Expect quiet model swaps in Chinese SaaS products and a longer-term split in the global AI stack.

AI Models LLM Evals · 2 May 2026 ·github.com

Qwen-Scope: Alibaba open-sources interpretability tools for steering models

If you fine-tune or deploy open-weight models, this is a cheap upgrade in safety and steering. Instead of writing longer system prompts, you can directly suppress unwanted behaviors at the feature level. The real news: interpretability has gone from an Anthropic talking point to an open tool anyone can use.

AI Models LLM Evals · 1 May 2026 ·pymnts.com

GPT-5.5-Cyber: OpenAI ships a frontier security model to vetted defenders

If you run security at a hospital, utility, or bank, the most useful AI for your job is no longer something you can just sign up for. Access will require vetting and contracts. If you build software, expect customers to start asking which defensive AI tools touched your code before they buy.

AI Models AI Industry · 1 May 2026 ·fortune.com

Google Cloud backlog hits $462B as Gemini 3 powers a 63% growth quarter

If you have been treating Google Cloud as the third option behind AWS and Azure, that habit is out of date. TPUs are now a serious alternative for training and inference, and Gemini 3 Pro is genuinely competitive on coding and reasoning. Run a real bake-off on your next workload before defaulting to your current cloud.

AI Models AI Industry · 1 May 2026 ·uncoveralpha.com

Amazon Bedrock processes more tokens in Q1 than all prior years combined

If you build on AWS, Bedrock is no longer a sandbox. Anthropic, Mistral, Meta, and others are sitting in the same model catalog as your custom prompt. The pricing, latency, and quota story will start to matter to your roadmap. Test multiple providers through Bedrock before you commit a production path to any one model.

AI Models · 30 Apr 2026 ·vrlatech.com

Mistral Medium 3.5: 128B open-weight model self-hosts on four GPUs

If you are a developer or engineering lead at a regulated company that did not want your data to leave your network, you finally have a credible flagship to test in your own data center. If you are a startup choosing models, the price gap between API and self-hosted is now small enough to matter to your runway.

AI Models LLM Evals · 25 Apr 2026 ·handyai.substack.com

OpenAI says 60 percent fewer hallucinations. One benchmarker says 86 percent rate. Both are right.

If someone at work tells you the new model almost never makes things up, ask which benchmark they are reading. A 60% relative improvement from a high baseline still means the model invents things regularly. Anything it produces that you would not independently know still needs a human check.

AI Models · 25 Apr 2026 ·openai.com

GPT-5.5 arrives at $5 per million tokens with a super-app ambition

If you pay for ChatGPT or build on the OpenAI API (the programming interface developers use to access models directly), the newest model is live now. The $5-per-million input rate is accessible for small tools and automations. Check whether DeepSeek V4 changes the cost calculation for your use case.

Keep up daily

One email a day, zero hype.

Get AI Models and the rest of the day's AI news in a short read every morning.