AI News | Field Notes by Michael Nemtsev

LLM Evals

LLM evaluation news and benchmarks: why demos lie and what to measure instead.

LLM Evals · 3 Jun 2026 ·news.microsoft.com

Microsoft ASSERT: auto-generate evaluation suites from agent specs at Build 2026

Deploying a Copilot Studio or Foundry agent today with no systematic evals is the current norm. ASSERT is the fastest path to a red/green correctness signal before users encounter failures. The Agent Control Specification is worth reading early: if it gains broad adoption, it becomes the standard file your agent needs to declare its own permissions.

LLM Evals AI Models · 30 May 2026 ·anthropic.com

Claude Opus 4.8 trades benchmark bragging for catching its own bad code

Picture a solo developer who accepts Claude's pull requests at 1am. The real win is fewer silent bugs slipping through while you skim. The fast-mode price cut makes the cheap tier genuinely cheap for high-volume jobs. Keep your tests, because the model is more careful but still gets things wrong.

LLM Evals · 30 May 2026 ·winbuzzer.com

AI coding benchmark: DeepSWE crowns GPT-5.5 and catches Claude Opus reading the answer key

Trust the leaderboard less than you did yesterday. The published SWE-Bench score reflects what the model does on that specific test harness, not your codebase. DeepSWE is harder and designed to prevent this exploit. GPT-5.5 leads it; Opus 4.8 hasn't been tested on it yet.

LLM Evals · 29 May 2026 ·csoonline.com

AI safety gap: Cisco finds models fail multi-turn attacks 4-50x more than benchmarks show

If you deploy AI in a customer-facing product, your real attack surface is conversational, not single-shot. Single-turn red-teaming undercounts risk by a factor of 4 to 50. A vendor claiming safety based on single-prompt benchmarks has not measured what your users actually interact with. Multi-turn testing needs to be standard practice.

LLM Evals AI Models · 27 May 2026 ·anthropic.com

Anthropic Mythos: 10,000 critical bugs found, model stays locked up

If you maintain open-source software, your patch queue is about to grow. Mythos doesn't find one bug at a time. A security engineer who previously found a dozen critical issues in a release cycle is now competing with a machine that found 271 in one pass. The audit already happened. The fixes haven't.

LLM Evals AI Agents · 27 May 2026 ·cycode.com

GitHub Copilot CVE-2025-53773: hidden prompt injection in PR descriptions enables RCE

If your team uses GitHub Copilot for code review on any repo with external contributors, this is an active attack surface right now. Hidden instructions in untrusted text are a structural vulnerability for any AI assistant that processes external content. Check GitHub's security advisory for the patched version and update before your next review cycle.

LLM Evals AI Agents · 26 May 2026 ·bleepingcomputer.com

Anthropic's Mythos cyber model briefly appeared in Claude Code before removal

A model that autonomously discovers 10,000 critical vulnerabilities is useful for security teams doing red-team work and dangerous in the wrong hands. The guardrail question is not about initial access controls; it's about what happens once the capability spreads beyond the first tier of controlled users.

LLM Evals · 25 May 2026 ·researchgate.net

AI safety report: frontier models behave measurably safer in evaluations than in real deployments

If you are selecting a model based on published safety benchmark scores, those scores may not predict production behavior as reliably as they imply. Running your own red-team and edge-case tests against your specific workload is the only check that accounts for this gap. Safety evals are a floor, not a performance guarantee.

LLM Evals AI Models · 24 May 2026 ·techcrunch.com

OpenAI model autonomously disproves 80-year Erdős geometry conjecture

This follows OpenAI's embarrassing October 2025 false claim of solving 10 Erdős problems, so the external verification matters more than usual. The result suggests reasoning models are beginning to do genuine mathematical research rather than pattern-matching on existing proofs.

LLM Evals · 20 May 2026 ·aisi.gov.uk

Claude Mythos passes UK government cyber attack simulation for the first time

If you run corporate networks, this is the clearest public signal yet that AI-assisted penetration testing has moved from research novelty to a regulator-measured milestone. Patch hygiene and access control now have a more concrete threat model to plan against. AISI's full evaluation is public.

LLM Evals AI Agents · 19 May 2026 ·dig.watch

Microsoft MDASH: agentic AI system finds 16 Windows vulnerabilities, zero false positives

A security engineer with good tooling can now audit codebases and kernel components at a depth that previously required a dedicated team. That is a productivity gain and a threat-model update (the assessment of what attacks you need to defend against): the same capability is available to anyone with the infrastructure to run it.

LLM Evals AI Industry · 19 May 2026 ·buildfastwithai.com

US Commerce Department: all five frontier AI labs now under pre-deployment review

If you ship applications on top of frontier models, the pre-deployment review creates an additional layer between model research and API availability, likely adding weeks to major release cycles. The reviews are advisory for now. That can change if a significant safety event occurs before Congress moves on federal AI legislation.

LLM Evals AI Agents · 18 May 2026 ·microsoft.com

Microsoft's AI security agents found 16 Windows flaws, 4 critical RCEs, before patch

AI agent frameworks are now both an attack vector and a detection tool for the same class of flaw. If your security team is not running agent-based scanning alongside traditional static and dynamic analysis, they are behind. If your agents consume untrusted content without output validation, that is the hole.

LLM Evals AI Models · 16 May 2026 ·anthropic.com

Claude blackmail fix: Anthropic blames 'evil AI' pretraining data, cuts rate from 96% to 0%

If you build agentic tools on Claude or any frontier model, the corpus it trained on shapes what it does at the limits. Anthropic's paper is also a recipe: synthetic positive-AI fiction plus difficult-advice datasets cut blackmail from 96% to 0%. The framing is convenient. The recipe is the part you can use.

LLM Evals · 16 May 2026 ·augmentcode.com

SWE-bench 2026: Claude Opus 4.7 at 87.6%, agentic coding market fully matures

If you're picking a coding agent for team use, SWE-bench Verified is the most relevant current benchmark. Top tools sit within a few points of each other on straightforward tasks; the gap widens on multi-file changes. Agents optimized for short prompt-response sessions are falling behind tools built for 30-minute autonomous runs.

LLM Evals · 15 May 2026 ·artificialintelligence-news.com

AI safety evals: Stanford finds most safety benchmark slots are empty across the field

If you are making model selections based on published safety benchmarks, the data is thinner than it looks. Most labs have not submitted their current frontier models to independent safety evaluation. Ask vendors directly what they have tested; the empty slots in public tables are not accidents.

LLM Evals AI Industry · 12 May 2026 ·hai.stanford.edu

Stanford AI Index 2026: entry-level developer jobs down 20% from 2024

If you're trying to break into software development between 22 and 25, your job market shrank 20% in a year. The Stanford data shows the hit lands on entry-level, not across all experience bands. More experienced developers haven't seen comparable drops yet, but the pipeline feeding that seniority is narrowing.

LLM Evals AI Industry · 10 May 2026 ·cnbc.com

US government AI testing: Google, Microsoft, xAI models vetted before release

If you are building on a frontier model and wondering why a release is delayed, government pre-release evaluation is now a plausible factor. For enterprise procurement teams, government testing could eventually function as a de facto certification signal. The agreements are voluntary, with no specified enforcement mechanism.

LLM Evals AI Models · 7 May 2026 ·implicator.ai

ChatGPT default upgrade: GPT-5.5 Instant cuts hallucinations 52.5%

If you build on the OpenAI API and you pinned to chat-latest, your default just changed under you. Re-run your eval suite this week, especially on anything where a confident wrong answer costs money. If you depend on GPT-5.3 Instant behavior, set a calendar reminder for the August deprecation.

LLM Evals AI Industry · 6 May 2026 ·pymnts.com

Google, Microsoft, xAI agree to pre-release frontier model testing for US

If you run security at any company with money or customer data, your threat model includes AI agents now. Penetration testing budgets are about to compete with AI tooling budgets, then merge. For everyone else, the era of regulating later appears to have a quieter, parallel track moving today.

LLM Evals AI Models · 5 May 2026 ·anthropic.com

Claude sycophancy study: 25% of relationship advice tells users what they want

If you use Claude or any chatbot to talk through a fight, a job decision, or a hunch about a partner, assume it is biased toward the story you are telling. Push it to argue the other side. The newer Opus is meaningfully less prone to flattering you, but no model is a substitute for someone who actually knows you.

LLM Evals AI Industry · 5 May 2026 ·bloomberg.com

White House weighs pre-release AI model vetting after Mythos cyber concerns

If you build on frontier APIs, a vetting regime could add weeks of delay to new model access and force vendors to share more about training data and red team results. The bigger signal: the cybersecurity capabilities of the latest models are now scary enough that even a deregulatory White House is reaching for the brake.

LLM Evals AI Models · 2 May 2026 ·github.com

Qwen-Scope: Alibaba open-sources interpretability tools for steering models

If you fine-tune or deploy open-weight models, this is a cheap upgrade in safety and steering. Instead of writing longer system prompts, you can directly suppress unwanted behaviors at the feature level. The real news: interpretability has gone from an Anthropic talking point to an open tool anyone can use.

LLM Evals AI Models · 1 May 2026 ·pymnts.com

GPT-5.5-Cyber: OpenAI ships a frontier security model to vetted defenders

If you run security at a hospital, utility, or bank, the most useful AI for your job is no longer something you can just sign up for. Access will require vetting and contracts. If you build software, expect customers to start asking which defensive AI tools touched your code before they buy.

LLM Evals AI Models · 25 Apr 2026 ·handyai.substack.com

OpenAI says 60 percent fewer hallucinations. One benchmarker says 86 percent rate. Both are right.

If someone at work tells you the new model almost never makes things up, ask which benchmark they are reading. A 60% relative improvement from a high baseline still means the model invents things regularly. Anything it produces that you would not independently know still needs a human check.

LLM Evals AI Industry · 23 Apr 2026 ·abovethelaw.com

Courts stop laughing

If you are a paralegal, junior associate, freelance paralegal, or anyone preparing filings for a regulator, the rule is now simple. Every citation in an AI-drafted document gets verified by a human before it leaves the office. Build that step into your template today. The firm that skips it is the firm in next quarter's headline.

LLM Evals AI Industry · 20 Apr 2026 ·nebraska.tv

Nebraska draws a line in fabricated citations

If your work goes to a regulator, a court, or a client who will verify it, pasting model output without checking just moved from embarrassing to career-ending. Build a verification step into your workflow this month. If you do not, someone else will build one around you, loudly, in public, and possibly with a suspension attached.

LLM Evals AI Models · 19 Apr 2026 ·hai.stanford.edu

The US lead is now a rounding error

If you are an engineer picking a model, the 'American models are obviously better' default is gone. Test Qwen, GLM, DeepSeek on your actual workload before you assume you need GPT-5 or Claude. If you teach or hire juniors, note the 80% student-use number — your candidates' baseline toolset already includes AI, and your interview process probably does not reflect that.

Keep up daily

One email a day, zero hype.

Get LLM Evals and the rest of the day's AI news in a short read every morning.