AI

Summary

Generated about 15 hours ago.

What stood out in June

  • Frontier access and regulation tightened. Multiple reports say U.S. actions led to Anthropic suspending access to Fable 5/Mythos 5 for foreign nationals; related coverage also highlighted export-control triggers tied to Amazon-linked discussions (e.g., The Verge, Axios). States also investigated OpenAI (e.g., Reuters).
  • Agentic AI, reliability, and cost pressures. Articles and tooling emphasized agent workflows (memory/knowledge formats, coding loops) while others warned about hidden costs, reliability drift, and governance/guardrail limits.
  • Health, education, and safety debates broadened. Coverage ranged from AI toys for kids to AI use in policing/courts and learning outcomes.

Model releases

Stories

Maxproof (arxiv.org) AI

The arXiv paper “MaxProof” proposes a population-level test-time scaling approach for generating mathematical proofs, using an M3 model trained on proof generation, verification, and critique-conditioned repair. At test time, it treats the model as generator/verifier/refiner/ranker, searches over many candidate proofs, and selects the final result via tournament selection—reporting 35/42 on IMO 2025 and 36/42 on USAMO 2026.

AI Economics for Dummies (mcsweeneys.net) AI

McSweeney’s “AI Economics for Dummies” uses a set of exaggerated, humorous examples to satirize how AI companies’ business models and finance metrics—especially around IPO hype—are often presented to the public.

Kimi K2.7-Code: open-source coding model with better token efficiency (huggingface.co) AI

Moonshot AI’s Kimi K2.7-Code is an open-source, coding-focused agentic model built on Kimi K2.6, claiming improved token efficiency (about 30% fewer thinking tokens) and better performance on long-horizon software engineering tasks. The Hugging Face model card includes deployment and usage examples via Transformers, vLLM, and SGLang, along with reported benchmark results and specifications such as a 256K context length and native INT4 quantization.

The Future of Email (fastmail.com) AI

Fastmail argues that as AI assistants increasingly read, filter, and act on emails, verifying sender identity through email authentication standards—SPF, DKIM, and DMARC—will become crucial to prevent spoofing and phishing before messages reach users’ inboxes.

The Normalization of Deviance in AI (embracethered.com) AI

The blog argues that AI systems—especially agentic ones—risk “normalizing deviance” by gradually over-trusting unreliable LLM outputs and treating the lack of past failures as proof of safety, despite growing evidence of issues like prompt injection, data exfiltration, and risky tool actions. It cites the idea in the spirit of the Challenger disaster’s warning-sign rationalization and points to multiple vendor warnings and examples where guardrails are limited or human oversight is absent. The author concludes that AI should remain human-led in high-stakes contexts with downstream security controls and threat modeling rather than assuming models will “do the right thing.”

Blogging with an LLM assistant (vincent.bernat.ch) AI

Vincent Bernat argues that using an LLM for selective tasks in blogging—such as grammar, copyediting, and translation—can be compatible with preserving an author’s voice, while also disclosing what level of AI assistance was used.

AI isn't making developers more productive – it's making them busier (leaddev.com) AI

A LeadDev analysis argues that AI coding tools are making developers busier rather than more productive, citing MIT/Wharton research showing a 741% increase in lines of code written but only a 20% increase in actual software releases. It says the gains attenuate after code generation due to human bottlenecks like PR review, integration, and release management, suggesting developer roles are shifting from writing code to evaluating it. The piece also notes that while some app releases have increased, overall app usage has stayed flat, implying that more AI-assisted software does not necessarily translate into user value.

Don't let the LLM speak, just probe it (blog.j11y.io) AI

The article argues that many LLM “judge” decisions are already present in the model’s hidden state before it generates any tokens, so you can avoid generation by extracting a hidden-state representation at a prompt “seed” position and training a small MLP/linear probe to output calibrated probabilities for English criteria.

Claude Fable is relentlessly proactive (simonwillison.net) AI

Simon Willison describes how Claude Fable 5+ in Claude Code proactively investigated a browser UI bug by running local dev servers, using Playwright and real browsers, taking screenshots, editing templates to trigger keyboard shortcuts, and deploying custom CORS web code to measure elements—then continued after being downgraded, ultimately validating a fix.

Codex for Open Source (openai.com) AI

OpenAI’s “Codex for Open Source” program supports maintainers of widely used open-source projects by easing coding and review burdens, offering selected maintainers six months of ChatGPT Pro and potential API credits (and, for some projects, conditional access to Codex Security), with applications reviewed on a rolling basis.

Making a vintage LLM from scratch (crlf.link) AI

The post describes how its author built a time-locked “vintage” language model trained on pre-1900 English texts, detailing custom data processing, training/fine-tuning scripts, and experiments, with the resulting 340M-parameter model and open-source code linked on Hugging Face and GitHub.

How a new DSL may survive in the era of LLMs (williamcotton.com) AI

William Cotton argues that new DSLs can still succeed in the LLM era by matching the “reality grounding” provided by legacy tooling—through strong documentation, smooth onboarding, robust language-server support, and diagnostics that give immediate feedback to both developers and LLM agents.

Finding Optimal Tokenizers (blog.aqnichol.com) AI

A blog post describes an approach to compute provably optimal tokenizers by formulating tokenization as an integer linear program and then using cutting-plane techniques to force the relaxed LP solution toward an integral optimum. The author reports that, despite theory suggesting optimal tokenization is intractable, they found optimal vocabularies for toy problems (including a vocab size 512 tokenizer for Pride and Prejudice) and discusses limitations such as reliance on a pretokenizer, near-optimal state of existing methods, and generalization concerns.

MTG Bench: Testing how well LLMs can play Magic (mtgautodeck.com) AI

The article presents “MTG Bench,” a benchmark that tests multiple LLMs on simulated Magic: The Gathering turns using an MCP-based library for deck operations, reporting overall scores and cost-per-turn (with best results led by gpt-5.5 medium at 95.4) and discussing common failure modes like illegal move simulations and tool-call mistakes.

Tailwind and Slop Apps (briandouglas.ie) AI

A developer argues that using LLMs to generate front-end “Tailwind” marketing sites often leads to a recognizable, template-like “slop” look, citing examples and warning that merely prompting an LLM for a stylish homepage can hurt perceptions of a product’s care and creativity.

OpenAI Prepping for On-Prem Product? (ledger.somantix.ai) AI

A new section in OpenAI’s service terms adds licensing language for software delivered for installation on a customer’s own systems (local machines or private cloud), defining “Licensed Materials” and requiring permanent deletion of all copies upon termination.

Show HN: HelixDB – A graph database built on object storage (github.com) AI

Show HN highlights HelixDB, an OLTP “graph-vector” database built in Rust that combines graph and vector data (and also supports KV, documents, and relational data) and is designed to let AI agents access needed storage components from one platform. The project provides a Helix CLI and SDKs (Rust/TypeScript) with queries sent to a local /v1/query endpoint, plus an object-storage-backed HelixDB Cloud offering with vector/full-text search, transactions, and high availability.

Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon (tridao.me) AI

The article proposes “Gram Newton-Schulz” (used in an optimizer called GramMuon) to speed up Muon’s Newton-Schulz orthogonalization by iterating on a smaller symmetric Gram matrix (XXᵀ) rather than the full rectangular weight matrix, enabling faster symmetric matrix-multiplication kernels and reducing the orthogonalization runtime by about 40–50%. It also studies numerical instability in the naive Gram form (e.g., spurious negative eigenvalues in half precision) and introduces a “restarting” strategy to stabilize it while preserving optimization quality (within ~0.01 validation perplexity). The authors report up to ~50% optimizer-time reduction in large MoE models and release implementation code and custom GPU kernels.

The Economics of Speculative Decoding (fergusfinn.com) AI

The article argues that speculative decoding remains a key inference performance win, but changing model architectures—especially mixture-of-experts (MoE) layers and compressed attention/KV-cache techniques—reduce the “free” nature of speculative tokens by shifting attention and feed-forward operations closer to compute-bound regimes. It describes how MoE routing changes the memory/compute roofline (making some speculative tokens costly to verify, especially at low batch sizes) and how compressed attention can remove the slack that speculation previously exploited. Using these updated cost considerations, it proposes that effective speculation lengths must be chosen more conservatively based on acceptance likelihood, since rejected speculative tokens are no longer zero-cost.

Anthropic walks back policy that could have 'sabotaged' researchers using Claude (wired.com) AI

Anthropic is backtracking on safeguards in Claude Fable 5 that critics said would covertly degrade the model’s performance for researchers trying to develop competing AI models, after researchers complained and pushed back. The company says it will make those frontier-LLM safeguards visible to users going forward, alerting or rerouting users if they appear to be using the model to pursue highly capable AI development, and it attributes the earlier approach to concerns about slowing frontier progress for safety and societal alignment reasons.

Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable (techcrunch.com) AI

Cybersecurity researchers say Anthropic’s public model Fable overreaches with guardrails that block or pause requests they describe as harmless, such as code review or even reading content, while falling back to another model when tripped. They argue the restrictions are keyword- or topic-based in a way that can downgrade responses needed for secure software work, despite Anthropic’s stated aim of reducing risks like malware development and biological weapons research. Anthropic did not immediately comment, and the company also runs an application-based Cyber Verification Program that reportedly allows approved professionals fewer limitations.