Anthropic Introduces Claude 4 Family and Claude Code

Anthropic released Claude Opus 4 and Sonnet 4, the newest versions of their Claude series of LLMs. Both models support extended thinking, tool use, and memory improvements, and Claude 4 Opus outperforms other LLMs on coding benchmarks.

Anthropic announced the release during their Code with Claude event. The Claude 4 models are "hybrid" models: they can give quick responses to questions or they can perform extended thinking. The models can use tools such as web search in extended thinking mode, execute multiple tools in parallel, and use local files for memory. Claude Opus 4 scores 72.5% on the SWE-bench and 43.2% on the Terminal-bench coding benchmarks, outperforming all other coding models. Anthropic also announced the general availability of Claude Code, Anthropic's coding agent, with beta extensions for integrating with JetBrains and VS Code. According to Anthropic,

These models are a large step toward the virtual collaborator—maintaining full context, sustaining focus on longer projects, and driving transformational impact. They come with extensive testing and evaluation to minimize risk and maximize safety, including implementing measures for higher AI Safety Levels like ASL-3. We're excited to see what you'll create.

Claude 4 includes several other improvements over previous iterations of Claude. Anthropic claims Claude 4 is "65% less likely" to use "shortcuts" to complete agentic tasks. It also "dramatically outperforms all previous models on memory capabilities" by using local files to store data. In thinking mode, the chain-of-thought output is summarized "about 5% of the time" to reduce the space needed for display.

Claude 4 Coding Benchmark Comparison

Image Source: Anthropic's Claude 4 Announcement

Users in a Hacker News discussion wondered whether the new models were improved enough to "justify the full version increment." One user replied:

I'm a developer, and I've been trying to use AI to vibe code apps for two years. This is the first time I'm able to vibe code an app without major manual interventions at every step. Not saying it's perfect, or that I'd necessarily trust it without human review, but I did vibe code an entire production-ready iOS/Android/web app that accepts payments in less than 24 hours and barely had to manually intervene at all, besides telling it what I wanted to do next.

Open-Source developer Simon Willison live-blogged the launch. He also delved into the Claude 4 system card, which documents several scenarios and outcomes of Anthropic's safety testing.

Anthropic's system cards are always worth a look, and this one for the new Opus 4 and Sonnet 4 has some particularly spicy notes. It's also 120 pages long - nearly three times the length of the system card for Claude 3.7 Sonnet! If you're looking for some enjoyable hard science fiction...this document absolutely has you covered.

Anthropic's tests reveal that its models would in some cases take "extreme actions," that while "rare and difficult to elicit, while nonetheless being more common than in earlier models." As part of their Responsible Scaling Policy (RSP), with the release of Claude 4 Anthropic decided to activate their AI Safety Level 3 (ASL-3) Deployment and Security Standards, which includes heightened internal security to help prevent theft of model weights.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter