SWE-Doctor: Runtime Diagnosis from Multi-Faceted Bug Reproduction Tests
The paper finds that directly using bug-reproduction tests to guide software-engineering agents is unreliable: fail-to-pass tests often cover only one manifestation of the reported issue and lead to partial patches, while fail-to-fail tests mislead agents when used as patch-generation targets. SWE-Doctor instead generates multi-faceted reproduction tests and converts their executions into runtime diagnoses that guide patching, reaching average resolution rates of 75.7% on SWE-bench Verified and 59.4% on SWE-bench Pro across ten LLM-benchmark combinations.
JingLabs read
The transferable pattern is reproduction-first with diagnosis feedback: have the agent write several reproduction tests and feed the runtime output back as diagnostic context, rather than treating a single failing test as the target to make pass. That is implementable today in any coding-agent harness without waiting for framework support.
Cross-Machine Replay Study Questions Performance-Optimization Agent Benchmarks
The study replays the official reference patches of three repository-level performance-optimization benchmarks — GSO, SWE-Perf, and SWE-fficiency — across four common Google Cloud machine types, covering 740 code-optimization tasks. Reference patches satisfy the benchmarks' own validity rules in every cross-machine replay for only 39 of 102 GSO tasks, 11 of 140 SWE-Perf tasks, and 411 of 498 SWE-fficiency tasks, with SWE-Perf especially fragile because many reference patches produce close-to-zero runtime changes.
JingLabs read
Treat performance-optimization leaderboard scores as weak evidence when choosing a coding agent — much of the signal is runtime instability and scoring-rule artifacts, not agent capability. If optimization work matters for a client engagement, benchmark candidate agents on the client's actual hardware and workloads instead.
Claude Code 2.1.200 and Agent SDK 0.3.199-200: Manual Mode, Approval Correlation, Masked Credentials
Claude Code 2.1.200, released July 3, renames the default permission mode to Manual across the CLI and IDE integrations and stops AskUserQuestion dialogs from auto-continuing by default, with an opt-in idle timeout via /config. The parallel Claude Agent SDK releases 0.3.199 and 0.3.200 add a requestId to the canUseTool callback for correlating out-of-band permission responses, a blocked field on workflow progress events indicating when the auto-mode safety classifier stopped an agent, and masked-credential injection into sandboxed commands via sandbox.credentials.
JingLabs read
If you run unattended Claude Code sessions, note that AskUserQuestion prompts now block until answered unless you configure the idle timeout — automated pipelines built on the old auto-continue behaviour will stall. The masked-credential injection is the more durable piece: it lets sandboxed agents use secrets without ever seeing them, a pattern worth adopting for any agent that touches client credentials.
MCP TypeScript SDK v2 Beta: Package Split, Standard Schema, 2026-07-28 Spec
The official MCP TypeScript SDK published v2.0.0-beta.1 on June 30, followed by beta.2 on July 2. The monolithic package is split into @modelcontextprotocol/server and @modelcontextprotocol/client with optional Express, Hono, Fastify, and Node adapters; tool schemas accept any Standard Schema library (Zod v4, ArkType, Valibot) or plain JSON Schema; and server setup consolidates to serveStdio() for local servers and createMcpHandler() for HTTP using web-standard Request/Response, portable across Node, Bun, Deno, and Workers. The beta implements the stateless 2026-07-28 spec while remaining compatible with legacy clients, and a codemod (npx @modelcontextprotocol/codemod@beta v1-to-v2) automates the mechanical migration; beta.2 adds CommonJS output alongside ESM.
JingLabs read
This is the JavaScript counterpart to the Python v2 beta covered on July 2, and the same advice applies: the API surface is now substantially frozen, so TypeScript MCP servers should be ported against the beta rather than deferring to the stable release in late July. The web-standard HTTP handler is the practical win for Next.js shops — an MCP server becomes an ordinary route handler with no framework glue.
Pydantic AI v2.4.0: Span-Based Trajectory Evaluators and File-Upload Controls
Pydantic AI v2.4.0, released July 2, adds five span-based agentic evaluators — ToolCorrectness, TrajectoryMatch, ArgumentCorrectness, MaxToolCalls, and MaxModelRequests — that score an agent's execution trace rather than only its final answer, alongside a GEval evaluator and standard quality-metric rubrics for LLMJudge. The release also splits the previous preserve_file_data setting into a distinct allow_uploaded_files control for inbound file security, separate from the opt-in for AG-UI representation.
JingLabs read
Process-aware evaluation — scoring tool selection, arguments, and trajectory shape separately from outcome — has been a recurring theme in recent papers and is now shipping in a mainstream framework's built-in eval tooling. If you run agents on Pydantic AI, adopt these evaluators for regression suites; if not, they are a reasonable template for what your harness should measure.
Agent Memory as a Data Management Problem: A Module-Level Experimental Study
The paper argues that agent memory has evolved into a data management system — persistent storage, retrieval, update, consolidation, and lifecycle governance — yet is still benchmarked as a monolithic black box through end-to-end task scores such as F1 or BLEU. It presents a systematic experimental study that decomposes memory systems into four modules — representation and storage, extraction, retrieval and routing, and maintenance — and measures operational cost, architectural trade-offs between modules, and robustness under dynamic knowledge updates, concerns that end-to-end metrics leave unexplored.
JingLabs read
If you are choosing a memory layer for a client assistant, this is the right question to ask vendors: per-module cost and behaviour under fact revision, not a headline benchmark score. The four-module decomposition is a usable checklist for auditing whatever memory stack you already run.
Claude Sonnet 5 Released: 1M-Token Context at Sonnet Pricing, Default in Claude Code
Anthropic released Claude Sonnet 5 on June 30 with a native 1M-token context window and positions it as an agentic model built to plan multi-step work, drive browsers and terminals, and run autonomously at a level previously requiring larger models. Introductory API pricing is $2/$10 per million tokens through August 31, then $3/$15. It is now the default model in Claude Code (version 2.1.197 and later) and for Free and Pro users on claude.ai, and is available in the Claude API and major coding tools.
JingLabs read
The notable number is 1M tokens of context at mid-tier pricing: for many SMB workloads that shrinks the need for elaborate retrieval pipelines over mid-size document sets. Test it against your current default model on real agent tasks before August 31 — the promotional window makes the comparison nearly free.
Claude Code 2.1.198: Background Agents Auto-Open Draft PRs, Agent Notification Hooks
Background agents launched from the claude agents view now commit, push, and open a draft pull request when they finish code work in a worktree, instead of stopping to ask. Sessions that need input or finish now fire the Notification hook with agent_needs_input and agent_completed events, the built-in Explore subagent inherits the main session's model instead of running on Haiku, and subagents and context compaction inherit the session's extended-thinking configuration. The release also makes Claude in Chrome generally available.
JingLabs read
Auto-opening draft PRs changes the default behaviour of unattended agents — decide whether that is acceptable in your repositories before upgrading automated runners. The notification-hook events are the practical piece: they let you wire agent completion into chat or ticketing systems without polling.
MCP Python SDK v2.0.0b1: First Beta with Full 2026-07-28 Spec Support
The first v2 beta of the official MCP Python SDK, published June 30, ships full support for the upcoming 2026-07-28 MCP specification, including the stateless protocol. It finalises the dispatcher/runner pipeline that replaces the session-centric v1 architecture, renames FastMCP to MCPServer, and adds resolver dependency injection, RFC 6570 URI templates, and OpenTelemetry tracing; protocol types now live in a standalone mcp-types package. The stable release is targeted for late July alongside the final spec.
JingLabs read
With the beta out, the migration window is now measured in weeks, not months: start porting Python MCP servers against v2 now, since the API surface is substantially frozen. Keep the mcp<2 upper bound in production dependencies until stable lands.
BioInsight: Typed Intermediate Artifacts for Multi-Agent Evidence Reporting
Static AI-generated biomedical reports are often insufficient for research decisions, where users need to inspect evidence, assess uncertainty, and refine hypotheses. BioInsight is a multi-agent system that organises disease-specific evidence through typed intermediate artifacts — ranked pathways, literature evidence packets, reasoning notes, citation-grounded reports, and dashboard schemas — decouples evidence retrieval from mechanistic reasoning, normalises citations through deterministic components, and renders the same structured evidence as an interactive interface instead of a text-only report.
JingLabs read
The transferable idea is the pipeline shape: agents exchanging typed, validated artifacts with deterministic components handling citations, rather than passing free text between prompts. That pattern suits any regulated-domain reporting agent in the EU, where provenance must survive the pipeline. Watch, and borrow the artifact design.
Vercel AI SDK 6.0.202 Patches Tool-Approval Forgery in generateText/streamText
The approval-replay path in generateText and streamText reconstructed approved tool calls from the client-supplied messages array and executed them without re-validating input against the tool's schema or re-checking that the tool actually requires approval, so a client could forge an assistant message containing a pre-approved tool-call part and have the server execute a tool with attacker-chosen arguments. The fix verifies the HMAC signature when experimental_toolApprovalSecret is configured, re-validates tool-call input against the input schema, and re-resolves whether the tool requires approval before execution. A companion 5.x patch (5.0.198) hardens UI message stream processing against prototype pollution from chunk IDs.
JingLabs read
If you run AI SDK chat endpoints with approval-gated tools, update now and configure the tool-approval secret — message history sent from the browser is attacker-controlled input, and approval state stored there was trusted. A good reminder to treat replayed client state as untrusted in any agent backend, whatever the framework.
MCP Python SDK v2.0.0a1: First Alpha of the Stateless v2 Redesign
The first v2 alpha of the official MCP Python SDK restructures the package around a stateless request/response design: a new Dispatcher pipeline replaces ServerSession, low-level server handlers become constructor parameters instead of decorators, return values are no longer auto-wrapped, field names move to snake_case with stricter validation, and FastMCP is renamed MCPServer. The alpha implements only the 2025-11-25 spec revision; support for the upcoming 2026-07-28 spec will arrive incrementally, with a beta targeted for June 30 and a stable release for July 28, 2026. Maintainers of dependent packages are advised to add an upper bound such as mcp>=1.27,<2.
JingLabs read
Don't build on the alpha, but do add the <2 upper bound to any Python MCP server you ship today — when stable lands in July, unpinned dependencies will break. The FastMCP-to-MCPServer rename and the handler-interface overhaul mean a real migration, so budget time for it now rather than discovering it in a broken deploy.
Claw-SWE-Bench: Comparing Heterogeneous Agent Harnesses on Coding Tasks
General-purpose agent harnesses are hard to measure under SWE-bench because a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. The paper introduces a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous harnesses comparable under fixed prompt, runtime budget, workspace contract, patch-extraction procedure, and evaluator: 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup, plus an 80-instance Lite subset selected by a cost-aware, rank-aware procedure. An open adapter framework currently supports five harnesses.
JingLabs read
Harness choice, not just model choice, is increasingly the deployment decision, and most published numbers conflate the two. Worth testing if you are choosing between coding agents: the adapter framework lets you run the comparison on your own language mix before committing.
MedCTA: Process-Aware Evaluation of Clinical Tool Agents
Existing medical AI benchmarks largely evaluate isolated perception or single-turn question answering, giving limited visibility into failures of planning, tool recruitment, and rollout reliability. MedCTA evaluates medical tool agents on 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, grounded in multimodal inputs including radiology images, pathology slides, and reports. Its evaluation is process-aware, scoring tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality rather than only the final answer.
JingLabs read
Even outside healthcare, the process-aware rubric — scoring tool selection, argument validity, and execution stability separately from outcome — is a transferable template for auditing domain agents in regulated EU sectors, where how the agent reached an answer matters as much as the answer. Watch, and borrow the rubric.
Infini Memory: Topic-Structured Documents for Long-Term LLM Agent Memory
Long-term LLM agents need persistent memory that can track changing facts and supply relevant evidence across sessions, but existing systems store observations as isolated records, summaries, or indexed fragments, which makes evidence aggregation, fact revision, and maintenance difficult. The paper proposes Infini Memory, a text-based persistent memory architecture that treats agent memory as topic-structured documents: new observations are staged in a buffer and periodically consolidated, and at inference time the LLM reads memory through iterative tool calls rather than a single retrieval step. The system reaches a 64.7% overall score on MemoryAgentBench.
JingLabs read
Plain-text, topic-structured memory is auditable and easy to inspect, correct, or delete — a practical advantage over opaque vector stores for GDPR-conscious deployments. Worth testing for assistants that must remember client facts across sessions; the buffer-then-consolidate pattern is implementable without new infrastructure.
Claude Code 2.1.172: Nested Sub-Agents up to Five Levels Deep
Version 2.1.172 lets sub-agents spawn their own sub-agents up to five levels deep, adds a search bar to the plugin marketplace browser, and reads the AWS region from ~/.aws config files for Bedrock when AWS_REGION is unset. Fixes include sessions getting permanently stuck when using 1M context without credits, background agents reading incorrect project settings from pre-warmed workers, and wildcard domain rules not matching subdomains in WebFetch permissions.
JingLabs read
Nested sub-agents enable deeper delegation hierarchies — useful for research-then-implement pipelines — but each level multiplies token spend and makes failures harder to trace, so add depth only where a flat orchestrator demonstrably falls short. The wildcard-domain permission fix is worth noting for anyone relying on domain allowlists as a security boundary.
OpenAI Codex 0.139.0: Web Search in Code Mode, MCP Schema Fidelity
Codex 0.139.0 adds standalone web search to code mode with plaintext results, including web-search calls nested inside JavaScript tool calls. MCP tool schemas now preserve oneOf and allOf constructs, and large schemas keep more of their structure during compaction, improving compatibility with complex MCP servers. The release also scopes MCP startup warnings per thread and keeps sandbox execution on approved escalation decisions and proxy-only networking.
JingLabs read
The schema-fidelity change matters if you expose MCP servers with complex input schemas — previously flattened oneOf/allOf constraints silently degraded tool-call accuracy. If your MCP tools misbehaved under Codex, retest on this release before redesigning the schemas.
Claude Code 2.1.169–170: Post-Session Hook, Safe Mode, Fable 5
Version 2.1.169 adds a post-session lifecycle hook for self-hosted runners to snapshot work after sessions end, a --safe-mode flag that disables all customisations for troubleshooting, a /cd command that moves a session to a new working directory without invalidating the prompt cache, and fixes MCP policy enforcement on server reconnect. Version 2.1.170 introduces Claude Fable 5 — described in the changelog as a Mythos-class model made safe for general use — as a selectable model within the CLI.
JingLabs read
The post-session hook is immediately useful for teams running Claude Code in automated pipelines where agent workspace state must be captured after a session ends. Fable 5 is now accessible from the existing CLI without config changes, which is worth testing against current coding-agent workflows.
Skill Rewriting for LLM Agents: Quality-Cost Trade-offs on SkillsBench
LLM agents rely on skills — reusable procedural documents encoding workflows, tool calls, and domain rules. The paper finds that treating skill rewriting as pure prompt compression backfires: shorter skills can remove sparse operational anchors that the agent depends on for recovery and debugging, raising downstream agent-token cost rather than reducing it. Experiments on SkillsBench show API/code anchoring, workflow guarding, and rule/formula anchoring each suit different task families; a learned selection policy reduces total cost by 7% and downstream agent-token cost by 6%.
JingLabs read
Directly applicable to teams maintaining agent prompt libraries or system-prompt skill repositories. Profiling which anchors the agent actually uses before compressing for cost is worth the effort — blind compression can increase total spend. The SkillsBench evaluation setup is reusable for auditing internal skill corpora.
Text World Models: Survey and Framework for LLM Agent Planning
Many LLM-based agents are reactive — mapping observations directly to actions without an explicit model of how environments are structured or evolve. The paper introduces text world models (TWMs): learned transition functions that take a state and a candidate action and predict the resulting webpage, terminal output, API response, or user reply. The survey organises TWM approaches by construction method and application, covering web navigation, code editing, tool use, and long-horizon dialogue, and shows benefits for lookahead planning, sample-efficient learning, and principled benchmark evaluation.
JingLabs read
Relevant for building agents that must plan multi-step sequences before committing to external API or browser actions. TWMs can cut costly real-environment trial-and-error in enterprise automation pipelines — worth tracking as implementations mature in open-source frameworks.