Alibaba’s AI Just Coded for 35 Hours Straight Without Human Help

Last week, Alibaba’s Qwen team dropped something wild: their new model Qwen3.7-Max ran autonomously for 35 hours straight, optimizing code for their custom AI chip. No human intervention. No breaks. Just an AI agent engineering itself.

This isn’t a chatbot that answers questions. This is an AI that does work — the kind of work that would normally take a senior engineer over a week.

And it’s directly relevant to the AI Junior Developer series I started yesterday.

What Actually Happened

Here’s the breakdown:

The Task: Optimize low-level code for Alibaba’s proprietary AI accelerator chip. Think CUDA kernels, memory management, instruction scheduling — the kind of work where a 5% improvement means millions in savings at scale.

The Agent: Qwen3.7-Max, a model built specifically for long-running autonomous tasks. Not a general chat model — an agent model.

The Runtime: 35 hours. The agent: – Read the existing codebase – Identified optimization opportunities – Made changes – Ran benchmarks – Iterated based on results – Repeated for 35 hours

The Result: Code optimizations that matched what human engineers would produce — but done in a day and a half instead of a week.

Why This Matters for AI Agents

There’s a key distinction here that most people miss: Qwen3.7-Max was built for agent work, not chat.

Most LLMs today are optimized for: – Single-turn Q&A – Short conversations (5-10 turns) – Clear, immediate tasks

Agent work is different: – Long horizon: Tasks that take hours or days – Self-correction: The agent needs to recognize when something isn’t working and try a different approach – State management: Remembering what was tried 10 hours ago – Tool use: Running code, reading files, executing benchmarks

Qwen3.7-Max is designed for the second category. And on benchmarks for autonomous agent tasks, it matches Claude Opus 4.6 — the current state-of-the-art for agent work.

The Benchmark Wars (Quick Context)

For context, here’s how the models stack up on agent benchmarks:

Model	Agent Score	Best For
Claude Opus 4.6	~92	Complex reasoning, long-horizon tasks
Qwen3.7-Max	~92	Autonomous coding, hardware optimization
DeepSeek V4 Pro	~88	Cost-effective general tasks
Kimi K2.6	~85	Long-context reading

I’ve been using DeepSeek V4 Pro for my Hermes orchestrator (cost efficiency), but for the actual code agent? This Qwen3.7-Max result is compelling.

The Robot Demo

The Qwen team didn’t stop at code optimization. They also demoed Qwen3.7-Max steering a four-legged robot — the kind you see from Boston Dynamics.

The model: – Processed sensor data – Made real-time movement decisions – Adjusted gait based on terrain – Recovered from stumbles

This is the same architecture as the code optimization agent: perceive → decide → act → iterate. The only difference is the output (motor commands vs. code commits).

What This Means for Your AI Junior Developer

If you’re following my AI Junior Developer series, here’s the takeaway:

The technology exists today. Alibaba isn’t doing something theoretically possible in 5 years. They’re running production agent workloads right now — 35-hour autonomous sessions that produce real engineering output.

Your AI junior developer doesn’t need to wait for better models. The stack is:

1. Orchestrator (Hermes Agent) — manages the workflow 2. Code Agent (Claude Code, or potentially Qwen3.7-Max) — does the actual work 3. Memory (CLAUDE.md) — accumulates project knowledge 4. Tools (MCP servers) — gives the agent hands to work with

The limiting factor isn’t the model anymore. It’s: – How well you define the task – How good your verification loop is – Whether you have human-in-the-loop at the right checkpoints

The Real Shift

Here’s what I’m realizing after building my own AI agent and seeing what Alibaba shipped:

We’re not building “AI assistants” anymore. We’re building AI employees.

An assistant helps you write code. An employee writes the code for you and opens a PR while you sleep.

The difference isn’t semantic — it’s architectural:

Assistant	Employee
Waits for your next prompt	Works autonomously on a goal
Single-turn tasks	Multi-hour workflows
You manage context	Agent manages its own context
You verify every line	You review the final PR

Qwen3.7-Max running for 35 hours is an employee. My Hermes + Claude Code setup is an employee. And if you’re building this stuff, you’re not a developer anymore — you’re a manager of AI engineers.

The Cost Question

Everyone asks: “But how much does this cost?”

Alibaba didn’t disclose, but we can estimate: – Qwen3.7-Max via API: ~$0.50-1.00 per 1K tokens (guessing based on comparable models) – 35 hours of agent work: probably 50K-100K tokens – Total cost: ~$25-100 for a week’s worth of senior engineer work

My setup (DeepSeek V4 Pro + Claude Sonnet) runs about $0.30 per ticket. For 2-3 tickets a day, that’s $30-50/month.

The ROI is obvious.

What’s Next

I’m watching Qwen3.7-Max closely. If it’s as good at agent work as the benchmarks suggest, I might swap it in as the code executor for my AI junior developer — replacing Claude Code with a Qwen-based agent.

That’s a future post in this series. For now, the takeaway is simple:

Autonomous AI agents aren’t coming. They’re here.

Alibaba’s model coded for 35 hours without human help. My agent fixes ERP bugs while I sleep. What’s stopping you from building one?

This post is part of the “AI Junior Developer” series. Next: [Post 3] Setting Up Claude Code — CLAUDE.md mastery, skills, and subagents.

Discover more from Susiloharjo

Subscribe to get the latest posts sent to your email.