Alibaba’s AI Just Coded for 35 Hours Straight Without Human Help

Alibaba’s AI Just Coded for 35 Hours Straight Without Human Help

Last week, Alibaba’s Qwen team dropped something wild: their new model Qwen3.7-Max ran autonomously for 35 hours straight, optimizing code for their custom AI chip. No human intervention. No breaks. Just an AI agent engineering itself.

This isn’t a chatbot that answers questions. This is an AI that does work — the kind of work that would normally take a senior engineer over a week.

And it’s directly relevant to the AI Junior Developer series I started yesterday.


What Actually Happened

Here’s the breakdown:

The Task: Optimize low-level code for Alibaba’s proprietary AI accelerator chip. Think CUDA kernels, memory management, instruction scheduling — the kind of work where a 5% improvement means millions in savings at scale.

The Agent: Qwen3.7-Max, a model built specifically for long-running autonomous tasks. Not a general chat model — an agent model.

The Runtime: 35 hours. The agent: – Read the existing codebase – Identified optimization opportunities – Made changes – Ran benchmarks – Iterated based on results – Repeated for 35 hours

The Result: Code optimizations that matched what human engineers would produce — but done in a day and a half instead of a week.


Why This Matters for AI Agents

There’s a key distinction here that most people miss: Qwen3.7-Max was built for agent work, not chat.

Most LLMs today are optimized for: – Single-turn Q&A – Short conversations (5-10 turns) – Clear, immediate tasks

Agent work is different: – Long horizon: Tasks that take hours or days – Self-correction: The agent needs to recognize when something isn’t working and try a different approach – State management: Remembering what was tried 10 hours ago – Tool use: Running code, reading files, executing benchmarks

Qwen3.7-Max is designed for the second category. And on benchmarks for autonomous agent tasks, it matches Claude Opus 4.6 — the current state-of-the-art for agent work.


The Benchmark Wars (Quick Context)

For context, here’s how the models stack up on agent benchmarks:

Model Agent Score Best For
Claude Opus 4.6 ~92 Complex reasoning, long-horizon tasks
Qwen3.7-Max ~92 Autonomous coding, hardware optimization
DeepSeek V4 Pro ~88 Cost-effective general tasks
Kimi K2.6 ~85 Long-context reading

I’ve been using DeepSeek V4 Pro for my Hermes orchestrator (cost efficiency), but for the actual code agent? This Qwen3.7-Max result is compelling.


The Robot Demo

The Qwen team didn’t stop at code optimization. They also demoed Qwen3.7-Max steering a four-legged robot — the kind you see from Boston Dynamics.

The model: – Processed sensor data – Made real-time movement decisions – Adjusted gait based on terrain – Recovered from stumbles

This is the same architecture as the code optimization agent: perceive → decide → act → iterate. The only difference is the output (motor commands vs. code commits).


What This Means for Your AI Junior Developer

If you’re following my AI Junior Developer series, here’s the takeaway:

The technology exists today. Alibaba isn’t doing something theoretically possible in 5 years. They’re running production agent workloads right now — 35-hour autonomous sessions that produce real engineering output.

Your AI junior developer doesn’t need to wait for better models. The stack is:

1. Orchestrator (Hermes Agent) — manages the workflow 2. Code Agent (Claude Code, or potentially Qwen3.7-Max) — does the actual work 3. Memory (CLAUDE.md) — accumulates project knowledge 4. Tools (MCP servers) — gives the agent hands to work with

The limiting factor isn’t the model anymore. It’s: – How well you define the task – How good your verification loop is – Whether you have human-in-the-loop at the right checkpoints


The Real Shift

Here’s what I’m realizing after building my own AI agent and seeing what Alibaba shipped:

We’re not building “AI assistants” anymore. We’re building AI employees.

An assistant helps you write code. An employee writes the code for you and opens a PR while you sleep.

The difference isn’t semantic — it’s architectural:

Assistant Employee
Waits for your next prompt Works autonomously on a goal
Single-turn tasks Multi-hour workflows
You manage context Agent manages its own context
You verify every line You review the final PR

Qwen3.7-Max running for 35 hours is an employee. My Hermes + Claude Code setup is an employee. And if you’re building this stuff, you’re not a developer anymore — you’re a manager of AI engineers.


The Cost Question

Everyone asks: “But how much does this cost?”

Alibaba didn’t disclose, but we can estimate: – Qwen3.7-Max via API: ~$0.50-1.00 per 1K tokens (guessing based on comparable models) – 35 hours of agent work: probably 50K-100K tokens – Total cost: ~$25-100 for a week’s worth of senior engineer work

My setup (DeepSeek V4 Pro + Claude Sonnet) runs about $0.30 per ticket. For 2-3 tickets a day, that’s $30-50/month.

The ROI is obvious.


What’s Next

I’m watching Qwen3.7-Max closely. If it’s as good at agent work as the benchmarks suggest, I might swap it in as the code executor for my AI junior developer — replacing Claude Code with a Qwen-based agent.

That’s a future post in this series. For now, the takeaway is simple:

Autonomous AI agents aren’t coming. They’re here.

Alibaba’s model coded for 35 hours without human help. My agent fixes ERP bugs while I sleep. What’s stopping you from building one?


This post is part of the “AI Junior Developer” series. Next: [Post 3] Setting Up Claude Code — CLAUDE.md mastery, skills, and subagents.

Related: Alibaba AI Just Coded for 35 Hours Straight Without Human Help.

Related: Post 4: Building the Agent Team — Supervisor, Coder, Reviewer, QC.


Discover more from Susiloharjo

Subscribe to get the latest posts sent to your email.

Discover more from Susiloharjo

Subscribe now to keep reading and get access to the full archive.

Continue reading