Alibaba’s AI Just Coded for 35 Hours Straight Without Human Help
Last week, Alibaba’s Qwen team dropped something wild: their new model Qwen3.7-Max ran autonomously for 35 hours straight, optimizing code for their custom AI chip. No human intervention. No breaks. Just an AI agent engineering itself.
This isn’t a chatbot that answers questions. This is an AI that does work — the kind of work that would normally take a senior engineer over a week.
And it’s directly relevant to the AI Junior Developer series I started yesterday.
What Actually Happened
Here’s the breakdown:
The Task: Optimize low-level code for Alibaba’s proprietary AI accelerator chip. Think CUDA kernels, memory management, instruction scheduling — the kind of work where a 5% improvement means millions in savings at scale.
The Agent: Qwen3.7-Max, a model built specifically for long-running autonomous tasks. Not a general chat model — an agent model.
The Runtime: 35 hours. The agent: – Read the existing codebase – Identified optimization opportunities – Made changes – Ran benchmarks – Iterated based on results – Repeated for 35 hours
The Result: Code optimizations that matched what human engineers would produce — but done in a day and a half instead of a week.
Why This Matters for AI Agents
There’s a key distinction here that most people miss: Qwen3.7-Max was built for agent work, not chat.
Most LLMs today are optimized for: – Single-turn Q&A – Short conversations (5-10 turns) – Clear, immediate tasks
Agent work is different: – Long horizon: Tasks that take hours or days – Self-correction: The agent needs to recognize when something isn’t working and try a different approach – State management: Remembering what was tried 10 hours ago – Tool use: Running code, reading files, executing benchmarks
Qwen3.7-Max is designed for the second category. And on benchmarks for autonomous agent tasks, it matches Claude Opus 4.6 — the current state-of-the-art for agent work.
The Benchmark Wars (Quick Context)
For context, here’s how the models stack up on agent benchmarks:
| Model | Agent Score | Best For |
| Claude Opus 4.6 | ~92 | Complex reasoning, long-horizon tasks |
| Qwen3.7-Max | ~92 | Autonomous coding, hardware optimization |
| DeepSeek V4 Pro | ~88 | Cost-effective general tasks |
| Kimi K2.6 | ~85 | Long-context reading |
I’ve been using DeepSeek V4 Pro for my Hermes orchestrator (cost efficiency), but for the actual code agent? This Qwen3.7-Max result is compelling.
The Robot Demo
The Qwen team didn’t stop at code optimization. They also demoed Qwen3.7-Max steering a four-legged robot — the kind you see from Boston Dynamics.
The model: – Processed sensor data – Made real-time movement decisions – Adjusted gait based on terrain – Recovered from stumbles
This is the same architecture as the code optimization agent: perceive → decide → act → iterate. The only difference is the output (motor commands vs. code commits).
What This Means for Your AI Junior Developer
If you’re following my AI Junior Developer series, here’s the takeaway:
The technology exists today. Alibaba isn’t doing something theoretically possible in 5 years. They’re running production agent workloads right now — 35-hour autonomous sessions that produce real engineering output.
Your AI junior developer doesn’t need to wait for better models. The stack is:
1. Orchestrator (Hermes Agent) — manages the workflow 2. Code Agent (Claude Code, or potentially Qwen3.7-Max) — does the actual work 3. Memory (CLAUDE.md) — accumulates project knowledge 4. Tools (MCP servers) — gives the agent hands to work with
The limiting factor isn’t the model anymore. It’s: – How well you define the task – How good your verification loop is – Whether you have human-in-the-loop at the right checkpoints
The Real Shift
Here’s what I’m realizing after building my own AI agent and seeing what Alibaba shipped:
We’re not building “AI assistants” anymore. We’re building AI employees.
An assistant helps you write code. An employee writes the code for you and opens a PR while you sleep.
The difference isn’t semantic — it’s architectural:
| Assistant | Employee |
| Waits for your next prompt | Works autonomously on a goal |
| Single-turn tasks | Multi-hour workflows |
| You manage context | Agent manages its own context |
| You verify every line | You review the final PR |
Qwen3.7-Max running for 35 hours is an employee. My Hermes + Claude Code setup is an employee. And if you’re building this stuff, you’re not a developer anymore — you’re a manager of AI engineers.
The Cost Question
Everyone asks: “But how much does this cost?”
Alibaba didn’t disclose, but we can estimate: – Qwen3.7-Max via API: ~$0.50-1.00 per 1K tokens (guessing based on comparable models) – 35 hours of agent work: probably 50K-100K tokens – Total cost: ~$25-100 for a week’s worth of senior engineer work
My setup (DeepSeek V4 Pro + Claude Sonnet) runs about $0.30 per ticket. For 2-3 tickets a day, that’s $30-50/month.
The ROI is obvious.
What’s Next
I’m watching Qwen3.7-Max closely. If it’s as good at agent work as the benchmarks suggest, I might swap it in as the code executor for my AI junior developer — replacing Claude Code with a Qwen-based agent.
That’s a future post in this series. For now, the takeaway is simple:
Autonomous AI agents aren’t coming. They’re here.
Alibaba’s model coded for 35 hours without human help. My agent fixes ERP bugs while I sleep. What’s stopping you from building one?
This post is part of the “AI Junior Developer” series. Next: [Post 3] Setting Up Claude Code — CLAUDE.md mastery, skills, and subagents.
Related: Alibaba AI Just Coded for 35 Hours Straight Without Human Help.
Related: Post 4: Building the Agent Team — Supervisor, Coder, Reviewer, QC.
Discover more from Susiloharjo
Subscribe to get the latest posts sent to your email.