May 28, 2026 · 5 min read
Last week, Alibaba’s Qwen team dropped something wild: their new model Qwen3.7-Max ran autonomously for 35 hours straight, optimizing code for their custom AI chip. No human intervention. No breaks. Just an AI agent engineering itself.
This isn’t a chatbot that answers questions. This is an AI that does work — the kind of work that would normally take a senior engineer over a week.
And it’s directly relevant to the AI Junior Developer series I started yesterday.
What Actually Happened
Here’s the breakdown:
The Task: Optimize low-level code for Alibaba’s proprietary AI accelerator chip. Think CUDA kernels, memory management, instruction scheduling — the kind of work where a 5% improvement means millions in savings at scale.
The Agent: Qwen3.7-Max, a model built specifically for long-running autonomous tasks. Not a general chat model — an agent model.
The Runtime: 35 hours. The agent:
- Read the existing codebase
- Identified optimization opportunities
- Made changes
- Ran benchmarks
- Iterated based on results
- Repeated for 35 hours
The Result: Code optimizations that matched what human engineers would produce — but done in a day and a half instead of a week.
Why This Matters for AI Agents
There’s a key distinction here that most people miss: Qwen3.7-Max was built for agent work, not chat.
Most LLMs today are optimized for:
- Single-turn Q&A
- Short conversations (5-10 turns)
- Clear, immediate tasks
Agent work is different:
- Long horizon: Tasks that take hours or days
- Self-correction: The agent needs to recognize when something isn’t working and try a different approach
- State management: Remembering what was tried 10 hours ago
- Tool use: Running code, reading files, executing benchmarks
Qwen3.7-Max is designed for the second category. And on benchmarks for autonomous agent tasks, it matches Claude Opus 4.6 — the current state-of-the-art for agent work.
The Benchmark Wars (Quick Context)
For context, here’s how the models stack up on agent benchmarks:
| Model | Agent Score | Best For |
|---|---|---|
| Claude Opus 4.6 | ~92 | Complex reasoning, long-horizon tasks |
| Qwen3.7-Max | ~92 | Autonomous coding, hardware optimization |
| DeepSeek V4 Pro | ~88 | Cost-effective general tasks |
| Kimi K2.6 | ~85 | Long-context reading |
I’ve been using DeepSeek V4 Pro for my Hermes orchestrator (cost efficiency), but for the actual code agent? This Qwen3.7-Max result is compelling.
The Robot Demo
The Qwen team didn’t stop at code optimization. They also demoed Qwen3.7-Max steering a four-legged robot — the kind you see from Boston Dynamics.
The model:
- Processed sensor data
- Made real-time movement decisions
- Adjusted gait based on terrain
- Recovered from stumbles
This is the same architecture as the code optimization agent: perceive → decide → act → iterate. The only difference is the output (motor commands vs. code commits).
What This Means for Your AI Junior Developer
If you’re following my AI Junior Developer series, here’s the takeaway:
The technology exists today. Alibaba isn’t doing something theoretically possible in 5 years. They’re running production agent workloads right now — 35-hour autonomous sessions that produce real engineering output.
Your AI junior developer doesn’t need to wait for better models. The stack is:
- Orchestrator (Hermes Agent) — manages the workflow
- Code Agent (Claude Code, or potentially Qwen3.7-Max) — does the actual work
- Memory (CLAUDE.md) — accumulates project knowledge
- Tools (MCP servers) — gives the agent hands to work with
The limiting factor isn’t the model anymore. It’s:
- How well you define the task
- How good your verification loop is
- Whether you have human-in-the-loop at the right checkpoints
The Real Shift
Here’s what I’m realizing after building my own AI agent and seeing what Alibaba shipped:
We’re not building “AI assistants” anymore. We’re building AI employees.
An assistant helps you write code. An employee writes the code for you and opens a PR while you sleep.
The difference isn’t semantic — it’s architectural:
| Assistant | Employee |
|---|---|
| Waits for your next prompt | Works autonomously on a goal |
| Single-turn tasks | Multi-hour workflows |
| You manage context | Agent manages its own context |
| You verify every line | You review the final PR |
Qwen3.7-Max running for 35 hours is an employee. My Hermes + Claude Code setup is an employee. And if you’re building this stuff, you’re not a developer anymore — you’re a manager of AI engineers.
The Cost Question
Everyone asks: “But how much does this cost?”
Alibaba didn’t disclose, but we can estimate:
- Qwen3.7-Max via API: ~$0.50-1.00 per 1K tokens (guessing based on comparable models)
- 35 hours of agent work: probably 50K-100K tokens
- Total cost: ~$25-100 for a week’s worth of senior engineer work
My setup (DeepSeek V4 Pro + Claude Sonnet) runs about $0.30 per ticket. For 2-3 tickets a day, that’s $30-50/month.
The ROI is obvious.
What’s Next
I’m watching Qwen3.7-Max closely. If it’s as good at agent work as the benchmarks suggest, I might swap it in as the code executor for my AI junior developer — replacing Claude Code with a Qwen-based agent.
That’s a future post in this series. For now, the takeaway is simple:
Autonomous AI agents aren’t coming. They’re here.
Alibaba’s model coded for 35 hours without human help. My agent fixes ERP bugs while I sleep. What’s stopping you from building one?
This post is part of the “AI Junior Developer” series. Next: [Post 3] Setting Up Claude Code — CLAUDE.md mastery, skills, and subagents.
Related: how to tame rogue AI agent processes on Linux and why shipping your AI agent beats over-optimizing it.
Discover more from Susiloharjo
Subscribe to get the latest posts sent to your email.