Autonomous coding agents — AI that takes a task description and produces working code without you writing a line — have gone from demos to production tools in the space of 18 months. Six are worth serious evaluation in 2026: Devin, Codex, Claude Code, Jules, Cosine, and OpenHands. This guide organizes them by workflow mode, gives you a decision framework, and is honest about what the benchmarks actually tell you.
First: a word on benchmarks
SWE-bench, SWE-Lancer, and similar benchmarks measure how often an agent can resolve a real GitHub issue on a real codebase. The numbers are useful but incomplete:
- They measure success on well-specified, isolated issues — not the ambiguous, cross-cutting tasks that dominate real engineering work.
- A 70% success rate on SWE-bench does not mean the tool successfully closes 70% of your tickets. Real tickets are often underspecified.
- Published numbers vary widely by test condition. Verify whether the benchmark uses "with scaffolding" (the agent is given hints) or "without scaffolding" before comparing.
Use benchmarks to rule tools out, not to pick a winner. A tool at 30% on SWE-bench probably isn't ready for production. A tool at 65% vs 72% is a much smaller practical gap than the numbers suggest.
Three workflow modes
The six agents split into three modes. Pick the mode that fits how your team works, then choose within it.
Mode 1: Async-and-review
You assign a task, do other work, and review a PR or set of changes when they're done. No real-time involvement.
Jules — Google's async agent for GitHub Issues. Free tier: 15 tasks/day. Shows you its implementation plan before writing a single line — you can redirect it before code is committed. Gemini 2.5 Pro powering it. Best for: developers who want to review the approach before it executes.
Codex — OpenAI's async cloud agent, bundled with ChatGPT Plus/Pro. Can run multiple tasks in parallel simultaneously. Zero setup — if you already pay for ChatGPT, it's included. Best for: developers who want maximum parallelism and have a ChatGPT subscription.
Cosine — Purpose-built for ticket automation across GitHub, Jira, and Linear. Picks up issues from your tracker, implements them, and opens PRs. 72% on SWE-Lancer. Generous free trial (80 tasks, no credit card). Best for: engineering teams with high ticket volume who want to automate the boring tickets.
Which to pick:
- Free, see the plan first → Jules
- Already on ChatGPT, want parallel tasks → Codex
- Living in Jira/Linear, want tracker automation → Cosine
Mode 2: Interactive-and-steer
You're present while the agent works. You monitor its progress, redirect it mid-task, and steer if it goes off course. Closer to pair programming than delegation.
Claude Code — Anthropic's terminal-native agent. Runs in your terminal (also available in VS Code, JetBrains, Slack). Reads your full codebase, edits files, runs commands, iterates. You can watch, interrupt, and redirect. Best for: developers who want deep agentic capability with the option to stay in control.
Devin — Positioned as a full "AI software engineer." Works in a browser-based interface where you can watch it operate. Includes Windsurf IDE access. Shows its steps as it works. Best for: teams that want a broader autonomous engineer with transparency into the process.
Which to pick:
- Terminal-first, want VS Code/JetBrains integrations → Claude Code
- Want browser-based oversight + Windsurf IDE included → Devin
Mode 3: Self-hosted
You run the agent on your own infrastructure, with your own models, on your own terms.
OpenHands — Open-source (MIT), runs locally or on your servers, supports any LLM via API. The choice when cloud processing of your code isn't acceptable or you want full control over the agent's environment. No SaaS subscription. Best for: platform teams, security-sensitive organizations, and developers who want maximum flexibility and cost control.
Which to pick:
- Need on-premise or want full control → OpenHands (only option in this category)
Decision matrix
| Jules | Codex | Cosine | Claude Code | Devin | OpenHands | |
|---|---|---|---|---|---|---|
| Free tier | 15 tasks/day | Via ChatGPT free | 80 tasks trial | Limited | Limited quota | Free (self-hosted) |
| Starting price | $20/mo (Google AI Pro) | $8/mo (Go via ChatGPT) | $20/mo | $20/mo (Claude Pro) | $20/mo | Free |
| Mode | Async | Async (parallel) | Async | Interactive | Interactive | Interactive |
| Plan preview | Yes | No | No | Partial | Yes | No |
| Tracker integration | GitHub | GitHub | GitHub + Jira + Linear | Manual | GitHub | GitHub |
| Parallel tasks | 3 concurrent (free) | Yes — multiple | Credit-based | No | No | No |
| Model | Gemini 2.5 Pro | OpenAI Codex | Cosine model | Claude Opus 4.6 | Multiple | Any (BYOK) |
| On-premise | No | No | No | No | No | Yes |
Five questions to narrow your choice
1. Can your code leave your network? If no: OpenHands only.
2. Do you have a ChatGPT subscription? If yes: Codex is already available — try it before adding another subscription.
3. Where do your tickets live? GitHub Issues only → any tool works. Jira or Linear → Cosine has native integration others lack.
4. Do you want to watch and steer? Yes → Claude Code or Devin. No → Jules, Codex, or Cosine.
5. What's your budget? Free to start → Jules (15 tasks/day) or OpenHands. Under $20/mo → Codex Go ($8) or Cosine Hobby. Full budget → Claude Code Max or Devin Max for power users.
What most teams actually do
Most teams that commit to agents settle on a two-layer setup:
Layer 1 — Interactive agent (Claude Code or Devin) for complex tasks that need steering: new features, architecture changes, anything where the spec isn't perfectly clear upfront.
Layer 2 — Async agent (Cosine or Jules) for routine tasks: dependency upgrades, test coverage gaps, small bug fixes, documentation updates. These can run overnight and be reviewed in the morning.
The common mistake is buying one tool and trying to use it for everything. Async agents are fast and cheap for well-scoped work; interactive agents are more reliable for ambiguous tasks.
