Round 2: LLM To-Do App Battle Royale - 13 Models Tested!

🚀 The LLM Arena Expands: Round 2 Results

After our previous battle of 7 LLMs, the AI landscape has evolved dramatically! We’re back with 13 models - including returning champions and exciting newcomers. The same To-Do app challenge, but with fresh competition and updated models. Let’s see who claims the crown! 👑

📊 Complete Leaderboard

Rank	LLM	Speed	Cost	Quality	Overall
🥇	Gemini flash 2.5-0520	★★★★★★★★★★ 10/10	★★★★★★★★★☆ 9/10	★★★★★★★★★☆ 9/10	9.3
🥈	Devstrall Small 2505	★★★★★★★★★★ 10/10	★★★★★★★★★★ 10/10	★★★★★★☆☆☆☆ 6/10	8.7
🥉	Gemini flash 2.5 experimental	★★★★★★★★★★ 10/10	★★★★★★★★★★ 10/10	★★★★★★★★★☆ 9/10	9.7
4	Gemini flash 2.0	★★★★★★★★★★ 10/10	★★★★★★★★★★ 10/10	★★★★★★★★☆☆ 8/10	9.3
5	LLama 3.3	★★★★★★★★★★ 10/10	★★★★★★★★★★ 10/10	★★★★★★☆☆☆☆ 6/10	8.7
6	DeepSeek 3.7 0324	★★★★★★☆☆☆☆ 6/10	★★★★★★★☆☆☆ 7/10	★★★★★★★★☆☆ 8/10	7.0
7	OpenAI GPT 4.1	★★★★★★★☆☆☆ 7/10	★★★★★★☆☆☆☆ 6/10	★★★★★★★★★☆ 9/10	7.3
8	Sonnet 4.0	★★★★★★☆☆☆☆ 6/10	★★☆☆☆☆☆☆☆☆ 2/10	★★★★★★★☆☆☆ 7/10	5.0
9	Llama 4 Maverick	★★★★☆☆☆☆☆☆ 4/10	★★★★★★★★★☆ 9/10	★★★★★★☆☆☆☆ 6/10	6.3
10	Mistral Large 24-11	★★★★★★☆☆☆☆ 6/10	★★★★★★★★★★ 10/10	★★★★★☆☆☆☆☆ 5/10	7.0
11	DeepSeek R1	★★★☆☆☆☆☆☆☆ 3/10	★★★★★★☆☆☆☆ 6/10	★★★★★★★☆☆☆ 7/10	5.3
12	Claude Sonnet 3.7	★★★★★☆☆☆☆☆ 5/10	★★☆☆☆☆☆☆☆☆ 2/10	★★★★★★★★☆☆ 8/10	5.0
13	Gemini Pro Preview 05-06	★★★★★★★☆☆☆ 7/10	★★☆☆☆☆☆☆☆☆ 2/10	★★★★★★★★☆☆ 8/10	5.7

🔍 Detailed Model Reviews

🥇 Gemini flash 2.5-0520 - The New Champion!

Speed: 10/10 (215 tokens/s) ★★★★★★★★★★
Cost: 9/10 ($0.009) ★★★★★★★★★☆
Quality: 9/10 ★★★★★★★★★☆

🎯 The Perfect Balance: This model delivers exceptional performance across all metrics! Lightning-fast generation, reasonable cost, and outstanding code quality with comprehensive documentation and modern JavaScript practices. The drag-and-drop works flawlessly, and the dark mode implementation is top-notch. Minor deviations from prompt requirements prevent a perfect score.

🥈 Devstrall Small 2505 - Speed Demon

Speed: 10/10 (199 tokens/s) ★★★★★★★★★★
Cost: 10/10 ($0.00054) ★★★★★★★★★★
Quality: 6/10 ★★★★★★☆☆☆☆

⚡ Blazing Fast & Cheap: Incredible speed and cost efficiency, but significant drag-and-drop implementation issues hold it back. Great foundation with clean UI, but the broken interactive features prevent higher quality scores.

🥉 Gemini flash 2.5 experimental - Returning Champion

Speed: 10/10 (155 tokens/s) ★★★★★★★★★★
Cost: 10/10 ($0.0042) ★★★★★★★★★★
Quality: 9/10 ★★★★★★★★★☆

🏆 Consistent Excellence: Our previous top performer maintains its high standards! Excellent prompt adherence, clean code structure, and smooth functionality. Still one of the best all-around choices for developers.

Gemini flash 2.0 - Speed & Value King

Speed: 10/10 (177.5 tokens/s) ★★★★★★★★★★
Cost: 10/10 ($0.0024) ★★★★★★★★★★
Quality: 8/10 ★★★★★★★★☆☆

💨 Fast & Furious: Excellent speed and cost metrics with solid functionality. The drag-and-drop has some UX limitations (only works on specific elements), but overall delivers great value for money.

LLama 3.3 - The Speedster

Speed: 10/10 (140 tokens/s) ★★★★★★★★★★
Cost: 10/10 ($0.0038) ★★★★★★★★★★
Quality: 6/10 ★★★★★★☆☆☆☆

🚀 Need for Speed: Maintains its reputation for blazing speed and ultra-low cost, but still struggles with incomplete drag-and-drop functionality and code repetition issues. Great for rapid prototyping!

DeepSeek 3.7 0324 - Quality Focused

Speed: 6/10 (82.59 tokens/s) ★★★★★★☆☆☆☆
Cost: 7/10 ($0.0102) ★★★★★★★☆☆☆
Quality: 8/10 ★★★★★★★★☆☆

🎨 Quality Over Speed: Delivers well-structured, maintainable code with good practices. Slower generation and slightly higher cost, but the code quality justifies the premium for production applications.

OpenAI GPT 4.1 - Feature Rich

Speed: 7/10 (109.25 tokens/s) ★★★★★★★☆☆☆
Cost: 6/10 ($0.041) ★★★★★★☆☆☆☆
Quality: 9/10 ★★★★★★★★★☆

🌟 Premium Experience: Goes above and beyond with toast notifications, accessibility features, and demo data. Excellent code quality and comprehensive features, but at a premium price point.

Sonnet 4.0 - The Overachiever

Speed: 6/10 (90.4 tokens/s) ★★★★★★☆☆☆☆
Cost: 2/10 ($0.108) ★★☆☆☆☆☆☆☆☆
Quality: 7/10 ★★★★★★★☆☆☆

💰 Premium But Pricey: Well-structured class-based architecture with extra features, but the cost is prohibitive and doesn’t precisely follow prompt requirements. Good for complex applications where budget isn’t a concern.

Llama 4 Maverick - The Inconsistent

Speed: 4/10 (64.85 tokens/s) ★★★★☆☆☆☆☆☆
Cost: 9/10 ($0.0032) ★★★★★★★★★☆
Quality: 6/10 ★★★★★★☆☆☆☆

⚠️ Mixed Results: Good cost efficiency but suffers from broken dark mode implementation and slower generation. The core functionality works, but polish issues prevent recommendation.

Mistral Large 24-11 - The Newcomer

Speed: 6/10 (99 tokens/s) ★★★★★★☆☆☆☆
Cost: 10/10 ($0.0007653) ★★★★★★★★★★
Quality: 5/10 ★★★★★☆☆☆☆☆

💸 Budget Champion: Incredibly cost-effective with decent speed, but significant bugs in drag-and-drop and dark mode implementation. Good foundation that needs refinement.

DeepSeek R1 - The Traditionalist

Speed: 3/10 (52.55 tokens/s) ★★★☆☆☆☆☆☆☆
Cost: 6/10 ($0.034) ★★★★★★☆☆☆☆
Quality: 7/10 ★★★★★★★☆☆☆

🐌 Slow But Steady: Solid functionality with good prompt adherence, but relies heavily on outdated inline event handlers and generates code slowly. The functionality works, but modern practices are lacking.

Claude Sonnet 3.7 - The Expensive Expert

Speed: 5/10 (79.69 tokens/s) ★★★★★☆☆☆☆☆
Cost: 2/10 ($0.06966) ★★☆☆☆☆☆☆☆☆
Quality: 8/10 ★★★★★★★★☆☆

💎 Quality at a Price: Excellent code structure and comprehensive features, but the cost is 13x over budget! Great for enterprise applications where cost isn’t the primary concern.

Gemini Pro Preview 05-06 - The Preview Problem

Speed: 7/10 (104 tokens/s) ★★★★★★★☆☆☆
Cost: 2/10 ($0.081) ★★☆☆☆☆☆☆☆☆
Quality: 8/10 ★★★★★★★★☆☆

🔬 Preview Pricing: High-quality implementation with excellent prompt adherence, but preview pricing makes it impractical for production use. Shows promise for future releases!

📈 Key Insights & Trends

🏆 Winners & Losers

Big Winner: Gemini flash 2.5-0520 takes the crown with the best overall balance
Speed Champions: Multiple models now hit 150+ tokens/s (vs. just 2 in Round 1)
Cost Kings: Several models achieve sub-$0.005 costs (major improvement!)
Quality Leaders: 9/10 scores are now achievable (up from 8/10 max in Round 1)

📊 Compared to Round 1:

Speed Improvements: Average speed increased from ~120 to ~140 tokens/s
Cost Reductions: More models hitting the $0.005 sweet spot
Quality Gains: Better prompt adherence and fewer critical bugs
New Players: 6 new models joined the competition

💡 Best Use Cases:

Rapid Prototyping: LLama 3.3, Devstrall Small 2505
Production Apps: Gemini flash 2.5-0520, Gemini flash 2.5 experimental
Budget Projects: Mistral Large 24-11, LLama 3.3
Enterprise: OpenAI GPT 4.1, Claude Sonnet 3.7 (if budget allows)

🎯 Final Recommendations

🥇 Overall Champion

Gemini flash 2.5-0520 - The perfect balance of speed, cost, and quality. Your go-to choice for most projects.

💰 Best Value

Gemini flash 2.5 experimental - Proven track record with excellent cost-to-quality ratio.

⚡ Speed Demon

Devstrall Small 2505 - When you need code fast and cheap, despite some quality trade-offs.

🏢 Enterprise Choice

OpenAI GPT 4.1 - Premium features and quality justify the higher cost for complex applications.

🔮 Looking Ahead

The LLM landscape continues to evolve rapidly! We’ve seen significant improvements in speed and cost efficiency since Round 1, with more models achieving our quality benchmarks. The competition is heating up, and we can expect even better performance in future rounds.

What’s next? We’re planning Round 3 with even more challenging prompts and newer model releases. Stay tuned! 🚀

Which model surprised you the most? Have you tried any of these for your own projects? Let us know in the comments below! 💬