Testing shows ChatGPT 5.5 performing strongly in isolated command-line tool tasks but struggling with extended, multi-step software engineering problems. Results from Terminal-Bench 2.0 and SWE-Bench ...