Testing shows ChatGPT 5.5 performing strongly in isolated command-line tool tasks but struggling with extended, multi-step software engineering problems. Results from Terminal-Bench 2.0 and SWE-Bench ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results