A new chess-based benchmark, ChessArena, has revealed that large language models (LLMs) still struggle with genuine strategic reasoning, with none surpassing human amateur level and some losing to ...