DESCRIPTION – A framework and report designed to compare LLMs across key dimensions: their Speed (Time to First Token, Tokens per Second/Throughput), their Efficiency (token consumption in/out, context window utilisation, pricing model), and their Risk (temperature, confidence/log probabilities, hallucination rate). The ultimate goal is to determine which model performs best and to understand why, by auditing all chess moves with the above metadata to analyse deductive and inductive reasoning and style of output.
Tooling – I built this app (see iframe below) and deployed it as a Docker container on Google Cloud Platform using Cloud Run. Core technologies include React (frontend framework), TypeScript and JavaScript (languages), Vite (build tool), Tailwind CSS (styling), chess.js (game logic), and several supporting libraries. All results and log files are stored on the server. Note: You will need to provide your own API keys to run the app, as its not cheap running these in autoplay.
GCP Serverless App URL,