Ruby Buzz Forum - Elo ratings for the Benchmarks Game (aka Great Computer Language Shootout)

Even though the Great Computer Language Shootout is now called the Computer Language Benchmarks Game, its results are often used incorrectly to compare language implementations (and even worse, this is generalized to languages as well) performance-wise by merely comparing the scores without any further thought. Many of the problems stem from the choice of the geometric mean of the execution time compared to the fastest entry for each benchmark and implementation. More on that below.

If it's a game, why not use an actual rating system like those used in chess or go? I use the Elo rating system. A win (see below for the definition of what represents a win) is worth one point, a draw 0.5 and a loss 0, and the expected score of program A against program B is

where $R sub X$ represents the "strength" (current rating) of the language implementation "X".

All the possible pairings of benchmark implementations are considered, and the ratings are updated at the end of each round (this is essential to avoid dependence on the order of the pairings) with

By choosing the scoring function carefully, the resulting ratings admit an interesting interpretation and allow to draw more meaningful conclusions than then geometric mean.

Here follow the Elo ratings for the x64 Ubuntu (Intel® Q6600® quad-core) data as of 2008-09-01, where a program is considered the winner when it runs in less than half as much time as its opponent. (The programs used to generate it and the justification of this scoring function are given below.)

This ranking includes languages omitted in the original one and the relative order differs a bit. This is because this rating reflects something different, arguably more interesting.

Source code

I use a small Ruby script (scrap.rb) to screen scrap the shootout site and dump the data in JSON format that will look like

open Printf
module H = ExtHashtbl.Hashtbl
module L = ExtList.List
let (|>) x f = f x

let init_rating = 1000.
let k = 5.
let rounds = 100
let win_ratio = 1.0 /. 2.0

let actual_score ta tb =
  if ta <= win_ratio *. tb then 1.
  else if tb <= win_ratio *. ta then 0.
  else 0.5

let expected_score ra rb = 1. /. (1. +. 10. ** ((rb -. ra) /. 500.))

let new_ratings ra rb ta tb =
  let d = actual_score ta tb -. expected_score ra rb in
    (ra +. k *. d, rb -. k *. d)

let rating_deltas ra rb ta tb =
  let r1, r2 = new_ratings ra rb ta tb in
    (r1 -. ra, r2 -. rb)

type json shootout_data = (string * (string * float) assoc) assoc

let get t lang = H.find_default t lang init_rating
let inc t v k delta = H.replace t k (H.find_default t k v +. delta)

let run data =
  let ratings = H.create 13 in
  let do_match deltas (lang1, t1) (lang2, t2) =
    if lang1 <> lang2 then
      let d1, d2 = rating_deltas (get ratings lang1) (get ratings lang2) t1 t2 in
        inc deltas 0. lang1 d1;
        inc deltas 0. lang2 d2 in
  let round () =
    let ds = H.create 13 in
      L.iter (fun (_, rs) -> L.iter (fun l1 -> L.iter (do_match ds l1) rs) rs) data;
      H.iter (inc ratings init_rating) ds
  in
    for i = 1 to rounds do round () done;
    H.fold (fun lang r l -> (r, lang) :: l) ratings [] |> L.sort |> L.rev |>
    L.iter (fun (rating, lang) -> printf "%16s  %5.0f\n" lang rating)

let () = match Array.length Sys.argv with
    2 -> Json_io.load_json Sys.argv.(1) |> shootout_data_of_json |> run
  | _ -> printf "elo_rating <file>\n";
         exit 0

The only remarkable thing is the use of the json-static syntax extension to load the JSON data type-safely with only one line of code which declares the type and generates the shootout_data_of_json function used to verify and load JSON data:


	Web Artima.com

Source code

Deficiencies of the geometric mean and justification of the scoring function