The geometric mean, as used by the Computer Language Benchmarks,
gives too much importance to outliers and results in unstable rankings
that do not reflect overall performance consistency well. The Elo rating
system can be applied to the same results to obtain a better ranking
where the influence of small speed perturbations is minimized and all-round
performance is favored.
Even though the Great Computer Language Shootout is now called the
Computer Language Benchmarks Game, its
results are often used incorrectly to compare language implementations (and
even worse, this is generalized to languages as well) performance-wise by
merely comparing the scores without any further thought. Many of the problems
stem from the choice of the geometric mean of the execution time compared to
the fastest entry for each benchmark and implementation. More on that below.
If it's a game, why not use an actual rating system like those used in chess
or go? I use the Elo rating system.
A win (see below for the definition of what represents a win) is worth one
point, a draw 0.5 and a loss 0, and the expected score of program A against
program B is
where represents the "strength" (current rating) of
the language implementation "X".
All the possible pairings of benchmark implementations are considered, and the
ratings are updated at the end of each round (this is essential to avoid
dependence on the order of the pairings) with
(K is a suitably small coefficient to avoid oscillations.)
By choosing the scoring function carefully, the resulting ratings admit
an interesting interpretation and allow to draw more meaningful conclusions
than then geometric mean.
Here follow the Elo ratings for the x64 Ubuntu (Intel® Q6600® quad-core) data as of
2008-09-01, where a program is considered the winner when it runs in less
than half as much time as its opponent. (The programs used to generate it and the
justification of this scoring function are given below.)
This ranking includes languages omitted in the
original one
and the relative order differs a bit. This is because this rating reflects
something different, arguably more interesting.
Source code
I use a small Ruby script (scrap.rb) to screen scrap the
shootout site and dump the data in JSON format that will look like
open Printf
module H = ExtHashtbl.Hashtbl
module L = ExtList.List
let (|>) x f = f x
let init_rating = 1000.
let k = 5.
let rounds = 100
let win_ratio = 1.0 /. 2.0
let actual_score ta tb =
if ta <= win_ratio *. tb then 1.
else if tb <= win_ratio *. ta then 0.
else 0.5
let expected_score ra rb = 1. /. (1. +. 10. ** ((rb -. ra) /. 500.))
let new_ratings ra rb ta tb =
let d = actual_score ta tb -. expected_score ra rb in
(ra +. k *. d, rb -. k *. d)
let rating_deltas ra rb ta tb =
let r1, r2 = new_ratings ra rb ta tb in
(r1 -. ra, r2 -. rb)
type json shootout_data = (string * (string * float) assoc) assoc
let get t lang = H.find_default t lang init_rating
let inc t v k delta = H.replace t k (H.find_default t k v +. delta)
let run data =
let ratings = H.create 13 in
let do_match deltas (lang1, t1) (lang2, t2) =
if lang1 <> lang2 then
let d1, d2 = rating_deltas (get ratings lang1) (get ratings lang2) t1 t2 in
inc deltas 0. lang1 d1;
inc deltas 0. lang2 d2 in
let round () =
let ds = H.create 13 in
L.iter (fun (_, rs) -> L.iter (fun l1 -> L.iter (do_match ds l1) rs) rs) data;
H.iter (inc ratings init_rating) ds
in
for i = 1 to rounds do round () done;
H.fold (fun lang r l -> (r, lang) :: l) ratings [] |> L.sort |> L.rev |>
L.iter (fun (rating, lang) -> printf "%16s %5.0f\n" lang rating)
let () = match Array.length Sys.argv with
2 -> Json_io.load_json Sys.argv.(1) |> shootout_data_of_json |> run
| _ -> printf "elo_rating <file>\n";
exit 0
The only remarkable thing is the use of the
json-static syntax extension
to load the JSON data type-safely with only one line of code which declares
the type and generates the shootout_data_of_json function used to verify and
load JSON data: