The Artima Developer Community
Sponsored Link

Ruby Buzz Forum
Elo ratings for the Benchmarks Game (aka Great Computer Language Shootout)

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Eigen Class

Posts: 358
Nickname: eigenclass
Registered: Oct, 2005

Eigenclass is a hardcore Ruby blog.
Elo ratings for the Benchmarks Game (aka Great Computer Language Shootout) Posted: Sep 1, 2008 3:50 AM
Reply to this message Reply

This post originated from an RSS feed registered with Ruby Buzz by Eigen Class.
Original Post: Elo ratings for the Benchmarks Game (aka Great Computer Language Shootout)
Feed Title: Eigenclass
Feed URL: http://feeds.feedburner.com/eigenclass
Feed Description: Ruby stuff --- trying to stay away from triviality.
Latest Ruby Buzz Posts
Latest Ruby Buzz Posts by Eigen Class
Latest Posts From Eigenclass

Advertisement

The geometric mean, as used by the Computer Language Benchmarks, gives too much importance to outliers and results in unstable rankings that do not reflect overall performance consistency well. The Elo rating system can be applied to the same results to obtain a better ranking where the influence of small speed perturbations is minimized and all-round performance is favored.

Even though the Great Computer Language Shootout is now called the Computer Language Benchmarks Game, its results are often used incorrectly to compare language implementations (and even worse, this is generalized to languages as well) performance-wise by merely comparing the scores without any further thought. Many of the problems stem from the choice of the geometric mean of the execution time compared to the fastest entry for each benchmark and implementation. More on that below.

If it's a game, why not use an actual rating system like those used in chess or go? I use the Elo rating system. A win (see below for the definition of what represents a win) is worth one point, a draw 0.5 and a loss 0, and the expected score of program A against program B is

1 over { 1 + 10 sup { (R sub B - R sub A ) / 500 } }

where R sub X represents the "strength" (current rating) of the language implementation "X".

All the possible pairings of benchmark implementations are considered, and the ratings are updated at the end of each round (this is essential to avoid dependence on the order of the pairings) with

R' sub A  = R sub A + sum from benchmarks { K ( ActualScore sub A - ExpectedScore sub A ) }

(K is a suitably small coefficient to avoid oscillations.)

By choosing the scoring function carefully, the resulting ratings admit an interesting interpretation and allow to draw more meaningful conclusions than then geometric mean.

Here follow the Elo ratings for the x64 Ubuntu (Intel® Q6600® quad-core) data as of 2008-09-01, where a program is considered the winner when it runs in less than half as much time as its opponent. (The programs used to generate it and the justification of this scoring function are given below.)

            gpp   1405
            gcc   1397
          ocaml   1336
          mlton   1317
           java   1307
            ghc   1301
        fpascal   1271
            cal   1271
          scala   1261
           gnat   1252
           nice   1246
         csharp   1199
           hipe    952
       mzscheme    928
           perl    855
         python    792
            lua    791
           pike    767
       javaxint    763
            yap    728
         groovy    687
            tcl    624
            gst    607
            php    532
     javascript    410

This ranking includes languages omitted in the original one and the relative order differs a bit. This is because this rating reflects something different, arguably more interesting.

Source code

I use a small Ruby script (scrap.rb) to screen scrap the shootout site and dump the data in JSON format that will look like

 {
   "pidigits": {
     "mzscheme": 28.27,
     "hipe": 17.04,
     ...
   },
   "fannkuch": {
     ...
   },
   ...
 }

It is then processed using by this OCaml program:

open Printf
module H = ExtHashtbl.Hashtbl
module L = ExtList.List
let (|>) x f = f x

let init_rating = 1000.
let k = 5.
let rounds = 100
let win_ratio = 1.0 /. 2.0

let actual_score ta tb =
  if ta <= win_ratio *. tb then 1.
  else if tb <= win_ratio *. ta then 0.
  else 0.5

let expected_score ra rb = 1. /. (1. +. 10. ** ((rb -. ra) /. 500.))

let new_ratings ra rb ta tb =
  let d = actual_score ta tb -. expected_score ra rb in
    (ra +. k *. d, rb -. k *. d)

let rating_deltas ra rb ta tb =
  let r1, r2 = new_ratings ra rb ta tb in
    (r1 -. ra, r2 -. rb)

type json shootout_data = (string * (string * float) assoc) assoc

let get t lang = H.find_default t lang init_rating
let inc t v k delta = H.replace t k (H.find_default t k v +. delta)

let run data =
  let ratings = H.create 13 in
  let do_match deltas (lang1, t1) (lang2, t2) =
    if lang1 <> lang2 then
      let d1, d2 = rating_deltas (get ratings lang1) (get ratings lang2) t1 t2 in
        inc deltas 0. lang1 d1;
        inc deltas 0. lang2 d2 in
  let round () =
    let ds = H.create 13 in
      L.iter (fun (_, rs) -> L.iter (fun l1 -> L.iter (do_match ds l1) rs) rs) data;
      H.iter (inc ratings init_rating) ds
  in
    for i = 1 to rounds do round () done;
    H.fold (fun lang r l -> (r, lang) :: l) ratings [] |> L.sort |> L.rev |>
    L.iter (fun (rating, lang) -> printf "%16s  %5.0f\n" lang rating)

let () = match Array.length Sys.argv with
    2 -> Json_io.load_json Sys.argv.(1) |> shootout_data_of_json |> run
  | _ -> printf "elo_rating <file>\n";
         exit 0

The only remarkable thing is the use of the json-static syntax extension to load the JSON data type-safely with only one line of code which declares the type and generates the shootout_data_of_json function used to verify and load JSON data:

 type json shootout_data = (string * (string * float) assoc) assoc

Deficiencies of the geometric mean and justification of the scoring function


Read more...

Read: Elo ratings for the Benchmarks Game (aka Great Computer Language Shootout)

Topic: $0.23 Skiddoo Previous Topic   Next Topic Topic: CSS tooltips using CSS Sprites

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use