Saturday, November 5, 2011

Lies, damned lies and statistics

Being a niche in a niche sport, Cyclocross draws some obsessive behavioral types.  I include myself among the geeky set of these but am completely outdone by the works of Colin Reuter made available in crossresults.com. 

This useful website employs some simple, clever mathematical modeling to come up with a dynamic ranking and prediction mechanism for cyclocross racing at http://crossresults.com.  The ranking system is so accurate that a number of races in the US North East are seeded using this ranking. 



Racers have a numerical ranking based on their results of the preceding 12 months.  The lower the number of your ranking, the better.  Top ranked is Sven Nys, previously a world champion with a points score of around 97 as of early Nov 2011.  A score around the hundred mark means that you're one of the top riders in the world and are probably fluent in Flemish.  Points scores under 200 are good pros or national level amateur racers.  If your points are below 300 you are a top regional rider at Masters level and even better if not Masters.  Below 400 and you're doing well.  Above 400 and below 500 means you finish races mostly in the bottom half of the field.  Above 500 means you need to review your training plans.  Above 600 means you need to take up a new sport or at least ride your bike at least once a week when you're not racing.

Your ranking is based on a running average of your recent scores with the most recent events weighing most heavily, reflecting physiological training effects but ignoring those of crashing, mechanical problems, bad days and hangovers.  Also not taking into account are conditions, course difficulty and weather. 

Each time you race, you get points based on your result.  Points for a race are allocated such that
  • the total points granted is equal to the total points of entrants as they were at the start of the race
  • the winner gets the lowest score - a fraction of the average of the scores of the top 5 finishers  (excluding the winner)
  • the median racer scores the points that s/he had at the start of the race
  • points granted between consecutive places are constant, in other words the difference between points granted between 1st and 2nd is the same as the difference between 2nd and 3rd and so on
  • DNF (did not finish) counts as last.  DFL in fact.
What does this gibberish mean?  It turns out that this model is a pretty accurate measurement of racing form across the population of racers, so much so that many races use this as the basis of seeding riders for call up.  It also has some interesting properties;
  • Sandbaggers don't gain.  A sandbagger, someone who races at a lower level than s/he should in order to win, is given points based on the average of the 2nd-5th place riders.  This means that a sandbagger who's significantly better will lower his/her ranking through increased points
  • Sandbaggers help the field. By lowering the average of the field, everyone benefits (slightly) from the lower points allocation than they'd have in a more even field
  • Normalized, consistency across riders
  • Consistency is rewarded and even more so, riders that abandon races are penalized irrespective of the reason

Bikereg.com (the core of competitive cycling in the US) acquired crossresults.com in early 2011.  It should be very interesting to see what improvements are made once full time funding is made to the site.  USA Cycling has already announced that it will be using this mechanism to rank riders in other disciplines.

One of the immediate enhancements was to provide a race prediction function to all users (this was previously available for a small donation).  The race predictor takes the list of registered riders and predicts the outcome of the race using their current rankings.  It's uncannily accurate.