|From the email inbox...|
"Hello Jeff,--end quote
Is there a general rule of thumb regarding how many races to use for probability expressions? I’m currently using ‘Top 100’ but I have seen you use ‘Top 80’ as well. Just curious.
Thank you for your time."
The game is something that is always changing.
Because of that: I don't think there is a correct answer to your question -- at least not one that can be applied in all cases.
I could run a back test for all of my Prob Expressions using a large sample -- and on that past sample only -- I might discover that using 178 starters in my Prob Expressions generated the highest returns.
But in another large sample of races from a different time period -- I might discover that using 43 starters in my Prob Expressions generated the highest returns.
Initially, results like those confused the crap out of me.
It wasn't until I asked myself "How are the races in sample A different than the races in sample B?" --
And then started analyzing WHY the inherent differences within distinct samples were generating (sometimes widely) different results -- that it started making sense to me.
I'll attempt to explain what I am talking about using a change in track surface and its (likely) effect on a Prob Expression based on RailPosition.
Last week I read an article by Daniel Ross that was published on the thoroughbredracing.com site on June 23, 2017.
Equine fatalities: why this is a pivotal year for Del Mar and Saratoga:
"At Del Mar, the most "significant" change since last summer involved drafting Santa Anita's veteran track superintendent, Dennis Moore, to oversee management of the San Diego facility's racing surface, said its president and CEO, Joe Harper.--end quote
And the primary focus of Moore's work has been to alter the banking of the track to mirror the consistency and geometric dimensions of Santa Anita’s surface."
According to the article, in an effort to cut down on equine fatalities, Del Mar hired the track super from Santa Anita -- and the first thing he did was dig up Del Mar's dirt course -- and rebuild it to give the turns at Del Mar the same 5 percent degree of banking Santa Anita has.
How is this surface change likely to influence the performance of a Prob Expression based on RailPosition?
If I have a Prob Expression based on RailPosition for the most recent n starters for today's track-intsurface-dist:
If I had to make an educated guess -- I would say that for the Del Mar 2017 meet -- it really doesn't matter if the sql driving my Prob Expression is based on n=600, n=300, n=150, or n=75...
Because if I can trust the article -- I know that they've dug up the dirt course and re-banked it.
If a Prob Expression for RailPosition is based on (say) n=150 starters at today's track-intsurface-dist -- during the first part of the Del Mar meet the bulk of the starters being scored by the Prob Expression are going to be from a past Del Mar meet when the banking was completely different.
This means that the Prob Expression now has a higher likelihood than normal of generating scores that could turn out to be misleading.
Based on past experience -- more often than not whenever I've seen a sudden change in course banking:
The change in steepness of banking causes the new surface to shape race outcomes in a completely different manner than the previous surface.
In this case -- I'll probably de-weight my RailPosition Prob Expression for all of the DMR intsurface <= 3 distances in my models... and keep a close eye on how the more steeply banked turns are shaping race outcomes as the DMR 2017 meet unfolds... and adjust on the fly after seeing some meet specific data.
Btw, it's rare that you get a heads up from an article.
More often than not the turns are rebanked (sometimes more than once in the middle of a meet) with absolutely no warning to the public.
--Hint: But if you watch head on replays you can clearly pick up on changes in the steepnesss of the banking.
In general for Prob Expressions driven by TOP n * with ORDER BY [DATE] DESC:
• n=600 gets you a big picture look and tends to smooth out short term variance. (But also tends to slow down Calc Races routines a bit.)
• n=25 gets you a look at the recent picture. (Plenty of built in short term variance with minimal slowing of Calc Races routines.)
• n=100 to 150 gets you a look at a combination of the big picture and the recent picture with some smoothing of short term variance with medium slowing of Calc Races routines.
All of that said:
The game is something that is forever evolving.
Because of that: I don't think there is a correct answer for n -- or the optimal number of starters in a Prob Expression.
I think the best any of us can do is break the game down into various parts -- find factors within each part that are being misrepresented a bit in the odds --
And from there create Prob Expressions to express those factors in counter-intuitive ways.
~Edited by: jeff on: 6/26/2017 at: 12:15:10 PM~