After watching my beloved Seattle Seahawks completely flattened by the San Francisco 49ers last weekend, in large part due to a crew of starters decimated by injuries, I instantly remembered a NYT piece passed along by friend and former colleague, Eric Moessinger. The article discusses an interesting project being conducted by the Dodger’s head athletic trainer, Stan Conte. In an effort to predict player injuries – and save millions of dollars from being wasted on players likely to warm the bench due to torn muscles or constant sprains – Conte has assembled a team of statisticians exploring the significance of correlations he has observed over the years between player traits and injuries.
Side note: Did you see Billy Beane say he would love to have someone doing the same thing but couldn’t afford the resource? If you know anyone in the Bay Area with statistics experience, a love of baseball, and no job – let them know about this. It could be the coolest unpaid internship ever!
I remember carrying out similar analyses, called regressions, as a management consultant: What variables are really correlated with radio station profitability? How correlated is relative market share with return on sales in the bicycle manufacturing market? Can we predict the year artificial intelligence will outsmart man and lead robots to overtake the planet? Pretty much all of these questions – and more – can be answered with enough ‘test cases’ and a trusty copy of SPSS (by the way – congrats guys!).
In Conte’s case, ‘test cases’ are real life professional baseball players. And what Conte’s doing is carrying out a regression. He’s essentially building a big equation: on the left hand side, there are hundreds of variables for things ranging from the readily observable (e.g., height, weight, body fat %, ethnicity, innings pitched), the less observable (e.g., nationality, family makeup) and the totally subjective (e.g., mental state). Each variable is weighted by their importance, as determined by hard-core computation. The formula that results would then allow you to plug in traits for any particular player and pop out something like ‘likely number of days on the disabled list per season.’
In reality things can’t be that clean. As the article notes, data on player injuries are a mess. Probably the biggest issue is the huge element of randomness that can’t be accounted for with variables: I don’t care if you’re Dominican or Danish – if ‘old school’ Randy Johnson nails you with a fastball, you’re out for a while.
What’s the excitement here? The article notes that any pro sports team would pay a hefty sum for a proprietary ‘formula’ for predicting player injury, seeing such a tool as yielding a great RoI. I’m not quite so sure. It feels like the inherent level of randomness in injuries limits the value of any such formula. And as anyone who read the book Moneyball will attest, pro sports are not really the most embracing of data-based personnel decision. If Al Davis wants Richard Seymour, then darn it, Al Davis will get the overpriced aging guy Richard Seymor. Moreover, your value as a company providing such a product/ consulting service is ultimately hamstrung by the fact that you’d only have ~500 potential customers around the globe. Finally, what if every sports team DID use such a formula? Imagine getting blackballed from the league because of traits outside of your control, even if you appeared destined to be an All-Star? It’d be like Gattaca. But as Gattaca taught us: “there is no gene for the human spirit.”
What I think is more promising, both in market potential as well as actual societal good, is combining this style of analysis with the advances to come (hopefully) in the realm of electronic health records . Insurance companies have been doing similar things for years, but likely using less comprehensive metrics as Conte. Perhaps with the addition of much more data on each ‘test case’ (i.e., you and me), and a big database with 300M+ cases, US healthcare officials can realize correlations in genetics or lifestyle decisions we didn’t already know. Patients could pay $10 a month to web services that manage their EMRs and provide info on what diseases they may have predisposition for based on their genetic code, lifestyle factors, and illness and treatment history. Or send an email every quarter letting them know of what would be the best steps for prevention or early detection of those diseases.
One cool policy move could be to provide an open database for analysis, but with absolutely no descriptors of who is included in the database and the ability for individuals to opt-out, if requested. You could then let a world of startups attack the data and find a host of interesting correlations. Privacy concerns and poor data quality would probably keep that from happening. Other issues may even include national security or pride – do we really want the world to confirm our average weight is 10-15 lbs. heavier than our counterparts in other developed countries? But would you solely want to trust the government with this kind of project – I’d much rather put such a database in the hands of entrepreneurs and see what happens. Well…I guess the government did put a man on the moon. And implemented “Cash for Clunkers.” I should cut them some slack. What’s most likely to happen is a high-security government program with bits outsourced to a few legitimate startups and/ or megacaps like IBM, which should work out okay.
That’s all I got…except for this: check out Matt Hasselbeck’s post-game tweet – live from a Stanford hospital room. Geez. Where was Conte when we resigned that guy…? Just kidding – you’re the man, Hass. Get well soon, #8!
Check it out:
http://freakonomics.blogs.nytimes.com/2009/09/14/football-injuries-the-metric-that-matters/
Thanks Eric. Further proof of the value of these types of statistical approaches!