Nothing Special   »   [go: up one dir, main page]

A Hierarchical Bayesian Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

PasstheROC

(http://danielweitzenfeld.github.io/passtheroc/)
About(http://danielweitzenfeld.github.io/passtheroc/pages/about.html)

AHierarchicalBayesianModelofthePremier
League
Oct28,2014

Last fall, I was listening to an episode of the BS Report podcast (http://espn.go.com/espnradio/podcast/archive?id=2864045) in which Bill
SimmonsandCousinSalwerediscussingthestrengthofdifferentNFLteams.Itwasearlyintheseasonweek3or4,maybeandBillmade
thepoint(I'mparaphrasinghere)thatyourestimateofoneteam'sstrengthdependsonyourestimateofalltheothers.Theconclusionsyoudraw
from team X beating team Y depends how strong team Y is, which in turn depends on the conclusions you draw from team Y's other games,
whichinturndependsonhowstrongY'sopponentswere,etc.

It occurred to me that this problem is perfect for a Bayesian model. We want to infer the latent paremeters (every team's strength) that are
generatingthedataweobserve(thescorelines).Moreover,weknowthatthescorelinesareanoisymeasurementofteamstrength,soideally,we
wantamodelthatmakesiteasytoquantifyouruncertaintyabouttheunderlyingstrengths.

SoIgoogled'Bayesianfootball'andfoundthispaper(http://www.statistica.it/gianluca/Research/BaioBlangiardo.pdf),called'Bayesianhierarchical
modelforthepredictionoffootballresults.'Theauthors(GianlucaBaioandMartaA.Blangiardo)beingItalian,though,the'football'hereissoccer.

Inthispost,I'mgoingtoreproducethefirstmodeldescribedinthepaperusingpymc.WhiletheyusedSeriaAintheirpaper,I'mgoingtousethe
20132014PremierLeague.

Gettingandcleaningthedata
Igotthedatafromwikipedia(http://en.wikipedia.org/wiki/201314_Premier_League).Icopiedthe'ResultTable'andsaveditasatextfile.

In[1]: %matplotlibinline
%configInlineBackend.figure_format='retina'

In[2]: importos
importmath
importwarnings
warnings.filterwarnings('ignore')

fromIPython.displayimportImage,HTML
importnumpyasnp
importpandasaspd
importseabornassns
importmatplotlib.pyplotasplt
importpymc#Iknowfolksareswitchingto"aspm"butI'mjustnotthereyet

In[3]: DATA_DIR=os.path.join(os.getcwd(),'data/')
CHART_DIR=os.path.join(os.getcwd(),'charts/')

In[4]: data_file=DATA_DIR+'premier_league_13_14.txt'
df=pd.read_csv(data_file,sep='\t',index_col=0,)
df.head()
Out[4]:
ARS AST CAR CHE CPA EVE FUL HUL LIV MNC MNU NEW NOR SOT STO SUN SWA TOT WBA WHU

Home\
Away[1]

Arsenal NaN 13 20 00 20 11 20 20 20 11 00 30 41 20 31 41 22 10 10 31

Aston
12 NaN 20 10 01 02 12 31 01 32 03 12 41 00 14 00 11 02 43 02
Villa

Cardiff
03 00 NaN 12 03 00 31 04 36 32 22 12 21 03 11 22 10 01 10 02
City

Chelsea 60 21 41 NaN 21 10 20 20 21 21 31 30 00 31 30 12 10 40 22 00

Crystal
02 10 20 10 NaN 00 14 10 33 02 02 03 11 01 10 31 02 01 31 10
Palace

Clearly,weneeddosomemunginghere.Let'sturnthisintoalongdataframe,andsplitthescoreintotwonumericcolumns.
In[5]: df.index=df.columns
rows=[]
foriindf.index:
forcindf.columns:
ifi==c:continue
score=df.ix[i,c]
score=[int(row)forrowinscore.split('')]
rows.append([i,c,score[0],score[1]])
df=pd.DataFrame(rows,columns=['home','away','home_score','away_score'])
df.head()
Out[5]:
home away home_score away_score

0 ARS AST 1 3

1 ARS CAR 2 0

2 ARS CHE 0 0

3 ARS CPA 2 0

4 ARS EVE 1 1

Much better, but still not done. We need an easy way to refer to the teams. Let's create a lookup table, which maps team name to a unique
integeri.

In[6]: teams=df.home.unique()
teams=pd.DataFrame(teams,columns=['team'])
teams['i']=teams.index
teams.head()
Out[6]:
team i

0 ARS 0

1 AST 1

2 CAR 2

3 CHE 3

4 CPA 4

Now,wecanmergethisintoourmaindataframetocreatethecolumnsi_homeandi_away.

In[7]: df=pd.merge(df,teams,left_on='home',right_on='team',how='left')
df=df.rename(columns={'i':'i_home'}).drop('team',1)
df=pd.merge(df,teams,left_on='away',right_on='team',how='left')
df=df.rename(columns={'i':'i_away'}).drop('team',1)
df.head()
Out[7]:
home away home_score away_score i_home i_away

0 ARS AST 1 3 0 1

1 CAR AST 0 0 2 1

2 CHE AST 2 1 3 1

3 CPA AST 1 0 4 1

4 EVE AST 2 1 5 1

Now,let'sextractthedataintoarrays,sothatpymccanworkwithit.Notethateachofthearrays(observed_home_goals,observed_away_goals,
home_team,away_team)arethesamelength,andthattheithentryofeachreferstothesamegame.

In[8]: observed_home_goals=df.home_score.values
observed_away_goals=df.away_score.values
home_team=df.i_home.values
away_team=df.i_away.values
num_teams=len(df.i_home.unique())
num_games=len(home_team)

This next step is unnecessary, but I like to do it anyway come up with some decent starting values for the att and def parameters. The log
transformationwillmakesenseshortly.

In[9]: g=df.groupby('i_away')
att_starting_points=np.log(g.away_score.mean())
g=df.groupby('i_home')
def_starting_points=np.log(g.away_score.mean())

TheModel
TheModel
Inthefirstmodelfromthepaper,theymodeltheobservedgoalsscoredcountsas:

ygj gj = P oisson(gj )

"wherethe = (g1 , g2 ) representthescoringintensityinthe gthgamefortheteamplayingathome(j =1)andaway(j = 2), respectively."


Theyusealoglinearmodelforthethetas:

log g1 = home + att h(g) + defa(g)

log g2 = att a(g) + defh(g)

Note that they're breaking out team strength into attacking and defending strength. A negative defense parameter will sap the mojo from the
opposingteam'sattackingparameter.

home represents hometeam advantage, and in this model is assumed to be constant across teams. The prior on the home and intercept
parametersisflat(keepinmindthatinpymc,Normalsarespecifiedas(mean,precision):

home N ormal(0, .0001)

intercept N ormal(0, .0001)

Theteamspecificeffectsaremodeledasexchangeable:

att t N ormal(att , att )

deft N ormal(def , def )

Toensureidentifiability,theyimposeasumtozeroconstraintontheattackanddefenseparameters.Theyalsotriedacornerconstraint(pinning
oneteam'sparametersto0,0),butfoundthatinterpretationiseasierinthesumtozerocasebecauseit'snotrelativetothe0,0team.

att t = 0

t=1

deft = 0

t=1

Thehyperpriorsontheattackanddefenseparametersarealsoflat:

att N ormal(0, .0001)

def N ormal(0, .0001)

att Gamma(.1, .1)

def Gamma(.1, .1)

MyTweaks
Imadetwotweakstothemodel.Itdidn'tmakesensetometohavea att when we're enforcing the sumtozero constraint by subtracting the
meananyway.SoIeliminatedatt anddef :

att t N ormal(0, att )

deft N ormal(0, def )

Also because of the sumtozero constraint, it seemed to me that we needed an intercept term in the loglinear model, capturing the average
goalsscoredpergamebytheawayteam:

log g1 = intercept + home + att h(g) + defa(g)

log g2 = intercept + att a(g) + defh(g)

Lastly,Itrieduniformpriorson att and def ,butfoundnodifferencerelativetothegammapriors.Belowiswhatthemodellookslikeinpymc.


OnethingIlearnedinthisprocessisthatpymcplaysbestwithnumpyarrays.Ifyoufindyourselfusingalistcomprehension,seeifthere'saway
tousenumpyarraysinstead.

In[10]: #hyperpriors
home=pymc.Normal('home',0,.0001,value=0)
tau_att=pymc.Gamma('tau_att',.1,.1,value=10)
tau_def=pymc.Gamma('tau_def',.1,.1,value=10)
intercept=pymc.Normal('intercept',0,.0001,value=0)

#teamspecificparameters
atts_star=pymc.Normal("atts_star",
mu=0,
tau=tau_att,
size=num_teams,
value=att_starting_points.values)
defs_star=pymc.Normal("defs_star",
mu=0,
tau=tau_def,
size=num_teams,
value=def_starting_points.values)

#tricktocodethesumtozerocontraint
@pymc.deterministic
defatts(atts_star=atts_star):
atts=atts_star.copy()
atts=attsnp.mean(atts_star)
returnatts

@pymc.deterministic
defdefs(defs_star=defs_star):
defs=defs_star.copy()
defs=defsnp.mean(defs_star)
returndefs

@pymc.deterministic
defhome_theta(home_team=home_team,
away_team=away_team,
home=home,
atts=atts,
defs=defs,
intercept=intercept):
returnnp.exp(intercept+
home+
atts[home_team]+
defs[away_team])

@pymc.deterministic
defaway_theta(home_team=home_team,
away_team=away_team,
home=home,
atts=atts,
defs=defs,
intercept=intercept):
returnnp.exp(intercept+
atts[away_team]+
defs[home_team])


home_goals=pymc.Poisson('home_goals',
mu=home_theta,
value=observed_home_goals,
observed=True)
away_goals=pymc.Poisson('away_goals',
mu=away_theta,
value=observed_away_goals,
observed=True)

mcmc=pymc.MCMC([home,intercept,tau_att,tau_def,
home_theta,away_theta,
atts_star,defs_star,atts,defs,
home_goals,away_goals])
map_=pymc.MAP(mcmc)
map_.fit()
mcmc.sample(200000,40000,20)
[100%]200000of200000completein91.9sec

Diagnostics
Let's see if/how the model converged. The home parameter looks good, and indicates that home field advantage amounts to e.4 .1
e .39

goalspergameattheintercept.

In[44]: pymc.Matplot.plot(home)
Plottinghome

Theinterceptalsolooksgood,andhasanHDIthatdoesn'tincludezero,indicatingthatincludinganinterceptwasagoodidea:

In[45]: pymc.Matplot.plot(intercept)
Plottingintercept

The slookok,althoughthereisdefinitelymoreautocorrelationindef :

In[46]: pymc.Matplot.plot(tau_att)
Plottingtau_att

In[47]: pymc.Matplot.plot(tau_def)
Plottingtau_def

Belowisasubsetoftheattsparametersplots.Toseethemall,we'djustdopymc.Matplot.plot(atts)

In[15]: Embed=Image(CHART_DIR+'atts.png')
Embed
Visualization
Let's see if we can recreate a plot (Figure 3) from their paper, of average defense effect by average attack effect. We'll add in colorcoding by
whethertheteamendedtheseasonqualifyingfortheChampion'sLeague(top4),EuropaLeague(5&6),orwasrelegated(bottom3).

In[17]: observed_season=DATA_DIR+'premier_league_13_14_table.csv'
df_observed=pd.read_csv(observed_season)
df_observed.loc[df_observed.QR.isnull(),'QR']=''

In[48]: df_avg=pd.DataFrame({'avg_att':atts.stats()['mean'],
'avg_def':defs.stats()['mean']},
index=teams.team.values)
df_avg=pd.merge(df_avg,df_observed,left_index=True,right_on='team',how='left')

fig,ax=plt.subplots(figsize=(8,6))
foroutcomein['champs_league','relegation','europa_league','']:
ax.plot(df_avg.avg_att[df_avg.QR==outcome],
df_avg.avg_def[df_avg.QR==outcome],'o',label=outcome)

forlabel,x,yinzip(df_avg.team.values,df_avg.avg_att.values,df_avg.avg_def.values):
ax.annotate(label,xy=(x,y),xytext=(5,5),textcoords='offsetpoints')
ax.set_title('AttackvsDefenseavgeffect:1314PremierLeague')
ax.set_xlabel('Avgattackeffect')
ax.set_ylabel('Avgdefenseeffect')
ax.legend()
Out[48]: <matplotlib.legend.Legendat0x115cd1c90>
Soasyouwouldexpect,thetopteamsareinthelowerrightsideofthechart,indicatingapositiveattackeffectandanegativedefenseeffect.
Interestingly,ManchesterUnited(MNU)appearstohavebothastrongerattackandastrongerdefensethantwoteamsthatfinishedaboveitin
thetable,Everton(EVE)andTottenham(TOT).Butwe'rejustlookingatposteriormeanshere,andweoughttotakeadvantageofthefactthatwe
canquantifyourposterioruncertaintyaroundtheseparameters.Let'slookattheHighestPosteriorDensityintervalsfortheattackparameters.

In[49]: df_hpd=pd.DataFrame(atts.stats()['95%HPDinterval'],
columns=['hpd_low','hpd_high'],
index=teams.team.values)
df_median=pd.DataFrame(atts.stats()['quantiles'][50],
columns=['hpd_median'],
index=teams.team.values)
df_hpd=df_hpd.join(df_median)
df_hpd['relative_lower']=df_hpd.hpd_mediandf_hpd.hpd_low
df_hpd['relative_upper']=df_hpd.hpd_highdf_hpd.hpd_median
df_hpd=df_hpd.sort_index(by='hpd_median')
df_hpd=df_hpd.reset_index()
df_hpd['x']=df_hpd.index+.5


fig,axs=plt.subplots(figsize=(10,4))
axs.errorbar(df_hpd.x,df_hpd.hpd_median,
yerr=(df_hpd[['relative_lower','relative_upper']].values).T,
fmt='o')
axs.set_title('HPDofAttackStrength,byTeam')
axs.set_xlabel('Team')
axs.set_ylabel('PosteriorAttackStrength')
_=axs.set_xticks(df_hpd.index+.5)
_=axs.set_xticklabels(df_hpd['index'].values,rotation=45)
Note that the intervals for MNU, EVE, and TOT overlap significantly. Indeed, there's a large set of teams in the middle with overlapping HPDs.
Wecanbefairlyconfident,though,thatLiverpool'sandManCity'sattackstrengthlefttherestofthepackbehind,whichmakessensegiventhat
theyeachscored30moregoalsthanChelsea,theteamwiththethirdmostgoalsscored.Infact,wemightconcludethattheirattackswereeven
furtherfromtherestofthepackifitweren'tforshrinkage.ToseewhatImean,let'ssimulatesomeseasons.

Simulations
Wecantakedrawsfromtheposteriordistributionsoftheparameters,andsimulateaseasonormanyseasons.Belowisthesimulationcode,just
soyoucanseehowIdidit.

In[24]: defsimulate_season():
"""
Simulateaseasononce,usingonerandomdrawfromthemcmcchain.
"""
num_samples=atts.trace().shape[0]
draw=np.random.randint(0,num_samples)
atts_draw=pd.DataFrame({'att':atts.trace()[draw,:],})
defs_draw=pd.DataFrame({'def':defs.trace()[draw,:],})
home_draw=home.trace()[draw]
intercept_draw=intercept.trace()[draw]
season=df.copy()
season=pd.merge(season,atts_draw,left_on='i_home',right_index=True)
season=pd.merge(season,defs_draw,left_on='i_home',right_index=True)
season=season.rename(columns={'att':'att_home','def':'def_home'})
season=pd.merge(season,atts_draw,left_on='i_away',right_index=True)
season=pd.merge(season,defs_draw,left_on='i_away',right_index=True)
season=season.rename(columns={'att':'att_away','def':'def_away'})
season['home']=home_draw
season['intercept']=intercept_draw
season['home_theta']=season.apply(lambdax:math.exp(x['intercept']+
x['home']+
x['att_home']+
x['def_away']),axis=1)
season['away_theta']=season.apply(lambdax:math.exp(x['intercept']+
x['att_away']+
x['def_home']),axis=1)
season['home_goals']=season.apply(lambdax:np.random.poisson(x['home_theta']),axis=1)
season['away_goals']=season.apply(lambdax:np.random.poisson(x['away_theta']),axis=1)
season['home_outcome']=season.apply(lambdax:'win'ifx['home_goals']>x['away_goals']else
'loss'ifx['home_goals']<x['away_goals']else'dra
w',axis=1)
season['away_outcome']=season.apply(lambdax:'win'ifx['home_goals']<x['away_goals']else
'loss'ifx['home_goals']>x['away_goals']else'dra
w',axis=1)
season=season.join(pd.get_dummies(season.home_outcome,prefix='home'))
season=season.join(pd.get_dummies(season.away_outcome,prefix='away'))
returnseason


defcreate_season_table(season):
"""
Usingaseasondataframeoutputbysimulate_season(),createasummarydataframewithwins,losses,g
oalsfor,etc.

"""
g=season.groupby('i_home')
home=pd.DataFrame({'home_goals':g.home_goals.sum(),
'home_goals_against':g.away_goals.sum(),
'home_wins':g.home_win.sum(),
'home_draws':g.home_draw.sum(),
'home_losses':g.home_loss.sum()
})
g=season.groupby('i_away')
away=pd.DataFrame({'away_goals':g.away_goals.sum(),
'away_goals_against':g.home_goals.sum(),
'away_wins':g.away_win.sum(),
'away_draws':g.away_draw.sum(),
'away_losses':g.away_loss.sum()
})
df=home.join(away)
df['wins']=df.home_wins+df.away_wins
df['draws']=df.home_draws+df.away_draws
df['losses']=df.home_losses+df.away_losses
df['points']=df.wins*3+df.draws
df['gf']=df.home_goals+df.away_goals
df['ga']=df.home_goals_against+df.away_goals_against
df['gd']=df.gfdf.ga
df=pd.merge(teams,df,left_on='i',right_index=True)
df=df.sort_index(by='points',ascending=False)
df=df.reset_index()
df['position']=df.index+1
df['champion']=(df.position==1).astype(int)
df['qualified_for_CL']=(df.position<5).astype(int)
df['relegated']=(df.position>17).astype(int)
returndf

defsimulate_seasons(n=100):
dfs=[]
foriinrange(n):
s=simulate_season()
t=create_season_table(s)
t['iteration']=i
dfs.append(t)
returnpd.concat(dfs,ignore_index=True)

Let'snowsimulateathousandseasons.

In[25]: simuls=simulate_seasons(1000)

HowdidManCitydoacrossour1000simulations?Belowisahistogramofthenumberofpointstheyhadatseason'send:

In[50]: ax=simuls.points[simuls.team=='MNC'].hist(figsize=(7,5))
median=simuls.points[simuls.team=='MNC'].median()
ax.set_title('ManCity:201314points,1000simulations')
ax.plot([median,median],ax.get_ylim())
plt.annotate('Median:%s'%median,xy=(median+1,ax.get_ylim()[1]10))
Out[50]: <matplotlib.text.Annotationat0x1152d9d90>
Andhowmanygoalstheyscored:

In[51]: ax=simuls.gf[simuls.team=='MNC'].hist(figsize=(7,5))
median=simuls.gf[simuls.team=='MNC'].median()
ax.set_title('ManCity:201314goalsfor,1000simulations')
ax.plot([median,median],ax.get_ylim())
plt.annotate('Median:%s'%median,xy=(median+1,ax.get_ylim()[1]10))
Out[51]: <matplotlib.text.Annotationat0x114a9e610>
NotethatthemediansimulationhasMancityscorting95goals,7fewerthantheyactuallyscoredlastseason.Iwonderifsomethingsimilaris
goingonforotherteams.Let'scheck.

In[52]: g=simuls.groupby('team')
season_hdis=pd.DataFrame({'points_lower':g.points.quantile(.05),
'points_upper':g.points.quantile(.95),
'goals_for_lower':g.gf.quantile(.05),
'goals_for_median':g.gf.median(),
'goals_for_upper':g.gf.quantile(.95),
'goals_against_lower':g.ga.quantile(.05),
'goals_against_upper':g.ga.quantile(.95),
})
season_hdis=pd.merge(season_hdis,df_observed,left_index=True,right_on='team')
column_order=['team','points_lower','Pts','points_upper',
'goals_for_lower','GF','goals_for_median','goals_for_upper',
'goals_against_lower','GA','goals_against_upper',]
season_hdis=season_hdis[column_order]
season_hdis['relative_goals_upper']=season_hdis.goals_for_upperseason_hdis.goals_for_median
season_hdis['relative_goals_lower']=season_hdis.goals_for_medianseason_hdis.goals_for_lower
season_hdis=season_hdis.sort_index(by='GF')
season_hdis=season_hdis.reset_index()
season_hdis['x']=season_hdis.index+.5
season_hdis

fig,axs=plt.subplots(figsize=(10,6))
axs.scatter(season_hdis.x,season_hdis.GF,c=sns.palettes.color_palette()[4],zorder=10,label='Actual
GoalsFor')
axs.errorbar(season_hdis.x,season_hdis.goals_for_median,
yerr=(season_hdis[['relative_goals_lower','relative_goals_upper']].values).T,
fmt='s',c=sns.palettes.color_palette()[5],label='Simulations')
axs.set_title('ActualGoalsFor,and90%IntervalfromSimulations,byTeam')
axs.set_xlabel('Team')
axs.set_ylabel('GoalsScored')
axs.set_xlim(0,20)
axs.legend()
_=axs.set_xticks(season_hdis.index+.5)
_=axs.set_xticklabels(season_hdis['team'].values,rotation=45)

Indeed,forhighscoringteams,theobservedgoalsscoredfallsintheupperendofthedistributionfromoursimulations,andtheoppositeistrue
(buttoalesserextent)forthelowerscoringteams.Thisisshrinkageinaction.

Shrinkage
Recall that we defined the attack parameters as being exchangable: attt N ormal(0, att ) . Shrinkage is when parameters are pulled toward
theirgroupmeaninthiscase,theteamspecificattackparametersarepulledtowardzero.Wecanseethisvisuallyinthechartabove:theblue
squares(simulationmedians)areallclosertothemeanthantheobservations(yellowcircles).
Onewaytothinkaboutshrinkageinthismodelisthatwe'reusingthedataonalltheteamstoinformourinferenceoneachindividualteam.We're
usingthefactthatmostteamsscored4060goalsintheseasontoinferthatLiverpool'sandManCity's100+goalseasonswereabitlucky.

IntheBaioandBlangiardopaper,theyaddresstheshrinkagebycreatingamorecomplicatedmodel:

The model of 2 assumes that all the attack and defense propensities be drawn by a common process,
characterisedbythecommonvectorofhyperparameters(att ,att ,def ,def )clearly,thismightbenot
sufficienttocapturethepresenceofdifferentqualityintheteams,thereforeproducingovershrinkage,with
theeffectof:a)penalisingextremelygoodteamsandb)overestimatetheperformanceofpoorteams.One
possiblewaytoavoidthisproblemistointroduceamorecomplicatedstructurefortheparametersofthe
model,inordertoallowforthreedifferentgeneratingmechanism,oneforthetopteams,oneforthemid
tableteams,andoneforthebottomtableteams.
That makes a lot of sense. There does seem to be a top tier of teams that is just a leg above the rest. (I tried implementing their more
complicatedmodel,butIcouldn'tgetittoconvergenicely).

Ihadsometroublewrappingmymindaroundhowwedecidewhenshrinkageisaproblemi.e.howdowechoosebetweenthesimplemodelwith
some shrinkage and the more complicated model so I emailed John Kruschke (author of the puppy book (http://www.amazon.com/Doing
BayesianDataAnalysisSecond/dp/0124058884/ref=sr_1_1?s=books&ie=UTF8&qid=1414453485&sr=1
1&keywords=doing+bayesian+data+analysis)),whoreplied:

There'snothinginherentlygoodorbadaboutshrinkageinhierarchicalmodels.Shrinkageisjustthemathematical
consequenceofthemodelchoice.Justasyoucanchoosetomodelatrendwithalineoraquadraticoranexponential,you
choosetomodeldatawithvarioushierarchiesofparameters.Wechooseamodeltoachievebothdescriptiveadequacyand
theoreticalmeaningfulness.Ifaparticularchoiceofhierarchicalmodelseemstoproducetoomuchshrinkageinsomeofthe
parameters,itmeansthatsomehowthehierarchydidnotcaptureyourpriorknowledgebecausetheposteriorseemsto
violateyourpriorknowledgeinwaysunjustifiablefromthedata.

Hierarchicalmodelsaresupposedtoexpresspriorknowledgeabouthowsubsetsofdataarerelated.Ifthevarioussubsets
cantrulybethoughtofasrepresentingahigherlevel,thenitmakesperfectsensetoputthesubsetsunderahierarchyand
useallthethesubsetstomutuallyinformthehigherlevelandsimultaneouslyconstrainthelowerlevelestimates.The
higherlevelestimateessentiallyconstitutesapriorforeachofthelowerlevelparameters.

Sointhiscase,isthesimplemodeladequate?Ontheonehand,theobservedshrinkageisn'ttooegregious,andthemodelhasallowedLiverpool
andManCitytobeinanattackingclassoftheirown.Ontheotherhand,wehavesomepriorknowledgeinfavoroftherebeingdifferenttiersof
teamstrength:wecanlookatpriorseasons,oratthewagebills.Thechartbelowisfrom20122013(select'WageBill'fromthedropdown),but
illustratesthatthereareatleasttwotierswhenitcomestoteamwages:thosespendingmorethan130m,andthosespendinglessthan100m.

In[55]: HTML('<iframesrc="http://cf.datawrapper.de/USTan/2/"frameborder="0"allowtransparency="true"allowfulls
creenwebkitallowfullscreenmozallowfullscreenoallowfullscreenmsallowfullscreenwidth="460"height="70
0"></iframe>')
Out[55]:
Arecord1.8bnwasspentonwagesin201213
Usethedropdownmenutoseehowturnovercomparestotheamountspenton
wages

Turnover,m(201213)

ManchesterUnited 363m

Arsenal 283m

ManchesterCity 271m

Chelsea 260m

Liverpool 206m
TottenhamHotspur 147m

NewcastleUnited 96m

WestHamUnited 91m

Everton 86m

AstonVilla 84m

Sunderland 76m
NorwichCity 75m

Fulham 73m

Southampton 72m

WestBromwichAlbion 70m

StokeCity 67m

SwanseaCity 67m
QueensParkRangers 61m

Reading 59m

WiganAthletic 56m

GETTHEDATAEMBEDFULLSCREEN

I don't think there's a correct/incorrect answer here. For a case that's more cutanddry, see this post
(http://doingbayesiandataanalysis.blogspot.com/2012/11/shrinkageinmultilevelhierarchical.html) by John Kruschke, in which he models
baseballbattingaverages.Becausebattingaveragesvarysignificantlybypositionpitchersaren'thiredfortheirbattingabilityitmakessense
tomodelbattingaveragesascomingfrompositionspecificdistributions.

ComeonYouSpurs
Lastly,justforfun,let'sseeinwhatpercentageofsimulationseachteamfinishedinthetop4,qualifyingfortheChampionsLeague:

In[43]: g=simuls.groupby('team')
df_champs=pd.DataFrame({'percent_champs':g.champion.mean(),
'percent_CL':g.qualified_for_CL.mean()})
df_champs=df_champs.sort_index(by='percent_CL')
df_champs=df_champs[df_champs.percent_CL>.05]
df_champs=df_champs.reset_index()

fig,ax=plt.subplots(figsize=(8,6))
ax.barh(df_champs.index.values,df_champs.percent_CL.values)

fori,rowindf_champs.iterrows():
label="{0:.1f}%".format(100*row['percent_CL'])
ax.annotate(label,xy=(row['percent_CL'],i),xytext=(3,10),textcoords='offsetpoints')
ax.set_ylabel('Team')
ax.set_title('%ofSimulatedSeasonsInWhichTeamFinishedinTop4')
_=ax.set_yticks(df_champs.index+.5)
_=ax.set_yticklabels(df_champs['team'].values)
In~8.3%ofalternativeuniverses,Tottenhamfinishedinthetop4.Ifthemefromthatuniverseisreadingthisandwantstoswap,pleasecontact
me.

MoreonHierachicalBayesianModeling
Ifyouwanttolearnmoreabouthierarchicalbayesianmodeling,Irecommend:

thispost(http://twiecki.github.io/blog/2014/03/17/bayesianglms3/)byThomasWieckiandDanneElbers
thispost(http://doingbayesiandataanalysis.blogspot.com/2012/11/shrinkageinmultilevelhierarchical.html)byJohnKruschke
preorderthepuppybook(http://www.amazon.com/DoingBayesianDataAnalysisSecond/dp/0124058884/ref=sr_1_1?
s=books&ie=UTF8&qid=1414453485&sr=11&keywords=doing+bayesian+data+analysis)OsvaldoMartinisimplementingthemodelsin
pymc(https://github.com/aloctavodia/Doing_bayesian_data_analysis)

PostedbyMeOct28,2014python(http://danielweitzenfeld.github.io/passtheroc/tag/python.html)bayesian
(http://danielweitzenfeld.github.io/passtheroc/tag/bayesian.html)mcmc(http://danielweitzenfeld.github.io/passtheroc/tag/mcmc.html)pymc
(http://danielweitzenfeld.github.io/passtheroc/tag/pymc.html)
Tweet

Comments
10Comments PasstheROC
1 Login

SortbyBest
Recommend Share

Jointhediscussion

LOGINWITH

ORSIGNUPWITHDISQUS ?

Name

IlyaKolpakov3monthsago
Hi,thanksforthewriteup.Iamwonderingwhyyoudecidedtoimplementthezerosumconstraintinsuchaway
against,say,makingtheattackanddefenseofthelastteamdefinedasasumoftheparametersforalltheotherteams?
against,say,makingtheattackanddefenseofthelastteamdefinedasasumoftheparametersforalltheotherteams?

Ikindofseewhythistrickcanworkforsamplingbasedestimationandamwonderingifitcouldcreateissuesfor
optimizationbasedmethods.Whatwouldyousay?
Reply Share

DanielWeitzenfeld Mod >IlyaKolpakov 3monthsago

@IlyaKolpakovIbelieveIdiditthiswaytomimichowtheauthorsdiditinthepaper.Intuitively,Icansee
whyitwouldcauseissuesforoptimizationbasedmethods.
Reply Share

EdwardTerryayearago
Thisisgreattosee,veryclear.HasanyonereplicatedthisinR(usingRJAGSorother)?
Reply Share

DanielWeitzenfeld Mod >EdwardTerry ayearago

Thanks!NotthatIknowof.Thepaperonwhichthisisbased(BaioandBlangiardo)includesWinBUGS,ifthat
helps.
Reply Share

GeorgePamfilisayearago
WhataGreatLesson!IamgatheringsourcesonMCMC(thepuppybooketc)andthisisabsolutelygreat.Ihavea
questionthough.assumingiwanttobetonamatch."ARS"
vs"AST".howwouldicomeupwithadistributionoffinalgoalsusingthemodel?Becauseasiseehereindividualteams
are'ranked'butnotaspecificpair.
Reply Share

Mark2yearsago
I'mtryingtoreplicatethis,andIgettothevisualizationstagebutmydf_observedhasn'tanypresentQRcolumn,
althoughitlookslikethisisadifferentdataset?

CanyoumakethisavailableviayourGitHubaccount?
Reply Share

DanielWeitzenfeld Mod >Mark 2yearsago

areyouloadingthisfile:https://github.com/DanielWe...
Reply Share

Mark>DanielWeitzenfeld2yearsago
ah,nope!Thatoughttohelp
Reply Share

ZygmuntZ2yearsago
Agreatarticle,thankyou.

Ihaveaquestion:iftheinterceptcaptures_theaveragegoalsscoredpergamebytheawayteam_,whyisitpresentin
home_theta(goalsscoredbythehometeam)?
Reply Share

DanielWeitzenfeld Mod >ZygmuntZ 2yearsago

It'sincludedinthehome_thetasothatthecoefficientonhomecapturesthehometeamadvantagethebenefit
ofplayingagameathomeversusplayingagameaway.Ifitwasnotincluded,thenthecoefficientonhome
wouldbetheaveragegoalscoringpropensityofhometeams,andIwouldhavehadtosubtracttheawayscoring
propensity(theintercept)inordertogetthehometeamadvantage.

PutanotherwaywhetherIincludeorexcludetheinterceptfromhometheta,thecoefficientsonallvariables
excepthomewillbethesame.Byincludingtheintercept,I'mmakingthecoefficientonhomemoredirectly
interpretable.Butit'sreallyjustamatterofpreferenceandthere'snoright/wrongway.
1 Reply Share

ALSOONPASSTHEROC
ALSOONPASSTHEROC

UsingMCMCtoFittheShiftedBetaGeometric WhyDataScientistsShouldUseanORMLike
CustomerLifetimeValueModel SQLAlchemy
2comments2yearsago 4comments3yearsago
DanielWeitzenfeldIdon'thaveplansto,butmaybe.Check OsbornIstronglyopposeusingORMfordatascience

RecentPosts
AHierarchicalBayesianDriveSurvivalModeloftheNFL(http://danielweitzenfeld.github.io/passtheroc/blog/2015/01/30/bayesnfl/)
UsingMCMCtoFittheShiftedBetaGeometricCustomerLifetimeValueModel(http://danielweitzenfeld.github.io/passtheroc/blog/2015/01/19/s
bg/)
AHierarchicalBayesianModelofthePremierLeague(http://danielweitzenfeld.github.io/passtheroc/blog/2014/10/28/bayespremierleague/)
WhyDataScientistsShouldUseanORMLikeSQLAlchemy(http://danielweitzenfeld.github.io/passtheroc/blog/2014/10/12/datascisqlalchemy/)
MakingThisBlog(http://danielweitzenfeld.github.io/passtheroc/blog/2014/10/12/makingthisblog/)

Follow@weitzenfeld 168followers

Copyright2017DanWeitzenfeldPoweredbyPelican(http://getpelican.com)

You might also like