A Hierarchical Bayesian Model

PasstheROC
(http://danielweitzenfeld.github.io/passtheroc/)
About(http://danielweitzenfeld.github.io/passtheroc/pages/about.html)
AHierarchicalBayesianModelofthePremier
League
Oct28,2014
Last fall, I was listening to an episode of the BS Report podcast (http://espn.go.com/espnradio/podcast/archive?id=2864045) in which Bill
SimmonsandCousinSalwerediscussingthestrengthofdifferentNFLteams.Itwasearlyintheseasonweek3or4,maybeandBillmade
thepoint(I'mparaphrasinghere)thatyourestimateofoneteam'sstrengthdependsonyourestimateofalltheothers.Theconclusionsyoudraw
from team X beating team Y depends how strong team Y is, which in turn depends on the conclusions you draw from team Y's other games,
whichinturndependsonhowstrongY'sopponentswere,etc.
It occurred to me that this problem is perfect for a Bayesian model. We want to infer the latent paremeters (every team's strength) that are
generatingthedataweobserve(thescorelines).Moreover,weknowthatthescorelinesareanoisymeasurementofteamstrength,soideally,we
wantamodelthatmakesiteasytoquantifyouruncertaintyabouttheunderlyingstrengths.
SoIgoogled'Bayesianfootball'andfoundthispaper(http://www.statistica.it/gianluca/Research/BaioBlangiardo.pdf),called'Bayesianhierarchical
modelforthepredictionoffootballresults.'Theauthors(GianlucaBaioandMartaA.Blangiardo)beingItalian,though,the'football'hereissoccer.
Inthispost,I'mgoingtoreproducethefirstmodeldescribedinthepaperusingpymc.WhiletheyusedSeriaAintheirpaper,I'mgoingtousethe
20132014PremierLeague.
Gettingandcleaningthedata
Igotthedatafromwikipedia(http://en.wikipedia.org/wiki/201314_Premier_League).Icopiedthe'ResultTable'andsaveditasatextfile.
In[1]: %matplotlibinline
%configInlineBackend.figure_format='retina'
In[2]: importos
importmath
importwarnings
warnings.filterwarnings('ignore')

fromIPython.displayimportImage,HTML
importnumpyasnp
importpandasaspd
importseabornassns
importmatplotlib.pyplotasplt
importpymc#Iknowfolksareswitchingto"aspm"butI'mjustnotthereyet
In[3]: DATA_DIR=os.path.join(os.getcwd(),'data/')
CHART_DIR=os.path.join(os.getcwd(),'charts/')
In[4]: data_file=DATA_DIR+'premier_league_13_14.txt'
df=pd.read_csv(data_file,sep='\t',index_col=0,)
df.head()
Out[4]:
ARS AST CAR CHE CPA EVE FUL HUL LIV MNC MNU NEW NOR SOT STO SUN SWA TOT WBA WHU
Home\
Away[1]
Arsenal NaN 13 20 00 20 11 20 20 20 11 00 30 41 20 31 41 22 10 10 31
Aston
12 NaN 20 10 01 02 12 31 01 32 03 12 41 00 14 00 11 02 43 02
Villa
Cardiff
03 00 NaN 12 03 00 31 04 36 32 22 12 21 03 11 22 10 01 10 02
City
Chelsea 60 21 41 NaN 21 10 20 20 21 21 31 30 00 31 30 12 10 40 22 00
Crystal
02 10 20 10 NaN 00 14 10 33 02 02 03 11 01 10 31 02 01 31 10
Palace
Clearly,weneeddosomemunginghere.Let'sturnthisintoalongdataframe,andsplitthescoreintotwonumericcolumns.
In[5]: df.index=df.columns
rows=[]
foriindf.index:
forcindf.columns:
ifi==c:continue
score=df.ix[i,c]
score=[int(row)forrowinscore.split('')]
rows.append([i,c,score[0],score[1]])
df=pd.DataFrame(rows,columns=['home','away','home_score','away_score'])
df.head()
Out[5]:
home away home_score away_score
0 ARS AST 1 3
1 ARS CAR 2 0
2 ARS CHE 0 0
3 ARS CPA 2 0
4 ARS EVE 1 1
Much better, but still not done. We need an easy way to refer to the teams. Let's create a lookup table, which maps team name to a unique
integeri.
In[6]: teams=df.home.unique()
teams=pd.DataFrame(teams,columns=['team'])
teams['i']=teams.index
teams.head()
Out[6]:
team i
0 ARS 0
1 AST 1
2 CAR 2
3 CHE 3
4 CPA 4
Now,wecanmergethisintoourmaindataframetocreatethecolumnsi_homeandi_away.
In[7]: df=pd.merge(df,teams,left_on='home',right_on='team',how='left')
df=df.rename(columns={'i':'i_home'}).drop('team',1)
df=pd.merge(df,teams,left_on='away',right_on='team',how='left')
df=df.rename(columns={'i':'i_away'}).drop('team',1)
df.head()
Out[7]:
home away home_score away_score i_home i_away
0 ARS AST 1 3 0 1
1 CAR AST 0 0 2 1
2 CHE AST 2 1 3 1
3 CPA AST 1 0 4 1
4 EVE AST 2 1 5 1
Now,let'sextractthedataintoarrays,sothatpymccanworkwithit.Notethateachofthearrays(observed_home_goals,observed_away_goals,
home_team,away_team)arethesamelength,andthattheithentryofeachreferstothesamegame.
In[8]: observed_home_goals=df.home_score.values
observed_away_goals=df.away_score.values
home_team=df.i_home.values
away_team=df.i_away.values
num_teams=len(df.i_home.unique())
num_games=len(home_team)
This next step is unnecessary, but I like to do it anyway come up with some decent starting values for the att and def parameters. The log
transformationwillmakesenseshortly.
In[9]: g=df.groupby('i_away')
att_starting_points=np.log(g.away_score.mean())
g=df.groupby('i_home')
def_starting_points=np.log(g.away_score.mean())
TheModel
TheModel
Inthefirstmodelfromthepaper,theymodeltheobservedgoalsscoredcountsas:
ygj gj = P oisson(gj )
"wherethe = (g1 , g2 ) representthescoringintensityinthe gthgamefortheteamplayingathome(j =1)andaway(j = 2), respectively."

Theyusealoglinearmodelforthethetas:
log g1 = home + att h(g) + defa(g)
log g2 = att a(g) + defh(g)
Note that they're breaking out team strength into attacking and defending strength. A negative defense parameter will sap the mojo from the
opposingteam'sattackingparameter.
home represents hometeam advantage, and in this model is assumed to be constant across teams. The prior on the home and intercept
parametersisflat(keepinmindthatinpymc,Normalsarespecifiedas(mean,precision):
home N ormal(0, .0001)
intercept N ormal(0, .0001)
Theteamspecificeffectsaremodeledasexchangeable:
att t N ormal(att , att )
deft N ormal(def , def )
Toensureidentifiability,theyimposeasumtozeroconstraintontheattackanddefenseparameters.Theyalsotriedacornerconstraint(pinning
oneteam'sparametersto0,0),butfoundthatinterpretationiseasierinthesumtozerocasebecauseit'snotrelativetothe0,0team.
att t = 0
t=1
deft = 0
t=1
Thehyperpriorsontheattackanddefenseparametersarealsoflat:
att N ormal(0, .0001)
def N ormal(0, .0001)
att Gamma(.1, .1)
def Gamma(.1, .1)
MyTweaks
Imadetwotweakstothemodel.Itdidn'tmakesensetometohavea att when we're enforcing the sumtozero constraint by subtracting the
meananyway.SoIeliminatedatt anddef :
att t N ormal(0, att )
deft N ormal(0, def )
Also because of the sumtozero constraint, it seemed to me that we needed an intercept term in the loglinear model, capturing the average
goalsscoredpergamebytheawayteam:
log g1 = intercept + home + att h(g) + defa(g)
log g2 = intercept + att a(g) + defh(g)
Lastly,Itrieduniformpriorson att and def ,butfoundnodifferencerelativetothegammapriors.Belowiswhatthemodellookslikeinpymc.

OnethingIlearnedinthisprocessisthatpymcplaysbestwithnumpyarrays.Ifyoufindyourselfusingalistcomprehension,seeifthere'saway
tousenumpyarraysinstead.
In[10]: #hyperpriors
home=pymc.Normal('home',0,.0001,value=0)
tau_att=pymc.Gamma('tau_att',.1,.1,value=10)
tau_def=pymc.Gamma('tau_def',.1,.1,value=10)
intercept=pymc.Normal('intercept',0,.0001,value=0)

#teamspecificparameters
atts_star=pymc.Normal("atts_star",
mu=0,
tau=tau_att,
size=num_teams,
value=att_starting_points.values)
defs_star=pymc.Normal("defs_star",
mu=0,
tau=tau_def,
size=num_teams,
value=def_starting_points.values)

#tricktocodethesumtozerocontraint
@pymc.deterministic
defatts(atts_star=atts_star):
atts=atts_star.copy()
atts=attsnp.mean(atts_star)
returnatts

@pymc.deterministic
defdefs(defs_star=defs_star):
defs=defs_star.copy()
defs=defsnp.mean(defs_star)
returndefs

@pymc.deterministic
defhome_theta(home_team=home_team,
away_team=away_team,
home=home,
atts=atts,
defs=defs,
intercept=intercept):
returnnp.exp(intercept+
home+
atts[home_team]+
defs[away_team])

@pymc.deterministic
defaway_theta(home_team=home_team,
away_team=away_team,
home=home,
atts=atts,
defs=defs,
intercept=intercept):
returnnp.exp(intercept+
atts[away_team]+
defs[home_team])

home_goals=pymc.Poisson('home_goals',
mu=home_theta,
value=observed_home_goals,
observed=True)
away_goals=pymc.Poisson('away_goals',
mu=away_theta,
value=observed_away_goals,
observed=True)

mcmc=pymc.MCMC([home,intercept,tau_att,tau_def,
home_theta,away_theta,
atts_star,defs_star,atts,defs,
home_goals,away_goals])
map_=pymc.MAP(mcmc)
map_.fit()
mcmc.sample(200000,40000,20)
[100%]200000of200000completein91.9sec
Diagnostics
Let's see if/how the model converged. The home parameter looks good, and indicates that home field advantage amounts to e.4 .1
e .39
goalspergameattheintercept.
In[44]: pymc.Matplot.plot(home)
Plottinghome

Theinterceptalsolooksgood,andhasanHDIthatdoesn'tincludezero,indicatingthatincludinganinterceptwasagoodidea:
In[45]: pymc.Matplot.plot(intercept)
Plottingintercept

The slookok,althoughthereisdefinitelymoreautocorrelationindef :
In[46]: pymc.Matplot.plot(tau_att)
Plottingtau_att

In[47]: pymc.Matplot.plot(tau_def)
Plottingtau_def

Belowisasubsetoftheattsparametersplots.Toseethemall,we'djustdopymc.Matplot.plot(atts)
In[15]: Embed=Image(CHART_DIR+'atts.png')
Embed
Visualization
Let's see if we can recreate a plot (Figure 3) from their paper, of average defense effect by average attack effect. We'll add in colorcoding by
whethertheteamendedtheseasonqualifyingfortheChampion'sLeague(top4),EuropaLeague(5&6),orwasrelegated(bottom3).
In[17]: observed_season=DATA_DIR+'premier_league_13_14_table.csv'
df_observed=pd.read_csv(observed_season)
df_observed.loc[df_observed.QR.isnull(),'QR']=''
In[48]: df_avg=pd.DataFrame({'avg_att':atts.stats()['mean'],
'avg_def':defs.stats()['mean']},
index=teams.team.values)
df_avg=pd.merge(df_avg,df_observed,left_index=True,right_on='team',how='left')

fig,ax=plt.subplots(figsize=(8,6))
foroutcomein['champs_league','relegation','europa_league','']:
ax.plot(df_avg.avg_att[df_avg.QR==outcome],
df_avg.avg_def[df_avg.QR==outcome],'o',label=outcome)

forlabel,x,yinzip(df_avg.team.values,df_avg.avg_att.values,df_avg.avg_def.values):
ax.annotate(label,xy=(x,y),xytext=(5,5),textcoords='offsetpoints')
ax.set_title('AttackvsDefenseavgeffect:1314PremierLeague')
ax.set_xlabel('Avgattackeffect')
ax.set_ylabel('Avgdefenseeffect')
ax.legend()
Out[48]: <matplotlib.legend.Legendat0x115cd1c90>
Soasyouwouldexpect,thetopteamsareinthelowerrightsideofthechart,indicatingapositiveattackeffectandanegativedefenseeffect.
Interestingly,ManchesterUnited(MNU)appearstohavebothastrongerattackandastrongerdefensethantwoteamsthatfinishedaboveitin
thetable,Everton(EVE)andTottenham(TOT).Butwe'rejustlookingatposteriormeanshere,andweoughttotakeadvantageofthefactthatwe
canquantifyourposterioruncertaintyaroundtheseparameters.Let'slookattheHighestPosteriorDensityintervalsfortheattackparameters.
In[49]: df_hpd=pd.DataFrame(atts.stats()['95%HPDinterval'],
columns=['hpd_low','hpd_high'],
df_median=pd.DataFrame(atts.stats()['quantiles'][50],
columns=['hpd_median'],
df_hpd=df_hpd.join(df_median)
df_hpd['relative_lower']=df_hpd.hpd_mediandf_hpd.hpd_low
df_hpd['relative_upper']=df_hpd.hpd_highdf_hpd.hpd_median
df_hpd=df_hpd.sort_index(by='hpd_median')
df_hpd=df_hpd.reset_index()
df_hpd['x']=df_hpd.index+.5

fig,axs=plt.subplots(figsize=(10,4))
axs.errorbar(df_hpd.x,df_hpd.hpd_median,
yerr=(df_hpd[['relative_lower','relative_upper']].values).T,
fmt='o')
axs.set_title('HPDofAttackStrength,byTeam')
axs.set_xlabel('Team')
axs.set_ylabel('PosteriorAttackStrength')
_=axs.set_xticks(df_hpd.index+.5)
_=axs.set_xticklabels(df_hpd['index'].values,rotation=45)
Note that the intervals for MNU, EVE, and TOT overlap significantly. Indeed, there's a large set of teams in the middle with overlapping HPDs.
Wecanbefairlyconfident,though,thatLiverpool'sandManCity'sattackstrengthlefttherestofthepackbehind,whichmakessensegiventhat
theyeachscored30moregoalsthanChelsea,theteamwiththethirdmostgoalsscored.Infact,wemightconcludethattheirattackswereeven
furtherfromtherestofthepackifitweren'tforshrinkage.ToseewhatImean,let'ssimulatesomeseasons.
Simulations
Wecantakedrawsfromtheposteriordistributionsoftheparameters,andsimulateaseasonormanyseasons.Belowisthesimulationcode,just
soyoucanseehowIdidit.
In[24]: defsimulate_season():
"""
Simulateaseasononce,usingonerandomdrawfromthemcmcchain.
"""
num_samples=atts.trace().shape[0]
draw=np.random.randint(0,num_samples)
atts_draw=pd.DataFrame({'att':atts.trace()[draw,:],})
defs_draw=pd.DataFrame({'def':defs.trace()[draw,:],})
home_draw=home.trace()[draw]
intercept_draw=intercept.trace()[draw]
season=df.copy()
season=pd.merge(season,atts_draw,left_on='i_home',right_index=True)
season=pd.merge(season,defs_draw,left_on='i_home',right_index=True)
season=season.rename(columns={'att':'att_home','def':'def_home'})
season=pd.merge(season,atts_draw,left_on='i_away',right_index=True)
season=pd.merge(season,defs_draw,left_on='i_away',right_index=True)
season=season.rename(columns={'att':'att_away','def':'def_away'})
season['home']=home_draw
season['intercept']=intercept_draw
season['home_theta']=season.apply(lambdax:math.exp(x['intercept']+
x['home']+
x['att_home']+
x['def_away']),axis=1)
season['away_theta']=season.apply(lambdax:math.exp(x['intercept']+
x['att_away']+
x['def_home']),axis=1)
season['home_goals']=season.apply(lambdax:np.random.poisson(x['home_theta']),axis=1)
season['away_goals']=season.apply(lambdax:np.random.poisson(x['away_theta']),axis=1)
season['home_outcome']=season.apply(lambdax:'win'ifx['home_goals']>x['away_goals']else
'loss'ifx['home_goals']<x['away_goals']else'dra
w',axis=1)
season['away_outcome']=season.apply(lambdax:'win'ifx['home_goals']<x['away_goals']else
'loss'ifx['home_goals']>x['away_goals']else'dra
w',axis=1)
season=season.join(pd.get_dummies(season.home_outcome,prefix='home'))
season=season.join(pd.get_dummies(season.away_outcome,prefix='away'))
returnseason

defcreate_season_table(season):
"""
Usingaseasondataframeoutputbysimulate_season(),createasummarydataframewithwins,losses,g
oalsfor,etc.

"""
g=season.groupby('i_home')
home=pd.DataFrame({'home_goals':g.home_goals.sum(),
'home_goals_against':g.away_goals.sum(),
'home_wins':g.home_win.sum(),
'home_draws':g.home_draw.sum(),
'home_losses':g.home_loss.sum()
})
g=season.groupby('i_away')
away=pd.DataFrame({'away_goals':g.away_goals.sum(),
'away_goals_against':g.home_goals.sum(),
'away_wins':g.away_win.sum(),
'away_draws':g.away_draw.sum(),
'away_losses':g.away_loss.sum()
})
df=home.join(away)
df['wins']=df.home_wins+df.away_wins
df['draws']=df.home_draws+df.away_draws
df['losses']=df.home_losses+df.away_losses
df['points']=df.wins*3+df.draws
df['gf']=df.home_goals+df.away_goals
df['ga']=df.home_goals_against+df.away_goals_against
df['gd']=df.gfdf.ga
df=pd.merge(teams,df,left_on='i',right_index=True)
df=df.sort_index(by='points',ascending=False)
df=df.reset_index()
df['position']=df.index+1
df['champion']=(df.position==1).astype(int)
df['qualified_for_CL']=(df.position<5).astype(int)
df['relegated']=(df.position>17).astype(int)
returndf

defsimulate_seasons(n=100):
dfs=[]
foriinrange(n):
s=simulate_season()
t=create_season_table(s)
t['iteration']=i
dfs.append(t)
returnpd.concat(dfs,ignore_index=True)
Let'snowsimulateathousandseasons.
In[25]: simuls=simulate_seasons(1000)
HowdidManCitydoacrossour1000simulations?Belowisahistogramofthenumberofpointstheyhadatseason'send:
In[50]: ax=simuls.points[simuls.team=='MNC'].hist(figsize=(7,5))
median=simuls.points[simuls.team=='MNC'].median()
ax.set_title('ManCity:201314points,1000simulations')
ax.plot([median,median],ax.get_ylim())
plt.annotate('Median:%s'%median,xy=(median+1,ax.get_ylim()[1]10))
Out[50]: <matplotlib.text.Annotationat0x1152d9d90>
Andhowmanygoalstheyscored:
In[51]: ax=simuls.gf[simuls.team=='MNC'].hist(figsize=(7,5))
median=simuls.gf[simuls.team=='MNC'].median()
ax.set_title('ManCity:201314goalsfor,1000simulations')
ax.plot([median,median],ax.get_ylim())
plt.annotate('Median:%s'%median,xy=(median+1,ax.get_ylim()[1]10))
Out[51]: <matplotlib.text.Annotationat0x114a9e610>
NotethatthemediansimulationhasMancityscorting95goals,7fewerthantheyactuallyscoredlastseason.Iwonderifsomethingsimilaris
goingonforotherteams.Let'scheck.
In[52]: g=simuls.groupby('team')
season_hdis=pd.DataFrame({'points_lower':g.points.quantile(.05),
'points_upper':g.points.quantile(.95),
'goals_for_lower':g.gf.quantile(.05),
'goals_for_median':g.gf.median(),
'goals_for_upper':g.gf.quantile(.95),
'goals_against_lower':g.ga.quantile(.05),
'goals_against_upper':g.ga.quantile(.95),
})
season_hdis=pd.merge(season_hdis,df_observed,left_index=True,right_on='team')
column_order=['team','points_lower','Pts','points_upper',
'goals_for_lower','GF','goals_for_median','goals_for_upper',
'goals_against_lower','GA','goals_against_upper',]
season_hdis=season_hdis[column_order]
season_hdis['relative_goals_upper']=season_hdis.goals_for_upperseason_hdis.goals_for_median
season_hdis['relative_goals_lower']=season_hdis.goals_for_medianseason_hdis.goals_for_lower
season_hdis=season_hdis.sort_index(by='GF')
season_hdis=season_hdis.reset_index()
season_hdis['x']=season_hdis.index+.5
season_hdis

fig,axs=plt.subplots(figsize=(10,6))
axs.scatter(season_hdis.x,season_hdis.GF,c=sns.palettes.color_palette()[4],zorder=10,label='Actual
GoalsFor')
axs.errorbar(season_hdis.x,season_hdis.goals_for_median,
yerr=(season_hdis[['relative_goals_lower','relative_goals_upper']].values).T,
fmt='s',c=sns.palettes.color_palette()[5],label='Simulations')
axs.set_title('ActualGoalsFor,and90%IntervalfromSimulations,byTeam')
axs.set_xlabel('Team')
axs.set_ylabel('GoalsScored')
axs.set_xlim(0,20)
axs.legend()
_=axs.set_xticks(season_hdis.index+.5)
_=axs.set_xticklabels(season_hdis['team'].values,rotation=45)
Indeed,forhighscoringteams,theobservedgoalsscoredfallsintheupperendofthedistributionfromoursimulations,andtheoppositeistrue
(buttoalesserextent)forthelowerscoringteams.Thisisshrinkageinaction.
Shrinkage
Recall that we defined the attack parameters as being exchangable: attt N ormal(0, att ) . Shrinkage is when parameters are pulled toward
theirgroupmeaninthiscase,theteamspecificattackparametersarepulledtowardzero.Wecanseethisvisuallyinthechartabove:theblue
squares(simulationmedians)areallclosertothemeanthantheobservations(yellowcircles).
Onewaytothinkaboutshrinkageinthismodelisthatwe'reusingthedataonalltheteamstoinformourinferenceoneachindividualteam.We're
usingthefactthatmostteamsscored4060goalsintheseasontoinferthatLiverpool'sandManCity's100+goalseasonswereabitlucky.
IntheBaioandBlangiardopaper,theyaddresstheshrinkagebycreatingamorecomplicatedmodel:
The model of 2 assumes that all the attack and defense propensities be drawn by a common process,
characterisedbythecommonvectorofhyperparameters(att ,att ,def ,def )clearly,thismightbenot
sufficienttocapturethepresenceofdifferentqualityintheteams,thereforeproducingovershrinkage,with
theeffectof:a)penalisingextremelygoodteamsandb)overestimatetheperformanceofpoorteams.One
possiblewaytoavoidthisproblemistointroduceamorecomplicatedstructurefortheparametersofthe
model,inordertoallowforthreedifferentgeneratingmechanism,oneforthetopteams,oneforthemid
tableteams,andoneforthebottomtableteams.
That makes a lot of sense. There does seem to be a top tier of teams that is just a leg above the rest. (I tried implementing their more
complicatedmodel,butIcouldn'tgetittoconvergenicely).
Ihadsometroublewrappingmymindaroundhowwedecidewhenshrinkageisaproblemi.e.howdowechoosebetweenthesimplemodelwith
some shrinkage and the more complicated model so I emailed John Kruschke (author of the puppy book (http://www.amazon.com/Doing
BayesianDataAnalysisSecond/dp/0124058884/ref=sr_1_1?s=books&ie=UTF8&qid=1414453485&sr=1
1&keywords=doing+bayesian+data+analysis)),whoreplied:
There'snothinginherentlygoodorbadaboutshrinkageinhierarchicalmodels.Shrinkageisjustthemathematical
consequenceofthemodelchoice.Justasyoucanchoosetomodelatrendwithalineoraquadraticoranexponential,you
choosetomodeldatawithvarioushierarchiesofparameters.Wechooseamodeltoachievebothdescriptiveadequacyand
theoreticalmeaningfulness.Ifaparticularchoiceofhierarchicalmodelseemstoproducetoomuchshrinkageinsomeofthe
parameters,itmeansthatsomehowthehierarchydidnotcaptureyourpriorknowledgebecausetheposteriorseemsto
violateyourpriorknowledgeinwaysunjustifiablefromthedata.
Hierarchicalmodelsaresupposedtoexpresspriorknowledgeabouthowsubsetsofdataarerelated.Ifthevarioussubsets
cantrulybethoughtofasrepresentingahigherlevel,thenitmakesperfectsensetoputthesubsetsunderahierarchyand
useallthethesubsetstomutuallyinformthehigherlevelandsimultaneouslyconstrainthelowerlevelestimates.The
higherlevelestimateessentiallyconstitutesapriorforeachofthelowerlevelparameters.
Sointhiscase,isthesimplemodeladequate?Ontheonehand,theobservedshrinkageisn'ttooegregious,andthemodelhasallowedLiverpool
andManCitytobeinanattackingclassoftheirown.Ontheotherhand,wehavesomepriorknowledgeinfavoroftherebeingdifferenttiersof
teamstrength:wecanlookatpriorseasons,oratthewagebills.Thechartbelowisfrom20122013(select'WageBill'fromthedropdown),but
illustratesthatthereareatleasttwotierswhenitcomestoteamwages:thosespendingmorethan130m,andthosespendinglessthan100m.
In[55]: HTML('<iframesrc="http://cf.datawrapper.de/USTan/2/"frameborder="0"allowtransparency="true"allowfulls
creenwebkitallowfullscreenmozallowfullscreenoallowfullscreenmsallowfullscreenwidth="460"height="70
0"></iframe>')
Out[55]:
Arecord1.8bnwasspentonwagesin201213
Usethedropdownmenutoseehowturnovercomparestotheamountspenton
wages
Turnover,m(201213)
ManchesterUnited 363m
Arsenal 283m
ManchesterCity 271m
Chelsea 260m
Liverpool 206m
TottenhamHotspur 147m
NewcastleUnited 96m
WestHamUnited 91m
Everton 86m
AstonVilla 84m
Sunderland 76m
NorwichCity 75m
Fulham 73m
Southampton 72m
WestBromwichAlbion 70m
StokeCity 67m
SwanseaCity 67m
QueensParkRangers 61m
Reading 59m
WiganAthletic 56m
GETTHEDATAEMBEDFULLSCREEN
I don't think there's a correct/incorrect answer here. For a case that's more cutanddry, see this post
(http://doingbayesiandataanalysis.blogspot.com/2012/11/shrinkageinmultilevelhierarchical.html) by John Kruschke, in which he models
baseballbattingaverages.Becausebattingaveragesvarysignificantlybypositionpitchersaren'thiredfortheirbattingabilityitmakessense
tomodelbattingaveragesascomingfrompositionspecificdistributions.
ComeonYouSpurs
Lastly,justforfun,let'sseeinwhatpercentageofsimulationseachteamfinishedinthetop4,qualifyingfortheChampionsLeague:
In[43]: g=simuls.groupby('team')
df_champs=pd.DataFrame({'percent_champs':g.champion.mean(),
'percent_CL':g.qualified_for_CL.mean()})
df_champs=df_champs.sort_index(by='percent_CL')
df_champs=df_champs[df_champs.percent_CL>.05]
df_champs=df_champs.reset_index()

fig,ax=plt.subplots(figsize=(8,6))
ax.barh(df_champs.index.values,df_champs.percent_CL.values)

fori,rowindf_champs.iterrows():
label="{0:.1f}%".format(100*row['percent_CL'])
ax.annotate(label,xy=(row['percent_CL'],i),xytext=(3,10),textcoords='offsetpoints')
ax.set_ylabel('Team')
ax.set_title('%ofSimulatedSeasonsInWhichTeamFinishedinTop4')
_=ax.set_yticks(df_champs.index+.5)
_=ax.set_yticklabels(df_champs['team'].values)
In~8.3%ofalternativeuniverses,Tottenhamfinishedinthetop4.Ifthemefromthatuniverseisreadingthisandwantstoswap,pleasecontact
me.
MoreonHierachicalBayesianModeling
Ifyouwanttolearnmoreabouthierarchicalbayesianmodeling,Irecommend:
thispost(http://twiecki.github.io/blog/2014/03/17/bayesianglms3/)byThomasWieckiandDanneElbers
thispost(http://doingbayesiandataanalysis.blogspot.com/2012/11/shrinkageinmultilevelhierarchical.html)byJohnKruschke
preorderthepuppybook(http://www.amazon.com/DoingBayesianDataAnalysisSecond/dp/0124058884/ref=sr_1_1?
s=books&ie=UTF8&qid=1414453485&sr=11&keywords=doing+bayesian+data+analysis)OsvaldoMartinisimplementingthemodelsin
pymc(https://github.com/aloctavodia/Doing_bayesian_data_analysis)
PostedbyMeOct28,2014python(http://danielweitzenfeld.github.io/passtheroc/tag/python.html)bayesian
(http://danielweitzenfeld.github.io/passtheroc/tag/bayesian.html)mcmc(http://danielweitzenfeld.github.io/passtheroc/tag/mcmc.html)pymc
(http://danielweitzenfeld.github.io/passtheroc/tag/pymc.html)
Tweet
Comments
10Comments PasstheROC
1 Login
SortbyBest
Recommend Share
Jointhediscussion
LOGINWITH
ORSIGNUPWITHDISQUS ?
Name
IlyaKolpakov3monthsago
Hi,thanksforthewriteup.Iamwonderingwhyyoudecidedtoimplementthezerosumconstraintinsuchaway
against,say,makingtheattackanddefenseofthelastteamdefinedasasumoftheparametersforalltheotherteams?
against,say,makingtheattackanddefenseofthelastteamdefinedasasumoftheparametersforalltheotherteams?
Ikindofseewhythistrickcanworkforsamplingbasedestimationandamwonderingifitcouldcreateissuesfor
optimizationbasedmethods.Whatwouldyousay?
Reply Share
DanielWeitzenfeld Mod >IlyaKolpakov 3monthsago
@IlyaKolpakovIbelieveIdiditthiswaytomimichowtheauthorsdiditinthepaper.Intuitively,Icansee
whyitwouldcauseissuesforoptimizationbasedmethods.
Reply Share
EdwardTerryayearago
Thisisgreattosee,veryclear.HasanyonereplicatedthisinR(usingRJAGSorother)?
Reply Share
DanielWeitzenfeld Mod >EdwardTerry ayearago
Thanks!NotthatIknowof.Thepaperonwhichthisisbased(BaioandBlangiardo)includesWinBUGS,ifthat
helps.
Reply Share
GeorgePamfilisayearago
WhataGreatLesson!IamgatheringsourcesonMCMC(thepuppybooketc)andthisisabsolutelygreat.Ihavea
questionthough.assumingiwanttobetonamatch."ARS"
vs"AST".howwouldicomeupwithadistributionoffinalgoalsusingthemodel?Becauseasiseehereindividualteams
are'ranked'butnotaspecificpair.
Reply Share
Mark2yearsago
I'mtryingtoreplicatethis,andIgettothevisualizationstagebutmydf_observedhasn'tanypresentQRcolumn,
althoughitlookslikethisisadifferentdataset?
CanyoumakethisavailableviayourGitHubaccount?
Reply Share
DanielWeitzenfeld Mod >Mark 2yearsago
areyouloadingthisfile:https://github.com/DanielWe...
Reply Share
Mark>DanielWeitzenfeld2yearsago
ah,nope!Thatoughttohelp
Reply Share
ZygmuntZ2yearsago
Agreatarticle,thankyou.
Ihaveaquestion:iftheinterceptcaptures_theaveragegoalsscoredpergamebytheawayteam_,whyisitpresentin
home_theta(goalsscoredbythehometeam)?
Reply Share
DanielWeitzenfeld Mod >ZygmuntZ 2yearsago
It'sincludedinthehome_thetasothatthecoefficientonhomecapturesthehometeamadvantagethebenefit
ofplayingagameathomeversusplayingagameaway.Ifitwasnotincluded,thenthecoefficientonhome
wouldbetheaveragegoalscoringpropensityofhometeams,andIwouldhavehadtosubtracttheawayscoring
propensity(theintercept)inordertogetthehometeamadvantage.
PutanotherwaywhetherIincludeorexcludetheinterceptfromhometheta,thecoefficientsonallvariables
excepthomewillbethesame.Byincludingtheintercept,I'mmakingthecoefficientonhomemoredirectly
interpretable.Butit'sreallyjustamatterofpreferenceandthere'snoright/wrongway.
1 Reply Share
ALSOONPASSTHEROC
ALSOONPASSTHEROC
UsingMCMCtoFittheShiftedBetaGeometric WhyDataScientistsShouldUseanORMLike
CustomerLifetimeValueModel SQLAlchemy
2comments2yearsago 4comments3yearsago
DanielWeitzenfeldIdon'thaveplansto,butmaybe.Check OsbornIstronglyopposeusingORMfordatascience
RecentPosts
AHierarchicalBayesianDriveSurvivalModeloftheNFL(http://danielweitzenfeld.github.io/passtheroc/blog/2015/01/30/bayesnfl/)
UsingMCMCtoFittheShiftedBetaGeometricCustomerLifetimeValueModel(http://danielweitzenfeld.github.io/passtheroc/blog/2015/01/19/s
bg/)
AHierarchicalBayesianModelofthePremierLeague(http://danielweitzenfeld.github.io/passtheroc/blog/2014/10/28/bayespremierleague/)
WhyDataScientistsShouldUseanORMLikeSQLAlchemy(http://danielweitzenfeld.github.io/passtheroc/blog/2014/10/12/datascisqlalchemy/)
MakingThisBlog(http://danielweitzenfeld.github.io/passtheroc/blog/2014/10/12/makingthisblog/)
Follow@weitzenfeld 168followers
Copyright2017DanWeitzenfeldPoweredbyPelican(http://getpelican.com)

A Hierarchical Bayesian Model

Uploaded by

Copyright:

Available Formats

A Hierarchical Bayesian Model

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Hierarchical Bayesian Model

Uploaded by

Copyright:

Available Formats

PasstheROC

"wherethe = (g1 , g2 ) representthescoringintensityinthe gthgamefortheteamplayingathome(j =1)andaway(j = 2), respectively."

log g1 = home + att h(g) + defa(g)

log g2 = att a(g) + defh(g)

home N ormal(0, .0001)

intercept N ormal(0, .0001)

att t N ormal(att , att )

deft N ormal(def , def )

att N ormal(0, .0001)

def N ormal(0, .0001)

att Gamma(.1, .1)

def Gamma(.1, .1)

att t N ormal(0, att )

deft N ormal(0, def )

log g1 = intercept + home + att h(g) + defa(g)

log g2 = intercept + att a(g) + defh(g)

Lastly,Itrieduniformpriorson att and def ,butfoundnodifferencerelativetothegammapriors.Belowiswhatthemodellookslikeinpymc.

DanielWeitzenfeld Mod >IlyaKolpakov 3monthsago

DanielWeitzenfeld Mod >EdwardTerry ayearago

DanielWeitzenfeld Mod >Mark 2yearsago

DanielWeitzenfeld Mod >ZygmuntZ 2yearsago

You might also like