Dynamic Mixtures of Contextual Experts For Language Modeling

Dynamic Mixtures of Contextual Experts for Language Modeling
Abstract representations. However, standard neural network

language models are not able to explicitly capture
Language models are getting increasingly ac- sources of linguistic variation, which Bellegarda
curate, but presently they are unable to selec- (2004) defines to include the topic of conversation,
tively attend to context information to improve
format of conversation, and speaker-specific factors.
modeling. Instead, they incorporate context in-
formation by concatenating it to the input word, We are interested in designing contextual language
which makes the importance of context static models that use context information to learn more
across words and sentences. In this paper, we specialized language representations. For example,
introduce a mixture of contextual experts model language models that incorporate user identity are
which dynamically determines, for each word,
able to capture idiosyncrasies of individual expres-
the appropriate combination of multiple con-
textual language models (experts). We design
sion. These personalized language models outper-
the model to linearly combine the experts’ pre- form standard language models and have been used
dictions, enabling us to identify how much the in the academic community as well as in industry
model attends to each context for a given input. (Chelba et al., 2017; Xue et al., 2009; Huang et al.,
Experiments on Reddit and Yelp corpora, using 2014; Lee et al., 2017). Incorporating the topic of
subreddits, restaurant categories, and users as a text also helps model language that is specific to
context, demonstrate that the proposed models the subject being discussed (Wang and Cho, 2016).
not only result in increased accuracy (over 6%
However, these LMs either learn different language
decrease in perplexity over concatenation) but
also provide insights into how different con- models for each context (Yoon et al., 2017) or con-
texts affect language. catenate the context representation to the input. We
look for a more dynamic and flexible method of in-
corporating context where the model can explicitly
1 Introduction choose how much weight to give to the context at
Real world language applications often depend on each timestep.
accurate language models (LMs) for robust perfor- In this paper, we introduce mixture of contextual ex-
mance. For example, speech recognition frameworks perts (MCE), an end-to-end context language model
use language models to discern between possible sen- for incorporating multiple contexts. Following the
tences generated from audio (Chorowski et al., 2015), mixture-of-experts (MoE) (Jacobs et al., 1991) ap-
and virtual assistants and other dialogue systems proach, our model consists of neural language models
heavily rely on LMs for realistic conversations (Sor- for each context, along with a context-free, back-
doni et al., 2015). Recently, LMs based on Recur- ground LM. For the given input word and its respec-
rent Neural Networks (RNNs) (Elman, 1990) have tive context, each LM generates word predictions
achieved state-of-the-art performance by modeling over the vocabulary, which are then synthesized at
the long-term language dependencies in continuous the output layer with the help of a gating network.
MCE can directly modulate the importance of con- vector, is projected to a dense, continuous space.
textual information based on sequence history and The weights of this layer are often word embed-
current input, providing the ability to dynamically in- dings (Mikolov et al., 2013) pretrained on a larger
corporate contexts. MCE structure also dynamically corpus, allowing language models to share factors
switches between multiple contexts. Furthermore, across similar words and leverage external informa-
the linear nature of the MoE approach provides a tion. In contrast to standard feedforward neural net-
handy tool to peek into the dependence of the words works, RNNs maintain a hidden state h(t) , a function
on the context, facilitating language analysis. of the previous hidden state h(t−1) and the current
(t)
We use large, real-world text corpora from social input word xi , with an activation function of
media data (Reddit) and reviews (Yelp) as our testbed,
(t) (t−1) (t−1)
as they provide intuitive contexts such as user iden- hi = f W ih xi + W hh hi + bh . (1)
tity and sub-community information. For these two
datasets with three context configurations, we show Because learning long term dependencies is difficult
that our method outperforms baselines in perplex- for RNNs, variants such as the LSTM (Hochreiter
ity by 6 − 14%. In addition, the word-specific im- and Schmidhuber, 1997) contain parameters to ex-
portance of the context leads to insights about how plicitly decide which information to forget or retain
contexts and words interact. We find certain parts of in the hidden state and which to incorporate in the
speech and beginnings of sentences are more likely final output. This facilitates more natural language
to depend on user and/or community identity. We generation (Wen et al., 2015).
are also able to identify certain words which are used In this work, we exclusively use LSTM layers for
differently across contexts. our recurrent layers, as they have shown recent suc-
cesses in several language modeling tasks (Melis et
2 Background al., 2018; Jozefowicz et al., 2016; Merity et al., 2018).
However, as our framework is generic, competing ar-
In language modeling, we are interested in learn-
chitectures including GRUs (Chung et al., 2014) and
ing representations that allow us to accurately pre-
QRNNs (Bradbury et al., 2017) may be substituted
dict the next word in a sequence, conditioned on
in for the LSTMs.
the words that have appeared in the sequence so
The final output of the model is then calculated as
far. These models are trained on data that consist
of a set of word sequences, D = {x1 , x2 , . . . , xN }, (t)
p(xi |xi
(1:t−1) (t)
) = softmax(W ho hi ).
where each sequence is typically a sentence. xi =
(1) (2) (T ) This is a straightforward application of an RNN to the
(xi , xi , . . . , xi ) is one word sequence, with
t ∈ [1, T ] the timestep or the position of the word in task of lanaguage modeling and is represented in Fig-
the sequence. More precisely, LMs predict a con- ure 1a. However, these models lack the ability to take
(t) (1:t−1) context information under consideration. Context de-
ditional probability distribution p(xi |xi , θ),
(1:t−1) (1) (2) (t−1) termines much of language choice, and therefore the
where xi = (xi , xi , . . . , xi ) is the his-
inability to specify context inhibits the performance
tory of observed words up to timestep t and θ repre-
of standard RNNs in modeling language (Mikolov
sents the model parameters. We follow the standard
and Zweig, 2012; Li et al., 2016). In addition, this
training approach of minimizing the log-loss (cross-
limitation prevents these models from generating sen-
entropy) objective function to train our models:
tences conditioned on a specific context.
XX (t) (1:t−1)
L(θ) = − log p xi |xi ,θ
i∈N t∈T 2.2 Incorporating Context
This is equivalent to maximizing the log-likelihood One commonly used method to overcome these draw-
of the sequences. backs is to append context information to the in-
put of the recurrent function (Mikolov and Zweig,
2.1 Recurrent Neural Language Models 2012; Serban et al., 2016). Let ci be the one-hot-
In recurrent neural network (RNN) models, each in- encoding of the context identity (or a dense rep-
(t)
put word xi , represented as a one-hot encoding resentation of it). Then, the new input vector is
(t) 0.1 0.0 0.2 0.2 0.1 0.1 0.3 0.3 0.1 0.0 0.2 0.2 0.1 0.1 0.3 0.3 0.1 0.0 0.2 0.2 0.1 0.1 0.3 0.3
x̂i (t)
<latexit sha1_base64="j9tjG3Vq0Axj+jVfDB1+/HMx97U=">AAAB93icbVBNS8NAEN34WetHox69LBahXkoqgnorevFYwdhCG8Nmu2mXbjZhdyLWkF/ixYOKV/+KN/+N2zYHbX0w8Hhvhpl5QSK4Bsf5tpaWV1bX1ksb5c2t7Z2Kvbt3p+NUUebSWMSqExDNBJfMBQ6CdRLFSBQI1g5GVxO//cCU5rG8hXHCvIgMJA85JWAk3670hgSyx9zn91kNjnPfrjp1Zwq8SBoFqaICLd/+6vVjmkZMAhVE627DScDLiAJOBcvLvVSzhNARGbCuoZJETHvZ9PAcHxmlj8NYmZKAp+rviYxEWo+jwHRGBIZ63puI/3ndFMJzL+MySYFJOlsUpgJDjCcp4D5XjIIYG0Ko4uZWTIdEEQomq7IJoTH/8iJxT+oXdefmtNq8LNIooQN0iGqogc5QE12jFnIRRSl6Rq/ozXqyXqx362PWumQVM/voD6zPHzskkwA=</latexit>
x̂i x̂tbi ↵
x̂tci
<latexit sha1_base64="GK3zHEr8sjfzP8R4AYe+UCdZTR8=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KokI6q3oxWMF0xbaUCbbTbt2swm7G6GE/gcvHlS8+oO8+W/ctjlo64OBx3szzMwLU8G1cd1vZ2V1bX1js7RV3t7Z3duvHBw2dZIpynyaiES1Q9RMcMl8w41g7VQxjEPBWuHoduq3npjSPJEPZpyyIMaB5BGnaKzU7KJIh9irVN2aOwNZJl5BqlCg0at8dfsJzWImDRWodcdzUxPkqAyngk3K3UyzFOkIB6xjqcSY6SCfXTshp1bpkyhRtqQhM/X3RI6x1uM4tJ0xmqFe9Kbif14nM9FVkHOZZoZJOl8UZYKYhExfJ32uGDVibAlSxe2thA5RITU2oLINwVt8eZn457Xrmnt/Ua3fFGmU4BhO4Aw8uIQ63EEDfKDwCM/wCm9O4rw4787HvHXFKWaO4A+czx/3l47r</latexit>
<latexit
Dense Dense
<latexit sha1_base64="j9tjG3Vq0Axj+jVfDB1+/HMx97U=">AAAB93icbVBNS8NAEN34WetHox69LBahXkoqgnorevFYwdhCG8Nmu2mXbjZhdyLWkF/ixYOKV/+KN/+N2zYHbX0w8Hhvhpl5QSK4Bsf5tpaWV1bX1ksb5c2t7Z2Kvbt3p+NUUebSWMSqExDNBJfMBQ6CdRLFSBQI1g5GVxO//cCU5rG8hXHCvIgMJA85JWAk3670hgSyx9zn91kNjnPfrjp1Zwq8SBoFqaICLd/+6vVjmkZMAhVE627DScDLiAJOBcvLvVSzhNARGbCuoZJETHvZ9PAcHxmlj8NYmZKAp+rviYxEWo+jwHRGBIZ63puI/3ndFMJzL+MySYFJOlsUpgJDjCcp4D5XjIIYG0Ko4uZWTIdEEQomq7IJoTH/8iJxT+oXdefmtNq8LNIooQN0iGqogc5QE12jFnIRRSl6Rq/ozXqyXqx362PWumQVM/voD6zPHzskkwA=</latexit>
<latexit sha1_base64="4RIjWZuJGduiqwERXT/98DNcXVI=">AAAB9HicbVBNS8NAEJ34WetX1aOXYBE8lUQE9Vb04rGCsYU2LZvtpl262YTdiVpC/ocXDype/THe/Ddu2xy09cHA470ZZuYFieAaHefbWlpeWV1bL22UN7e2d3Yre/v3Ok4VZR6NRaxaAdFMcMk85ChYK1GMRIFgzWB0PfGbD0xpHss7HCfMj8hA8pBTgkbqdoYEs6e8lwU872KvUnVqzhT2InELUoUCjV7lq9OPaRoxiVQQrduuk6CfEYWcCpaXO6lmCaEjMmBtQyWJmPaz6dW5fWyUvh3GypREe6r+nshIpPU4CkxnRHCo572J+J/XTjG88DMukxSZpLNFYSpsjO1JBHafK0ZRjA0hVHFzq02HRBGKJqiyCcGdf3mReKe1y5pze1atXxVplOAQjuAEXDiHOtxAAzygoOAZXuHNerRerHfrY9a6ZBUzB/AH1ucPs46S1g==</latexit>
<latexit sha1_base64="5b0QXg/G91xZcrXEbepDHLIeKB8=">AAAB9HicbVBNS8NAEJ34WetX1aOXYBE8lUQE9Vb04rGCsYU2LZvtpl262YTdiVpC/ocXDype/THe/Ddu2xy09cHA470ZZuYFieAaHefbWlpeWV1bL22UN7e2d3Yre/v3Ok4VZR6NRaxaAdFMcMk85ChYK1GMRIFgzWB0PfGbD0xpHss7HCfMj8hA8pBTgkbqdoYEs6e8l1Ged7FXqTo1Zwp7kbgFqUKBRq/y1enHNI2YRCqI1m3XSdDPiEJOBcvLnVSzhNARGbC2oZJETPvZ9OrcPjZK3w5jZUqiPVV/T2Qk0nocBaYzIjjU895E/M9rpxhe+BmXSYpM0tmiMBU2xvYkArvPFaMoxoYQqri51aZDoghFE1TZhODOv7xIvNPaZc25PavWr4o0SnAIR3ACLpxDHW6gAR5QUPAMr/BmPVov1rv1MWtdsoqZA/gD6/MHtRWS1w==</latexit>
(t) Dense Dense LSTM

hi
(t) hi hib
(t) (t)
<latexit sha1_base64="5MKYT0nvMPELx1updYI3XL6Xd0M=">AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpaQiqLeiF48VjK20sWy2m3bp7ibsToQS+iu8eFDx6t/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfztLyyuraemGjuLm1vbNb2tu/N3GqKfNpLGLdColhgivmI0fBWolmRIaCNcPh9cRvPjFteKzucJSwQJK+4hGnBK30MOjyx6yCJ+NuqexVvSncRVLLSRlyNLqlr04vpqlkCqkgxrRrXoJBRjRyKti42EkNSwgdkj5rW6qIZCbIpgeP3WOr9Nwo1rYUulP190RGpDEjGdpOSXBg5r2J+J/XTjG6CDKukhSZorNFUSpcjN3J926Pa0ZRjCwhVHN7q0sHRBOKNqOiDaE2//Ii8U+rl1Xv9qxcv8rTKMAhHEEFanAOdbiBBvhAQcIzvMKbo50X5935mLUuOfnMAfyB8/kDznuP8g==</latexit>
<latexit sha1_base64="5MKYT0nvMPELx1updYI3XL6Xd0M=">AAAB73icbVBNS8NAEJ34WetX1aOXYBHqpaQiqLeiF48VjK20sWy2m3bp7ibsToQS+iu8eFDx6t/x5r9x2+agrQ8GHu/NMDMvTAQ36HnfztLyyuraemGjuLm1vbNb2tu/N3GqKfNpLGLdColhgivmI0fBWolmRIaCNcPh9cRvPjFteKzucJSwQJK+4hGnBK30MOjyx6yCJ+NuqexVvSncRVLLSRlyNLqlr04vpqlkCqkgxrRrXoJBRjRyKti42EkNSwgdkj5rW6qIZCbIpgeP3WOr9Nwo1rYUulP190RGpDEjGdpOSXBg5r2J+J/XTjG6CDKukhSZorNFUSpcjN3J926Pa0ZRjCwhVHN7q0sHRBOKNqOiDaE2//Ii8U+rl1Xv9qxcv8rTKMAhHEEFanAOdbiBBvhAQcIzvMKbo50X5935mLUuOfnMAfyB8/kDznuP8g==</latexit>
<latexit sha1_base64="gL0YGuKKDAeqwI4pwGVjyfWQx1A=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BItQLyURQb0VvXisYGyhjWWz3bRLN5uwOxFKyN/w4kHFq7/Gm//GbZuDtj4YeLw3w8y8IBFco+N8W6WV1bX1jfJmZWt7Z3evun/woONUUebRWMSqExDNBJfMQ46CdRLFSBQI1g7GN1O//cSU5rG8x0nC/IgMJQ85JWik3qif8SB/zOp4mverNafhzGAvE7cgNSjQ6le/eoOYphGTSAXRuus6CfoZUcipYHmll2qWEDomQ9Y1VJKIaT+b3ZzbJ0YZ2GGsTEm0Z+rviYxEWk+iwHRGBEd60ZuK/3ndFMNLP+MySZFJOl8UpsLG2J4GYA+4YhTFxBBCFTe32nREFKFoYqqYENzFl5eJd9a4ajh357XmdZFGGY7gGOrgwgU04RZa4AGFBJ7hFd6s1Hqx3q2PeWvJKmYO4Q+szx9Vq5Fq</latexit>
hic
LSTM
<latexit sha1_base64="B0E2KD7w2hjJs1UpnGrCZrmjvFw=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69BItQLyURQb0VvXisYGyhjWWz3bRLN5uwOxFKyN/w4kHFq7/Gm//GbZuDtj4YeLw3w8y8IBFco+N8W6WV1bX1jfJmZWt7Z3evun/woONUUebRWMSqExDNBJfMQ46CdRLFSBQI1g7GN1O//cSU5rG8x0nC/IgMJQ85JWik3qifcZo/ZnU8zfvVmtNwZrCXiVuQGhRo9atfvUFM04hJpIJo3XWdBP2MKORUsLzSSzVLCB2TIesaKknEtJ/Nbs7tE6MM7DBWpiTaM/X3REYirSdRYDojgiO96E3F/7xuiuGln3GZpMgknS8KU2FjbE8DsAdcMYpiYgihiptbbToiilA0MVVMCO7iy8vEO2tcNZy781rzukijDEdwDHVw4QKacAst8IBCAs/wCm9War1Y79bHvLVkFTOH8AfW5w9XNZFr</latexit>
LSTM (t 1)
LSTM b <latexit sha1_base64="fLcFuau32Ix18DXh1X/nqwYgiJk=">AAAB53icbVBNS8NAEJ34WetX1aOXxSJ4KokI6q3oxWMLxhbaUDbbSbt2swm7G6GE/gIvHlS8+pe8+W/ctjlo64OBx3szzMwLU8G1cd1vZ2V1bX1js7RV3t7Z3duvHBw+6CRTDH2WiES1Q6pRcIm+4UZgO1VI41BgKxzdTv3WEyrNE3lvxikGMR1IHnFGjZWaYa9SdWvuDGSZeAWpQoFGr/LV7Scsi1EaJqjWHc9NTZBTZTgTOCl3M40pZSM6wI6lksaog3x26IScWqVPokTZkobM1N8TOY21Hseh7YypGepFbyr+53UyE10FOZdpZlCy+aIoE8QkZPo16XOFzIixJZQpbm8lbEgVZcZmU7YheIsvLxP/vHZd85oX1fpNkUYJjuEEzsCDS6jDHTTABwYIz/AKb86j8+K8Ox/z1hWnmDmCP3A+fwAzXYy6</latexit>
<latexit
LSTM c <latexit sha1_base64="dH/5ouAzGwbHNs4b529AlentnSk=">AAAB53icbVBNS8NAEJ34WetX1aOXxSJ4KokI6q3oxWMLxhbaUDbbSbt2swm7G6GE/gIvHlS8+pe8+W/ctjlo64OBx3szzMwLU8G1cd1vZ2V1bX1js7RV3t7Z3duvHBw+6CRTDH2WiES1Q6pRcIm+4UZgO1VI41BgKxzdTv3WEyrNE3lvxikGMR1IHnFGjZWarFepujV3BrJMvIJUoUCjV/nq9hOWxSgNE1TrjuemJsipMpwJnJS7mcaUshEdYMdSSWPUQT47dEJOrdInUaJsSUNm6u+JnMZaj+PQdsbUDPWiNxX/8zqZia6CnMs0MyjZfFGUCWISMv2a9LlCZsTYEsoUt7cSNqSKMmOzKdsQvMWXl4l/Xruuec2Lav2mSKMEx3ACZ+DBJdThDhrgAwOEZ3iFN+fReXHenY9564pTzBzBHzifPzTgjLs=</latexit>
<latexit
zi (t 1)
(t 1) <latexit sha1_base64="Eo3+Lhr45MP4p/vSCoQhqEnkVKI=">AAAB8XicbVBNSwMxEM3Wr1q/qh69BItQD5ZdEdRb0YvHCq4ttGvJptk2NJssyaxQl/4MLx5UvPpvvPlvTNs9aOuDgcd7M8zMCxPBDbjut1NYWl5ZXSuulzY2t7Z3yrt790almjKfKqF0KySGCS6ZDxwEayWakTgUrBkOryd+85Fpw5W8g1HCgpj0JY84JWCl9lOXP2RVOPGOx91yxa25U+BF4uWkgnI0uuWvTk/RNGYSqCDGtD03gSAjGjgVbFzqpIYlhA5Jn7UtlSRmJsimJ4/xkVV6OFLalgQ8VX9PZCQ2ZhSHtjMmMDDz3kT8z2unEF0EGZdJCkzS2aIoFRgUnvyPe1wzCmJkCaGa21sxHRBNKNiUSjYEb/7lReKf1i5r7u1ZpX6Vp1FEB+gQVZGHzlEd3aAG8hFFCj2jV/TmgPPivDsfs9aCk8/soz9wPn8AyKmQdg==</latexit>
zi
xi (t 1) (t 1) (t 1)
ci
<latexit sha1_base64="Eo3+Lhr45MP4p/vSCoQhqEnkVKI=">AAAB8XicbVBNSwMxEM3Wr1q/qh69BItQD5ZdEdRb0YvHCq4ttGvJptk2NJssyaxQl/4MLx5UvPpvvPlvTNs9aOuDgcd7M8zMCxPBDbjut1NYWl5ZXSuulzY2t7Z3yrt790almjKfKqF0KySGCS6ZDxwEayWakTgUrBkOryd+85Fpw5W8g1HCgpj0JY84JWCl9lOXP2RVOPGOx91yxa25U+BF4uWkgnI0uuWvTk/RNGYSqCDGtD03gSAjGjgVbFzqpIYlhA5Jn7UtlSRmJsimJ4/xkVV6OFLalgQ8VX9PZCQ2ZhSHtjMmMDDz3kT8z2unEF0EGZdJCkzS2aIoFRgUnvyPe1wzCmJkCaGa21sxHRBNKNiUSjYEb/7lReKf1i5r7u1ZpX6Vp1FEB+gQVZGHzlEd3aAG8hFFCj2jV/TmgPPivDsfs9aCk8/soz9wPn8AyKmQdg==</latexit>
xi <latexit sha1_base64="RS/sOSFDP3IpuT+VMRM+0GqPjSM=">AAAB8XicbVBNS8NAEN34WetX1aOXxSLUgyURQb0VvXisYGwhjWWz3bRLN7thdyKU0J/hxYOKV/+NN/+N2zYHbX0w8Hhvhpl5USq4Adf9dpaWV1bX1ksb5c2t7Z3dyt7+g1GZpsynSijdjohhgkvmAwfB2qlmJIkEa0XDm4nfemLacCXvYZSyMCF9yWNOCVgpoF3+mNfg1DsZdytVt+5OgReJV5AqKtDsVr46PUWzhEmgghgTeG4KYU40cCrYuNzJDEsJHZI+CyyVJGEmzKcnj/GxVXo4VtqWBDxVf0/kJDFmlES2MyEwMPPeRPzPCzKIL8OcyzQDJulsUZwJDApP/sc9rhkFMbKEUM3trZgOiCYUbEplG4I3//Ii8c/qV3X37rzauC7SKKFDdIRqyEMXqIFuURP5iCKFntErenPAeXHenY9Z65JTzBygP3A+fwCk/pBf</latexit>
xi ci
(t 1)
<latexit sha1_base64="RS/sOSFDP3IpuT+VMRM+0GqPjSM=">AAAB8XicbVBNS8NAEN34WetX1aOXxSLUgyURQb0VvXisYGwhjWWz3bRLN7thdyKU0J/hxYOKV/+NN/+N2zYHbX0w8Hhvhpl5USq4Adf9dpaWV1bX1ksb5c2t7Z3dyt7+g1GZpsynSijdjohhgkvmAwfB2qlmJIkEa0XDm4nfemLacCXvYZSyMCF9yWNOCVgpoF3+mNfg1DsZdytVt+5OgReJV5AqKtDsVr46PUWzhEmgghgTeG4KYU40cCrYuNzJDEsJHZI+CyyVJGEmzKcnj/GxVXo4VtqWBDxVf0/kJDFmlES2MyEwMPPeRPzPCzKIL8OcyzQDJulsUZwJDApP/sc9rhkFMbKEUM3trZgOiCYUbEplG4I3//Ii8c/qV3X37rzauC7SKKFDdIRqyEMXqIFuURP5iCKFntErenPAeXHenY9Z65JTzBygP3A+fwCk/pBf</latexit>
Embedding
Embedding Embedding Embedding Embedding
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1
(a) RNN LM (b) Concatenate Context (c) MCE
Figure 1: Representation of the three models Discussed. MCE is a mixture of the RNN LM and the version
that concatenates the two.
(t−1) (t−1) (t−1)

zi = [xi ; ci ], a concatenation of the input outputs. MCE aims to capture different sources of
and the context vectors. As illustrated in Figure 1b, language variation via different experts. Intuitively,
the hidden state is altered from equation 1 to based on the current word and sequence history, the
(t)

(t−1) (t−1)
gating network learns to place more attention on spe-
hi = f W ih zi + W hh hi + bh cific contexts.

(t−1) (t−1) (t−1)
= f Wiih xi +Wcih ci +W hh hi +bh Our model has the flexibility to dynamically de-
termine the importance and effect for each context
However, as the context does not change within a
type, given a particular input word and set of con-
sentence (e.g. the identity of a speaker), the context
(t−1) texts. In other words, it allows the prediction of
contribution Wcih ci essentially acts a context bias some input words to be context-specific, while for
term. Because this bias is invariant to input word and others it remains generic. Furthermore, because it lin-
position in the sequence, the overall effect of the con- early combines output probability distributions, it is
text is relatively inflexible. This effect is lessened, intepretable and facilitates language analysis. There-
but not eradicated, when using LSTMs, which con- fore, it can identify particular types of words that are
tain an intrinsic ability to choose the importance of more context dependent, or find specific words that
the input. However, this process is implicit, and the carry different meaning in one context versus others.
model may be better served with an explicit ability
to control the influence of the context bias. In addi- We will now introduce the separate components of
tion, the nonlinear and recurrent nature of the LSTM MCE. Recall that though the following components
makes it difficult to determine exactly how the model use LSTMs, competing networks such as GRUs and
is using contextual information. QRNN may also be used in our architecture.
3 Mixture of Contextual Experts

3.1 The Background Model
To better represent context, we introduce our mix-
ture of contextual experts (MCE) model,1 based on MCE includes an LSTM layer whose input consists
two key ideas: attention (Bahdanau et al., 2015) and only of the previous word, with no context informa-
mixture-of-experts (Jacobs et al., 1991). Illustrated tion. This part of the model is essentially a simple
in Figure 1c, the model consists of an expert for each RNN LM and can be thought as a generic language
context and a context-independent background ex- model which allows contexts with limited data to uti-
pert, as well as a gating network that synthesizes the lize information from the entire dataset. We denote
(t) (1:t−1)
1
Our code will be made available upon publication. the prediction of this model as pb (xi |xi ).
3.2 The Expert (Context) Models as part of a selector; instead, we only use the gating
Our model also contains an LSTM targeted at network’s outputs as linear mixing weights. Inspired
each unique context; we denote its predictions as by the attention mechanism (Bahdanau et al., 2015),
(t) (1:t−1) we argue that the output word is better represented
pc (xi |xi ; ci ). This part of our model is iden-
as a combination of the background model and the
tical to the architecture of concatenation models de-
context models with varying attention placed on each
scribed in Section 2.2. However, allowing the model
model. Because each expert receives different input,
to leverage both the background and the concatena-
the experts avoid being too correlated. The α values
tion predictions enables the concatenation models to
of the mixing weights in combination with the differ-
be more bold and context-specific in their predictions.
ent expert input are what allows us to interpret the
In concatenated context models, the importance of
behavior of the model.
context is an average of how much context matters
The model output is given by a linear interpolation
over all inputs. Hence, if some words are not spe-
of the context and background model predictions:
cialized, the model has to lower the contribution of
the context part overall. MCE has more flexibility (t) (1:t−1) X (t) (1:t−1)
p(xi |xi )= αc pc (xi |xi ; ci ) (3)
to model the varying effects of context based on sen- i∈C
tence history and current input. X (t) (1:t−1)
This advantage is reinforced when multiple + (1 − αc )pb (xi |xi )
i∈C
sources of context are included. A standard con-
catenation model has limited ability to increase or 4 Data
reduce the influence of one context at the expense
of the other dynamically. MCE can explicitly in- We choose two datasets that contain both user identity
crease or decrease the influence of a user language and category identity to test the models’ ability to
model relative to the community (e.g. subreddit) and incorporate two functionally different contexts.
background language models. In order to achieve a Reddit: We selected first-level Reddit posts from
fair comparison, we keep the number of hidden units the year 2016 and filtered out subreddits and users
the same across models (i.e. the size of the context with fewer than 50000 subscribers and 1000 posts re-
and background models is equal to or smaller than spectively. To maximize user-subreddit interactions,
baselines). we calculate the h-index of each user and select users
with an h-index of at least 20 – i.e. they have posted
3.3 The Gating Network
at least 20 times in at least 20 subreddits (Jaech et
The gating network synthesizes the model compo- al., 2015). The vocabulary includes the 50,000 most
nents by linearly mixing their predictive distributions. common words, and the rest are replaced with an
The mixing weights α are generated from an LSTM unknown token: unk. Sentences begin with a start
layer, which takes the previous word and context token and have a maximum sentence length of 34
embeddings as inputs. A softmax layer ensures the words (which covers more than 90% of sentences).
weights sum to one. Sentences with more than 34 words are truncated,

(t−1) (t−1)
and those with fewer are zero padded. Such padding
αt = softmax LSTM xi ; ci (2)
is ignored during training and evaluation.
From this data, we select two subsets of subreddits
We choose an LSTM (instead of a feed forward layer)
in order to make the analysis more comprehensive.
in order to output mixing weights based both on the
The first is a selection of ten popular subreddits with
current information and the history.
similar numbers of posts, referred to as 10sr. We also
One common technique to train MoE model well
select an additional subset of 200 subreddits, each of
is to use the gating network’s output as probabilites
which contains 13K-20K posts, referred to as 200sr.2
into a stochastic selector (Jacobs et al., 1991; Shazeer
This is chosen so that the number of subreddits is
et al., 2017). This helps decorrelate the experts as
large, with similar number of posts per context.
the same gradient no longer flows into each expert.
2
However, we do not use the gating network’s output Datasets will be made available upon publication.
10sr 200sr Yelp embedding. We determine the size of the context em-
bedding independently by passing the context-word
Sequences 2.8M 3.6M 0.9M
distributions through an autoencoder and identifying
Categories 10 200 77 the minimum size that maintains 90% of the variance
Users 9647 10159 192 (resulting in Table 1). MCE uses the same context
Category Emb Size 6 13 9 embedding size; however, since it has two LSTM
User Emb Size 30 50 11 layers, each of them is limited to 512 neurons for fair-
ness (340 in the two context case). In addition, MCE
Table 1: The sizes of datasets, along with the selected has an additional LSTM for the gating network; the
sizes of the embedding used for each context. size of the gating network is limited to a maximum
of 50 neurons (optimized on the validation set).
Each model contains a max-norm constraint at the
Yelp: We also use a dataset from the Yelp Dataset output layer, to reduce the effect of frequent words
Challenge, which contains a subset of business re- being updated more often, a common optimization
views and user data made public for academic use pitfall. The word embedding layer is initialized with
by Yelp.3 We consider the reviews that belong the 300-dimensional pre-trained word embeddings from
Restaurants category and maintain only users with Common Crawl.4 In MCE models, the embeddings
an h-index of at least 15. As each restaurant corre- and output layer (transpose of embeddings) are ini-
sponds to multiple categories, we keep only the most tialized using context-free and concat models.
popular category for each restaurant. After prepro- We create training, validation and testing data us-
cessing, the dataset consists of 68,263 reviews and ing 60/20/20 split. All models are optimized using
892,646 sentences, with 192 unique users and 77 cat- Adam (Kingma and Ba, 2015). For reasons of compu-
egories in total. The top 20,000 most frequent words tational efficiency, the hyperparameters for the each
are kept in the vocabulary, and the rest are replaced model are optimized on their respective validation
with an unk token. Sentences are truncated in the sets, using a 20% random sample of Reddit training
same manner as in the Reddit data. data and all Yelp training data. After identifying the
best learning rate for each dataset-model combina-
5 Experiments and Results tion, each model is trained until the validation loss
In this section we will evaluate the language model- stops improving.
ing abilities of our proposed architecture on multiple
datasets, and observe the ways in which context af- 5.2 Language Model Performance
fects token-specific word distributions. We first evaluate the performance of our models and
baselines using average test set perplexity. We dis-
5.1 Baselines and Implementation Details tinguish between the context used (category, user,
We compare performance against three baselines: category + user) and the models that utilize no con-
1. Kneser-Ney smoothed 5-gram model as text at all in Table 2. Context information improves
in Mikolov et al. (2010), via SRI LM (Stolcke, performance significantly over the baseline simple
2002). RNN. Additionally, there is another improvement of
2. Context-free RNN, an LSTM language model 6-14% when using our method to incorporate context
with only the word information as input. information versus simple concatenation. We also
3. Concat Context: LSTM language model with find that user information contributes more in terms
the word and context concatenation as input. of performance than category information. However,
We use three types of context: user identity, category using both contexts in MCE is often the best option.
identity (subreddit or restaurant type), and a combi- These results indicate that the flexibility of the
nation of both. Each of the baseline models contains MCE results in a more accurate language model. Re-
an LSTM layer of size 1024, with additional context call that the context contribution within a sentence
3 4
https://www.yelp.com/dataset/ https://nlp.stanford.edu/projects/glove/
10sr 200sr Yelp
5-gram 96.66 92.31 76.84 <S>
Context-free RNN 82.07 73.41 46.78
What
Concat Category 75.38 65.76 45.46 Why
MCE Category 65.39 60.37 40.59 Well
.
Improvement 13.25% 8.19% 10.70%
Concat User 73.41 63.71 43.15
MCE User 64.65 60.08 39.64
Improvement 11.93% 5.71% 8.13%
Concat Both 73.72 66.90 42.61
MCE Both 63.32 59.22 39.79
Figure 2: Distribution of α values for the subreddit
Improvement 14.10% 11.47% 6.63%
model, in the 10sr dataset. We include some of the
Table 2: Perplexity for all datasets and contexts example input words, i.e. these α determine how
(lower is better). Improvement is measured for our much the context is relevant for the next word.
model (MCE) against the most competitive baseline
for the same context information. lated, as both experts are language models trained
to predict the next word, so the effective scale is be-
σ corr tween 0 − 1. Despite the fact that the model compo-
nents are initialized from independently trained mod-
10sr 0.15 0.30
els (which should produce correlated predictions),
MCE Category 200sr 0.14 0.20
we observe that the experts in the model have low
yelp 0.13 0.11
correlation. This suggests that the two experts in
10sr 0.15 0.42 MCE have managed to concentrate on different parts
MCE User 200sr 0.14 0.17 of the input space.
yelp 0.14 0.09 To understand how the context actually affects the
language modeling, in the following sections, we will
Table 3: Standard deviation of α within a sentence, explore and analyze the α values for a random subset
and correlation of predictions between the experts. of 100k sentences for the case of a single context.
5.3 Input Separability

does not change in the concatenation model. In con- Every word, when passed through the MCE gating
trast, MCE can assign different mixing weights for network, determines how much emphasis should be
each word in the sentence. The first column of Ta- given to the context-aware model’s prediction for the
ble 3 shows the average standard deviation of α’s next word. Therefore, a high value of α indicates that
within a sentence, roughly around 0.15 for all models the next word prediction is highly context-dependent.
and datasets. The flexibility of MCE to weigh the This, however, does not indicate that a low α is neces-
context for different words in the sentence differently sarily context-independent. A word that appears only
likely contributes to its improved performance. in a single context, for example, is context-specific,
We also investigate the correlation of predictions but it may be modeled in the background part of our
of the two experts in our model. We measure the model. Rather, a high value of α suggests that the
Pearson correlation of each word in the vocabulary model has learned to softly separate a word usage
independently over a corpus of input sentences and within a context from its normal background usage.
contexts. We report the average (over the vocabulary) Likewise, a low value suggests that the model has de-
of those correlations the second column of Table 3. termined similar usage of a word within that context
There are very few words that are negatively corre- and the background usage. Figure 2 shows a his-
0.6 0.6 ables us to identify words that behave very differently
0.45 0.45 based on the context. Such words include “rule" and
0.3 0.3 “message," whose average α is shown for some sub-
0.15 0.15
reddits in Figure 3a and 3b. While their semantic
meaning is constant across subreddits, examining the
0 0
ELI5 gaming nba nfl ELI5 gaming news politics input data reveals that these words behave differently
(a) rule (b) message in ELI55 and politics respectively. In particular,
0.6 0.8 “rule" is mentioned 2400 times in ELI5 and 700 in all
0.45 0.6 other subreddits combined. In ELI5 it is frequently
0.3 0.4
followed by the tokens “7," “9," “2," and “6," which
is an artifact of the community: moderators repeat-
0.15 0.2
edly asking users to follow the respective rule.
0 0
news nba nfl politics AskHist ELI5 nba politics Similarly in politics, “message" is mentioned
(c) Washington (d) Great 2000 times versus 600 in all others. This occurs be-
cause posts are often deleted or modified with a note
Figure 3: Average α values per subreddit for input to “message the moderators" for inquiries. Hence,
words with high variance, indicating the effect of the 79% of the time “message" is followed by “the" in
context on the next word. this context. Due to its frequency, this usage dom-
inates the background model. However, the usage
of “message" is different in other subreddits. Conse-
togram of α’s from the MCE Category model from quently, the model relies the context model more to
the 10sr dataset. The α values are quite spread out, predict what comes after “message."
with a mean of 0.57. Looking at more words with high variance, we
The spike (around 0.8) stems from the start sen- observe words that are not just an artifact of extreme
tence token, which appears in every sentence. In edge cases. For instance, we note this bimodal distri-
addition, we observe that the words that, as input, bution for the word “Washington” (Figure 3c), which
have the highest α values on average are capitalized is intuitive since in politics and news it is primar-
and typically start a sentence, such as What and Why. ily followed by words such as “Post”, “Times”, and
These results suggest that our model predicts the ini- “DC”. This is different in nba (“Wizards”, “Gener-
tial part of sentences based on the context. This is als”) and nfl (“Redskins”).6 We observe a similar
intuitive as without a sentence history, the model uti- trend for “Great” (Figure 3d). In the context of sports
lizes the subreddit information to generate sentences and games (nba as well as nfl and gaming), it is
in the language of the subreddit. The additional spike often succeeded by “game,” “job,” or “play,” while in
(around 0.92) is created from a period (“.”), which is ELI5 and AskHistorians, it used as part of a noun
almost always followed by the sentence end (99% of and is followed by capitalized words such as “Bridge,”
the time) or closing quotation marks (∼ 1%), depen- “Leap,” “Britain,” “Depression,” etc.
deing on the subreddit. Upon closer examination we These results suggest that the corresponding α
observe that in a single subreddit, it is followed by values allow the model to correctly segment similar
a “)” 1% of the time, and thus the model effectively uses of “Great” away from dissimilar ones. More
splits the data to capture this better. generally, α values allow us insight into which words
are used similarly versus dissimilarly in different
5.4 Context Dependent Words contexts. If the model largely uses background model
While looking at words with a high average α can predictions for a word for multiple contexts, then that
explain scenarios where our model identifies a good word is being used similarly within those contexts.
split for the data, averaging can hide important dis- Understanding the relationships of individual words
crepancies between contexts. Instead, we measure 5
Explainlikeimfive: a subreddit used for simple explanations
the average α of words per subreddit and examine of sometimes complex topics.
6
those with high variance over subreddits. This en- These are sports team names.
0.8 5.6 Sentence Generation
An additional advantage of using context in language
0.6 models is that it can explicitly generate sentences for
a particular context. Figure 5 shows some generated
sentences from the concat model (left panel) and
0.4 MCE (right panel). The first set of sentences was
created without any leading words, while the next are
0.2 seeded with the phrases “I like" and “This is what
happens." For fairness, we generated 3 sentences of
each type and picked the best two for all models.
0 The models were trained on the 200sr dataset and
S
N
P
JJ
VB
(
TO
C
)
use the Patriots 8 and sweden subreddits as contexts.
JJ
VB
N
VB
C
N
One notable difference between the two models is

Figure 4: Average α per POS tag for the Yelp Cat- that some sentences in sweden are in Swedish, but
egory Dataset. Nouns and Verbs are more likely to when the sentence starts with English, our model
come from the context component. continues the trend while concat does not. This is
possibly due to the flexibility of giving emphasis to
subreddit vocabulary based on the context as well as
to the various contexts is a particular benefit of the past words rather than as a constant for the entire con-
MCE model that standard models cannot offer. text. The sentences from our model are colored based
on how much emphasis was given to the context part
5.5 Context Dependent POS tags of the model to generate each sentence.
We go a little deeper in understanding what kinds 6 Related Work

of words are often context-dependent and which are Past work on reddit language has explored several
not. We use NLTK7 to perform part-of-speech (POS) sources of language variation, examining how, for
tagging and analyze the average α per POS tag. Fig- example, reddit and subreddit norms affect user lan-
ure 4 illustrates the average α value on the output guage. Tan and Lee (2015) found that the language
word for the Yelp categories dataset. For example, an of reddit users shifts closer to subreddit norms as the
average high α in nouns (NN) means that the model they spend more time on the website, and Danescu-
gave more emphasis to the context part of the model Niculescu-Mizil et al. (2013) illustrated how adapta-
when the prediction was a noun. tion to community language norms can indicate level
We observe that meaning-rich POS tags such as of engagement. Jaech et al. (2015) attempted to rank
adjectives (JJS, JJ) and nouns (NN, NNS, NNP) have comments and found that subreddit identity modu-
the highest α values, followed closely by verbs (VB, lates the importance of language and non-language
VBN, VBP). This suggests that the choice of de- elements. Tran and Ostendorf (2016) compared topic
scriptive language such as nouns is more context- and language style in order to predict subreddit iden-
dependent, which is intuitive as they would differ tity from posts, concluding that a combination of
more across categories. In contrast, we find that struc- both produces the best results. Though these works
tural language such as parentheses, the word “to," do not utilize the learned context information to build
and coordinating conjunctions (CC) such as “but" predictive language models, they suggest that user
and “and" have consistently smaller α values. These and subreddit identity are key sources of variation
words are used more similarly among all contexts, relative to overall social media language norms.
and MCE consequently uses the background model For language modeling, Mikolov et al. (2010) are
more in order to model their behavior. the first to use RNNs to predict language, finding that
a linear interpolation of several RNN LMs provided
7 8
https://www.nltk.org/ American football team.
Concat MCE
Good reason why is this teams being posted and doing something
drowns in the ball and Brady is like this again . diﬀerent , does n't even seem like a real sign !
Adrian <unk> makes my balls flexible . I saw you attempted to realize that no play against the Pats .
Patriots
I like Scully <unk> <unk> . I like how he could endorse SB to win writer and articulate this
team with reasons I think .
I like with Rory and <unk> winning the season
I like how Brady 's leverage for the press
This is what happens HYPE <unk> This is what happens , until the summer of ours that make the playoﬀs
turns in a corner , and the Shane fans are receiving escape games .
This is what happens !
This is what happens when you score in an angle .
I 've moved up to work in 7th grade 1 and I 've been going to age of light bars . Nothing that will be rewritten , /r/sweden .
( Wales och finns ) , it 's near a large group , with <unk> marriages , and I Det var som sport eller inte -unk- .
believe the entire world , but an age is a soulless <unk> 's disappearing circle
sweden
I like hans <unk> i <unk> i <unk> <unk> . I like Sanderson 's humor and work the podcast says that the police are
bringing the Islamic -unk- you wondered how we managed to laugh at them […]
I like mitt i <unk> om <unk> <unk> i <unk> andra <unk> man <unk> min du
inte i 'm ? I like strange Blue Sea though .
This is what happens when <unk> att <unk> <unk> de har <unk> <unk> man This is what happens when someone first Montreal this Way guys came out
inte och <unk> som <unk> , dom <unk> Situation vad <unk> och <unk> […] of their -unk- Sweden and McDonalds declined
This is what happens when det <unk> i <unk> <unk> . This is what happens when the fish comes to your world .
Figure 5: Sentences from Concat (left) and MCE (right) models, using the subreddit as context. Top panel
is using the subreddit Patriots while the bottom is sweden. Words highlighted in green indicate that more
emphasis was given to the context part of the model during generation.
the best performance boost over a backoff 5-gram generating each token of its translation. Likewise,
model; Sundermeyer et al. (2012) show better results we look to build a model that can pay attention to the
could be achieved with Long Short-Term Memory contexts based on the specific input word.
networks (LSTMs) (Hochreiter and Schmidhuber,
1997). Mikolov and Zweig (2012) improve perfor- We also draw inspiration from mixture-of-expert
mance with LDA-based pre-trained topic models of (MoE) models (Jacobs et al., 1991; Jordan and Ja-
the entire document as context to an RNN LM. Those cobs, 1994). MoE is an ensemble method of im-
representations are concatenated into both the input proving performance by blending together perfor-
and hidden layers, in an effort to achieve better docu- mances of different “experts.” In MoE models, a
ment perplexity. Sordoni et al. (2015) expand on this gating network learns to mix the predictions of vari-
work by setting the context vector to be the output ous experts. The gating network can be thought of as
of a feed forward neural network, allowing the entire analogous to the attention network from Bahdanau
model to be trained end-to-end. Other papers have et al. (2015). The original motivation of MoE in Ja-
since incorporated context in language models for a cobs et al. (1991) was to build a gating network that
variety of tasks, including conversation (Vinyals and would help experts segment the input space into local
Le, 2015), email suggestion (Kannan et al., 2016), pieces, such that each could learn decoupled represen-
machine translation (Bahdanau et al., 2015), and im- tations. In other words, the experts are given the same
age captioning (Chen and Zitnick, 2015). However, inputs, but they work competitively rather than co-
concatenating context is inflexible because it assumes operatively to learn different parts of the input space.
the effect of the context is static throughout the sen- One challenge in MoE is that same gradient flows
tence. Therefore, we draw from the attention mecha- back to every expert, albeit with different weights,
nism used in neural machine translation (Bahdanau leading to correlated experts and a less expressive
et al., 2015), in which part of a network learns to overall model. To avoid experts learning correlated
place more importance on certain input tokens when responses, Jacobs et al. (1991) stochastically choose
a single expert to update. This approach is similarly
applied in Shazeer et al. (2017) as a way to lower modules for language modeling, such as QRNNs, and
computation costs in massively large networks. in other variations such as an unbalanced allocation
Similar to how our approach feeds each expert of hidden layers for each expert.
a different input vector, Garmash and Monz (2016)
also feed different inputs to each expert. For the
problem of Neural Machine Translation, they train References
a network end-to-end with two experts that take in Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
different inputs (French or German language) but 2015. Neural machine translation by jointly learning to
have the same target output: the English translation. align and translate. In Proceedings of ICLR.
The gating network learns to interpolate the two mod- Jerome R Bellegarda. 2004. Statistical language model
els in one combined model. Because the inputs are adaptation: review and perspectives. Speech Communi-
different, the representations are inherently different, cation, 42(1):93 – 108. Adaptation Methods for Speech
Recognition.
and there’s no notion of the experts converging to
similar representations. This approach works well James Bradbury, Stephen Merity, Caiming Xiong, and
Richard Socher. 2017. Quasi-recurrent neural net-
but requires a trilingual dataset where multiple inputs
works. In Proceedings of ICLR.
map to the same output.
Ciprian Chelba, Mohammad Norouzi, and Samy Ben-
gio. 2017. N-gram language modeling using re-
7 Discussion current neural network estimation. arXiv preprint
arXiv:1703.10724.
In this paper, we present a novel method of incorpo- Xinlei Chen and Lawrence C. Zitnick. 2015. Mind’s eye:
rating context information in language models, based A recurrent visual representation for image caption
on the idea of mixture-of-experts. By allocating each generation. In Proceedings of CVPR.
context to an expert, our MCE model can dynam- Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk,
ically attend to different contexts as needed. This Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-
flexibility allows our model to capture richer interac- based models for speech recognition. In Proceedings
tions of input word, context identity or identities, and of NIPS.
input word sequence history. This advantage largely Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and
powers our MCE’s consistent perplexity gains over Yoshua Bengio. 2014. Empirical evaluation of gated
recurrent neural networks on sequence modeling. In
the concatenation model baselines. In addition, MCE
Proceeedings of NIPS Deep Learning Workshop.
linearly mixes the predictions of experts for each
Cristian Danescu-Niculescu-Mizil, Robert West, Dan Ju-
word, which allows us to analyze its behavior. Our rafsky, Jure Leskovec, and Christopher Potts. 2013. No
analyses reveal that the model exhibits desirable prop- country for old members: User lifecycle and linguis-
erties, such as increased influence of the context at tic change in online communities. In Proceedings of
the start of sentences and when predicting content WWW.
words such as nouns. Jeffrey L. Elman. 1990. Finding structure in time. Cogni-
The flexibility of the MCE framework opens up tive science, 14(2):179–211.
multiple possible avenues for future research. In Ekaterina Garmash and Christof Monz. 2016. Ensemble
particular, the value of the contextless background learning for multi-source neural machine translation.
model to the scheme should be investigated - there In Proceedings of COLING.
exists clear interpretive value in a background model, Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
short-term memory. Neural computation, 9(8):1735–
but allocating its hidden units to the contextual mod-
1780.
els may yield better results. In addition, in this study,
Yu-Yang Huang, Rui Yan, Tsung-Ting Kuo, and Shou-De
context is limited to community and user identity, but
Lin. 2014. Enriching cold start personalized language
adding other forms of context is of interest. For in- model using social network information. In Proceed-
stance, MCE could benefit conversation modeling by ings of ACL.
including an expert that generates predictions based Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and
specifically on conversational history, in addition to a Geoffrey E Hinton. 1991. Adaptive mixtures of local
user expert. We are also interested in using different experts. Neural computation, 3(1):79–87.
Aaron Jaech, Victoria Zayats, Hao Fang, Mari Ostendorf, Alessandro Sordoni, Michel Galley, Michael Auli, Chris
and Hannaneh Hajishirzi. 2015. Talking to the crowd: Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun
What do people react to in online discussions? In Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural
Proceedings of EMNLP. network approach to context-sensitive generation of
Michael I Jordan and Robert A Jacobs. 1994. Hierarchi- conversational responses. In Proceedings of NAACL
cal mixtures of experts and the em algorithm. Neural HLT.
computation, 6(2):181–214. Andreas Stolcke. 2002. Srilm – an extensible language
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam modeling toolkit. In Proceedings of INTERSPEECH.
Shazeer, and Yonghui Wu. 2016. Exploring Martin Sundermeyer, Ralf Schlüter, and Hermann Ney.
the limits of language modeling. arXiv preprint 2012. Lstm neural networks for language modeling. In
arXiv:1602.02410. Proceedings of INTERSPEECH.
Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kauf- Chenhao Tan and Lillian Lee. 2015. All who wander: On
mann, Andrew Tomkins, Balint Miklos, Greg Corrado, the prevalence and characteristics of multi-community
László Lukács, Marina Ganea, Peter Young, et al. 2016. engagement. In Proceedings of WWW.
Smart reply: Automated response suggestion for email. Trang Tran and Mari Ostendorf. 2016. Characterizing
In Proceedings of ACM SIGKDD. the language of online communities and its relation to
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A community reception. In EMNLP.
method for stochastic optimization. In Proceedings of Oriol Vinyals and Quoc Le. 2015. A neural conversational
ICLR. model. In Proceedings of ICLR.
Hung-Yi Lee, Bo-Hsiang Tseng, Tsung-Hsien Wen, and Tian Wang and Kyunghyun Cho. 2016. Larger-context
Yu Tsao. 2017. Personalizing recurrent-neural- language modelling with recurrent neural network. In
network-based language model by social network. In Proceedings of ACL.
Proceedings of TASLP. Tsung-Hsien Wen, Milica Gasic, Nikola Mrkšić, Pei-Hao
Jiwei Li, Michel Galley, Chris Brockett, Georgios Sp- Su, David Vandyke, and Steve Young. 2015. Seman-
ithourakis, Jianfeng Gao, and Bill Dolan. 2016. A tically conditioned lstm-based natural language gener-
persona-based neural conversation model. In Proceed- ation for spoken dialogue systems. In Proceedings of
ings of ACL. EMNLP.
Gábor Melis, Chris Dyer, and Phil Blunsom. 2018. On Gui-Rong Xue, Jie Han, Yong Yu, and Qiang Yang. 2009.
the state of the art of evaluation in neural language User language model for collaborative personalized
models. In Proceedings of ICLR. search. In Proceedings of TOIS.
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Seunghyun Yoon, Hyeongu Yun, Yuna Kim, Gyu-tae
2018. Regularizing and optimizing LSTM language Park, and Kyomin Jung. 2017. Efficient transfer
models. In Proceedings of ICLR. learning schemes for personalized language model-
Tomas Mikolov and Geoffrey Zweig. 2012. Context ing using recurrent neural network. arXiv preprint
dependent recurrent neural network language model. arXiv:1701.03578.
In Proceedings of SLT.
Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cer-
nocký, and Sanjeev Khudanpur. 2010. Recurrent neu-
ral network based language model. In Proceedings of
INTERSPEECH.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado,
and Jeffrey Dean. 2013. Distributed representations
of words and phrases and their compositionality. In
Proceedings of NIPS.
Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio,
Aaron Courville, and Joelle Pineau. 2016. Building
end-to-end dialogue systems using generative hierarchi-
cal neural network models. In Proceedings of AAAI.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,
Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff
Dean. 2017. Outrageously large neural networks: The
sparsely-gated mixture-of-experts layer. In Proceed-
ings of ICLR.

Dynamic Mixtures of Contextual Experts For Language Modeling

Uploaded by

Copyright:

Available Formats

Dynamic Mixtures of Contextual Experts For Language Modeling

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dynamic Mixtures of Contextual Experts For Language Modeling

Uploaded by

Copyright:

Available Formats

Dynamic Mixtures of Contextual Experts for Language Modeling

Abstract representations. However, standard neural network

(t) Dense Dense LSTM

(a) RNN LM (b) Concatenate Context (c) MCE

(t−1) (t−1) (t−1)

3 Mixture of Contextual Experts

5.3 Input Separability

One notable difference between the two models is

We go a little deeper in understanding what kinds 6 Related Work

You might also like