-
-
Notifications
You must be signed in to change notification settings - Fork 986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ProdLDA tutorial improvements #2794
base: dev
Are you sure you want to change the base?
Conversation
thanks @ahoho ! none of us are experts in NLP so it's great to have this tutorial see some attention from an NLP person. i hope your interest in it suggests that this can be a useful starting point for doing actual NLP modeling. some comments/suggestions:
|
I like the idea of evaluation. I just wonder if we need to normalize
This is a bit unfortunate. I guess as a tutorial, we don't need to achieve state-of-the-art with 4x slower. But it is fine as long as you think the result is good. You don't need to stick with the original ProdLDA implementation though. By the way, it is not clear to me that the word cloud shows a better result than the current one. Maybe my judgment is out of the mainstream (given "NPMI correlates with human judgments of topic quality") or a good metric is still under active research?
I don't have much opinion on this, but if you think having those additional bias parameters is good, it might be clearer to separating out the definition of BatchNorm to a function or a class, e.g. def BatchNorm1d(vocab_size, eps=0.001, momentum=0.001):
bn = nn.BatchNorm1d(
vocab_size, eps=eps, momentum=momentum, affine=True
)
bn.weight.data.copy_(torch.ones(vocab_size))
bn.weight.requires_grad = False
return bn I also wonder if we use BatchNorm to collect running mean/variance and bias parameters, will we want to use that information in prediction/evaluation, in particular the unnormalized
This is inconsistent with the above cell.
Yeah, could you elaborate the following sentence in the tutorial "(Note, however, that PyTorch/Pyro have included support for reparameterizable gradients for the Dirichlet distribution since 2018)." with links to paper/reference that use Dirichlet.rsample? |
Really appreciate the feedback! I do think this is a great starting point for neural topic models (in fact I'm surprised I didn't notice it before, pyro has proven great for experimentation with them) I'll make the fixes you both suggested, but to respond to your questions:
Since we turn off gradient updates for the BN scaling terms, the trace gives a warning about those variables. If there's a better way, I'll gladly update.
You do not, since the score just relies on an
I'll change this back, but make a note in the text that increasing it can help (or make it conditional on CUDA availability)
Yeah, this could be the result of (a) using less data (since we went from all data to just train), (b) problems with preprocessing, or (c) issues with the metric, which can be sometimes attributed to bad preprocessing, since that affects the words included in the estimates. I will play around with preprocessing to see if this helps; if memory serves I had better ProdLDA results with 20ng in the past. And yes, it's still under development---in fact, my own group is working on this now! That said, I still thought it made sense to have some sort of standard quantitative evaluation as a benchmark.
Hm, good question. In the NTM model I've usually built on in the past, the output of the BN layer is annealed to zero over training (
As a funny aside, I've seen a couple topic modeling papers in the past two years in that just pretend as if the Pathwise Derivatives paper doesn't exist, then come up with other ways of estimating a Dirichlet-based neural model---all while it's been implemented in torch from day one. |
You are right. The value is not used in the metric but used in
Yeah, agreed!
Looks reasonable to me. If you want to incorporate this, just make sure to switch |
Handful of changes to the ProdLDA tutorial
affine=False
, but ratherscale=False
: there is still a bias term. This improves the best NPMI from 0.276 to 0.355. See below:While I didn't include it, I'll note that thanks to @fritzo and @martinjankowiak, we needn't rely on ProdLDA's Laplace approximation and can just use
distributions.Dirichlet
as a drop-in replacement. I've done so in a separate pyro reimplementation of the Dirichlet-VAE, and I have no reason to suspect it wouldn't work here. Of course, this would obviously deviate from ProdLDA.nbviewer link