ProdLDA tutorial improvements #2794

ahoho · 2021-04-05T21:54:26Z

Handful of changes to the ProdLDA tutorial

Calculate the NPMI of topics, a measurement of coherence, using the held-out 20 newsgroups test set. This allows us to estimate whether changes we make to the implementation yield any improvements, and accords with standard practice (NB: this change means that we use a smaller training set, since previously the model was trained on train+test). The parameters yielding the best NPMI are then used for the word clouds.
Remove headers and footers from the 20ng data, which helps with topic readability.
Increase the number of epochs to 200, following the original implementation. It still runs relatively quickly on CPU (<30 min), but perhaps it's worth having a flag that changes the number depending on whether CUDA is available.
Correct the batch norm terms to align with the original implementation. In particular, the original does not use affine=False, but rather scale=False: there is still a bias term. This improves the best NPMI from 0.276 to 0.355. See below:

        self.bn = nn.BatchNorm1d(
            vocab_size, eps=0.001, momentum=0.001, affine=True
        )
        self.bn.weight.data.copy_(torch.ones(vocab_size))
        self.bn.weight.requires_grad = False

While I didn't include it, I'll note that thanks to @fritzo and @martinjankowiak, we needn't rely on ProdLDA's Laplace approximation and can just use distributions.Dirichlet as a drop-in replacement. I've done so in a separate pyro reimplementation of the Dirichlet-VAE, and I have no reason to suspect it wouldn't work here. Of course, this would obviously deviate from ProdLDA.

nbviewer link

martinjankowiak · 2021-04-05T23:14:12Z

thanks @ahoho ! none of us are experts in NLP so it's great to have this tutorial see some attention from an NLP person. i hope your interest in it suggests that this can be a useful starting point for doing actual NLP modeling.

some comments/suggestions:

can you please expand/reword this comment? "NB: here we turn off the scaling to reduce..."? be explicit that there is still a bias term.
can you include the definition of tbe nmpi acronym?
are the warning filters new?
is 1 * (docs_test > 0) equivalent to (docs_test >0).float()? if so can we use the latter which is more explicit?

fehiepsi · 2021-04-06T03:51:57Z

I like the idea of evaluation. I just wonder if we need to normalize beta before computing any score (if so, we might update ProdLDA.beta method to reflect this)? You'll need to rerun the notebook from the beginning to avoid any seed-related inconsistency. Currently, the order of cell execution is a bit messy.

Increase the number of epochs to 200

This is a bit unfortunate. I guess as a tutorial, we don't need to achieve state-of-the-art with 4x slower. But it is fine as long as you think the result is good. You don't need to stick with the original ProdLDA implementation though.

By the way, it is not clear to me that the word cloud shows a better result than the current one. Maybe my judgment is out of the mainstream (given "NPMI correlates with human judgments of topic quality") or a good metric is still under active research?

BatchNorm1d

I don't have much opinion on this, but if you think having those additional bias parameters is good, it might be clearer to separating out the definition of BatchNorm to a function or a class, e.g.

def BatchNorm1d(vocab_size, eps=0.001, momentum=0.001):
    bn = nn.BatchNorm1d(
        vocab_size, eps=eps, momentum=momentum, affine=True
    )
    bn.weight.data.copy_(torch.ones(vocab_size))
    bn.weight.requires_grad = False
    return bn

I also wonder if we use BatchNorm to collect running mean/variance and bias parameters, will we want to use that information in prediction/evaluation, in particular the unnormalized beta?

We have a dictionary of 8,902 unique words

This is inconsistent with the above cell.

we needn't rely on ProdLDA's Laplace approximation and can just use distributions.Dirichlet as a drop-in replacement

Yeah, could you elaborate the following sentence in the tutorial "(Note, however, that PyTorch/Pyro have included support for reparameterizable gradients for the Dirichlet distribution since 2018)." with links to paper/reference that use Dirichlet.rsample?

ahoho · 2021-04-06T14:06:12Z

Really appreciate the feedback! I do think this is a great starting point for neural topic models (in fact I'm surprised I didn't notice it before, pyro has proven great for experimentation with them)

I'll make the fixes you both suggested, but to respond to your questions:

are the warning filters new?

Since we turn off gradient updates for the BN scaling terms, the trace gives a warning about those variables. If there's a better way, I'll gladly update.

I just wonder if we need to normalize beta before computing any score (if so, we might update ProdLDA.beta method to reflect this)?

You do not, since the score just relies on an argsort over each row

Increase the number of epochs to 200

I'll change this back, but make a note in the text that increasing it can help (or make it conditional on CUDA availability)

it is not clear to me that the word cloud shows a better result than the current one. [...] or a good metric is still under active research?

Yeah, this could be the result of (a) using less data (since we went from all data to just train), (b) problems with preprocessing, or (c) issues with the metric, which can be sometimes attributed to bad preprocessing, since that affects the words included in the estimates. I will play around with preprocessing to see if this helps; if memory serves I had better ProdLDA results with 20ng in the past. And yes, it's still under development---in fact, my own group is working on this now! That said, I still thought it made sense to have some sort of standard quantitative evaluation as a benchmark.

I also wonder if we use BatchNorm to collect running mean/variance and bias parameters, will we want to use that information in prediction/evaluation, in particular the unnormalized beta?

Hm, good question. In the NTM model I've usually built on in the past, the output of the BN layer is annealed to zero over training ((alpha) * beta_output + (1 - alpha) * bn_beta_output), so this hasn't come up. I guess it'd look something like this:

# in ProdLDA class
def beta(self):
    pseudo_batch = torch.eye(self.num_topics)
    beta = self.decoder.bn(self.decoder.beta(pseudo_batch))
    return beta.cpu().detach().numpy()

As a funny aside, I've seen a couple topic modeling papers in the past two years in that just pretend as if the Pathwise Derivatives paper doesn't exist, then come up with other ways of estimating a Dirichlet-based neural model---all while it's been implemented in torch from day one.

fehiepsi · 2021-04-06T15:39:18Z

the score just relies on an argsort over each row

You are right. The value is not used in the metric but used in plot_word_cloud. Probably it is not important...

it made sense to have some sort of standard quantitative evaluation as a benchmark

Yeah, agreed!

I guess it'd look something like this

Looks reasonable to me. If you want to incorporate this, just make sure to switch bn to eval mode for evaluation, then switch back to train mode.

ahoho added 2 commits April 5, 2021 21:10

include NPMI calculation

b3d4606

replicate ProdLDA batch norm

1eb4212

fritzo added awaiting review Examples labels Apr 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProdLDA tutorial improvements #2794

ProdLDA tutorial improvements #2794

ProdLDA tutorial improvements #2794

Are you sure you want to change the base?

ProdLDA tutorial improvements #2794

Conversation