Fine-tune cross-lingual translator for text2text generation #27

artitw · 2021-07-31T01:22:28Z

Fine-tune cross-lingual translator for text2text generation tasks, e.g. question generation, question answering, summarization, etc. to demonstrate cross-lingual alignment, zero-shot generation, etc.

For example, can we demonstrate question generation or question answering using the existing API? If not, what needs to get fixed?

https://github.com/artitw/text2text#training--finetuning

johnanisere · 2021-08-01T15:54:39Z

I would be working on this.

artitw · 2021-08-01T17:20:45Z

Awesome. For question generation, one approach to get started is to use the SQuAD dataset, and pre-process it into context + answer -> question. Likewise, for question answering pre-process it into context + question -> answer. This could then be used to for the fine-tuning.

johnanisere · 2021-08-14T17:07:00Z

Here is the link to the colab for the analysis of the question answering and question generation API --> https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt?usp=sharing

artitw · 2021-08-14T18:53:31Z

Thanks very much for sharing the notebook. I can recommend two things to try:

Use the existing pre-trained question-answering model to evaluate the performance of the question generation model.
Use the text2text fine-tuning API to see if we can get question generation to work on a pre-trained translator. Although the documentation provides an example for translating, there's nothing stopping us from using it for question generation. Depending on the results, we can dig deeper to understand how to develop the model further.

johnanisere · 2021-08-22T11:40:21Z

Thanks Art.

Feedback:
After evaluating the performance of the question generation model with the question-answering model, my conclusion is that is pretty accurate to a high level, I have documented it here https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt#scrollTo=BtnFEHUlProe&line=1&uniqifier=1

Blocker: When using the text2text fine-tuning API to get a pre-trained translator I run out of space. I have experienced this on both AWS and colab (I get 'no space left on device' error message). I would appreciate every help I can get. I have attached screenshots here.

artitw · 2021-08-22T21:10:57Z

Would you be able to report the test set accuracy so that we can establish a benchmark? This would be useful for researchers as a better way to measure question generation performance.
It looks like you are using the default model, whish takes up a lot of space and memory. Try using a smaller model with the setting
t2t.Transformer.PRETRAINED_TRANSLATOR = "facebook/m2m100_418M"
This was tested on Google's colab environment.

johnanisere · 2021-08-23T12:33:39Z

@artitw oh Okay. Got it

johnanisere · 2021-09-01T23:07:01Z

Thanks Art.
The question generation API actually works on a pre-trained translator. I was able to demonstrate it here. https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt?usp=sharing

The next steps is for me to report the test set accuracy so that we can establish a benchmark.

artitw · 2021-09-03T19:17:31Z

Reviewed the notebook. It looks like the fine-tuning was not performed on question generation data; rather, it was done using the example for translation. Could you try the following format? I updated the API in the repo to avoid confusion with the [SEP] token.

result = t2t.Handler(["I will go to school today to take my math exam. [SEP] school [TGT] Where will you go to take your math exam?"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

johnanisere · 2021-09-03T19:40:23Z

Oh I see

johnanisere · 2021-09-16T22:50:47Z

Hi @artitw I have gone back to do the work again and the question generation API actually works on a pre-trained translator. Here https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt?usp=sharing

What strategy do you recommend for benchmarking the test set accuracy?

artitw · 2021-09-19T18:24:32Z

For benchmarking, we can start with lower casing the text and then calculating the exact match accuracy.

For finetuning a pretrained translator, we would have to use the translate (not question generation) API to generate the finetuned results.

artitw · 2021-09-26T21:14:35Z

In addition to exact match accuracy, it would be good to calculate average BLEU scores over the answers as well. For reference, see https://en.wikipedia.org/wiki/BLEU

johnanisere · 2021-09-26T21:24:20Z

Hi @artitw , After training with about 33 data points, the pre-trained translator is still just translating the payload.
Do you suggest I train with even more data?
here is my result https://colab.research.google.com/drive/1vJ5U_UNFxeu92VVyhAhxKSur_BZSJSIJ?usp=sharing

artitw · 2021-09-26T21:31:03Z

Thanks for sharing the notebook. It looks like the right direction, but I would expect it to need much more training (>10k examples). I would also recommend saving the intermediate results in Google Drive so that you can pick up where you left off without starting over.

johnanisere · 2021-09-27T13:08:12Z

@artitw oh okay, got it

lere01 · 2022-01-03T15:11:10Z

Hi @artitw, I would like to continue from where John stopped.

artitw · 2022-01-03T22:06:40Z

great, I've assigned you to this issue. Please review what John has done and let us know of any questions here.

lere01 · 2022-01-03T22:35:36Z

Noted. I have reviewed John's work and played with the notebooks he reported. It seems that my assignments are the following, in order:

Get sufficient (> 10k) training data.
Report exact match accuracy.
Report average BLEU scores for the answers.

Am I right? Do you have any suggestions on getting training data?

Thank you.

artitw · 2022-01-03T22:47:58Z

What you describe sounds like the right track. I would recommend starting with the English SQuAD [1] dataset and then use XQuAD [2] after that is somewhat working.

[1] https://rajpurkar.github.io/SQuAD-explorer/
[2] https://github.com/deepmind/xquad

lere01 · 2022-01-18T13:18:48Z

Hi @artitw,

After trying different options that did not work out, I opted for Amazon Sagemaker.

I loaded the datasets (JSON) to AWS s3
I dockerized the fine-tuning script and pushed the image to AWS ECR
I then created a job on Sagemaker using the docker image as a custom algorithm

The job has been running for some hours taking SQuAD[1] dataset as input. I will keep you updated.
I could not get access to a HPC Cluster so I followed this approach. Please, let me know what you think.

artitw · 2022-01-20T01:57:53Z

Hi @lere01 What you suggest seems interesting. I would recommend using a small dataset to test your setup before running any heavy jobs.

lere01 · 2022-01-30T19:38:45Z

Hi @artitw,

I used a small dataset to test my setup as you suggested and it worked fine. But the larger dataset took too long to run. I set the job to run for 5 days and even that time frame was not enough.

However, you can see some sort of proof of concept at https://colab.research.google.com/drive/1Vvem1DqNJZQej4t2qAIkZN0DyCdUY_sM#scrollTo=RXf2UrMvSc25.

I used 50 rows from the training set to fine tune
Then performed translation task (answering) on 50 rows of the dev set.
I calculated the bleu_score using NLTK implementation and reported BLEU-1 to BLEU-4
84% of answers generated by the model were perfect match for references.

This was just to show that the whole process works. I would like your suggestion on how to proceed.

artitw · 2022-02-01T02:47:36Z

Hi @lere01,

Thanks for sharing your work and the summary. It looks like a good start. The main issue I can see is that the notebook you shared uses the Answerer model, not the finetuned translator you performed fitting on. We would have to perform predictions using the translator model because we are using it for an unintended purpose.

lere01 · 2022-02-16T02:16:10Z

Hi @artitw

Hope you have had a good day. Two things.

1. Before going far, I want to let you know that I am fine tuning using

t2t.Handler([f"{CONTEXT} [TGT] {QUESTION}"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

AND NOT


t2t.Handler([f"{CONTEXT} [SEP] {ANSWER} [TGT] {QUESTION}"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

Am I on the right track?

lere01 · 2022-02-16T02:16:18Z

2. I dug into the codebase and figured out a way to use the GPU.

By editing the Translator and doing this:

import text2text as t2t
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

class Translator(t2t.Transformer):
     def __init__(self, **kwargs):
     pretrained_translator = self.__class__.PRETRAINED_TRANSLATOR
     torch_device = "cuda" if torch.cuda.is_available() else "cpu"
     self.__class__.model = AutoModelForSeq2SeqLM.from_pretrained(pretrained_translator).to(torch_device)
     self.__class__.tokenizer = AutoTokenizer.from_pretrained(pretrained_translator)

What do you think?

artitw · 2022-02-16T02:43:46Z

The second approach should work, as we want to generate questions that correspond to a context and an answer.

Hi @artitw

Hope you have had a good day. Two things.

1. Before going far, I want to let you know that I am fine tuning using

t2t.Handler([f"{CONTEXT} [TGT] {QUESTION}"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

AND NOT


t2t.Handler([f"{CONTEXT} [SEP] {ANSWER} [TGT] {QUESTION}"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

Am I on the right track?

artitw · 2022-02-16T02:44:36Z

Nice find. I am referencing your pull request here: #31

2. I dug into the codebase and figured out a way to use the GPU.

By editing the Translator and doing this:

import text2text as t2t
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

class Translator(t2t.Transformer):
     def __init__(self, **kwargs):
     pretrained_translator = self.__class__.PRETRAINED_TRANSLATOR
     torch_device = "cuda" if torch.cuda.is_available() else "cpu"
     self.__class__.model = AutoModelForSeq2SeqLM.from_pretrained(pretrained_translator).to(torch_device)
     self.__class__.tokenizer = AutoTokenizer.from_pretrained(pretrained_translator)

What do you think?

lere01 · 2022-02-17T23:27:13Z

Hi @artitw,

The dataset we are using for fine tuning has multiple questions attached to each context. Do you think that this might be affecting the algorithm's learning? As against one question per context.

artitw · 2022-02-18T02:15:47Z

Yes, I would suggest that the context be concatenated with the answer for each target question. This would ensure that each unique question is mapped to a unique input to the model.

lere01 · 2022-05-30T14:14:45Z

Hi @artitw

I have been able to fine-tune up to 50,000 of the training data (SQuAD 1.0). At 10000, 20000 and 50000, I tried the model on the dev dataset but got a BLEU Score of 0 in all cases. Is this expected? Would you be able to take a look at my code to ascertain that I am doing things right? I ran the code locally but you can find it here https://colab.research.google.com/drive/1z3YTjOF1dllxqSQPLgxDDeKOf9wJFfG3?usp=sharing

artitw · 2022-05-30T18:35:38Z

@lere01 thanks for your efforts on this and for sharing the notebook. Code looks fine to me, so good job with that. Can you share the prediction results after 50k training? If those don't look promising, we might have to put this project on hold until we can figure out how to train it more.

artitw changed the title ~~Test fine-tuning module to ensure functionality~~ Fine-tune cross-lingual translator for text2text generation Jul 31, 2021

artitw assigned johnanisere Aug 1, 2021

artitw mentioned this issue Dec 4, 2021

QG not working properly for other langauges. #29

Open

artitw assigned lere01 Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tune cross-lingual translator for text2text generation #27

Fine-tune cross-lingual translator for text2text generation #27

1. Before going far, I want to let you know that I am fine tuning using

2. I dug into the codebase and figured out a way to use the GPU.

Fine-tune cross-lingual translator for text2text generation #27

Fine-tune cross-lingual translator for text2text generation #27

Comments

1. Before going far, I want to let you know that I am fine tuning using

2. I dug into the codebase and figured out a way to use the GPU.

1. Before going far, I want to let you know that I am fine tuning using

2. I dug into the codebase and figured out a way to use the GPU.