Hello,
I am interested in processing the ARC dataset (http://nlpprogress.com/english/question_answering.html) with the GPT2 double heads model neural network. The dataset (tab delimited) is structured as below:
```
Question Answer
Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? (A) worldwide disease (B) global mountain building (C) rise of mammals that preyed upon plants and animals (D) impact of an asteroid created dust that blocked the sunlight. D
```
I know that I am supposed to tokenize the dataset before passing it into GPT2 double heads model for doing NLP.How should I tokenize this data? More specifically,
So for instance, to generate an input sequence for the GPT2 double heads model, should I break up the original question statement into 4 sequences, 1 for each multiple choice option, and apply the tokenization to each of the 4 sequences as below?:
```
Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? (A) worldwide disease
Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? (B) global mountain building
Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? (C) rise of mammals that preyed upon plants and animals
Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? (D) impact of an asteroid created dust that blocked the sunlight.
```
Thank you,
PS: I found this site https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313 and it seem to address some of the questions I have, but still this is not a complete help.