Hello,

I am interested in processing the ARC dataset (http://nlpprogress.com/english/question_answering.html) with the GPT2 double heads model neural network. The dataset (tab delimited) is structured as below:

```

Question Answer

Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? (A) worldwide disease (B) global mountain building (C) rise of mammals that preyed upon plants and animals (D) impact of an asteroid created dust that blocked the sunlight. D

```

I know that I am supposed to tokenize the dataset before passing it into GPT2 double heads model for doing NLP.How should I tokenize this data? More specifically,

  • should I add a special token before each character that denotes for multiple choice options (A), (B), (C) and (D)?
  • should I add special token before each string that denotes for the contents of the multiple choice options?
  • Am I supposed to add the tokens "" and "" at the beginning and at the end of each question statement?
  • If I am to pass this data into a GPT2 Double Heads Model (The GPT2 model with two heads) for processing multiple choice questions, what should I do with the part that denotes for an actual answer to the multiple choice question?
  • So for instance, to generate an input sequence for the GPT2 double heads model, should I break up the original question statement into 4 sequences, 1 for each multiple choice option, and apply the tokenization to each of the 4 sequences as below?:

    ```

    Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? (A) worldwide disease

    Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? (B) global mountain building

    Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? (C) rise of mammals that preyed upon plants and animals

    Which of these do scientists offer as the most recent explanation as to why many plants and animals died out at the end of the Mesozoic era? (D) impact of an asteroid created dust that blocked the sunlight.

    ```

    Thank you,

    PS: I found this site https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313 and it seem to address some of the questions I have, but still this is not a complete help.

    More Hyunjin-Dominique Cho's questions See All
    Similar questions and discussions