15 July 2021 3 10K Report

I am using Hugging Face `mrm8488/longformer-base-4096-finetuned-squadv2` pre-trained model

https://huggingface.co/mrm8488/longformer-base-4096-finetuned-squadv2.

I want to generate sentence level embedding. I have a data-frame which has a text column.

## Objective:

Create Sentence/document embeddings using **LongformerForMaskedLM** model. We don't have lables in our data-set, so we want to do clustering on output of embeddings generated. Please let me know if the code is correct?

## Environment info

- `transformers` **version:3.0.2**

- Platform:

- Python version: **Python 3.6.12 :: Anaconda, Inc.**

- PyTorch version (GPU?):**1.7.1**

- Tensorflow version (GPU?): **2.3.0**

- Using GPU in script?: **Yes**

- Using distributed or parallel set-up in script?: **parallel**

## Information

I have fine-tuned LongformerForMaskedLM and saved it as .bin file. I try to use this model to generate embeddings for every documents ( That is one row of pandas DataFrame)

## Code:

```

from transformers import LongformerModel, LongformerTokenizer,LongformerForMaskedLM

model = LongformerForMaskedLM.from_pretrained('file-path',output_hidden_states = True)

tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

# Put the model in "evaluation" mode, meaning feed-forward operation.

model.eval()

df = pd.read_csv("inshort_news_data-1.csv")

#**news_article** column is used to generate embedding.

```

```

all_content=list(df['news_article'])

def sentence_bert():

list_of_emb=[]

for i in range(len(all_content)):

SAMPLE_TEXT = all_content[i] # long input document

input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)

# How to include batch of size here?

# Attention mask values -- 0: no attention, 1: local attention, 2: global attention

attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention

attention_mask[:, [0,-1]] = 2 # Is this correct?

with torch.no_grad():

outputs = model(input_ids, attention_mask=attention_mask)

hidden_states = outputs[2]

token_embeddings = torch.stack(hidden_states, dim=0)

# Remove dimension 1, the "batches".

token_embeddings = torch.squeeze(token_embeddings, dim=1)

# Swap dimensions 0 and 1.

token_embeddings = token_embeddings.permute(1,0,2)

token_vecs_sum = []

# For each token in the sentence...

for token in token_embeddings:

#but preferrable is

sum_vec=torch.sum(token[-4:],dim=0)

# Use `sum_vec` to represent `token`.

token_vecs_sum.append(sum_vec)

h=0

for i in range(len(token_vecs_sum)):

h+=token_vecs_sum[i]

list_of_emb.append(h)

return list_of_emb

f=sentence_bert()

```

## Doubts/Question:

1. The code replaces 1st token `` and last token ` ` of document by value 2. Is this correct approach to get global attention to get embeddings for one documents.?

2. The code put model in evaluation mode. What does it do?` model.eval()`

The output is a tuple of size 2:

```

outputs[0] gives us sequence_output:

torch.Size([1, 34, 50265])

length of outputs[1] gives us hidden states 13

outputs[1] gives us hidden states

torch.Size([1, 512, 768]) that is [13, 512, 768]

```

3. What does outputs[0] of dimension ` torch.Size([1, 34, 50265]) ` signify ? `34` is my sentence length and `50265` is vocabulary size. I understand that it is `logit` output. But how to interpret it in simple english?

4. How this code can be corrected or changed to get correct document embeddings? Or any other approach would be helpful.

5. In the code it uses last 4 hidden layers and sums it up. What can be done to just normalize it wit attention and then take average?

## Expected behavior

Document1: Embeddings

Document2: Embeddings

More Pratik Ch's questions See All
Similar questions and discussions