What could be the approaches to combine the pairwise document similarity scores to get the overall similarity score of a certain document against a document collection?
You might, for example, calculate the centroid of the document collection and then calculate the similarity between the document and the centroid; you might calculate the average from similarities between the document and all documents from the collection; you might calculate the similarity as the minimum/maximum (i.e., the most similar and the least similar) from all similarities between the document and all documents from the collection… As the similarity measure, you might use, e.g., cosine similarity or Euclidean distance.
You can use a Bag-Of-Words approach to extract keywords and then group based on them. I used a similar approach for grouping news articles. If you need more info, you can find it here
You might, for example, calculate the centroid of the document collection and then calculate the similarity between the document and the centroid; you might calculate the average from similarities between the document and all documents from the collection; you might calculate the similarity as the minimum/maximum (i.e., the most similar and the least similar) from all similarities between the document and all documents from the collection… As the similarity measure, you might use, e.g., cosine similarity or Euclidean distance.