What I did (unpublished yet) is extract several (e.g. 16) mask features within Mask R-CNN from multiple images, construct a distance metric to compare them to predict the image class. Each set of feature vectors is labelled (image label).
The problem is that if I run, e.g. 4 images, that is 4 feedforward+backprops + 16x4=64 vectors, i.e. 64x64 distance metrics, loss computations +backprops. This puts some pressure on GPU VRAM, preventing from using large samples and hence reducing the efficiency. I'd be grateful for an open-source solution of DL+ supervised clustering that allows using large samples from images, if such exists.