You can actually exploit the sparseness of Doc-by-Term matrix by using Cosine similarity calculation.
I believe it wouldn't take much time for 1m comparison with Cosine similarity metric if you use sparse matrix implementation correctly on a reasonably powerful machine (CRS or CCS format - http://en.wikipedia.org/wiki/Sparse_matrix).
Depending on the application, you may first cluster the documents, and then compare the new given document with the clusters (which happen to be less than the total number of documents) and then compare it with the documents in more relevant cluster(s). Just an idea.
If you have a more particular question, I can try to help accordingly. Good luck!
for cosine similarity calculation between two documents you need O(mn) where m is the length of first document and n is the length of the second. If you have N documents you need O(N2 mn).
But I think the question is related accuracy specially when matrix is sparse, full of zeros.
I know it takes nxm multiplications if we use regular full/dense vectors. However, I know from experience that Document-Term matrices are usually about +95% sparse which means we never need to make this many multiplications. This is where we exploit the sparseness. Ugur asks they need to compare a newcoming document with the existing ones, so I expect that the operation takes N cosine calculations with very few number of multiplications at each Cosine calculation (we never need to multiply by the zeros in CRS/CSS structure).
Of course, depending on the expected throughput, this solution could be unacceptable due to total processing time. So, smarter ways could be developed like clustering the whole document collection etc.
You can proceed as follows (unsupervised classification):
1. Create the clusters using K-means for example.
2. Once a new document is introduced, you can compare it to the centroid of each cluster. And, the document is affected to the closest cluster. Or you can show to the user the set of the similar documents ordered by similarity distance.
But the other possibility is to propose a supervised method to classify the document. In this case, you could proceed as follow:
1. You select the set of training documents.
2. You create the vocabulary ( in the case of bag-of-words representation, it is the set of all the word in the training corpus without repetition) . Then you create the set of vectors. These vectors should have the same size as the vocabulary.
3. You feed these documents to the classifier. If you use for instance SVMs a learning model will be created.
4. for a new document the learned model will be used to predict the class of this new document.