What is the best way to classify sparse document x frequency matris?

More Ugur Ayan's questions See All

Is there any survey on Iterative Multiple Kernel Learning (not-Batch)?

There are lots of papers but survey?

04 May 2014 4,971 1 View

What is the best Kernel parameter selection Criteria?

Is there any better approaches to select kernel parameters other than grid search?

04 May 2014 4,139 3 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Is there an English Translation of the Carl Moller text: ZUR VERGLEICHENDEN ANATOMIE DER SILURIDEN?

I recently came across an anatomy text by Carl Moller that was published in 1915 but it is in German or Dutch neither of which I can understand. I would like to know if there is an English...

10 August 2024 4,347 1 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

Do you know best mines of western part of Afghanistan?

I want to know more about Mn deposits in west of Afghanistan.

07 August 2024 3,427 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

Volkan Tunalı

Ugur,

You can actually exploit the sparseness of Doc-by-Term matrix by using Cosine similarity calculation.

I believe it wouldn't take much time for 1m comparison with Cosine similarity metric if you use sparse matrix implementation correctly on a reasonably powerful machine (CRS or CCS format - http://en.wikipedia.org/wiki/Sparse_matrix).

Depending on the application, you may first cluster the documents, and then compare the new given document with the clusters (which happen to be less than the total number of documents) and then compare it with the documents in more relevant cluster(s). Just an idea.

If you have a more particular question, I can try to help accordingly. Good luck!

Mohamed Yehia Dahab

Dear Volkan,

for cosine similarity calculation between two documents you need O(mn) where m is the length of first document and n is the length of the second. If you have N documents you need O(N2 mn).

But I think the question is related accuracy specially when matrix is sparse, full of zeros.

Dear Mohamed,

I know it takes nxm multiplications if we use regular full/dense vectors. However, I know from experience that Document-Term matrices are usually about +95% sparse which means we never need to make this many multiplications. This is where we exploit the sparseness. Ugur asks they need to compare a newcoming document with the existing ones, so I expect that the operation takes N cosine calculations with very few number of multiplications at each Cosine calculation (we never need to multiply by the zeros in CRS/CSS structure).

Of course, depending on the expected throughput, this solution could be unacceptable due to total processing time. So, smarter ways could be developed like clustering the whole document collection etc.

I hope I could make my point clear.

Best regards.

Riadh Belkebir

You can proceed as follows (unsupervised classification):

1. Create the clusters using K-means for example.

2. Once a new document is introduced, you can compare it to the centroid of each cluster. And, the document is affected to the closest cluster. Or you can show to the user the set of the similar documents ordered by similarity distance.

But the other possibility is to propose a supervised method to classify the document. In this case, you could proceed as follow:

1. You select the set of training documents.

2. You create the vocabulary ( in the case of bag-of-words representation, it is the set of all the word in the training corpus without repetition) . Then you create the set of vectors. These vectors should have the same size as the vocabulary.

3. You feed these documents to the classifier. If you use for instance SVMs a learning model will be created.

4. for a new document the learned model will be used to predict the class of this new document.

Best regards,

Alex Garel

You may look at gensim http://radimrehurek.com/gensim/ and http://arxiv.org/pdf/1405.4053v2.pdf

Laith Abualigah

As a initial step, the k-mean is nice