I need to compute TFIDF for 2 M documents (abstracts). How much more eff Linux will be more effective than Windows for R environment? What type of computer will accomplish this task fast in say during a week?
Performance-wise, R under Windows should run as fast as under Linux.
However, Linux is much more reliable than Windows, less subjected to (random) crashes, restarts due to systema updates and the like. I think it is the best choice for you.
Assuming an average abstract length of 200 words, you have a total of 2*10^2 x 2*10^6 = 4*10^8 words to analyze, i.e. the equivalent text of ~1000 novels (http://en.wikipedia.org/wiki/Word_count). I might be wrong, but wouldn't say it is a huge task. I guess any modern computer can complete the task in (much?) less than a week, if the text mining program is adequate
Actually I had no chance to process that many documents previously but can you please give a chance to the PRETO tool we developed for preprocessing documents? It can produce TF-IDF matrix files in various file formats. It has several preprocessing options like stemming, n-gram generation etc.
Depending on the number of documents and of course on possible number of distinct terms, you may need a reasonable amount of RAM on your machine. And sure, you will need to adjust Java virtual machine max memory limits as large as possible. The program is available at google code as an open source project. https://code.google.com/p/preto/
I really wonder if PRETO can handle 2M small documents.
Preto supports text in both English and Turkish. We developed it for our text mining (and specifically for document-term matrix generation for clustering) studies. So far we have not worked with documents more than 50000. 2GB RAM was OK for them. I hope RAM requirement does not increase with direct proportion with the number of documents because texts are small and distinct terms will be limited. However, unfortunately I cannot make a forecast about RAM requirement for 2M documents. You may take a look at our paper about Preto for more info.
Conference Paper PRETO: A High-performance Text Mining Tool for Preprocessing...
Using GenSim's TFIDF implementation in Python http://radimrehurek.com/gensim/models/tfidfmodel.html you should be done overnight even on your Win laptop ;-).
There is a good tutorial http://radimrehurek.com/gensim/tutorial.html to guide you. GenSim uses efficient libraries (NumPy SciPy) for efficient matrix computations.
It will be better using linux for large data while working with R, as in Windows when you work on large data you need to restart the system to clear whole RAM when crash occurs. But in linux there may be some command to do it easily without restarting. It disturbs in windows.