How much more effective is eff Linux than Windows for R environment for text mining computing?

More Veslava Osinska's questions See All

Are there some articles about Researchgate phenomena?

Are there some articles about Researchgate phenomena? About people behaviour on RG? About RG evaluation (RG score)? Does somebody point me the souces? Is it possible to get statistics data on...

01 February 2015 9,887 0 View

Who is real creator of the Memes theory: Dawkins, Bonner or Yew Soon?

The main problem: who is main interest group of memes theory: sociologists, computer scientists or psychologists?

06 July 2013 8,939 3 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Is there an English Translation of the Carl Moller text: ZUR VERGLEICHENDEN ANATOMIE DER SILURIDEN?

I recently came across an anatomy text by Carl Moller that was published in 1915 but it is in German or Dutch neither of which I can understand. I would like to know if there is an English...

10 August 2024 4,347 1 View

Do you know best mines of western part of Afghanistan?

I want to know more about Mn deposits in west of Afghanistan.

07 August 2024 3,427 1 View

Separation of organic acids-HPLC?

Hello What should be done to separate and identify organic acids in HPC when their RetTime is the same?Like oxalic acid with Propanoic Acid.or acids that have a very close RetTime.

07 August 2024 8,782 3 View

Is Galaxy.org good to use for research for analyzing data and for publication?

Hello all, I wanted to know, can I use galaxy (USA, Europe or Australia) platform for analyzing the shotgun data, and can it be used for publication purpose as well? Thanks :)

06 August 2024 6,610 4 View

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

05 August 2024 8,836 2 View

How to convert a privately loaded document into a public document?

I attempted to make a privately uploaded text public but a window appeared that said an error occurred. There was no explanation provided as to why there was an error or what might be done to...

05 August 2024 8,025 7 View

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

Better ways to analyze the qualitative and quantitative data in a sequential explanatory mixed method approaches

04 August 2024 2,703 6 View

Muhammad Asif Razzaq

why not use ubuntu (Rainbow/Weka) with python or perl I used it for TF/IDF

Sabino Maggi

Performance-wise, R under Windows should run as fast as under Linux.

However, Linux is much more reliable than Windows, less subjected to (random) crashes, restarts due to systema updates and the like. I think it is the best choice for you.

Assuming an average abstract length of 200 words, you have a total of 2*10^2 x 2*10^6 = 4*10^8 words to analyze, i.e. the equivalent text of ~1000 novels (http://en.wikipedia.org/wiki/Word_count). I might be wrong, but wouldn't say it is a huge task. I guess any modern computer can complete the task in (much?) less than a week, if the text mining program is adequate

Kaushalya Madhawa

If performance is your concern user NLTK ith Python that's much faster than R.

NLTK will come handy for document analysis tasks.

http://www.nltk.org/

Volkan Tunalı

Actually I had no chance to process that many documents previously but can you please give a chance to the PRETO tool we developed for preprocessing documents? It can produce TF-IDF matrix files in various file formats. It has several preprocessing options like stemming, n-gram generation etc.

Depending on the number of documents and of course on possible number of distinct terms, you may need a reasonable amount of RAM on your machine. And sure, you will need to adjust Java virtual machine max memory limits as large as possible. The program is available at google code as an open source project. https://code.google.com/p/preto/

I really wonder if PRETO can handle 2M small documents.

Good luck!

Veslava Osinska

Thank you Volkan, i have never heard about Preto. Is it suitable for processing english texts? Can you precise how much RAM must be?

Veslava,

Preto supports text in both English and Turkish. We developed it for our text mining (and specifically for document-term matrix generation for clustering) studies. So far we have not worked with documents more than 50000. 2GB RAM was OK for them. I hope RAM requirement does not increase with direct proportion with the number of documents because texts are small and distinct terms will be limited. However, unfortunately I cannot make a forecast about RAM requirement for 2M documents. You may take a look at our paper about Preto for more info.

Conference Paper PRETO: A High-performance Text Mining Tool for Preprocessing...

Petr Sojka

Using GenSim's TFIDF implementation in Python http://radimrehurek.com/gensim/models/tfidfmodel.html you should be done overnight even on your Win laptop ;-).

There is a good tutorial http://radimrehurek.com/gensim/tutorial.html to guide you. GenSim uses efficient libraries (NumPy SciPy) for efficient matrix computations.

http://radimrehurek.com/gensim/models/tfidfmodel.html

Thanks, tutorial looks sensible

RAM and disk space should not be an issue if the tool used for data mining is efficient and sufficiently optimized for the task.

Mayur Narkhede

It will be better using linux for large data while working with R, as in Windows when you work on large data you need to restart the system to clear whole RAM when crash occurs. But in linux there may be some command to do it easily without restarting. It disturbs in windows.

Mustapha Bouakkaz

R under Windows should run as fast as under Linux.

Kasper Christensen

My experience is that R runs a lot faster on iOS. Have no experience with Linux though...