What is the Maximum size of data that is supported by R-datamining?

R does have limitations. Currently the compilation uses libraries that are constrained to 32-bit integers. This means that some indeces and vectors are limited to the 32-bit (4G) limit. It is possible to find that some object (dataframe) "runs out of space" even when running R on a powerful large-memory computer.

There are ways around this, as well as packages that create only meta-objects in memory and use HDF5 or NetCDF file storage for very large objects (GenABEL, SNPrelate are examples). In addition there generic packages bigmemory and ff that can in some instances provide workarounds for the 32-bit integer limitation.

This is not to say that R isn't a wonderful system, just to be clear that there are limitations.

Gerard Tromp

@Hassan Abedi,

You are correct, I just checked. I used to run into the problem of objects exceeding vector limits all the time for the reason stated above, and now run into a very similar problem in that data sets exceed RAM. I simply assumed it was due to the same root cause.

Nevertheless, the memory mapping solutions given above are useful when the data sets exceed memory storage. I currently run problems that don't fit into 256 GB RAM; one can throw more memory at the problem, but this can be quite costly.

Pooja Jardosh

R-DataMining Tool can support file in size of GBs also.

But execution support depends on RAM and Processor,because after all it is going to be executed in memory not on hard disk.

R and KNIME data mining tools are most popular tool.

Younos Aboulnaga

Gerard Tromp's answer covers the size limitations of R pretty well. I only want to add that, if need be, there are packages on CRAN that wraps a data frame and removes the limitations. Check out CRAN guide to High Performance Computing in R.

http://cran.r-project.org/web/views/HighPerformanceComputing.html

Abzetdin Adamov

R is not good choice when it comes to working with true large-scale data (multi Gb) even if you quite powerful computer with decent memory. In this case it make sense to consider Hive on HDFS...

Silvia Giulio

Hello. I've anderstood that r and computer RAM have limitations. But can a r calculation last days?

Knut Jägersberg

max length of a vector (or dataframe) in R is still around 2 billion, which is a hard cap I hit some time ago. https://stackoverflow.com/questions/10640836/max-length-for-a-vector-in-r

Nonetheless R is a great tool for analyzing medium sized and big data:

you either use spark via sparkr or sparklyr and scale analysis written with r wrappers around spark .

Another solution is disk.frame, which lets you manipulate data.tables as chucnks written and read from fst files from the harddisk.

https://diskframe.com/articles/vs-dask-juliadb.html

Here the main limit is the space of the local harddrive, as it is currently not designed to work across machines (which can be realized fia futures packages in one way or another) and the number of cores, as it runs in parallel.

disk.frame is the fastest and my favorite out of core data manipulation solution I have worked with so far and matches speeds of column databases as monetdb or duckdb with the advantage of allowing you to simpily use the map function to ue arbitrary r functions on any disk.frame which fits on your harddrive.

duckdb is also a nice embedded database (sqlite of anlaytics) for simple sql things and very fast on commodidity hardware:

https://www.duckdb.org/docs/current/tutorials/installation.html

r is fun for big data. sparklyr scales endless via google cloud autoscaling, but I prefer to stay local and natively r if possible, often spark is overkill and running pipelines on clusters which could run fast on a single workstation is not green (see for example https://www.r-bloggers.com/disk-frame-is-epic/).

It is amazing what you can do with disk.frame on a stupid laptop with a fast ssd.

Cloud is not necassary for medium data, which ranges terrabytes.

I heard that sparkr can run native r code too, I did not experiment with it, but I would expect it to run into the hard cap of 2 billion records as well, to my knowledge only using spark (not native r code in spark) sparklyr or disk.frame allow to go beyond that.

My current flow: 1. disk.frame. 2. if too large for one machine, than sparklyr in the google cloud which automatically manages the spark cluster (something I dont want to be busy with).

Which chemical applying copper coin the needle push?

What will be the internal and external validity for the below thesis topic?

How to fostering EFL teenager learners be an independent learner?

Would like ask opinion on the below PhD research title, is this too long?

Can any one help me to remove the foll error in silvaco tool?

What are the methods to extract zinc from flue gas cleaning residue of Iron and Steel Plant?

Can you tell me Matlab code for the below problem?

What would be appropriate Mental Health Inventory for the Children Victims of Child Maltreatment?

To study prevalence of Child Abuse and Neglect in Child Care Institutions, whether ICAST-CH or else ICAST-CI should be used for data collection?

Could anyone suggest how to conceptualize Child Protection system and Child Protection issues in Institutionalization using Attachment Theory?

How to learn more about SPSS and its Application?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Do you know best mines of western part of Afghanistan?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Is Galaxy.org good to use for research for analyzing data and for publication?

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

How to report results of Generalised Linear Mixed Models in a journal article?

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?