R does have limitations. Currently the compilation uses libraries that are constrained to 32-bit integers. This means that some indeces and vectors are limited to the 32-bit (4G) limit. It is possible to find that some object (dataframe) "runs out of space" even when running R on a powerful large-memory computer.
There are ways around this, as well as packages that create only meta-objects in memory and use HDF5 or NetCDF file storage for very large objects (GenABEL, SNPrelate are examples). In addition there generic packages bigmemory and ff that can in some instances provide workarounds for the 32-bit integer limitation.
This is not to say that R isn't a wonderful system, just to be clear that there are limitations.
You are correct, I just checked. I used to run into the problem of objects exceeding vector limits all the time for the reason stated above, and now run into a very similar problem in that data sets exceed RAM. I simply assumed it was due to the same root cause.
Nevertheless, the memory mapping solutions given above are useful when the data sets exceed memory storage. I currently run problems that don't fit into 256 GB RAM; one can throw more memory at the problem, but this can be quite costly.
Gerard Tromp's answer covers the size limitations of R pretty well. I only want to add that, if need be, there are packages on CRAN that wraps a data frame and removes the limitations. Check out CRAN guide to High Performance Computing in R.
R is not good choice when it comes to working with true large-scale data (multi Gb) even if you quite powerful computer with decent memory. In this case it make sense to consider Hive on HDFS...
max length of a vector (or dataframe) in R is still around 2 billion, which is a hard cap I hit some time ago. https://stackoverflow.com/questions/10640836/max-length-for-a-vector-in-r
Nonetheless R is a great tool for analyzing medium sized and big data:
you either use spark via sparkr or sparklyr and scale analysis written with r wrappers around spark .
Another solution is disk.frame, which lets you manipulate data.tables as chucnks written and read from fst files from the harddisk.
Here the main limit is the space of the local harddrive, as it is currently not designed to work across machines (which can be realized fia futures packages in one way or another) and the number of cores, as it runs in parallel.
disk.frame is the fastest and my favorite out of core data manipulation solution I have worked with so far and matches speeds of column databases as monetdb or duckdb with the advantage of allowing you to simpily use the map function to ue arbitrary r functions on any disk.frame which fits on your harddrive.
duckdb is also a nice embedded database (sqlite of anlaytics) for simple sql things and very fast on commodidity hardware:
r is fun for big data. sparklyr scales endless via google cloud autoscaling, but I prefer to stay local and natively r if possible, often spark is overkill and running pipelines on clusters which could run fast on a single workstation is not green (see for example https://www.r-bloggers.com/disk-frame-is-epic/).
It is amazing what you can do with disk.frame on a stupid laptop with a fast ssd.
Cloud is not necassary for medium data, which ranges terrabytes.
I heard that sparkr can run native r code too, I did not experiment with it, but I would expect it to run into the hard cap of 2 billion records as well, to my knowledge only using spark (not native r code in spark) sparklyr or disk.frame allow to go beyond that.
My current flow: 1. disk.frame. 2. if too large for one machine, than sparklyr in the google cloud which automatically manages the spark cluster (something I dont want to be busy with).