Well, it is the usual way of training CNN on very large datasets. Instead of computing the gradient on the whole dataset, you estimate it on a very small subset (a batch, typically of size between 16 and 256), which allows you to stream the training samples.
The extreme (and original) online learning scheme is to process one example at a time. But nothing forbids you to accumulate these samples into a buffer and perform one gradient descent update once the buffer is full.
For very big network, you may not have the computational resources to process more than one sample at a time. In those cases, just use a very big momentum (0.99 or so).
In a practical way, you can use file formats that do not require to load the entire file in memory like hdf5. Or you may also read your data from a network connection as they come.
if you use cloud space data (MNIST, CelebA, Faces, ... dataset) in Anaconda in your code (ex. Tensorflow) you can say doing online training during writing your own code.
We usually distinguish 4 optimization modes in machine learning:
1) Off-line / Batch
2) On-line,
3) Recursive,
4) Incremental
The Off-line / Batch mode is the classical learning mode. The estimation/learning dataset is considered as a whole. The optimal estimation model can then be determined either directly (by Moore-Penrose generalized inverse, Section 2 from https://www.researchgate.net/publication/275590644_Learning_deep_representations_via_extreme_learning_machines ) when the optimization problem is linear or iteratively by overall/batch gradient descent when facing with nonlinearities ( https://arxiv.org/pdf/1609.04747.pdf ).
The On-line mode consists in estimating the parameters of the model iteratively by means of stochastic gradient descent while presenting the estimation data sequentially (one by one, https://arxiv.org/pdf/1609.04747.pdf ). This has the advantage of avoiding the simultaneous storage of all the data in memory.
The Recursive mode is of On-line type in addition with CONTINUOUS OPTIMAL estimation of the model parameters ( https://www.researchgate.net/publication/6666254_A_Fast_and_Accurate_Online_Sequential_Learning_Algorithm_for_Feedforward_Networks and http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.5176&rep=rep1&type=pdf ). This mode has the advantage of adapting dynamically and optimally the model without requiring complete re-learning each time new input data are processed.
The Incremental mode denotes an optimal dynamic building of the estimation model in course of learning ( https://www.researchgate.net/profile/Chee_Siew/publication/6928613_Universal_Approximation_Using_Incremental_Constructive_Feedforward_Networks_With_Random_Hidden_Nodes/links/00b4952f8672bc0621000000.pdf ). Such an approach constitutes a valuable solution in overcoming well-known overfitting problems inherent to CNNs.
Note: the Recurrent extra-qualifier for a learning model refers to a model whose outputs are re-entering. Such a model is similar to a temporal state model.
Article A Fast and Accurate Online Sequential Learning Algorithm for...
Article Universal Approximation Using Incremental Constructive Feedf...
Article Learning deep representations via extreme learning machines