RL algorithms requires a long time for collecting data points that is not acceptable for online policy task (time complexity). Moreover, the number of Q-values grows exponentially with state space variables (space complexity).
Answering this question is one of the fundamental challenges of RL research, there is no easy answer. Many people focus on value function approximation which tries to learn a compact representation of the Q-value function rather than a table of values. All the function approximation methods from Machine Learning can basically be tried for this task using any portion of the (s,a,r,s,a)'s as the data you are trying to fit.
Another approach is to use Direct Policy Search methods where the policy is represented directly and the values are not stored for all states and actions at all. Rather, the policy is improved directly using it's gradient and the samples that have been encountered. This loses many of the convergence guarantees of value-function approaches but allows you to work with larger problems.
But there really is a vast literature on this with lots of different methods being explored.