Summary:
The authors propose a large-scale, big data processing in real time using a parallel recovery system in a distributed environment. The proposed system addresses the features lacking in the existing approaches, most importantly sub-second fault and stragglers recovery by defining an in-memory data structure (called RDD) instead of a replicated on-disk state recovery mode. The experimental results using a Spark Engine-based machine show the system is able to sub-secondly process over 60 million records per second on a 100-node running scenario.
Pros:
The main advantage of the proposed system that as the authors claim, many of the other work do not support is the address the issues with both faults, and stragglers. Also, unlike centralized approaches, this work is proposing a parallel recovery mechanism in a distributed environment which enables huge scalability. Plus, the other major vantage of the proposed system is the sub-second recovery latency for both faults and stranglers.
Cons:
However I’m not a distributed systems guy myself, but what I like to point out is regarding the scalability of Conviva’s Internet video streaming application which I am a bit familiar with. The authors claim that on 64 EC2 nodes, the system can process enough concurrent viewers which was exceeding the peak load experience of Conviva so far. Well, it’s worth to mention that just on mobile devices, video is accounting for more than 67% of the whole data traffic, and with the increase in the video resolution and quality, this trend is rapidly increasing. I was willing to see some information on this particular video steaming scalability, and some predictions on the incremental model as shown in Figure 14. (a).
Thought for further development:
One option for the optimization model could be somewhat similar to what TCP’s RTT is doing, so to provide an approximation based on the priority of the real-time data. So given the data history, basically we can give more priority to the latest hot data, and less recovery priority to the older ones, and try to seek a memory-accuracy trade-off.
Also regarding Figure 14. (a), given today’s daily-increased video traffic, maybe providing a prediction model for the scalability of nodes in cluster as opposed to supporting more active sessions could be interesting!
Critiques/Questions:
So basically as I mentioned in the cons section, shall we assume a linear-like model (maybe with high R-squared value) as a prediction for the scalability of nodes in cluster so to support more active sessions?! Maybe we should update the results again regarding to today’s daily-increased video traffic!