Summary:
The authors in this paper propose Starfish, an optimizer tool for big-data analytics. It enables Hadoop workloads and applications to get optimized performance automatically throughout the data lifecycle in analytics of big-data, by eliminating need to understand and manipulate the available tuning knobs. Starfish is consisted of three major components, namely an Optimizer (which finds the optimal configuration settings to use for executing a MapReduce job, that can also run with recommended job configuration settings), a Profiler (which is based on dataflow and cost estimation of MR Job execution), and also a What-if-Engine (which combines simulation and model-based estimation at the phase level of MapReduce job execution, so to predict the MapReduce job performance before execution on a Hadoop cluster).
Pros:
Basically the main thumbs-up for the design of Starfish is caring more about enabling Hadoop users and applications to get good performance automatically instead of just caring about the peak performance of Hadoop and generally the query performance of parallel databases.
For me the other main advantage of the methodology is the relatively large space of configuration choices. So basically choices such as memory allocation to task-level buffers, multiphase external sorting in the tasks, and that whether output data from tasks should be compressed, etc are considered.
Also for sure enabling the behavior prediction of a hypothetical job execution is also a major plus.
Cons:
One thing that was the main drawback of the paper itself was the lack of an in-depth comparison of related work. I’m not a distributed systems guy, but in general I would like to know aside from the fact that Starfish’s approach in focusing simultaneously on different workload granularities, how Starfish can be integrated with Nectar, Quincy, and MRShare, and also what other drawbacks they have.
Also like all the other applications, I’m concerned with the accuracy through Job profile caused by the What-if Engine through estimation.
Thoughts for further development:
One of the main directions for the authors to go is to enhance the parameters involved in jobs, maybe by specifically considering resources such as energy, or by embedding security.
Also I’m curious to know if prioritizing the jobs and/or the resources can benefit the design.
One of the main directions for the authors to go is to providing more in depth analysis of different and variant types of jobs as opposed to just using two types of jobs, which will for sure bring a more generalized conclusion scheme.
Questions/Critiques:
Is it possible to integrate energy and security in the parameters involved in jobs?!