Paper Review: Starﬁsh: A Self-tuning System for Big Data Analytics

09 September 2014 0 8K Report

Summary:

The authors in this paper propose Starfish, an optimizer tool for big-data analytics. It enables Hadoop workloads and applications to get optimized performance automatically throughout the data lifecycle in analytics of big-data, by eliminating need to understand and manipulate the available tuning knobs. Starfish is consisted of three major components, namely an Optimizer (which finds the optimal configuration settings to use for executing a MapReduce job, that can also run with recommended job configuration settings), a Profiler (which is based on dataflow and cost estimation of MR Job execution), and also a What-if-Engine (which combines simulation and model-based estimation at the phase level of MapReduce job execution, so to predict the MapReduce job performance before execution on a Hadoop cluster).

Pros:

Basically the main thumbs-up for the design of Starfish is caring more about enabling Hadoop users and applications to get good performance automatically instead of just caring about the peak performance of Hadoop and generally the query performance of parallel databases.

For me the other main advantage of the methodology is the relatively large space of configuration choices. So basically choices such as memory allocation to task-level buffers, multiphase external sorting in the tasks, and that whether output data from tasks should be compressed, etc are considered.

Also for sure enabling the behavior prediction of a hypothetical job execution is also a major plus.

Cons:

One thing that was the main drawback of the paper itself was the lack of an in-depth comparison of related work. I’m not a distributed systems guy, but in general I would like to know aside from the fact that Starfish’s approach in focusing simultaneously on different workload granularities, how Starfish can be integrated with Nectar, Quincy, and MRShare, and also what other drawbacks they have.

Also like all the other applications, I’m concerned with the accuracy through Job profile caused by the What-if Engine through estimation.

Thoughts for further development:

One of the main directions for the authors to go is to enhance the parameters involved in jobs, maybe by specifically considering resources such as energy, or by embedding security.

Also I’m curious to know if prioritizing the jobs and/or the resources can benefit the design.

One of the main directions for the authors to go is to providing more in depth analysis of different and variant types of jobs as opposed to just using two types of jobs, which will for sure bring a more generalized conclusion scheme.

Questions/Critiques:

Is it possible to integrate energy and security in the parameters involved in jobs?!

Badges
Science topic

More Mohammad Hosseini's questions See All

Paper Review: Camdoop: Exploiting In-network Aggregation for Big Data Applications

Summary:In this paper, the authors propose Camdoop, a system similar to Map-Reduce that supports full on-path aggregation of data streams. It builds aggregation trees with the sources of the...

08 September 2014 9,602 0 View

Review: Apache Hadoop YARN: Yet Another Resource Negotiator

Summary:In this paper, the authors discuss YARN, the next generation of Hadoop platform, and summarize its design and development. They discussed how adoption and new types of applications has...

08 September 2014 6,880 0 View

Paper REVIEW: Rhea: automatic ﬁltering for unstructured cloud storage

Summary:The authors in this paper propose Rhea, a system which automatically generates and executes storage-side filters for unstructured text data. It extracts both row filters (which selects...

08 September 2014 7,365 0 View

Paper Review: Bimodal Multicast

Summary:In this paper, the authors propose a bimodal multicast protocol with good scalability and predictable reliability even under highly perturbed conditions, which can also be understood as...

08 September 2014 3,549 0 View

Paper REVIEW: Discretized streams fault-tolerant streaming computation at scale

Summary: The authors propose a large-scale, big data processing in real time using a parallel recovery system in a distributed environment. The proposed system addresses the features lacking in...

08 September 2014 1,407 0 View

Paper Review: X-Stream: Edge-centric Graph Processing using Streaming Partitions

SUMMARY:The authors in this paper propose X-Stream, which is a system for scaling-up graph processing on a single shared-memory machine. It keeps state in the vertices and disclosures a...

08 September 2014 7,485 0 View

How (and How Not) to Write a Good Systems Paper?

There are many articles around discussing what are the elements of a good research. During my Masters, I had the chance to be a guest reviewer and reviewer for some of SIGMM (SIG Multimedia)...

08 September 2014 1,874 2 View

Paper REVIEW: Discretized streams fault-tolerant streaming computation at scale

Summary:The authors in this paper propose Trinity.RDF, which is a distributed and scalable RDF system that is able to handle web scale RDF data (billion or even trillion triples). Trinity.RDF...

08 September 2014 4,492 0 View

Paper Review: STREAM: The Stanford Data Stream Management System

Summary: Stream, a system proposed by Stanford introduces a framework for continuous and long-running data management and query processing, and that for both continuous streams and traditional...

08 September 2014 5,800 0 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Is Galaxy.org good to use for research for analyzing data and for publication?

Hello all, I wanted to know, can I use galaxy (USA, Europe or Australia) platform for analyzing the shotgun data, and can it be used for publication purpose as well? Thanks :)

06 August 2024 6,610 4 View

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

05 August 2024 8,836 2 View

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

Better ways to analyze the qualitative and quantitative data in a sequential explanatory mixed method approaches

04 August 2024 2,703 6 View

How can I interpret the data without the need of solving it manually?

How can I interpret the data gathered without solving?

03 August 2024 9,054 3 View

Why can't academics earn the money they deserve?

Only Journals make money from the articles we have worked on for years. Academics do not earn money from their refereeing. Then shouldn't the solution be a system in which academics can earn...

01 August 2024 6,469 6 View

Conjugation of PEG-Amine to an Amino Acid Using EDC?

I am attempting to conjugate PEG to an amino acid at the C-terminus, for the purposes of producing nanoparticles. I have been told that PEG modified with amine groups can be used for this purpose,...

31 July 2024 2,033 1 View

How Do Project Data Analytics and AI Advance Quality 4.0 in Construction Project Management?

As the construction industry advances, the integration of Project Data Analytics and Artificial Intelligence (AI) is becoming increasingly crucial in project management. These technologies are...

31 July 2024 6,484 1 View