Summary:
The authors in this paper propose Rhea, a system which automatically generates and executes storage-side filters for unstructured text data. It extracts both row filters (which selects irrelevant rows/lines in the input) and column filters (which select irrelevant columns in the surviving rows). It uses static analysis of application code to generate safe and stateless filters. As for the evaluation, the results showed that Rhea filters reduces job runtime by up to 5 times and dollar costs by up to 13
times!
Pros:
For sure the main advantage of MemC3 is to reduce the bandwidth cost of transferring redundant data from storage to computation by retaining both the unstructured storage and cloud storage.
In addition, it’s a plus that it can have false positives (return true for records that do not affect the output), but it cannot have false negatives.
Cons:
Unfortunately at this point Rhea is supporting Map-Reduce and Java language.
Also it was not clear for me a general overhead of filters, maybe in terms of CPU and energy usage.
Thought for further development:
For sure one option that the authors also mentioned themselves was to generalize Rhea to support other format such as binary formats, and XML. Also data processing tools and runtimes other than Hadoop and Java could be considered.
Critiques/Questions:
Like I said previously, I’d like to know what tools other than Map-Reduce can be leveraged.