What should be the maximum or ideal number of map/reduce tasks for a MapReduce job?

More Sudhakar Singh's questions See All

How to write a research proposal for a grant? How is it different from writing research proposal for PhD degree?

How to write an effective research proposal in computer science and engineering so that it has the maximum probability to be selected? What kind of proposal are given more preference? What are the...

06 July 2019 1,751 5 View

Should one submit the manuscript in the same journal of which he/she is a reviewer?

Want to know the opinions on the practice of submitting manuscript in the same journal of which you are a reviewer?

08 September 2017 2,682 9 View

What is difference between Data Mining and Data Analytics?

Is it a new fancy and alternative term for data mining?

04 May 2016 9,245 9 View

What is difference between Print ISSN and Online ISSN?

If the journal is indexed in Scopus or SCI then which ISSN should be indexed, either print or online or both?

03 April 2016 6,480 12 View

What is the difference among Literature Review, Review of Literature, Literature Survey, Survey of Literature etc.?

Are Literature Review and Review of Literature same and similarly Literature Survey and Survey of Literature? Whats the difference between Review and Survey?

02 March 2016 4,960 8 View

Is frequent itemset mining a np-hard problem?

Particularly, if yes then all major algorithms like Apriori, FP-Growth and Eclat are np-hard or only Apriori is np-hard?

10 November 2015 5,321 7 View

How do I customize data placement on DataNodes (DN) of Hadoop cluster?

Let we change the default block size to 32 MB and replication factor to 1. Let Hadoop cluster consists of 4 DNs. Let input data size is 192 MB. Now I want to place data on DNs as following. DN1...

09 October 2015 9,972 1 View

Why there is a performance variation between physical machine and virtual machine with same number of cores and memory?

I have installed a Hadoop 2.6.0 Cluster using one NameNode (NN) and 3 DataNodes (DN). Two DNs are on two physical machine running Ubuntu while 3rd DN is virtual node running Ubuntu on window...

09 October 2015 2,303 29 View

Should we publish our manuscript in open access journal ?

Open access journals charge a high publication fee which is not affordable by a research student having no funding. I want to publish my work in a good journal of computer science. The keywords of...

06 July 2015 5,025 0 View

How can I find the execution time of a MapReduce job?

I am using this code snippet. But I am not sure this is the correct way. Is there any other methods or API to find the execution time of a MapReduce Job? Date date; long start, end; // for...

04 May 2015 6,678 2 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Which Scopus Journal provides the most affordable fees?

"PUBLISHING IN A SCOPUS JOURNAL" Researchers are now at a cross road. The critical need to publish in a Scopus or ISI, etc journal is ever vital. Journal Publication fees must be submitted....

10 August 2024 8,621 1 View

Geotechnical Engineering (Proceedings of the ICE) time review?

Hello everyone, I recently submitted an article to Geotechnical Engineering (Proceedings of the ICE), and the current status has been listed as "EiC Pre-assessment: Ready" for the past 20 days. I...

10 August 2024 6,493 1 View

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Hello everyone, I am currently developing a thesis proposal and would appreciate your input on its viability and how to effectively carry it out. My proposed topic is: "Does the perceived threat...

10 August 2024 8,992 0 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

Vlad Korolev

This totally depends on what your jobs actually do. Usual rule of the thumb is that the amount of data consumed by mapper should be no more than 10 HDFS blocks, but at the same time it should try to consume most of the block which is about 16MB on most installations, but it's configurable. At the same time job start-up time could be a minute or two, so you want your tasks to run for at least 15-20 minutes to reduce overhead, but long running tasks are at risk of failure ( the longer tasks exists higher the risk) also running tasks make the downstream tasks hang around and wait for completion thus driving utilization down.

So based on this constraints you can come up with the model. But as I said it totally depends on what are you doing in these tasks

Athanasios Anastasiou

In addition to Vlad's answer and the timings that are listed therein you might also want to consider what would you see as an acceptable time for the whole process to execute. (For example, a few hours? A few months?)

To get a useful estimate of that, you might want to profile your code to discover the timings due to unavoidably serialy executed operations and parallel operations. This is because it is not necessary that pouring more CPUs on a problem will result in less computational time. For more information, please see attached link.

https://en.wikipedia.org/wiki/Amdahl's_law

Sabeur Aridhi

It depends on your MapReduce job and on your cluster. There are some research works that aim to determine the optimal/best MapReduce parameters (number of Map/Reduce functions, Chunk size, ...) but they are generally designed for a specific application. For example, this paper deals with MapReduce-based pattern mining approaches:

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7345482

Best,

Sabeur