21 Haziran 2014 Cumartesi

Hadoop - Number of Map and Reduce Tasks

I will try to explain how number of map and reduce tasks is calculated.

Test are done with Hadoop 1.0.3 on Ubuntu 12.04 machine.

1. Number of Map Tasks

When you submit a map-reduce job (or pig/hive job), Hadoop first calculates the input splits, each input split size generally equals to HDFS block size. For example, for a file of 1GB size, there will be 16 input splits, if block size is 64MB. However, split size can be configured to be less/more than HDFS block size. Calculation of input splits is done with org.apache.hadoop.mapred.FileInputFormat. For each of these input splits, a map task must be started.


First, let's investigate properties that govern the input split size:
mapred.min.split.size 
Default value: 1
Description: The minimum size chunk that map input should be split into. Note that some file formats may have minimum split sizes that take priority over this setting.

mapred.max.split.size 
Default value: This configuration cannot be set in Hadoop 1.0.3, it is calculated in code. However in later versions, its default value is Long.MAX_VALUE, that is 9223372036854775807.
Description: The largest valid size inbytes for a file split.

dfs.block.size
Default value: 64 MB, that is 67108864
Description: The default block size for new files.

If you are using a newer Hadoop version, some of the above properties are deprecated. You can check from here.


Configure the properties:
You can set mapred.min.split.size in mapred-site.xml
<property>
  <name>mapred.min.split.size</name>
  <value>127108864</value>
</property>

You can set dfs.block.size in hdfs-site.xml
<property>
  <name>dfs.block.size</name>
  <value>67108864</value>
</property>

Split Size Calculation Examples:
Calculation of input split size is done  in InputFileFormat as:
Math.max(minSize, Math.min(maxSize, blockSize));
mapred.min.split.sizemapred.max.split.sizedfs.block.sizeInput Split SizeInput Splits (1GB file)
1 (default)Long.MAX_VALUE(default)64MB(Default)64MB16
1 (default)Long.MAX_VALUE(default)128MB128MB8
128MBLong.MAX_VALUE(default)64MB128MB8
1 (default)32MB64MB32MB32


Configuring input split size larger than block size decreases data locality and can degrade performance.
According to above table, if file size is 1GB, there will be respectively 16, 8, 8 and 32 input splits.

What if input files are too small?
FileInputFormat splits files that are larger than split size. What if out input files are too small? In this point, FileInputFormat creates a input split per file. For example, if you have 100 10KB files and input split size is 64MB, there will be 100 input splits. Total file size is 1MB, but we will have 100 input splits and 100 map tasks. This is known as Hadoop small files problem. You can look at CombineFileInputFormat for a solution. Apache Pig combines small input files into one map by default.

So the number of map tasks depends on
- Total size of input
- Input split size
- Structure of input files (small files problem)

2. Number of Reducer Tasks

The number of reduce tasks to create is determined by themapred.reduce.tasks property in the JobConf, which is set by the setNumReduceTasks() method, and Hadoop simply creates this number of reduce tasks to be run.

1 yorum: