27 August 2013

Hadoop HDFS Balancer Explained

HDFS blocks allocation strategy tries hard to spread new blocks evenly amongst all the datanodes. The rational behind that behavior is to avoid recently added nodes into the cluster to become a bottleneck because all the new blocks would be allocated and read from that datanode.
The drawback of this allocation strategy is that your cluster needs constantly to be rebalanced. Even more when you start having multiple hardware generations living along side: the bigger your boxes are, the slower their disk usage is increasing proportionally to older (and smaller) generations.
Look at the following graph: we can easily distinguish 3 generations of hardware (dn20-dn60, dn61-dn70 and dn71-dn80).

Unbalanced cluster

So Hadoop HDFS Balancer need to be run on a regular basis.

HDFS Balancer


Help entry from the command line:

$ hdfs balancer -h
Usage: java Balancer
        [-policy ]       the balancing policy: datanode or blockpool (default datanode)
        [-threshold ] Percentage of disk capacity (default 10)


The threshold parameter is a float number between 0 and 100 (12.5 for instance). From the average cluster utilization (about 50% in the graph below), the balancer process will try to converge all datanodes' usage in the range [average - threshold, average + threshold]. In the current example:
- Higher (average + threshold): 60% if run with the default threshold (10%)
- Lower (average - threshold): 40%

You can easily notice that the smaller your threshold, the more balanced your datanodes will be. For very small threshold, the cluster may not be able to reach the balanced state if other clients concurrently write and delete data in the cluster.


The balancer will pick datanodes with current usage above the higher threshold (the source, classified as over-utilized), and try to find blocks from these datanodes that could be copied into nodes with current usage below the lower threshold (the destination, under-utilized). A second selection round will select over-utilized nodes to move blocks to nodes with utilization below average. A third pass will pick nodes with utilization above average to move data to under-utilized nodes.




In addition to that selection process, the balancer can also pick a proxy node if the source and the destination are not located in the same rack (i.e. a datanode storing a replica of the block and located in the same rack than the destination). Yes, balancer is rack aware and will generate very little rack-to-rack noise.

Iterative steps in the balancer

The balancer process is iterative. As HDFS has lots of moving state, the balancer try at each iteration to move 10 GB * number of selected sources. Each iteration does not last more than 20 minutes.
The progress of the runs is reported to stdout:

$ hdfs balancer -threshold 25
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
Aug 26, 2013 9:29:16 AM           0                 0 KB             5.82 TB              50 GB
Aug 26, 2013 9:46:40 AM           1             49.88 GB             5.76 TB              50 GB
Aug 26, 2013 10:04:27 AM          2             99.67 GB             5.72 TB              50 GB
Aug 26, 2013 10:22:36 AM          3            149.62 GB             5.69 TB              50 GB
...

Aug 27, 2013 3:24:54 AM          61              2.07 TB             4.08 TB              30 GB
Aug 27, 2013 3:42:32 AM          62               2.1 TB             4.05 TB              30 GB
Aug 27, 2013 4:00:19 AM          63              2.13 TB             4.02 TB              30 GB
Aug 27, 2013 4:18:15 AM          64              2.16 TB             3.95 TB              30 GB
...

Iterative steps as administrator


The same iterative approach applies to the Hadoop Administrator: if you haven't balance your cluster for a long time, you should start by balancing with a higher threshold (like 25), and then converging to a smaller target threshold (like 10).


Remember: balancer needs to be run regularly to keep constant performances of HDFS.

Special note for HBase

Apache HBase
HBase and HDFS balance are not best friend at all. It is recommended to run HBase on a separate HDFS cluster and to not re-balance it.

3 comments:

  1. HDFS balancer explained in a very simple way .. thanks!!!

    ReplyDelete
  2. Thanks for very good explanation, I have question
    If I delete/insert data from hdfs while running balancer will it affect on any other things?

    ReplyDelete
    Replies
    1. Nope, balancer can run concurrently to any other insert/delete.

      Delete