めもめも

このブログに記載の内容は個人の見解であり、必ずしも所属組織の立場、戦略、意見を代表するものではありません。

Auto configuration of Hadoop instances on Eucalyptus.

This is just a private memo... ;-)

Hadoop config files

First of all, you should build your own Hadoop AMI with the following config files.

/home/hadoop/hadoop/conf/core-site.xml

  <property>
    <name>fs.default.name</name>
        <value>hdfs://hdpmgmt01/</value>
    <final>true</final>
  </property>

/home/hadoop/hadoop/conf/hdfs-site.xml

  <property>
    <name>dfs.name.dir</name>
    <value>/disk01/hdfs/name</value>
    <final>true</final>
  </property>

  <property>
    <name>fs.checkpoint.dir</name>
    <value>/disk01/hdfs/name_secondary</value>
    <final>true</final>
  </property>

  <property>
    <name>dfs.data.dir</name>
    <value>/disk01/hdfs/data</value>
    <final>true</final>
  </property>

/home/hadoop/hadoop/conf/mapred-site.xml

  <property>
    <name>mapred.job.tracker</name>
    <value>hdpmgmt01:8021</value>
    <final>true</final>
  </property>

  <property>
    <name>mapred.local.dir</name>
    <value>/disk01/mapred/local</value>
    <final>true</final>
  </property>

  <property>
    <name>mapred.system.dir</name>
    <value>/tmp/hadoop/mapred/system</value>
    <final>true</final>
  </property>

  <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>1</value>
    <final>true</final>
  </property>

  <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>1</value>
    <final>true</final>
  </property>

  <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx200m</value>
    <final>true</final>
  </property>

Auto configuration script

These should also be stored in the Hadoop AMI. Ephemeral area is supposed to be /dev/sda2 (hard-coded).

/root/config_nodes.sh

#!/bin/sh -x
function set_node {
  ip=$1
  name=$2
## Accept host key
  expect -c "
    set timeout 10
    spawn scp /tmp/tmp_hosts$$ root@$ip:/etc/hosts
    expect \"*connecting (yes/no)?\"; send \"yes\n\"; expect eof
    spawn ssh root@$name date
    expect \"*connecting (yes/no)?\"; send \"yes\n\"; expect eof
  "
## Setting up...
  ssh -n root@$ip "df | grep sda2"
  rc=$?
  if [[ $rc -eq 1 ]]; then
    ssh -n root@$ip "echo HOSTNAME=$name >> /etc/sysconfig/network"
    ssh -n root@$ip hostname $name
    ssh -n root@$ip 'echo "/dev/sda2  /disk01  ext3  defaults  0 0" >> /etc/fstab'
    ssh -n root@$ip 'mke2fs -j /dev/sda2; mkdir -p /disk01; mount /disk01; chown hadoop.hadoop /disk01'
  fi
}

## Main
rm -f /root/.ssh/known_hosts
cat /dev/null > /home/hadoop/hadoop/conf/slaves
cat /dev/null > /home/hadoop/hadoop/conf/masters
echo "127.0.0.1       localhost.localdomain localhost" >/tmp/tmp_hosts$$

while read name ip; do
  echo "$ip    $name" >> /tmp/tmp_hosts$$
  if [[ "$name" = "hdpmgmt01" ]]; then
    echo $name >> /home/hadoop/hadoop/conf/masters
  else
    echo $name >> /home/hadoop/hadoop/conf/slaves
  fi
done < /root/nodelist.txt

while read name ip; do
  if [[ "$name" = "" ]]; then
    continue
  fi
  set_node $ip $name
done < /root/nodelist.txt

cp /root/.ssh/known_hosts /home/hadoop/.ssh/known_hosts
chown hadoop.hadoop /home/hadoop/.ssh/known_hosts
rm /tmp/tmp_hosts$$

/root/nodelist.txt

hdpmgmt01  172.16.4.3
hdpnode01  172.16.4.4
hdpnode02  172.16.4.5
hdpnode03  172.16.4.6

How to use the script.

1. Start as many instances as you like from the Hadoop AMI. Choose any one of them as the namenode.

2. Copy the private key to the namenode instance so that root of the namenode can access other nodes as root.

[root@eucaman01]# scp -i ~/.euca/eucakey.private_key ~/.euca/eucakey.private_key 172.16.4.3:/root/.ssh/id_rsa

3. Login to the namenode and configure the cluster.

[root@eucaman01]# ssh -i ~/.euca/eucakey.private_key 172.16.4.3
[root@hdpmgmt01]# vi /root/nodelist.txt   <= Modify IP addresses according to your instances.
[root@hdpmgmt01]# /root/config_nodes.sh
[root@hdpmgmt01]# su - hadoop
[hadoop@hdpmgmt01]$ hadoop namenode -format
[hadoop@hdpmgmt01]$ start-all.sh