This is just a private memo... ;-)
Hadoop config files
First of all, you should build your own Hadoop AMI with the following config files.
/home/hadoop/hadoop/conf/core-site.xml
<property> <name>fs.default.name</name> <value>hdfs://hdpmgmt01/</value> <final>true</final> </property>
/home/hadoop/hadoop/conf/hdfs-site.xml
<property> <name>dfs.name.dir</name> <value>/disk01/hdfs/name</value> <final>true</final> </property> <property> <name>fs.checkpoint.dir</name> <value>/disk01/hdfs/name_secondary</value> <final>true</final> </property> <property> <name>dfs.data.dir</name> <value>/disk01/hdfs/data</value> <final>true</final> </property>
/home/hadoop/hadoop/conf/mapred-site.xml
<property> <name>mapred.job.tracker</name> <value>hdpmgmt01:8021</value> <final>true</final> </property> <property> <name>mapred.local.dir</name> <value>/disk01/mapred/local</value> <final>true</final> </property> <property> <name>mapred.system.dir</name> <value>/tmp/hadoop/mapred/system</value> <final>true</final> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>1</value> <final>true</final> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>1</value> <final>true</final> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx200m</value> <final>true</final> </property>
Auto configuration script
These should also be stored in the Hadoop AMI. Ephemeral area is supposed to be /dev/sda2 (hard-coded).
/root/config_nodes.sh
#!/bin/sh -x function set_node { ip=$1 name=$2 ## Accept host key expect -c " set timeout 10 spawn scp /tmp/tmp_hosts$$ root@$ip:/etc/hosts expect \"*connecting (yes/no)?\"; send \"yes\n\"; expect eof spawn ssh root@$name date expect \"*connecting (yes/no)?\"; send \"yes\n\"; expect eof " ## Setting up... ssh -n root@$ip "df | grep sda2" rc=$? if [[ $rc -eq 1 ]]; then ssh -n root@$ip "echo HOSTNAME=$name >> /etc/sysconfig/network" ssh -n root@$ip hostname $name ssh -n root@$ip 'echo "/dev/sda2 /disk01 ext3 defaults 0 0" >> /etc/fstab' ssh -n root@$ip 'mke2fs -j /dev/sda2; mkdir -p /disk01; mount /disk01; chown hadoop.hadoop /disk01' fi } ## Main rm -f /root/.ssh/known_hosts cat /dev/null > /home/hadoop/hadoop/conf/slaves cat /dev/null > /home/hadoop/hadoop/conf/masters echo "127.0.0.1 localhost.localdomain localhost" >/tmp/tmp_hosts$$ while read name ip; do echo "$ip $name" >> /tmp/tmp_hosts$$ if [[ "$name" = "hdpmgmt01" ]]; then echo $name >> /home/hadoop/hadoop/conf/masters else echo $name >> /home/hadoop/hadoop/conf/slaves fi done < /root/nodelist.txt while read name ip; do if [[ "$name" = "" ]]; then continue fi set_node $ip $name done < /root/nodelist.txt cp /root/.ssh/known_hosts /home/hadoop/.ssh/known_hosts chown hadoop.hadoop /home/hadoop/.ssh/known_hosts rm /tmp/tmp_hosts$$
/root/nodelist.txt
hdpmgmt01 172.16.4.3 hdpnode01 172.16.4.4 hdpnode02 172.16.4.5 hdpnode03 172.16.4.6
How to use the script.
1. Start as many instances as you like from the Hadoop AMI. Choose any one of them as the namenode.
2. Copy the private key to the namenode instance so that root of the namenode can access other nodes as root.
[root@eucaman01]# scp -i ~/.euca/eucakey.private_key ~/.euca/eucakey.private_key 172.16.4.3:/root/.ssh/id_rsa
3. Login to the namenode and configure the cluster.
[root@eucaman01]# ssh -i ~/.euca/eucakey.private_key 172.16.4.3 [root@hdpmgmt01]# vi /root/nodelist.txt <= Modify IP addresses according to your instances. [root@hdpmgmt01]# /root/config_nodes.sh [root@hdpmgmt01]# su - hadoop [hadoop@hdpmgmt01]$ hadoop namenode -format [hadoop@hdpmgmt01]$ start-all.sh