Hadoop備忘録

ジョブ一覧

$ hadoop job -list
1 jobs currently running
JobId   State   StartTime       UserName        Priority        SchedulingInfo
job_201306180422_0004   1       1371548411004   sheeps          NORMAL  NA

ジョブ削除

$ hadoop job -kill job_201306180422_0004
Killed job job_201306180422_0004

セーフモード確認

$ hadoop dfsadmin -safemode get
Safe mode is ON

セーフモード解除

$ hadoop dfsadmin -safemode leave
Safe mode is OFF

セーフモード設定

hadoop dfsadmin -safemode enter
Safe mode is ON

hadoop設定の注意点

ネーム解決エラー

INFO org.apache.hadoop.metrics.MetricsUtil: Unable to obtain hostName
java.net.UnknownHostException: ip-192-168-11-16 : ip-192-168-11-16
        at java.net.InetAddress.getLocalHost(InetAddress.java:1374)
        at org.apache.hadoop.metrics.MetricsUtil.getHostName(MetricsUtil.java:91)
        at org.apache.hadoop.metrics.MetricsUtil.createRecord(MetricsUtil.java:80)
        at org.apache.hadoop.security.UserGroupInformation$UgiMetrics.(UserGroupInformation.java:102)
        at org.apache.hadoop.security.UserGroupInformation.(UserGroupInformation.java:208)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1765)
        at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1758)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1626)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
        at org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:225)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1668)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1623)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1641)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1767)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1784)

ネーム解決されないエラーがでました。
hostsだけ、設定して問題ない場合もありましたが
hostnameを設定するとうまく動作しました。

hostnameの設定

sudo vi /etc/hostname
slaves001.sheeps.me

hostnameの反映

sudo hostname -b -F /etc/hostname

hostname -b -F /etc/hostname

hostsの設定

sudo vi /etc/hosts
192.168.11.14   masters000.sheeps.me    masters000
192.168.11.15   slaves000.sheeps.me     slaves000
192.168.11.16   slaves001.sheeps.me     slaves001
192.168.11.17   slaves002.sheeps.me     slaves002

サービスの起動

sudo service hadoop-0.20-datanode restart
sudo service hadoop-0.20-tasktracker restart

tcpdumpで確認

sudo tcpdump -s 1600 -X -i eth0 src port 8020
sudo tcpdump -s 1600 -X -i eth0 dst port 8020
sudo tcpdump -s 1600 -X -i eth0 src port 8021
sudo tcpdump -s 1600 -X -i eth0 dst port 8021

masters000.sheeps.meから実行

sudo tcpdump -s 1600 -X -i eth0 src host slaves000
sudo tcpdump -s 1600 -X -i eth0 src host slaves001
sudo tcpdump -s 1600 -X -i eth0 src host slaves002

JAVA開発支援ツール

Mavenのインストール

sudo aptitude -y install maven2

環境変数の設定

export JAVA_HOME=/usr/lib/jvm/java-6-sun
export CLASSPATH=".:/usr/lib/jvm/java-6-sun/lib" 

Mavenへライブラリのインストール

Hadoop関連

export HADOOP_HOME=/usr/lib/hadoop-0.20
export HBASE_HOME=/usr/lib/hbase


mvn install:install-file -DgroupId=org.apache.hadoop -DartifactId=hadoop-core -Dversion=1.2.1 -Dpackaging=jar -Dfile=${HADOOP_HOME}/hadoop-core.jar
mvn install:install-file -DgroupId=org.apache.zookeeper -DartifactId=zookeeper -Dversion=3.4.2 -Dpackaging=jar -Dfile=${HBASE_HOME}/lib/zookeeper.jar
mvn install:install-file -DgroupId=org.apache.hadoop -DartifactId=hbase -Dversion=0.90.6 -Dpackaging=jar -Dfile=${HBASE_HOME}/hbase.jar

Sunライブラリ関連
Download the version 1.1 API Documentation, Jar and Source
Java Management Extension (JMX) 1.2.1

mvn install:install-file -DgroupId=javax.jms -DartifactId=jms -Dversion=1.1 -Dpackaging=jar -Dfile=/usr/lib/jvm/java-6-sun/lib/jms.jar
mvn install:install-file -DgroupId=com.sun.jmx -DartifactId=jmxri -Dversion=1.2.1 -Dpackaging=jar -Dfile=/usr/lib/jvm/java-6-sun/lib/jmxri.jar
mvn install:install-file -DgroupId=com.sun.jdmk -DartifactId=jmxtools -Dversion=1.2.1 -Dpackaging=jar -Dfile=/usr/lib/jvm/java-6-sun/lib/jmxtools.jar

Mavenリポジトリ

Mavenでの開発

プロジェクトの作成

mkdir projects
cd projects

mvn archetype:create -DgroupId=me.sheeps.hdfs -DartifactId=sample
cd sample

生成されるファイル

./sample
    src/
        main/
            java/
                me/
                    sheeps/
                        hdfs/
                            App.java
        test/
            java/
                me/
                    sheeps/
                        hdfs/
                            AppTest.java
    pom.xml

pom.xmlに利用するライブラリを追加

vi pom.xml

[xml]
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>me.sheeps.hdfs</groupId>
<artifactId>sample</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>

<name>sample</name>
<url>http://maven.apache.org</url>

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.6</source>
<target>1.6</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>me.sheeps.hdfs.App</mainClass>
<packageName>me.sheeps.hdfs</packageName>
<addClasspath>true</addClasspath>
<addExtensions>true</addExtensions>
<classpathPrefix>lib</classpathPrefix>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>

<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>

<dependency>
<groupId>com.sun.jmx</groupId>
<artifactId>jmxri</artifactId>
<version>1.2.1</version>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20</version>
</dependency>

<dependency>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
<version>3.4.2</version>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hbase</artifactId>
<version>0.90.6</version>
</dependency>

<dependency>
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
<version>1.1</version>
</dependency>

</dependencies>
</project>
[/xml]

hBaseのサンプル

[java]
package me.sheeps.hdfs;

import java.io.IOException;
import java.lang.System;
import java.util.NavigableMap;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.filter.Filter;
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
import org.apache.hadoop.hbase.filter.BinaryPrefixComparator;
import org.apache.hadoop.hbase.filter.CompareFilter.CompareOp;
import org.apache.hadoop.hbase.filter.DependentColumnFilter;
import org.apache.hadoop.hbase.util.Bytes;

/**
* Appクラス
*
* @package Sample
* @author Yujiro Takahashi <yujiro3@gmail.com>
*/
public class App {
/**
* メイン
*
* @access public
* @param String[] args
* @return void
*/
public static void main(String[] args) throws Exception {
// 設定情報の読み込み
Configuration conf = HBaseConfiguration.create();
conf.addResource("/etc/hbase/conf/hbase-site.xml");
conf.set("hbase.client.scanner.caching", "3");

// 引数のパース
new GenericOptionsParser(conf, args);

HTable table = new HTable(conf, Bytes.toBytes("accesslog")); // テーブルを指定

// Scan条件の指定
Scan scan = new Scan();
Filter filter = new DependentColumnFilter(
Bytes.toBytes("user"), // カラムファミリを指定
Bytes.toBytes("id"), // 修飾子を指定
false,
CompareOp.EQUAL,
new BinaryPrefixComparator(Bytes.toBytes(args[0]))
);
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);

System.out.println("/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/");

for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
String row = Bytes.toString(rr.getValue(Bytes.toBytes(args[1]), Bytes.toBytes(args[2])));
System.out.println(row);
}
}
}
[/java]

コンパイル

mvn compile

パッケージ化

mvn clean package

実行

hadoop jar ./target/sample-1.0-SNAPSHOT.jar 1258878 log timestamp

Hadoop streaming用のスクリプトファイルを配布する時など

lsyncd

lsyncdインストール

sudo aptitude -y install lsyncd

lsyncd設定ファイル

sudo vi /etc/lsyncd/lsyncd.conf.lua
----
-- Streaming configuration file for lsyncd.
--
settings = {
    statusFile = "/var/run/lsyncd.stat",
    statusInterval = 30,
}

sync { 
    default.rsync, 
    source="/home/mapred/",
    target="slaves000:/home/mapred/",
    rsyncOps={"-aruz", "--delete"}, 
    delay=10 
}
sync { 
    default.rsync, 
    source="/home/mapred/",
    target="slaves001:/home/mapred/",
    rsyncOps={"-aruz", "--delete"}, 
    delay=10 
}
sync { 
    default.rsync, 
    source="/home/mapred/",
    target="slaves002:/home/mapred/",
    rsyncOps={"-aruz", "--delete"}, 
    delay=10 
}

lsyncdデーモン起動

sudo /etc/init.d/lsyncd start

rsync

rsyncd設定ファイル

sudo vi /etc/rsyncd.conf
# GLOBAL OPTIONS

# pid file = /var/run/rsync.pid
# log file = /var/log/rsync.log

timeout = 600
hosts allow *.sheeps.me
read only = yes

max connections = 2
dont compress = *.gz *.tgz *.zip *.z *.rpm *.deb *.iso *.bz2 *.tbz

[MapReduce]
comment = PHP for Hadoop streaming
path = /home/mapred
uid = mapred
gid = mapred

rsyncdデフォルトファイル

sudo vi /etc/default/rsync
# start rsync in daemon mode from init.d script?
#  only allowed values are "true", "false", and "inetd"
#  Use "inetd" if you want to start the rsyncd from inetd,
#  all this does is prevent the init.d script from printing a message
#  about not starting rsyncd (you still need to modify inetd's config yourself).
RSYNC_ENABLE=true

rsyncデーモン起動

sudo /etc/init.d/rsync start

SSHを利用したlsyncd設定

sudo vi /etc/lsyncd/lsyncd.conf.lua
----
-- Streaming configuration file for lsyncd.
--
settings = {
    statusFile = "/var/run/lsyncd.stat",
    statusInterval = 30,
}

sync {
    default.rsyncssh,
    source="/home/mapred/",
    host="hdfs@slaves000",
    targetdir="/home/mapred/",
    rsyncOps={"-aruz", "--delete"}, 
    delay=10
}

sync {
    default.rsyncssh,
    source="/home/mapred/",
    host="hdfs@slaves001",
    targetdir="/home/mapred/",
    rsyncOps={"-aruz", "--delete"}, 
    delay=10
}

sync {
    default.default.rsyncssh,
    source="/home/mapred/",
    host="hdfs@slaves002",
    targetdir="/home/mapred/",
    rsyncOps={"-aruz", "--delete"}, 
    delay=10
}

rootのSSH設定

sudo vi /root/.ssh/config
Host slaves000
    HostName            slaves000.sheeps.me
    IdentityFile        /root/.ssh/id_rsa
    User                hdfs

Host slaves001
    HostName            slaves001.sheeps.me
    IdentityFile        /root/.ssh/id_rsa
    User                hdfs

Host slaves002
    HostName            slaves002.sheeps.me
    IdentityFile        /root/.ssh/id_rsa
    User                hdfs
sudo cp $HADOOP_HOME/.ssh/id_rsa /root/.ssh/id_rsa
sudo chmod 0600 /root/.ssh/id_rsa

Lsyncdからrsyncを実行するユーザーを変更する方法がわからないので
デフォルトのrootで実行させています。

接続先が/root/.ssh/known_hostsに登録されていないとエラーとなりました。
予め接続して登録しておく必要があるようです。

phpを利用したMapReduce実行

サンプルデータの配置

サンプルデータの作成

echo Hello World Bye World > file01
echo Hello Hadoop Goodbye Hadoop > file02

ls
file01  file02

HDFS上にinputディレクトリを作成

sudo -u hdfs hadoop fs -mkdir /user/hdfs/input

HDFS上にサンプルデータを配置

sudo -u hdfs hadoop fs -put file01 /user/hdfs/input/file01
sudo -u hdfs hadoop fs -put file02 /user/hdfs/input/file02

sudo -u hdfs hadoop fs -cat /user/hdfs/input/file01
Hello World Bye World

sudo -u hdfs hadoop fs -cat /user/hdfs/input/file02
echo Hello Hadoop Goodbye Hadoop

map処理の作成

vi map.php

[php]
<?php
while (($row = fgetcsv(STDIN, 1024, " ")) !== FALSE) {
foreach ($row as $word) {
if ($word !== ”) {
echo "${word}\t1\n";
}
}
}
?>
[/php]

map.phpローカルテスト

cat file01 file02 | php ./map.php

Hello   1
World   1
Bye     1
World   1
Hello   1
Hadoop  1
Goodbye 1
Hadoop  1

キーと値のペアが出力される。
値は文字の出現回数とし1を固定。

map処理と同じ状態で出力

cat file01 file02 | php ./map.php | sort

Bye     1
Goodbye 1
Hadoop  1
Hadoop  1
Hello   1
Hello   1
World   1
World   1

キーを元にソートされるためsortコマンドへ送る。

reduce処理の作成

vi reduce.php

[php]
<?php
$count = array();
while ((list($key, $value) = fgetcsv(STDIN, 1024, "\t")) !== FALSE) {
$count[$key] = empty($count[$key]) ? 1: $count[$key] + 1;
}

foreach ( $count as $key => $value ) {
echo "${key}\t${value}\n";
}
?>
[/php]

reduce.phpローカルテスト

cat file01 file02 | php ./map.php | sort | php ./reduce.php

Bye     1
Goodbye 1
Hadoop  2
Hello   2
World   2

キーと値のペアを配列にマップしてカウント

Hadoop Streamingの実行

ファイルの配信

scp -r /home/mapred hdfs@slaves000:/home/
scp -r /home/mapred hdfs@slaves001:/home/
scp -r /home/mapred hdfs@slaves002:/home/

ストリーミングモジュールを利用してmapreduceの実行

sudo su hdfs

/usr/lib/hadoop-0.20/bin/hadoop \
  jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u5.jar \
  -input /user/hdfs/input \
  -output /user/hdfs/output \
  -mapper '/usr/bin/php /home/mapred/map.php' \
  -reducer '/usr/bin/php /home/mapred/reduce.php'

/user/hdfs/outputが既に存在しているとエラーになります。

結果の確認

sudo -s hdfs hadoop fs -ls /user/hdfs/output
Found 3 items
-rw-r--r--   1 hdfs supergroup          0 2012-12-02 04:23 /user/hdfs/output/_SUCCESS
drwxr-xr-x   - hdfs supergroup          0 2012-12-02 04:24 /user/hdfs/output/_logs
-rw-r--r--   1 hdfs supergroup         41 2012-12-02 04:25 /user/hdfs/output/part-00000

sudo -u hdfs hadoop fs -cat /user/hdfs/output/part-00000

Bye     1
Goodbye 1
Hadoop  2
Hello   2
World   2

HDFS上の/user/hdfs/output/に結果が保存される。

HDFSマスター1つ HDFSスレーブ3つの構成

DataNodeとTaskTrackerのインストール

sudo aptitude -y install hadoop-0.20 hadoop-0.20-datanode hadoop-0.20-tasktracker

設定ファイルの同期

全スレーブにSSH公開鍵の登録

ssh root@slaves000 mkdir /usr/lib/hadoop-0.20/.ssh
scp /usr/lib/hadoop-0.20/.ssh/authorized_keys root@slaves000:/usr/lib/hadoop-0.20/.ssh/
ssh root@slaves000 chown -R hdfs:hdfs /usr/lib/hadoop-0.20/.ssh/
ssh root@slaves000 mod 0600 /usr/lib/hadoop-0.20/.ssh/authorized_keys 

設定ファイルの配信

rsync -av /etc/hadoop-0.20/conf hdfs@slaves000:/etc/hadoop-0.20/conf
rsync -av /etc/hadoop-0.20/conf hdfs@slaves001:/etc/hadoop-0.20/conf
rsync -av /etc/hadoop-0.20/conf hdfs@slaves002:/etc/hadoop-0.20/conf

予めhdfs権限で上書きできるように設定する必要があります。

/usr/lib/hadoop-0.20/conf/以下はマスターと同じ設定にします。

設定ファイルの編集

hostsの設定

sudo vi /etc/hosts
192.168.196.125   masters000.sheeps.me    masters000
192.168.196.126   slaves000.sheeps.me     slaves000
192.168.196.127   slaves001.sheeps.me     slaves001
192.168.196.128   slaves002.sheeps.me     slaves002

初期化

cacheディレクトリの設定

sudo mkdir -p /var/lib/hadoop-0.20/cache
sudo chown -R hdfs:hadoop /var/lib/hadoop-0.20

sudo chmod 0777 /var/lib/hadoop-0.20/cache

公開鍵の登録

sudo su hdfs
cd
mkdir ./.ssh
echo ssh-rsa ************** >> ./.ssh/authorized_keys
chmod 0600 ./.ssh/authorized_keys

サービスの起動

DataNodeとTaskTrackerの起動

sudo service hadoop-0.20-datanode start
sudo service hadoop-0.20-tasktracker start

HDFSマスターのインストールへ

HDFSマスター1つ HDFSスレーブ3つの構成

NamenodeとJobtrackerのインストール

sudo aptitude -y install hadoop-0.20 hadoop-0.20-namenode hadoop-0.20-jobtracker

設定ファイルの編集

core-site.xmlの設定

sudo vi /etc/hadoop-0.20/conf/core-site.xml

[sourcecode language=”plain”]
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!– Put site-specific property overrides in this file. –>

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://masters000:8020</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop-0.20/cache/${user.name}</value>
</property>

<!– OOZIE proxy user setting –>
<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>
</configuration>
[/sourcecode]

hdfs-site.xmlの設定

sudo vi /etc/hadoop-0.20/conf/hdfs-site.xml

[sourcecode language=”plain”]
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!– Put site-specific property overrides in this file. –>

<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<!– Immediately exit safemode as soon as one DataNode checks in.
On a multi-node cluster, these configurations must be removed. –>
<property>
<name>dfs.safemode.extension</name>
<value>0</value>
</property>
<property>
<name>dfs.safemode.min.datanodes</name>
<value>1</value>
</property>
<property>
<!– specify this so that running ‘hadoop namenode -format’ formats the right dir –>
<name>dfs.name.dir</name>
<value>/var/lib/hadoop-0.20/cache/hadoop/dfs/name</value>
</property>

<!– Enable Hue Plugins –>
<property>
<name>dfs.namenode.plugins</name>
<value>org.apache.hadoop.thriftfs.NamenodePlugin</value>
<description>Comma-separated list of namenode plug-ins to be activated.
</description>
</property>
<property>
<name>dfs.datanode.plugins</name>
<value>org.apache.hadoop.thriftfs.DatanodePlugin</value>
<description>Comma-separated list of datanode plug-ins to be activated.
</description>
</property>
<property>
<name>dfs.thrift.address</name>
<value>0.0.0.0:10090</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
<property>
<name>dfs.support.broken.append</name>
<value>true</value>
</property>
</configuration>
[/sourcecode]

WebHDFSを有効にappendをサポートするように追加しています。
fluentdなどでは必要になるプラグインになります。

mapred-site.xmlの設定

sudo vi /etc/hadoop-0.20/conf/mapred-site.xml

[sourcecode language=”plain”]
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!– Put site-specific property overrides in this file. –>

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>masters000:8021</value>
</property>

<!– Enable Hue plugins –>
<property>
<name>mapred.jobtracker.plugins</name>
<value>org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin</value>
<description>Comma-separated list of jobtracker plug-ins to be activated.
</description>
</property>
<property>
<name>jobtracker.thrift.address</name>
<value>0.0.0.0:9290</value>
</property>
</configuration>
[/sourcecode]

hadoop-env.shの設定

sudo vi /etc/hadoop-0.20/conf/hadoop-env.sh

[sourcecode language=”plain”]
# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-6-sun

# Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH="<extra_entries>:$HADOOP_CLASSPATH"

# The maximum amount of heap to use, in MB. Default is 1000.
# export HADOOP_HEAPSIZE=2000

# Extra Java runtime options. Empty by default.
# if [ "$HADOOP_OPTS" == "" ]; then export HADOOP_OPTS=-server; else HADOOP_OPTS+=" -server"; fi

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS"
export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"
export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS"
# export HADOOP_TASKTRACKER_OPTS=
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
# export HADOOP_CLIENT_OPTS

# Extra ssh options. Empty by default.
# export HADOOP_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR"

# Where log files are stored. $HADOOP_HOME/logs by default.
# export HADOOP_LOG_DIR=${HADOOP_HOME}/logs

# File naming remote slave hosts. $HADOOP_HOME/conf/slaves by default.
# export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves

# host:path where hadoop code should be rsync’d from. Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop

# Seconds to sleep between slave commands. Unset by default. This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HADOOP_SLAVE_SLEEP=0.1

# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by
# the users that are going to run the hadoop daemons. Otherwise there is
# the potential for a symlink attack.
# export HADOOP_PID_DIR=/var/hadoop/pids

# A string representing this instance of hadoop. $USER by default.
# export HADOOP_IDENT_STRING=$USER

# The scheduling priority for daemon processes. See ‘man nice’.
# export HADOOP_NICENESS=10
[/sourcecode]
JAVA_HOME=/usr/lib/jvm/java-6-sunだけ変更すれば基本動くようです。

mastersの設定

sudo vi /etc/hadoop-0.20/conf/masters
masters000

slavesの設定

sudo vi /etc/hadoop-0.20/conf/slaves
slaves000
slaves001
slaves002

初期化

cacheディレクトリの設定

sudo mkdir -p /var/lib/hadoop-0.20/cache
sudo chown -R hdfs:hadoop /var/lib/hadoop-0.20

sudo chmod 0777 /var/lib/hadoop-0.20/cache

SSH公開鍵の登録

sudo su hdfs
ssh-keygen -t rsa -P "" 
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys 

各サーバへパスフレーズなしでsshログイン出来るように設定します。

NameNodeのフォーマット

sudo -u hdfs hadoop namenode -format

設定ファイルの編集

hostsの設定

sudo vi /etc/hosts
127.0.0.1         localhost
127.0.0.1         masters000.sheeps.me    masters000
192.168.196.125   masters000.sheeps.me    masters000
192.168.196.126   slaves000.sheeps.me     slaves000
192.168.196.127   slaves001.sheeps.me     slaves001
192.168.196.128   slaves002.sheeps.me     slaves002

ドメイン設定をしっかりやらないと上手くいかない場合があるようです。

サービスの起動

NameNodeとJobTrackerの起動

sudo service hadoop-0.20-namenode start
sudo service hadoop-0.20-jobtracker start

起動時のclasspath

/usr/lib/hadoop-0.20/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u5.jar:/usr/lib/hadoop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/asm-3.2.jar:/usr/lib/hadoop-0.20/lib/aspectjrt-1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/hadoop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.1.jar:/usr/lib/hadoop-0.20/lib/commons-io-2.1.jar:/usr/lib/hadoop-0.20/lib/commons-lang-2.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-3.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/guava-r09-jarjar.jar:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u5.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jersey-core-1.8.jar:/usr/lib/hadoop-0.20/lib/jersey-json-1.8.jar:/usr/lib/hadoop-0.20/lib/jersey-server-1.8.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jetty-servlet-tester-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.cloudera.1.jar:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.20/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:/usr/lib/hadoop-0.20/lib/mysql-connector-java-5.1.22-bin.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api-2.1.jar

HDFSスレーブのインストールへ

sun-javaのインストール

Ubuntuにsun-javaをインストールするためにflexion.orgのスクリプトを実行します。

flexion.org

wget https://raw.github.com/flexiondotorg/oab-java6/master/oab-java.sh -O oab-java.sh
chmod +x oab-java.sh
sudo ./oab-java.sh

インストール

sudo aptitude -y install sun-java6-jdk ant

環境変数設定

export JAVA_HOME=/usr/lib/jvm/java-6-sun

aptでhadoopをインストールする準備

HadoopをaptからインストールするためにClouderaのリポジトリを追加します。

Clouderaのリスト作成

sudo vi /etc/apt/sources.list.d/cloudera.list

—- cloudera.list —-

deb http://archive.cloudera.com/debian lucid-cdh3 contrib

公開鍵の登録

curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -

パッケージ情報の更新

sudo aptitude update