最近在自己的局域网中又全新安装了hadoop0.20版本,感觉和0.19版本还是有一些变化的。
20版原包中默认取消了hadoop-default.xml配置文件,取而代之的是三个配置文件:
core-site.xml
mapred-site.xml
hdfs-site.xml
默认的这三个文件都是空的,也就是说,这些配置的全局默认值已经在代码中写死了,我们在配置文件中写的是和默认值不同的选项,会覆盖默认选项。
不同的配置选项要放在相应的文件中,不能放错地方。
hadoop 0.20官方英文文档中告诉我们了该怎么写(注意,是英文文档,20版提供了中文文档,但是里面的内容都是旧内容,我就是看了中文文档所以走了不少弯路),参考:http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html
原文摘录如下:
This section deals with important parameters to be specified in the following:
conf/core-site.xml :
Parameter | Value | Notes |
---|
fs.default.name | URI of NameNode . | hdfs://hostname/ |
conf/hdfs-site.xml :
Parameter | Value | Notes |
---|
dfs.name.dir | Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. | If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. |
dfs.data.dir | Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks. | If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. |
conf/mapred-site.xml :
Parameter | Value | Notes |
---|
mapred.job.tracker | Host or IP and port of JobTracker . | host:port pair. |
mapred.system.dir | Path on the HDFS where where the Map/Reduce framework stores system files e.g. /hadoop/mapred/system/ . | This is in the default filesystem (HDFS) and must be accessible from both the server and client machines. |
mapred.local.dir | Comma-separated list of paths on the local filesystem where temporary Map/Reduce data is written. | Multiple paths help spread disk i/o. |
mapred.tasktracker.{map reduce}.tasks.maximum | The maximum number of Map/Reduce tasks, which are run simultaneously on a given TaskTracker , individually. | Defaults to 2 (2 maps and 2 reduces), but vary it depending on your hardware. |
dfs.hosts/dfs.hosts.exclude | List of permitted/excluded DataNodes. | If necessary, use these files to control the list of allowable datanodes. |
mapred.hosts/mapred.hosts.exclude | List of permitted/excluded TaskTrackers. | If necessary, use these files to control the list of allowable TaskTrackers. |
mapred.queue.names | Comma separated list of queues to which jobs can be submitted. | The Map/Reduce system always supports atleast one queue with the name as default . Hence, this parameter's value should always contain the string default . Some job schedulers supported in Hadoop, like the Capacity Scheduler , support multiple queues. If such a scheduler is being used, the list of configured queue names must be specified here. Once queues are defined, users can submit jobs to a queue using the property name mapred.job.queue.name in the job configuration. There could be a separate configuration file for configuring properties of these queues that is managed by the scheduler. Refer to the documentation of the scheduler for information on the same. |
mapred.acls.enabled | Specifies whether ACLs are supported for controlling job submission and administration | If true , ACLs would be checked while submitting and administering jobs. ACLs can be specified using the configuration parameters of the form mapred.queue.queue-name.acl-name , defined below. |
mapred.queue.queue-name .acl-submit-job | List of users and groups that can submit jobs to the specified queue-name . | The list of users and groups are both comma separated list of names. The two lists are separated by a blank. Example: user1,user2 group1,group2 . If you wish to define only a list of groups, provide a blank at the beginning of the value. |
mapred.queue.queue-name .acl-administer-job | List of users and groups that can change the priority or kill jobs that have been submitted to the specified queue-name . | The list of users and groups are both comma separated list of names. The two lists are separated by a blank. Example: user1,user2 group1,group2 . If you wish to define only a list of groups, provide a blank at the beginning of the value. Note that an owner of a job can always change the priority or kill his/her own job, irrespective of the ACLs. |
Typically all the above parameters are marked as final to ensure that they cannot be overriden by user-applications.
在局域网里,每个机器往往都用用户的名字命名,如John-desktop,但是我们在分布式系统中,通常希望用master, slave001,slave002这样的命名规则来命名机器,
这样我们需要编辑/etc/hosts文件,把每一台机器希望的命名都写进去,如:
192.168.1.10 John-desktop
192.168.1.10 master
192.168.1.11 Peter-desktop
192.168.1.11 slave001
依此类推。因为在hadoop中,系统会自动取当前机器名(用hostname),这时,如果hostname不是master, slave001这样的名字,网络通信就会出问题。
以下是我的配置文件
注:我有两台机器,主机IP:192.168.1.10 从机IP:192.168.1.11
core-site.xml:
[xhtml]view plaincopyprint?- <?xml version="1.0"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
- <!-- Put site-specific property overrides in this file. -->
- <configuration>
- <property>
- <name>fs.default.name</name>
- <value>hdfs://master/</value>
- <description>The name of the default file system. A URI whose
- scheme and authority determine the FileSystem implementation. The
- uri's scheme determines the config property (fs.SCHEME.impl) naming
- the FileSystem implementation class. The uri's authority is used to
- determine the host, port, etc. for a filesystem.</description>
- </property>
- <property>
- <name>hadoop.tmp.dir</name>
- <value>/home/hadoop/hdfs</value>
- </property>
- </configuration>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master/</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hdfs</value>
</property>
</configuration>
hdfs-site.xml:
[xhtml]view plaincopyprint?- <?xml version="1.0"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
- <!-- Put site-specific property overrides in this file. -->
- <configuration>
- <property>
- <name>dfs.name.dir</name>
- <value>${hadoop.tmp.dir}/dfs/name</value>
- <description>Determines where on the local filesystem the DFS name node
- should store the name table(fsimage). If this is a comma-delimited list
- of directories then the name table is replicated in all of the
- directories, for redundancy. </description>
- </property>
- <property>
- <name>dfs.data.dir</name>
- <value>${hadoop.tmp.dir}/dfs/data</value>
- <description>Determines where on the local filesystem an DFS data node
- should store its blocks. If this is a comma-delimited
- list of directories, then data will be stored in all named
- directories, typically on different devices.
- Directories that do not exist are ignored.
- </description>
- </property>
- </configuration>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.name.dir</name>
<value>${hadoop.tmp.dir}/dfs/name</value>
<description>Determines where on the local filesystem the DFS name node
should store the name table(fsimage). If this is a comma-delimited list
of directories then the name table is replicated in all of the
directories, for redundancy. </description>
</property>
<property>
<name>dfs.data.dir</name>
<value>${hadoop.tmp.dir}/dfs/data</value>
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all named
directories, typically on different devices.
Directories that do not exist are ignored.
</description>
</property>
</configuration>
mapred-site.xml
[xhtml]view plaincopyprint?- <?xml version="1.0"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
- <!-- Put site-specific property overrides in this file. -->
- <configuration>
- <property>
- <name>mapred.job.tracker</name>
- <value>192.168.1.10:9001</value>
- <description>The host and port that the MapReduce job tracker runs
- at. If "local", then jobs are run in-process as a single map
- and reduce task.
- </description>
- </property>
- <property>
- <name>mapred.system.dir</name>
- <value>${hadoop.tmp.dir}/mapred/system</value>
- <description>The shared directory where MapReduce stores control files.
- </description>
- </property>
- <property>
- <name>mapred.local.dir</name>
- <value>${hadoop.tmp.dir}/mapred/local</value>
- <description>The local directory where MapReduce stores intermediate
- data files. May be a comma-separated list of
- directories on different devices in order to spread disk i/o.
- Directories that do not exist are ignored.
- </description>
- </property>
- </configuration>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>192.168.1.10:9001</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.system.dir</name>
<value>${hadoop.tmp.dir}/mapred/system</value>
<description>The shared directory where MapReduce stores control files.
</description>
</property>
<property>
<name>mapred.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
<description>The local directory where MapReduce stores intermediate
data files. May be a comma-separated list of
directories on different devices in order to spread disk i/o.
Directories that do not exist are ignored.
</description>
</property>
</configuration>
另外,
在conf/hadoop-env.sh里,要把JAVA_HOME环境变量指向JDK路境,尽管可能在.profile中已经设置过了,这里还是要设一下,不然有时会提示“没有指定JAVA_HOME"
;