hadoop、zookeeper、spark集群配置:
1. 软件版本:
2. 配置环境变量:
我环境变量放在自定义文件中/etc/profile.d/my_env_sh中,配置有jdk、hadoop、spark、scala、zookeeper
3. hadoop配置:
我的hadoop版本是3.1.4
在目录/opt/module/hadoop-3.1.4/etc/hadoop下,我配置了core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml这四个配置文件
core-site.xml:
<configuration>
<!-- 指定NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop102:8020</value>
</property>
<!-- 指定hadoop数据的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop-3.1.4/data</value>
</property>
</configuration>
hdfs-site.xml:
<configuration>
<!-- nn web端访问地址 -->
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop102:9870</value>
</property>
<!-- 2nnweb端访问地址 -->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop104:9868</value>
</property>
</configuration>
mapred-site.xml:
<configuration>
<!-- 指定MapReduce程序运行在Yarn上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml:
<configuration>
<!-- 指定MR走shuffle -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<!--<value>mapreduce_shuffle</value>-->
</property>
<!-- 指定ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop103</value>
</property>
<!-- 环境变量的继承 -->
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
4. xsync、xcall配置:
我在用户新建了bin目录,然后touch了两个可执行文件xsync、xcall
Xcall文件内容:
#!/bin/bash
pcount=$#
if((pcount==0));then
echo no args;
exit;
fi
echo -------------localhost----------
$@
for((host=102; host<=104; host++)); do
echo ----------hadoop$host---------
ssh hadoop$host $@
done
Xsync文件内容:
#1.判断参数个数
if [ $# -lt 1 ]
then
echo Not Enough Argument!
exit;
fi
#2.遍历集群所有机器
for host in hadoop102 hadoop103 hadoop104
do
echo ========= $host =========
#3.遍历所有目录,挨个发送
for file in $@
do
#4.判断文件是否存在
if [ -e $file ]
then
#5.获取父目录
pdir=$(cd -P $(dirname $file);pwd)
#6.获取当前文件的名称
fname=$(basename $file)
ssh $host "mkdir -p $pdir"
rsync -av $pdir/$fname $host:$pdir
else
echo $file does not exists!
fi
done
done
5. zookeeper配置:
我使用的版本为3.4.10
我在zookeeper目录下创建了数据文件data和日志文件datalog
在data目录中新建myid
写入hadoop10*对应的数字
其他两台虚拟机hadoop103、hadoop104也这样配置
接着在conf/下写入数据文件和日志文件的路径以及三台虚拟机的映射关系
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
#dataDir=/tmp/zookeeper
dataDir=/opt/module/zookeeper-3.4.10/data
dataLogDir=/opt/module/zookeeper-3.4.10/datalog
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
server.2=hadoop102:2888:3888
server.3=hadoop103:2888:3888
server.4=hadoop104:2888:3888
配置log4j.properties日志文件,主要配置log路径生成,默认生成当前路径
即配置两处#zookeeper.log.dir=
#zookeeper.tracelog.dir=
# Define some default values that can be overridden by system properties
#zookeeper.log.dir=.
zookeeper.log.dir=/opt/module/zookeeper-3.4.10/logs
zookeeper.log.file=zookeeper.log
zookeeper.log.threshold=DEBUG
#zookeeper.tracelog.dir=.
zookeeper.tracelog.dir=/opt/module/zookeeper-3.4.10/logs
zookeeper.tracelog.file=zookeeper_trace.log
#
# ZooKeeper Logging Configuration
#
# Format is "<default threshold> (, <appender>)+
# DEFAULT: console appender only
log4j.rootLogger=${zookeeper.root.logger}
# Example with rolling log file
#log4j.rootLogger=DEBUG, CONSOLE, ROLLINGFILE
# Example with rolling log file and tracing
#log4j.rootLogger=TRACE, CONSOLE, ROLLINGFILE, TRACEFILE
#
# Log INFO level and above messages to the console
#
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.Threshold=${zookeeper.console.threshold}
log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
log4j.appender.CONSOLE.layout.ConversionPattern=%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n
#
# Add ROLLINGFILE to rootLogger to get log file output
# Log DEBUG level and above messages to a log file
log4j.appender.ROLLINGFILE=org.apache.log4j.RollingFileAppender
log4j.appender.ROLLINGFILE.Threshold=${zookeeper.log.threshold}
log4j.appender.ROLLINGFILE.File=${zookeeper.log.dir}/${zookeeper.log.file}
# Max log file size of 10MB
log4j.appender.ROLLINGFILE.MaxFileSize=10MB
# uncomment the next line to limit number of backup files
#log4j.appender.ROLLINGFILE.MaxBackupIndex=10
log4j.appender.ROLLINGFILE.layout=org.apache.log4j.PatternLayout
log4j.appender.ROLLINGFILE.layout.ConversionPattern=%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n
#
# Add TRACEFILE to rootLogger to get log file output
# Log DEBUG level and above messages to a log file
log4j.appender.TRACEFILE=org.apache.log4j.FileAppender
log4j.appender.TRACEFILE.Threshold=TRACE
log4j.appender.TRACEFILE.File=${zookeeper.tracelog.dir}/${zookeeper.tracelog.file}
log4j.appender.TRACEFILE.layout=org.apache.log4j.PatternLayout
### Notice we are including log4j's NDC here (%x)
log4j.appender.TRACEFILE.layout.ConversionPattern=%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L][%x] - %m%n
同时配置/opt/module/zookeeper-3.4.10/bin
下的zkEnv.sh文件,该文件启动也会默认生成日志文件
if [ "x${ZOO_LOG_DIR}" = "x" ]
then
# ZOO_LOG_DIR="."
ZOO_LOG_DIR="/opt/module/zookeeper-3.4.10/logs"
fi
最后对两个文件 log4j.properties和 zkEnv.sh同步到其他集群主机上
[root@hadoop102 conf]# xsync /opt/module/zookeeper-3.4.10/conf/log4j.properties
========= hadoop102 =========
sending incremental file list
sent 79 bytes received 12 bytes 60.67 bytes/sec
total size is 2,213 speedup is 24.32
========= hadoop103 =========
sending incremental file list
log4j.properties
sent 890 bytes received 59 bytes 632.67 bytes/sec
total size is 2,213 speedup is 2.33
========= hadoop104 =========
sending incremental file list
log4j.properties
sent 890 bytes received 59 bytes 632.67 bytes/sec
total size is 2,213 speedup is 2.33
[root@hadoop102 bin]# xsync zkEnv.sh
========= hadoop102 =========
sending incremental file list
sent 71 bytes received 12 bytes 55.33 bytes/sec
total size is 2,751 speedup is 33.14
========= hadoop103 =========
sending incremental file list
zkEnv.sh
sent 1,477 bytes received 59 bytes 1,024.00 bytes/sec
total size is 2,751 speedup is 1.79
========= hadoop104 =========
sending incremental file list
zkEnv.sh
zookeeper配置完成
6. spark配置
进入conf/中cp三个temp临时文件Slaves、spark-defaults.conf、spark-env.sh
Spark:
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# A Spark Worker will be started on each of the machines listed below.
hadoop102
hadoop103
hadoop104
spark-defaults.conf:
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
//开启记录事件日志的功能
spark.eventLog.enabled true
//设置事件日志存储的目录
spark.eventLog.dir hdfs://hadoop102:8020/history
//日志优化选项,压缩日志
spark.eventLog.compress true
spark-env.sh:
#!/usr/bin/env bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.
#指定 Java Home
export JAVA_HOME=/opt/module/jdk1.8.0_281
export SCALA_HOME=/opt/module/scala-2.12.0
export HADOOP_HOME=/opt/module/hadoop-3.1.4
export HADOOP_CONF_DIR=/opt/hadoop/hadoop-3.1.4/etc/hadoop
#指定 Spark Master 地址
#export SPARK_MASTER_HOST=hadoop102
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_MEMORY=4g
export SPARK_WORKER_CORES=2
export SPARK_WORKER_INSTANCES=1
#指定 Spark History 运行参数
export SPARK_HISTORY_OPTS="
-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://hadoop102:8020/history"
#指定Spark 运行时参数
export SPARK_DAEMON_JAVA_OPTS="
-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=hadoop102:2181,hadoop103:2181,hadoop104:2181 -Dspark.deploy.zookeeper.dir=/spark"
# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos
# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
# Options for launcher
# - SPARK_LAUNCHER_OPTS, to set config properties and Java options for the launcher (e.g. "-Dx=y")
# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE Run the proposed command in the foreground. It will not output a PID file.
# Options for native BLAS, like Intel MKL, OpenBLAS, and so on.
# You might get better performance to enable these options if using native BLAS (see SPARK-21305).
# - MKL_NUM_THREADS=1 Disable multi-threading of Intel MKL
# - OPENBLAS_NUM_THREADS=1 Disable multi-threading of OpenBLAS
- spark高可用配置
即利用zookeeper来启动,前面zookeeper已经配置好了。
启动过程:
遇到的问题及解决方案:
之前在配置高可用zookeeper启动时遇到jdk问题,于是按照师兄的方法试了一下,发现不是jdk路径问题,看了文档,看到zookeeper3.3.6是支持jdk8的,但我换了zookeeper3.4.10,发现问题解决了,但我用了更高版本的zookeeper3.5.6或者更高的3.6.2,zookeeper启动成功,但是ui显示jdk错误,看了文档,最低需要jdk8,于是我换到jdk11发现依旧不能解决问题。
于是我还是用回3.4.10版本。
最后,附上启动过程:
启动过程:
(1) 先启动hadoop:
102启动start-dfs.sh:
103启动start-yarn.sh
(2) 高可用时先启用zookeeper:
在即将作为两台master的虚拟机上,hadoop102和hadoop103
zkServer.sh start
查看zookeeper状态:zkSerer.sh status
经查看, hadoop102:follow Hadoop103:leader
(3) 启动spark
hadoop102启动spark-all.sh以及历史服务spark-history-server.sh
hadoop103启动spark-master…sh作为高可用的master
启动后结果截图: