Spark集群搭建感想 spark集群和hadoop集群配置

编程语言2024-04-21 01:37:09

hadoop、zookeeper、spark集群配置：

1. 软件版本：

Spark集群搭建感想 spark集群和hadoop集群配置,Spark集群搭建感想 spark集群和hadoop集群配置_spark,第1张

2. 配置环境变量：
我环境变量放在自定义文件中/etc/profile.d/my_env_sh中，配置有jdk、hadoop、spark、scala、zookeeper
3. hadoop配置：
我的hadoop版本是3.1.4
在目录/opt/module/hadoop-3.1.4/etc/hadoop下，我配置了core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml这四个配置文件

core-site.xml:

<configuration>
	<!-- 指定NameNode的地址  -->
	<property>
                <name>fs.defaultFS</name>
                <value>hdfs://hadoop102:8020</value>
        </property>
	<!-- 指定hadoop数据的存储目录  -->
	<property>
 	  	<name>hadoop.tmp.dir</name>
 		<value>/opt/module/hadoop-3.1.4/data</value>

	</property>	

</configuration>

hdfs-site.xml:

<configuration>

	<!-- nn web端访问地址 -->
	<property>
		<name>dfs.namenode.http-address</name>
  		<value>hadoop102:9870</value>
	</property>
	<!-- 2nnweb端访问地址 -->
	<property>
		<name>dfs.namenode.secondary.http-address</name>
		<value>hadoop104:9868</value>
       </property>	

</configuration>

mapred-site.xml:

<configuration>

<!-- 指定MapReduce程序运行在Yarn上 -->
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>


</configuration>

yarn-site.xml:

<configuration>

<!-- 指定MR走shuffle -->
<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
	<!--<value>mapreduce_shuffle</value>-->
</property>
<!-- 指定ResourceManager的地址 -->
<property>
	<name>yarn.resourcemanager.hostname</name>
	<value>hadoop103</value>
</property>   
<!-- 环境变量的继承 --> 
<property>
	<name>yarn.nodemanager.env-whitelist</name>
	<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>

</configuration>

4. xsync、xcall配置：
我在用户新建了bin目录，然后touch了两个可执行文件xsync、xcall

Spark集群搭建感想 spark集群和hadoop集群配置,Spark集群搭建感想 spark集群和hadoop集群配置_hadoop_02,第2张

Xcall文件内容：

#!/bin/bash
pcount=$#
if((pcount==0));then
        echo no args;
        exit;
fi

echo -------------localhost----------
$@
for((host=102; host<=104; host++)); do
        echo ----------hadoop$host---------
        ssh hadoop$host $@
done

Xsync文件内容:

#1.判断参数个数
if [ $# -lt 1 ]
then
	echo Not Enough Argument!
	exit;
fi

#2.遍历集群所有机器
for host in hadoop102 hadoop103 hadoop104
do
	echo ========= $host =========
	#3.遍历所有目录，挨个发送
	
	for file in $@
	do
		#4.判断文件是否存在
		if [ -e $file ]
			then
				#5.获取父目录
				pdir=$(cd -P $(dirname $file);pwd)
				
				
				#6.获取当前文件的名称
				fname=$(basename $file)
				ssh $host "mkdir -p $pdir"
				rsync -av $pdir/$fname  $host:$pdir
			else
				echo $file does not exists!
		fi	
	done
done

5. zookeeper配置：
我使用的版本为3.4.10
我在zookeeper目录下创建了数据文件data和日志文件datalog

Spark集群搭建感想 spark集群和hadoop集群配置,Spark集群搭建感想 spark集群和hadoop集群配置_spark_03,第3张

在data目录中新建myid

写入hadoop10*对应的数字

Spark集群搭建感想 spark集群和hadoop集群配置,Spark集群搭建感想 spark集群和hadoop集群配置_Spark集群搭建感想_04,第4张

其他两台虚拟机hadoop103、hadoop104也这样配置
接着在conf/下写入数据文件和日志文件的路径以及三台虚拟机的映射关系

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial 
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just 
# example sakes.
#dataDir=/tmp/zookeeper
dataDir=/opt/module/zookeeper-3.4.10/data
dataLogDir=/opt/module/zookeeper-3.4.10/datalog

# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the 
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1

server.2=hadoop102:2888:3888
server.3=hadoop103:2888:3888
server.4=hadoop104:2888:3888

配置log4j.properties日志文件，主要配置log路径生成，默认生成当前路径
即配置两处
#zookeeper.log.dir=#zookeeper.tracelog.dir=

# Define some default values that can be overridden by system properties
#zookeeper.log.dir=.
zookeeper.log.dir=/opt/module/zookeeper-3.4.10/logs
zookeeper.log.file=zookeeper.log
zookeeper.log.threshold=DEBUG
#zookeeper.tracelog.dir=.
zookeeper.tracelog.dir=/opt/module/zookeeper-3.4.10/logs
zookeeper.tracelog.file=zookeeper_trace.log

#
# ZooKeeper Logging Configuration
#

# Format is "<default threshold> (, <appender>)+

# DEFAULT: console appender only
log4j.rootLogger=${zookeeper.root.logger}

# Example with rolling log file
#log4j.rootLogger=DEBUG, CONSOLE, ROLLINGFILE

# Example with rolling log file and tracing
#log4j.rootLogger=TRACE, CONSOLE, ROLLINGFILE, TRACEFILE

#
# Log INFO level and above messages to the console
#
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.Threshold=${zookeeper.console.threshold}
log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
log4j.appender.CONSOLE.layout.ConversionPattern=%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n

#
# Add ROLLINGFILE to rootLogger to get log file output
#    Log DEBUG level and above messages to a log file
log4j.appender.ROLLINGFILE=org.apache.log4j.RollingFileAppender
log4j.appender.ROLLINGFILE.Threshold=${zookeeper.log.threshold}
log4j.appender.ROLLINGFILE.File=${zookeeper.log.dir}/${zookeeper.log.file}

# Max log file size of 10MB
log4j.appender.ROLLINGFILE.MaxFileSize=10MB
# uncomment the next line to limit number of backup files
#log4j.appender.ROLLINGFILE.MaxBackupIndex=10

log4j.appender.ROLLINGFILE.layout=org.apache.log4j.PatternLayout
log4j.appender.ROLLINGFILE.layout.ConversionPattern=%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n


#
# Add TRACEFILE to rootLogger to get log file output
#    Log DEBUG level and above messages to a log file
log4j.appender.TRACEFILE=org.apache.log4j.FileAppender
log4j.appender.TRACEFILE.Threshold=TRACE
log4j.appender.TRACEFILE.File=${zookeeper.tracelog.dir}/${zookeeper.tracelog.file}

log4j.appender.TRACEFILE.layout=org.apache.log4j.PatternLayout
### Notice we are including log4j's NDC here (%x)
log4j.appender.TRACEFILE.layout.ConversionPattern=%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L][%x] - %m%n

同时配置/opt/module/zookeeper-3.4.10/bin下的zkEnv.sh文件，该文件启动也会默认生成日志文件

if [ "x${ZOO_LOG_DIR}" = "x" ]
then
#    ZOO_LOG_DIR="."
     ZOO_LOG_DIR="/opt/module/zookeeper-3.4.10/logs"
fi

最后对两个文件 log4j.properties和 zkEnv.sh同步到其他集群主机上

[root@hadoop102 conf]# xsync /opt/module/zookeeper-3.4.10/conf/log4j.properties 
========= hadoop102 =========
sending incremental file list

sent 79 bytes  received 12 bytes  60.67 bytes/sec
total size is 2,213  speedup is 24.32
========= hadoop103 =========
sending incremental file list
log4j.properties

sent 890 bytes  received 59 bytes  632.67 bytes/sec
total size is 2,213  speedup is 2.33
========= hadoop104 =========
sending incremental file list
log4j.properties

sent 890 bytes  received 59 bytes  632.67 bytes/sec
total size is 2,213  speedup is 2.33

[root@hadoop102 bin]# xsync zkEnv.sh 
========= hadoop102 =========
sending incremental file list

sent 71 bytes  received 12 bytes  55.33 bytes/sec
total size is 2,751  speedup is 33.14
========= hadoop103 =========
sending incremental file list
zkEnv.sh

sent 1,477 bytes  received 59 bytes  1,024.00 bytes/sec
total size is 2,751  speedup is 1.79
========= hadoop104 =========
sending incremental file list
zkEnv.sh

Spark集群搭建感想 spark集群和hadoop集群配置,Spark集群搭建感想 spark集群和hadoop集群配置_zookeeper_05,第5张

zookeeper配置完成

6. spark配置

进入conf/中cp三个temp临时文件Slaves、spark-defaults.conf、spark-env.sh

Spark集群搭建感想 spark集群和hadoop集群配置,Spark集群搭建感想 spark集群和hadoop集群配置_hadoop_06,第6张

Spark:

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# A Spark Worker will be started on each of the machines listed below.
hadoop102
hadoop103
hadoop104

spark-defaults.conf:

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master                     spark://master:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

//开启记录事件日志的功能
spark.eventLog.enabled           	 true
//设置事件日志存储的目录
spark.eventLog.dir                   	 hdfs://hadoop102:8020/history
//日志优化选项,压缩日志
spark.eventLog.compress  		 true

spark-env.sh:

#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

#指定 Java Home
export JAVA_HOME=/opt/module/jdk1.8.0_281
export SCALA_HOME=/opt/module/scala-2.12.0
export HADOOP_HOME=/opt/module/hadoop-3.1.4
export HADOOP_CONF_DIR=/opt/hadoop/hadoop-3.1.4/etc/hadoop
#指定 Spark Master 地址
#export SPARK_MASTER_HOST=hadoop102

export SPARK_MASTER_PORT=7077

export SPARK_WORKER_MEMORY=4g
export SPARK_WORKER_CORES=2
export SPARK_WORKER_INSTANCES=1

#指定 Spark History 运行参数
export SPARK_HISTORY_OPTS="
-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://hadoop102:8020/history"


#指定Spark 运行时参数
export SPARK_DAEMON_JAVA_OPTS="
-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=hadoop102:2181,hadoop103:2181,hadoop104:2181 -Dspark.deploy.zookeeper.dir=/spark"


# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

# Options for launcher
# - SPARK_LAUNCHER_OPTS, to set config properties and Java options for the launcher (e.g. "-Dx=y")

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE  Run the proposed command in the foreground. It will not output a PID file.
# Options for native BLAS, like Intel MKL, OpenBLAS, and so on.
# You might get better performance to enable these options if using native BLAS (see SPARK-21305).
# - MKL_NUM_THREADS=1        Disable multi-threading of Intel MKL
# - OPENBLAS_NUM_THREADS=1   Disable multi-threading of OpenBLAS

spark高可用配置
即利用zookeeper来启动，前面zookeeper已经配置好了。
启动过程：
遇到的问题及解决方案：
之前在配置高可用zookeeper启动时遇到jdk问题，于是按照师兄的方法试了一下，发现不是jdk路径问题，看了文档，看到zookeeper3.3.6是支持jdk8的，但我换了zookeeper3.4.10,发现问题解决了，但我用了更高版本的zookeeper3.5.6或者更高的3.6.2，zookeeper启动成功，但是ui显示jdk错误，看了文档，最低需要jdk8,于是我换到jdk11发现依旧不能解决问题。
于是我还是用回3.4.10版本。

最后，附上启动过程：

启动过程：

(1) 先启动hadoop:

102启动start-dfs.sh:

103启动start-yarn.sh

(2) 高可用时先启用zookeeper:

在即将作为两台master的虚拟机上，hadoop102和hadoop103

zkServer.sh start

查看zookeeper状态：zkSerer.sh status

经查看, hadoop102:follow Hadoop103:leader

(3) 启动spark

hadoop102启动spark-all.sh以及历史服务spark-history-server.sh

hadoop103启动spark-master…sh作为高可用的master

启动后结果截图：

Spark集群搭建感想 spark集群和hadoop集群配置,Spark集群搭建感想 spark集群和hadoop集群配置_spark_07,第7张

查看全文

https://www.xamrdz.com/lan/5md1924556.html

hadoop、zookeeper、spark集群配置：

相关文章：