docker6_搭建spark集群
- 1、安装spark
- 2、配置spark
- 3、Standalone-HA
- 4、Spark-On-Yarn
链接: 基于CentOS 8镜像的spark集群搭建连接进入node1容器
docker exec -it gpb_hdp_node1 bash
或
ssh -p 10122 root@localhost
1、安装spark
退出到实体机
上传安装包到/root目录
siriyang@siriyangs-MacBook-Pro 资料 % scp -P 10022 spark-3.0.1-bin-hadoop2.7.tgz root@localhost:/root/
root@localhost's password:
spark-3.0.1-bin-hadoop2.7.tgz 100% 210MB 44.3MB/s 00:04
在实体机使用ssh远程连接到node1
ssh root@localhost -p 10022
解压安装包到/export/server/目录
tar -zxvf spark-3.0.1-bin-hadoop2.7.tgz -C /export/server/
修改权限
chown -R root /export/server/spark-2.4.0-bin-hadoop2.7
chgrp -R root /export/server/spark-2.4.0-bin-hadoop2.7
创建软连接
ln -s /export/server/spark-2.4.0-bin-hadoop2.7 /export/server/spark
启动spark交互式窗口
/export/server/spark/bin/spark-shell
查看http://172.23.27.197/:4040
测试
准备文件
vim /root/words.txt
hello me you her
hello me you
hello me
hello
执行WordCount
val textFile = sc.textFile("file:///root/words.txt")
val counts = textFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
counts.collect
2、配置spark
配置slaves/workers
进入配置目录
cd /export/server/spark/conf
修改配置文件名称
cp slaves.template slaves
vim slaves
内容如下:
node2
node3
配置master
进入配置目录
cd /export/server/spark/conf
修改配置文件名称
cp spark-env.sh.template spark-env.sh
修改配置文件
vim spark-env.sh
增加如下内容:
## 设置JAVA安装目录
JAVA_HOME=/export/server/jdk1.8.0_231
## HADOOP软件配置文件目录,读取HDFS上文件和运行Spark在YARN集群时需要,先提前配上
HADOOP_CONF_DIR=/export/server/hadoop/etc/hadoop
YARN_CONF_DIR=/export/server/hadoop/etc/hadoop
## 指定spark老大Master的IP和提交任务的通信端口
SPARK_MASTER_HOST=node1
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8080
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=1g
分发
将配置好的将 Spark 安装包分发给集群中其它机器,命令如下:
cd /export/server/
scp -r spark node3:$PWD
scp -r spark node2:$PWD
测试
1.集群启动和停止
在主节点上启动spark集群
/export/server/spark/sbin/start-all.sh
在主节点上停止spark集群
/export/server/spark/sbin/stop-all.sh
/export/server/spark/bin/spark-shell --master spark://node1:7077
http://172.23.27.197/:8080/ 链接: Linux 查看端口占用情况
网上说是zookeeper的小部件占用了8080这个端口,
导致http://172.23.27.197/:8080/无法访问。
需要解决端口占用情况
[root@69432059c763 server]# netstat -anp | grep 8080
tcp 0 0 0.0.0.0:18080 0.0.0.0:* LISTEN 20762/java
tcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN 1770/java
[root@69432059c763 server]# kill -9 1770
[root@69432059c763 server]# netstat -anp | grep 8080
tcp 0 0 0.0.0.0:18080 0.0.0.0:* LISTEN 20762/java
[root@69432059c763 server]# kill -9 20762
[root@69432059c763 server]# netstat -anp | grep 8080
[root@69432059c763 server]# /export/server/spark/sbin/stop-all.sh
node3: stopping org.apache.spark.deploy.worker.Worker
node2: stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master
[root@69432059c763 server]# /export/server/spark/sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /export/server/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-69432059c763.out
node2: starting org.apache.spark.deploy.worker.Worker, logging to /export/server/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-7797b90e0ccd.out
node3: starting org.apache.spark.deploy.worker.Worker, logging to /export/server/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-068874a8c481.out
[root@69432059c763 server]# netstat -anp | grep 8080
tcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN 24491/java
http://172.23.27.197/:8080/
3、Standalone-HA
1、配置
1.启动zk
/export/server/zookeeper/bin/zkServer.sh status
/export/server/zookeeper/bin/zkServer.sh stop
/export/server/zookeeper/bin/zkServer.sh start
2.修改配置
vim /export/server/spark/conf/spark-env.sh
注释
#SPARK_MASTER_HOST=node1
增加
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=node1:2181,node2:2181,node3:2181 -Dspark.deploy.zookeeper.dir=/spark-ha"
3.分发配置
[root@69432059c763 conf]# scp -r spark-env.sh node2:$PWD
spark-env.sh 100% 4842 4.7MB/s 00:00
[root@69432059c763 conf]# scp -r spark-env.sh node3:$PWD
spark-env.sh 100% 4842 4.8MB/s 00:00
2、测试
0.启动zk服务
/export/server/zookeeper/bin/zkServer.sh status
/export/server/zookeeper/bin/zkServer.sh stop
/export/server/zookeeper/bin/zkServer.sh start
1.node1上启动Spark集群执行
/export/server/spark/sbin/stop-all.sh
/export/server/spark/sbin/start-all.sh
2.在node2上再单独只起个master:
/export/server/spark/sbin/start-master.sh
3.查看WebUI
http://node1:8080/
http://node2:8080/
4.模拟node1宕机
jps
kill -9 进程id
5.再次查看web-ui
4、Spark-On-Yarn
0.关闭之前的Spark-Standalone集群
/export/server/spark/sbin/stop-all.sh
1.配置Yarn历史服务器并关闭资源检查
vim /export/server/hadoop/etc/hadoop/yarn-site.xml
<configuration><!-- 配置yarn主节点的位置 -->
<property><name>yarn.resourcemanager.hostname</name><value>node1</value></property><property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property><!-- 设置yarn集群的内存分配方案 -->
<property><name>yarn.nodemanager.resource.memory-mb</name><value>20480</value></property><property><name>yarn.scheduler.minimum-allocation-mb</name><value>2048</value></property><property><name>yarn.nodemanager.vmem-pmem-ratio</name><value>2.1</value></property><!-- 开启日志聚合功能 -->
<property><name>yarn.log-aggregation-enable</name><value>true</value></property><!-- 设置聚合日志在hdfs上的保存时间 -->
<property><name>yarn.log-aggregation.retain-seconds</name><value>604800</value></property><!-- 设置yarn历史服务器地址 -->
<property><name>yarn.log.server.url</name><value>http://node1:19888/jobhistory/logs</value></property><!-- 关闭yarn内存检查 -->
<property><name>yarn.nodemanager.pmem-check-enabled</name><value>false</value></property><property><name>yarn.nodemanager.vmem-check-enabled</name><value>false</value></property></configuration>
注意:如果之前没有配置,现在配置了需要分发并重启yarn
cd /export/server/hadoop/etc/hadoop
scp yarn-site.xml node2:$PWD
scp yarn-site.xml node3:$PWD
/export/server/hadoop/sbin/stop-yarn.sh
/export/server/hadoop/sbin/start-yarn.sh
2.配置Spark的历史服务器和Yarn的整合
- 修改spark-defaults.conf
进入配置目录
cd /export/server/spark/conf
修改配置文件名称
mv spark-defaults.conf.template spark-defaults.conf
vim spark-defaults.conf
添加内容:
spark.eventLog.enabled true
spark.eventLog.dir hdfs://node1:8020/sparklog/
spark.eventLog.compress true
spark.yarn.historyServer.address node1:18080
- 修改spark-env.sh
修改配置文件
vim /export/server/spark/conf/spark-env.sh
增加如下内容:
## 配置spark历史日志存储地址
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://node1:8020/sparklog/ -Dspark.history.fs.cleaner.enabled=true"
手动创建
hadoop fs -mkdir -p /sparklog
- 修改日志级别
进入目录
cd /export/server/spark/conf
修改日志属性配置文件名称
mv log4j.properties.template log4j.properties
改变日志级别
vim log4j.properties
修改内容如下:
- 分发-可选,如果只在node1上提交spark任务到yarn,那么不需要分发
cd /export/server/spark/conf
scp spark-env.sh node2:$PWD
scp spark-env.sh node3:$PWD
scp spark-defaults.conf node2:$PWD
scp spark-defaults.conf node3:$PWD
scp log4j.properties node2:$PWD
scp log4j.properties node3:$PWD
3.配置依赖的Spark 的jar包
1.在HDFS上创建存储spark相关jar包的目录
hadoop fs -mkdir -p /spark/jars/
2.上传$SPARK_HOME/jars所有jar包到HDFS
hadoop fs -put /export/server/spark/jars/* /spark/jars/
3.在node1上修改spark-defaults.conf
vim /export/server/spark/conf/spark-defaults.conf
添加内容
spark.yarn.jars hdfs://node1:8020/spark/jars/*
分发同步-可选
cd /export/server/spark/conf
scp spark-defaults.conf root@node2:$PWD
scp spark-defaults.conf root@node3:$PWD
4.启动服务
- 启动HDFS和YARN服务,在node1执行命令
/export/server/hadoop/sbin/start-dfs.sh
/export/server/hadoop/sbin/start-yarn.sh
或
/export/server/hadoop/sbin/start-all.sh
- 启动MRHistoryServer服务,在node1执行命令
/export/server/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver
- 启动Spark HistoryServer服务,,在node1执行命令
/export/server/spark/sbin/start-history-server.sh
- MRHistoryServer服务WEB UI页面:
http://node1:19888/ - Spark HistoryServer服务WEB UI页面:
- http://node1:18080/