由于大数据技术组件繁多且部署成本比较高,这里为了简化使用成本,使用k8s 容器化部署hadoop应用,方便平时测试及使用.
基础依赖:
- K8s
- K8s 包管理工具Helm
- Helm 对比工具Helm-diff
- Chart管理工具helmfile
一键Hadoop
Hadoop其实不单单是个软件,可以认为是一个基础运行环境,他提供分布式存储(HDFS)和分布式计算(MapReduce),Yarn 资源调度三个工具,部署过程如下
- 通过git下载chart代码
git clone https://gitee.com/gaochuanaaa/bigdata-platfrom-charts.git
工程如下
[图片上传失败...(image-b58d87-1693789352760)]
部分不好下载的包我将包放在helmhub目录中,还有一些包在github中,下载速度可能不稳定,可以使用dev-sidercar进行代理加速
- 选择需要部署的工具包,如果只要hadoop基础环境可以编辑helmfile.yaml文件
releases:
- name: my-hadoop
chart: ./charts/hadoop
values:
- conf:
## @param hadoop.conf.coreSite will append the key and value to the core-site.xml file
coreSite:
# defined the Unix user[hue] that will run the hue supervisor server as a proxyuser. ref: https://docs.gethue.com/administrator/configuration/connectors/#hdfs
hadoop.proxyuser.hue.hosts: "*"
hadoop.proxyuser.hue.groups: "*"
# defined the Unix user[httpfs] that will run the HttpFS server as a proxyuser. ref: https://hadoop.apache.org/docs/stable/hadoop-hdfs-httpfs/ServerSetup.html
hadoop.proxyuser.httpfs.hosts: "*"
hadoop.proxyuser.httpfs.groups: "*"
# defined the Unix user[hive] that will run the HiveServer2 server as a proxyuser. ref: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Superusers.html
hadoop.proxyuser.hive.hosts: "*"
hadoop.proxyuser.hive.groups: "*"
hdfsSite:
dfs.permissions.enabled: false
dfs.webhdfs.enable: true
dfs.replication: 3
httpfsSite:
# Hue HttpFS proxy user setting. ref: https://docs.gethue.com/administrator/configuration/connectors/#hdfs
httpfs.proxyuser.hue.hosts: "*"
httpfs.proxyuser.hue.groups: "*"
- ingress:
nameNode:
enabled: true
hosts:
- hdfs.demo.com
resourcemanager:
enabled: true
hosts:
- resourcemanager.demo.com
- 进入bigdata-platfrom-charts使用helmfile apply helmfile.yaml进行应用部署
一键Hive
hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供完整的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。
如Hadoop部署一致,如下配置资源描述的hive部署内容
- name: my-hive
chart: ./charts/hive
needs:
- my-hadoop
values:
- conf:
hiveSite:
hive.metastore.warehouse.dir: hdfs://my-hadoop-namenode:9820/user/hive/warehouse
hive.metastore.schema.verification: false
hadoopConfigMap: my-hadoop-hadoop
一键Hue
Hue 被作为Hadoop的前台应用使用,一方面他可以将文件存储数据如csv等资源上传到hadoop环境中, 一方面可以将资源通过hive映射到hive表中,通过sql查询的方式查询这类数据.部署资源描述如下
- name: my-hue
chart: ./charts/hue
needs:
- my-hadoop
- my-hive
- my-openldap
values:
- ingress:
enabled: true
hosts:
- hue.demo.com
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: 2048m
- hue:
replicas: 1
interpreters: |-
[[[postgresql]]]
name = postgresql
interface=sqlalchemy
options='{"url": "postgresql://training:Training@2022@my-postgresql:5432/training"}'
zz_hue_ini: |
[desktop]
secret_key=hue123
# ref: https://gethue.com/mini-how-to-disabling-some-apps-from-showing-up/
app_blacklist=spark,zookeeper,hbase,impala,search,pig,sqoop,security,oozie,jobsub,jobbrowser
django_debug_mode=false
gunicorn_work_class=sync
enable_prometheus=true
[[task_server]]
enabled=false
broker_url=redis://redis:6379/0
result_cache='{"BACKEND": "django_redis.cache.RedisCache", "LOCATION": "redis://redis:6379/0", "OPTIONS": {"CLIENT_CLASS": "django_redis.client.DefaultClient"},"KEY_PREFIX": "queries"}'
celery_result_backend=redis://redis:6379/0
[[custom]]
[[auth]]
backend=desktop.auth.backend.LdapBackend,desktop.auth.backend.AllowFirstUserDjangoBackend
[[ldap]]
ldap_url=ldap://my-openldap:389
search_bind_authentication=true
use_start_tls=false
create_users_on_login=true
base_dn="ou=databu,dc=demo,dc=com"
bind_dn="cn=admin,dc=demo,dc=com"
bind_password=Root@2022
test_ldap_user="cn=admin,dc=demo,dc=com"
test_ldap_group="cn=openldap,dc=demo,dc=com"
[[[users]]]
user_filter="objectClass=posixAccount"
user_name_attr="uid"
[[[groups]]]
group_filter="objectClass=posixGroup"
group_name_attr="cn"
group_member_attr="memberUid"
[beeswax]
# Host where HiveServer2 is running.
hive_server_host=my-hive-hiveserver
# Port where HiveServer2 Thrift server runs on.
hive_server_port=10000
thrift_version=7
[notebook]
[[interpreters]]
[[[hive]]]
name=Hive
interface=hiveserver2
[hadoop]
[[hdfs_clusters]]
[[[default]]]
fs_defaultfs=hdfs://my-hadoop:9820
webhdfs_url=http://my-hadoop-httpfs:14000/webhdfs/v1
# Configuration for YARN (MR2)
# ------------------------------------------------------------------------
[[yarn_clusters]]
[[[default]]]
resourcemanager_host=my-hadoop-resourcemanager-hl
resourcemanager_api_url=http://my-hadoop-resourcemanager-hl:8088/
resourcemanager_port=8032
history_server_api_url=http://my-hadoop-historyserver-hl:19888/
spark_history_server_url=http://my-hadoop-spark-master-svc:18080
一键Superset
Superset 为可视化分析应用,通过维度+指标的配置实现可视化视图(折现饼图等)展示过程.部署资源描述如下
- name: my-superset
chart: ./helmhub/superset-0.8.5.tgz
version: ~0.8.5
needs:
- my-postgresql
- my-hive
values:
- image:
tag: 2.0.0
ingress:
enabled: true
hosts:
- superset.demo.com
init:
adminUser:
password: Root@2022
一键DolphinsSheduler
DolphinsSheduler为一个分布式调度系统,通过创建DAG图的方式,将大数据计算引擎进行集成调度.部署资源描述如下
- name: my-dolphinscheduler
chart: ./charts/dolphinscheduler
needs:
- my-hadoop
- my-hive
values:
- ingress:
enabled: true
host: "dolphinscheduler.demo.com"
path: "/"
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: 2048m
- common:
configmap:
RESOURCE_STORAGE_TYPE: "HDFS"
FS_DEFAULT_FS: "hdfs://my-hadoop-namenode:9820"
## Shared storage persistence mounted into api, master and worker, such as Hadoop, Spark, Flink and DataX binary package
sharedStoragePersistence:
enabled: false
storageClassName: rook-cephfs
- conf:
common:
resource.storage.type: HDFS
fs.defaultFS: hdfs://my-hadoop-namenode:9820
一键Spark
spark作为分布式计算引擎,目前大厂一般作为Mapredude替代品使用,提高计算速度
- name: my-spark
chart: bitnami/spark
version: ~6.1.5
values:
- ingress:
enabled: true
hostname: spark.demo.com
以上资源部署后日志(部署组件可以按需添加):
UPDATED RELEASES:
NAME CHART VERSION
my-openldap ./charts/openldap 2.1.0
my-hadoop ./charts/hadoop 1.0.1
my-spark bitnami/spark 6.1.14
my-postgresql bitnami/postgresql 11.6.26
my-hive ./charts/hive 0.2.0
my-hue ./charts/hue 1.0.4
部分资源是有管理页面的,如果使用wsl可以通过port-forwarding访问,如我们访问一下hadoop webhdfs
验证Hdfs io性能,供测试使用hadoop MapReduce,此脚本会在hdfs中创建文件,可以通过webhdfs查询创建的文件数据
kubectl exec -n default -it my-release-hadoop-nodemanager-0 -- /opt/hadoop/bin# hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.2.3-tests.jar TestDFSIO -write -nrFiles 5 -file
Size 128MB -resFile /tmp/TestDFSIOwrite.txt
以上为hadoop环境部分功能验证