当前位置: 首页>后端>正文

Hadoop本地开发k8s一键部署

由于大数据技术组件繁多且部署成本比较高,这里为了简化使用成本,使用k8s 容器化部署hadoop应用,方便平时测试及使用.

基础依赖:

  • K8s
  • K8s 包管理工具Helm
  • Helm 对比工具Helm-diff
  • Chart管理工具helmfile

一键Hadoop

Hadoop其实不单单是个软件,可以认为是一个基础运行环境,他提供分布式存储(HDFS)和分布式计算(MapReduce),Yarn 资源调度三个工具,部署过程如下

  1. 通过git下载chart代码
git clone https://gitee.com/gaochuanaaa/bigdata-platfrom-charts.git

工程如下

[图片上传失败...(image-b58d87-1693789352760)]

部分不好下载的包我将包放在helmhub目录中,还有一些包在github中,下载速度可能不稳定,可以使用dev-sidercar进行代理加速

  1. 选择需要部署的工具包,如果只要hadoop基础环境可以编辑helmfile.yaml文件
releases:
  - name: my-hadoop
    chart: ./charts/hadoop
    values:
      - conf:
          ## @param hadoop.conf.coreSite will append the key and value to the core-site.xml file
          coreSite:
            # defined the Unix user[hue] that will run the hue supervisor server as a proxyuser. ref: https://docs.gethue.com/administrator/configuration/connectors/#hdfs
            hadoop.proxyuser.hue.hosts: "*"
            hadoop.proxyuser.hue.groups: "*"
            # defined the Unix user[httpfs] that will run the HttpFS server as a proxyuser. ref: https://hadoop.apache.org/docs/stable/hadoop-hdfs-httpfs/ServerSetup.html
            hadoop.proxyuser.httpfs.hosts: "*"
            hadoop.proxyuser.httpfs.groups: "*"
            # defined the Unix user[hive] that will run the HiveServer2 server as a proxyuser. ref: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Superusers.html
            hadoop.proxyuser.hive.hosts: "*"
            hadoop.proxyuser.hive.groups: "*"
          hdfsSite:
            dfs.permissions.enabled: false
            dfs.webhdfs.enable: true
            dfs.replication: 3
          httpfsSite:
            # Hue HttpFS proxy user setting. ref: https://docs.gethue.com/administrator/configuration/connectors/#hdfs
            httpfs.proxyuser.hue.hosts: "*"
            httpfs.proxyuser.hue.groups: "*"
      - ingress:
          nameNode:
            enabled: true
            hosts:
              - hdfs.demo.com
          resourcemanager:
            enabled: true
            hosts:
              - resourcemanager.demo.com
  1. 进入bigdata-platfrom-charts使用helmfile apply helmfile.yaml进行应用部署

一键Hive

hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供完整的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。

如Hadoop部署一致,如下配置资源描述的hive部署内容

  - name: my-hive
    chart: ./charts/hive
    needs:
      - my-hadoop
    values:
      - conf:
          hiveSite:
            hive.metastore.warehouse.dir: hdfs://my-hadoop-namenode:9820/user/hive/warehouse
            hive.metastore.schema.verification: false
          hadoopConfigMap: my-hadoop-hadoop

一键Hue

Hue 被作为Hadoop的前台应用使用,一方面他可以将文件存储数据如csv等资源上传到hadoop环境中, 一方面可以将资源通过hive映射到hive表中,通过sql查询的方式查询这类数据.部署资源描述如下

  - name: my-hue
    chart: ./charts/hue
    needs:
      - my-hadoop
      - my-hive
      - my-openldap
    values:
      - ingress:
          enabled: true
          hosts:
            - hue.demo.com
          annotations:
            nginx.ingress.kubernetes.io/proxy-body-size: 2048m
      - hue:
          replicas: 1
          interpreters: |-
            [[[postgresql]]]
            name = postgresql
            interface=sqlalchemy
            options='{"url": "postgresql://training:Training@2022@my-postgresql:5432/training"}'
          zz_hue_ini: |
            [desktop]
            secret_key=hue123
            # ref: https://gethue.com/mini-how-to-disabling-some-apps-from-showing-up/
            app_blacklist=spark,zookeeper,hbase,impala,search,pig,sqoop,security,oozie,jobsub,jobbrowser
            django_debug_mode=false
            gunicorn_work_class=sync
            enable_prometheus=true

            [[task_server]]
            enabled=false
            broker_url=redis://redis:6379/0
            result_cache='{"BACKEND": "django_redis.cache.RedisCache", "LOCATION": "redis://redis:6379/0", "OPTIONS": {"CLIENT_CLASS": "django_redis.client.DefaultClient"},"KEY_PREFIX": "queries"}'
            celery_result_backend=redis://redis:6379/0

            [[custom]]
            [[auth]]
            backend=desktop.auth.backend.LdapBackend,desktop.auth.backend.AllowFirstUserDjangoBackend

            [[ldap]]
            ldap_url=ldap://my-openldap:389
            search_bind_authentication=true
            use_start_tls=false
            create_users_on_login=true
            base_dn="ou=databu,dc=demo,dc=com"
            bind_dn="cn=admin,dc=demo,dc=com"
            bind_password=Root@2022
            test_ldap_user="cn=admin,dc=demo,dc=com"
            test_ldap_group="cn=openldap,dc=demo,dc=com"

            [[[users]]]
            user_filter="objectClass=posixAccount"
            user_name_attr="uid"

            [[[groups]]]
            group_filter="objectClass=posixGroup"
            group_name_attr="cn"
            group_member_attr="memberUid"

            [beeswax]
            # Host where HiveServer2 is running.
            hive_server_host=my-hive-hiveserver
            # Port where HiveServer2 Thrift server runs on.
            hive_server_port=10000
            thrift_version=7

            [notebook]
            [[interpreters]]
            [[[hive]]]
            name=Hive
            interface=hiveserver2

            [hadoop]
            [[hdfs_clusters]]
            [[[default]]]
            fs_defaultfs=hdfs://my-hadoop:9820
            webhdfs_url=http://my-hadoop-httpfs:14000/webhdfs/v1

            # Configuration for YARN (MR2)
            # ------------------------------------------------------------------------
            [[yarn_clusters]]
            [[[default]]]
            resourcemanager_host=my-hadoop-resourcemanager-hl
            resourcemanager_api_url=http://my-hadoop-resourcemanager-hl:8088/
            resourcemanager_port=8032
            history_server_api_url=http://my-hadoop-historyserver-hl:19888/
            spark_history_server_url=http://my-hadoop-spark-master-svc:18080

一键Superset

Superset 为可视化分析应用,通过维度+指标的配置实现可视化视图(折现饼图等)展示过程.部署资源描述如下

  - name: my-superset
    chart: ./helmhub/superset-0.8.5.tgz
    version: ~0.8.5
    needs:
      - my-postgresql
      - my-hive
    values:
      - image:
          tag: 2.0.0
        ingress:
          enabled: true
          hosts:
            - superset.demo.com
        init:
          adminUser:
            password: Root@2022

一键DolphinsSheduler

DolphinsSheduler为一个分布式调度系统,通过创建DAG图的方式,将大数据计算引擎进行集成调度.部署资源描述如下

  - name: my-dolphinscheduler
    chart: ./charts/dolphinscheduler
    needs:
      - my-hadoop
      - my-hive
    values:
      - ingress:
          enabled: true
          host: "dolphinscheduler.demo.com"
          path: "/"
          annotations:
            nginx.ingress.kubernetes.io/proxy-body-size: 2048m
      - common:
          configmap:
            RESOURCE_STORAGE_TYPE: "HDFS"
            FS_DEFAULT_FS: "hdfs://my-hadoop-namenode:9820"
          ## Shared storage persistence mounted into api, master and worker, such as Hadoop, Spark, Flink and DataX binary package
          sharedStoragePersistence:
            enabled: false
            storageClassName: rook-cephfs
      - conf:
          common:
            resource.storage.type: HDFS
            fs.defaultFS: hdfs://my-hadoop-namenode:9820

一键Spark

spark作为分布式计算引擎,目前大厂一般作为Mapredude替代品使用,提高计算速度

  - name: my-spark
    chart: bitnami/spark
    version: ~6.1.5
    values:
      - ingress:
          enabled: true
          hostname: spark.demo.com

以上资源部署后日志(部署组件可以按需添加):

UPDATED RELEASES:
NAME            CHART                VERSION
my-openldap     ./charts/openldap      2.1.0
my-hadoop       ./charts/hadoop        1.0.1
my-spark        bitnami/spark         6.1.14
my-postgresql   bitnami/postgresql   11.6.26
my-hive         ./charts/hive          0.2.0
my-hue          ./charts/hue           1.0.4

部分资源是有管理页面的,如果使用wsl可以通过port-forwarding访问,如我们访问一下hadoop webhdfs

验证Hdfs io性能,供测试使用hadoop MapReduce,此脚本会在hdfs中创建文件,可以通过webhdfs查询创建的文件数据

kubectl exec -n default -it my-release-hadoop-nodemanager-0 --  /opt/hadoop/bin# hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.2.3-tests.jar TestDFSIO -write -nrFiles 5 -file
Size 128MB -resFile /tmp/TestDFSIOwrite.txt

以上为hadoop环境部分功能验证


https://www.xamrdz.com/backend/3uu1930483.html

相关文章: