当前位置: 首页>后端>正文

云计算机器学习平台

云计算机器学习平台提供的多种功能可以支持完整的机器学习生命周期。

云计算机器学习平台,第1张
szhjh

为了创建有效的机器学习和深度学习模型,组织需要获取大量的数据,并对其执行特征工程的方法,以及在合理的时间内训练数据模型的方法。然后,组织需要一种方法来部署模型,监视它们是否随时间的推移而改变,以及根据需要重新训练它们。

如果组织已经在计算资源和加速器(例如GPU)上进行了投资,则可以在内部部署基础设施完成所有这些操作,但是可能会发现,如果资源足够,它们在很多时间都处于闲置状态。另一方面,有时在云平台中运行整个管道,根据需要使用大量计算资源和加速器,然后释放它们,可能更具成本效益。

主要的云计算提供商(以及其他云计算提供商)已投入大量精力来构建其机器学习平台,以支持从计划项目到维护生产模型的完整机器学习生命周期。组织如何确定哪些云平台可以满足其需求?每个端到端机器学习平台都应提供以下12种功能。

1.接近自己的数据

如果组织拥有建立精确模型所需的大量数据,则不希望将其传输到世界各地。这里的问题并不是距离,而是时间:数据传输速度最终受到光速限制,即使在带宽无限的完美网络上也是如此。长距离意味着等待时间更长。

对于非常大的数据集,理想的情况是在已经存在数据的地方建立模型,从而不需要传输大量数据。一些数据库在一定程度上支持这一点。

下一个最佳情况是数据与模型构建软件位于同一高速网络上,这通常意味着在同一数据中心内。如果组织拥有TB或更大容量的数据,即使将数据从一个数据中心迁移到云计算可用性区域内的另一个数据中心,也可能会造成严重的延迟。组织可以通过执行增量更新来缓解这种情况。

最坏的情况是,组织必须在带宽受限和高延迟的路径上远程移动大量数据。而在这方面,澳大利亚为此部署的跨太平洋电缆的工程令人惊叹。

2.支持ETL或ELT管道

ETL(导出、转换和加载)和ELT(导出、加载和转换)是数据库领域中常见的两种数据管道配置。机器学习和深度学习扩大了对这些内容的需求,尤其是转换部分。当转换需要更改时,ELT为组织提供更大的灵活性,因为加载阶段通常是大数据最耗时的阶段。

通常情况下,没有进行处理的数据很嘈杂,需要过滤。此外,这些数据的范围也有所不同:一个变量的最大值可能高达数百万,而另一个变量的范围可能是-0.1至-0.001。对于机器学习来说,必须将其变量转换为标准化的范围,以防止较大范围的变量控制模型。具体的标准化范围取决于模型使用的算法。

3.支持在线环境进行模型构建

传统的观点是,组织应该将数据导入桌面设备以进行模型构建。建立良好的机器学习和深度学习模型所需的大量数据改变了局面:组织可以将一小部分数据样本下载到桌面设备上,以进行探索性数据分析和模型构建,但是对于生产模型,则需要访问完整的模型数据。

Jupyter Notebooks、JupyterLab和Apache Zeppelin等基于Web的开发环境非常适合模型构建。如果组织的数据与笔记本电脑环境位于同一云平台中,则可以对数据进行分析,以很大程度地减少数据移动的时间。

4.支持扩展训练

除了训练模型之外,笔记本电池的计算和内存要求通常很小。如果笔记本可以生成在多个大型虚拟机或容器上运行的训练作业,则将大有帮助。如果训练可以访问加速器(例如GPU、TPU和FPGA),也将有很大帮助;这些模型可以将数天的训练过程变成数小时。

5.支持AutoML和自动特征工程

并不是每个组织都擅长于选择机器学习模型、选择特征(模型使用的变量),以及从原始观察中设计新特征。即使组织擅长这些任务,它们也很耗时,而且可以在很大程度上实现自动化。(华东CIO大会、华东CIO联盟、CDLC中国数字化灯塔大会、CXO数字化研学之旅、数字化江湖-讲武堂,数字化江湖-大侠传、数字化江湖-论剑、CXO系列管理论坛(陆家嘴CXO管理论坛、宁波东钱湖CXO管理论坛等)、数字化转型网,走进灯塔工厂系列、ECIO大会等)

AutoML系统经常尝试使用许多模型来查看哪些模型可以产生优秀目标函数值,例如回归问题的最小平方误差。出色的AutoML系统还可以执行功能工程,并有效地利用它们的资源,以尽可能多的特征集合来追求优质的模型。

6.支持优质的机器学习和深度学习框架

大多数数据科学家拥有用于机器学习和深度学习技术的很喜欢的框架和编程语言。对于喜欢Python的人来说,Scikit学习通常是机器学习的最爱,而TensorFlow、PyTorch、Keras、MXNet通常是深度学习的首选。在Scala中,Spark MLlib往往是机器学习的首选。在R中,有许多原生机器学习包,以及与Python的良好接口。而在Java中,H2O.ai的评价很高,Java-ML和Deep Java Library也是如此。

云计算机器学习和深度学习平台往往拥有自己的算法集合,并且它们通常以至少一种语言或作为具有特定入口点的容器来支持外部框架。在某些情况下,组织可以将自己的算法和统计方法与平台的AutoML功能集成在一起,这非常方便。

一些云平台还提供了自己的主要深度学习框架的调整版本。例如,AWS公司拥有TensorFlow的优化版本,该公司声称可以为深度神经网络训练提供近乎线性的可扩展性。

7.提供预训练的模型并支持转移学习

并非每个人都希望花费时间和资源来训练自己的模型,而即使当预训练的模型可用时,他们也不应该如此。例如,ImageNet数据集非常庞大,并且要针对其训练比较先进的深度神经网络可能要花费数周的时间,因此在可能的情况下使用预先训练的模型是有意义的。

另一方面,经过预训练的模型可能无法始终标识组织关心的对象。转移学习可以帮助组织针对特定数据集自定义神经网络的最后几层,而无需花费更多时间和资金来训练整个网络。

8.提供优化的人工智能服务

主要的云平台为许多应用程序提供了功能强大且经过优化的人工智能服务,而不仅仅是图像识别。其示例包括语言翻译、语音转换到文本、文本转换到语音、预测和推荐。

这些服务已经经过训练和测试,其使用的数据量超出了企业通常可获得的数据量。它们还已经部署在具有足够计算资源(包括加速器)的服务端点上,以确保在全球负载下具有良好的响应时间。

9.管理实验

为组织的数据集找到优秀模型的唯一方法是尝试采用所有方法,无论是人工还是使用AutoML。这就留下了另一个问题:管理实验。

良好的云计算机器学习平台将为组织提供一种方式,可以查看和比较每个实验的目标函数值(训练集和测试数据)以及模型和混淆矩阵的大小。而能够绘制所有这些图表具有一定的优势。

10.支持模型部署以进行预测

一旦组织有了根据自己的条件选择优秀实验的方法,就需要一种简单的方法来部署模型。如果组织出于同一目的部署多个模型,则还需要一种在它们之间分配流量的方法来进行a/b测试。(华东CIO大会、华东CIO联盟、CDLC中国数字化灯塔大会、CXO数字化研学之旅、数字化江湖-讲武堂,数字化江湖-大侠传、数字化江湖-论剑、CXO系列管理论坛(陆家嘴CXO管理论坛、宁波东钱湖CXO管理论坛等)、数字化转型网,走进灯塔工厂系列、ECIO大会等)

11.监控预测效果

数据随着世界的变化而变化。这意味着组织无法部署模型而忘记它。与其相反,组织需要监视为预测而提交的数据。当数据从原始训练数据集的基线开始发生明显变化时,组织需要重新训练模型。

12.控制成本

最后,组织需要一些方法来控制模型产生的成本。部署用于生产推理的模型通常占到深度学习成本的90%,而训练仅占成本的10%。

控制预测成本的优秀方法取决于组织的负载和模型的复杂性。如果负载很高,则可以使用加速器来避免添加更多虚拟机实例。如果负载可变,则随着负载的增加或减少,组织可能能够动态更改大小或实例或容器的数量。而且,如果组织的负载较少,则可以使用带有部分加速器的非常小的实例来处理预测。

本文作者Martin Heller目前为InfoWorld网站的特约编辑兼评论员,此前曾担任Web和Windows编程顾问。从1986年至2010年,Heller一直从事数据库、软件和网站的开发工作。近期,Heller还出任了Alpha Software的技术兼训练副总裁和Tubifi的董事长兼首席执行官。

原文:

In order to create effective machine learning and deep learning

models, you need copious amounts of data, a way to clean the data and

perform feature engineering on it, and a way to train models on your

data in a reasonable amount of time. Then you need a way to deploy your

models, monitor them for drift over time, and retrain them as needed.

You can do all of that on-premises if you have invested in compute

resources and accelerators such as GPUs, but you may find that if your

resources are adequate, they are also idle much of the time. On the

other hand, it can sometimes be more cost-effective to run the entire

pipeline in the cloud, using large amounts of compute resources and

accelerators as needed, and then releasing them.

The major cloud providers — and a number of minor clouds too — have

put significant effort into building out their machine learning

platforms to support the complete machine learning lifecycle, from

planning a project to maintaining a model in production. How do you

determine which of these clouds will meet your needsHere are 12

capabilities every end-to-end machine learning platform should provide,

with notes on which clouds provide them.

[ The best open source software of 2022 ]

Be close to your data

If you have the large amounts of data needed to build precise models,

you don’t want to ship it halfway around the world. The issue here

isn’t distance, however, it’s time: Data transmission latency is

ultimately limited by the speed of light, even on a perfect network with

infinite bandwidth. Long distances mean latency.

The ideal case for very large data sets is to build the model where

the data already resides, so that no mass data transmission is needed. A

number of databases support that.

The next best case is for the data to be on the same high-speed

network as the model-building software, which typically means within the

same data center. Even moving the data from one data center to another

within a cloud availability zone can introduce a significant delay if

you have terabytes (TB) or more. You can mitigate this by doing

incremental updates.

The worst case would be if you have to move big data long distances

over paths with constrained bandwidth and high latency. The

trans-Pacific cables going to Australia are particularly egregious in

this respect.

[ Attend Virtual Summit on November 8 - CIO's Future of Cloud Summit:

Mastering Complexity & Digital Innovation – Register Today!?]

The major cloud providers have been addressing this issue in multiple

ways. One is to add machine learning and deep learning to their

database services. For example, Amazon Redshift ML is designed to make

it easy for SQL users to create, train, and deploy machine learning

models using SQL commands against Amazon Redshift, a managed,

petabyte-scale data warehouse service. BigQuery ML lets you create and

execute machine learning models in BigQuery, Google Cloud’s managed,

petabyte-scale data warehouse, also using SQL queries.

IBM Db2 Warehouse on Cloud includes a wide set of in-database SQL

analytics that includes some basic machine learning functionality, plus

in-database support for R and Python. Microsoft SQL Server Machine

Learning Services supports R, Python, Java, the PREDICT T-SQL command,

the rx_Predict stored procedure in the SQL Server RDBMS, and Spark MLlib

in SQL Server Big Data Clusters. And, Oracle Cloud Infrastructure (OCI)

Data Science is a managed and serverless platform for data science

teams to build, train, and manage machine learning models using Oracle

Cloud Infrastructure including Oracle Autonomous Database and Oracle

Autonomous Data Warehouse.

Another way cloud providers have addressed this issue is to bring

their cloud services to customer data centers as well as to satellite

points of presence (often in large metropolitan areas) that are closer

to customers than full-blown availability zones. AWS calls these AWS

Outposts, and AWS Local Zones; Microsoft Azure calls them Azure Stack

Edge nodes and Azure Arc; Google Cloud Platform calls them network edge

locations, Google Distributed Cloud Virtual, and Anthos on-prem.

Support an ETL or ELT pipeline

ETL (export, transform, and load) and ELT (export, load, and

transform) are two data pipeline configurations that are common in the

database world. Machine learning and deep learning amplify the need for

these, especially the transform portion. ELT gives you more flexibility

when your transformations need to change, as the load phase is usually

the most time-consuming for big data.

In general, data in the wild is noisy. That needs to be filtered.

Additionally, data in the wild has varying ranges: One variable might

have a maximum in the millions, while another might have a range of -0.1

to -0.001. For machine learning, variables must be transformed to

standardized ranges to keep the ones with large ranges from dominating

the model. Exactly which standardized range depends on the algorithm

used for the model.

AWS Glue is an Apache Spark-based serverless ETL engine; AWS also

offers Amazon EMR, a big data platform that can run Apache Spark, and

Amazon Redshift Spectrum, which supports ELT from an Amazon S3-based

data lake. Azure Data Factory and Azure Synapse can do both ETL and ELT.

Google Cloud Data Fusion, Dataflow, and Dataproc are useful for ETL and

ELT. Third-party self-service ETL/ELT products such as Trifacta can

also be used on the clouds.

Support an online environment for model building

The conventional wisdom used to be that you should import your data

to your desktop for model building. The sheer quantity of data needed to

build good machine learning and deep learning models changes the

picture: You can download a small sample of data to your desktop for

exploratory data analysis and model building, but for production models

you need to have access to the full data.

Web-based development environments such as Jupyter Notebooks,

JupyterLab, and Apache Zeppelin are well suited for model building. If

your data is in the same cloud as the notebook environment, you can

bring the analysis to the data, minimizing the time-consuming movement

of data. Notebooks can also be used for ELT as part of the pipeline.

Amazon SageMaker allows you to build, train, and deploy machine

learning and deep learning models for any use case with fully managed

infrastructure, tools, and workflows. SageMaker Studio is based on

JupyterLab.

Microsoft Azure Machine Learning is an end-to-end, scalable, trusted

AI platform with experimentation and model management; Azure Machine

Learning Studio includes Jupyter Notebooks, a drag-and-drop machine

learning pipeline designer, and an AutoML facility. Azure Databricks is

an Apache Spark-based analytics platform; Azure Data Science Virtual

Machines make it easy for advanced data scientists to set up machine

learning and deep learning development environments.

Google Cloud Vertex AI allows you to build, deploy, and scale machine

learning models faster, with pre-trained models and custom tooling

within a unified artificial intelligence platform. Through Vertex AI

Workbench, Vertex AI is natively integrated with BigQuery, Dataproc, and

Spark. Vertex AI also integrates with widely used open source

frameworks such as TensorFlow, PyTorch, and Scikit-learn, and supports

all machine learning frameworks and artificial intelligence branches via

custom containers for training and prediction.

Support scale-up and scale-out training

The compute and memory requirements of notebooks are generally

minimal, except for training models. It helps a lot if a notebook can

spawn training jobs that run on multiple large virtual machines or

containers. It also helps a lot if the training can access accelerators

such as GPUs, TPUs, and FPGAs; these can turn days of training into

hours.

Amazon SageMaker supports a wide range of VM sizes; GPUs and other

accelerators including NVIDIA A100s, Habana Gaudi, and AWS Trainium; a

model compiler; and distributed training using either data parallelism

or model parallelism. Azure Machine Learning supports a wide range of VM

sizes; GPUs and other accelerators including NVIDIA A100s and Intel

FPGAs; and distributed training using either data parallelism or model

parallelism. Google Cloud Vertex AI supports a wide range of VM sizes;

GPUs and other accelerators including NVIDIA A100s and Google TPUs; and

distributed training using either data parallelism or model parallelism,

with an optional reduction server.

Support AutoML and automated feature engineering

Not everyone is good at picking machine learning models, selecting

features (the variables that are used by the model), and engineering new

features from the raw observations. Even if you’re good at those tasks,

they are time-consuming and can be automated to a large extent.

AutoML systems often try many models to see which result in the best

objective function values, for example the minimum squared error for

regression problems. The best AutoML systems can also perform feature

engineering, and use their resources effectively to pursue the best

possible models with the best possible sets of features.

Amazon SageMaker Autopilot provides AutoML and hyperparameter tuning,

which can use Hyperband as a search strategy. Azure Machine Learning

and Azure Databricks both provide AutoML, as does Apache Spark in Azure

HDInsight. Google Cloud Vertex AI supplies AutoML, and so do Google’s

specialized AutoML services for structured data, sight, and language,

although Google tends to lump AutoML in with transfer learning in some

cases.

DataRobot, Dataiku, and H2O.ai Driverless AI all offer AutoML with automated feature engineering and hyperparameter tuning.

Support the best machine learning and deep learning frameworks

Most data scientists have favorite frameworks and programming

languages for machine learning and deep learning. For those who prefer

Python, Scikit-learn is often a favorite for machine learning, while

TensorFlow, PyTorch, Keras, and MXNet are often top picks for deep

learning. In Scala, Spark MLlib tends to be preferred for machine

learning. In R, there are many native machine learning packages, and a

good interface to Python. In Java, H2O.ai rates highly, as do Java-ML

and Deep Java Library.

The cloud machine learning and deep learning platforms tend to have

their own collection of algorithms, and they often support external

frameworks in at least one language or as containers with specific entry

points. In some cases you can integrate your own algorithms and

statistical methods with the platform’s AutoML facilities, which is

quite convenient.

Some cloud platforms also offer their own tuned versions of major

deep learning frameworks. For example, AWS has an optimized version of

TensorFlow that it claims can achieve nearly linear scalability for deep

neural network training. Similarly, Google Cloud offers TensorFlow

Enterprise.

Offer pre-trained models and support transfer learning

Not everyone wants to spend the time and compute resources to train

their own models — nor should they, when pre-trained models are

available. For example, the ImageNet dataset is huge, and training a

state-of-the-art deep neural network against it can take weeks, so it

makes sense to use a pre-trained model for it when you can.

On the other hand, pre-trained models may not always identify the

objects you care about. Transfer learning can help you customize the

last few layers of the neural network for your specific data set without

the time and expense of training the full network.

All major deep learning frameworks and cloud service providers

support transfer learning at some level. There are differences; one

major difference is that Azure can customize some kinds of models with

tens of labeled exemplars, versus hundreds or thousands for some of the

other platforms.

Offer tuned, pre-trained AI services

The major cloud platforms offer robust, tuned AI services for many

applications, not just image identification. Examples include language

translation, speech to text, text to speech, forecasting, and

recommendations.

These services have already been trained and tested on more data than

is usually available to businesses. They are also already deployed on

service endpoints with enough computational resources, including

accelerators, to ensure good response times under worldwide load.

The differences among the services offered by the big three tend to

be down in the weeds. One area of active development is services for the

edge, including machine learning that resides on devices such as

cameras and communicates with the cloud.

Manage your experiments

The only way to find the best model for your data set is to try

everything, whether manually or using AutoML. That leaves another

problem: Managing your experiments.

A good cloud machine learning platform will have a way that you can

see and compare the objective function values of each experiment for

both the training sets and the test data, as well as the size of the

model and the confusion matrix. Being able to graph all of that is a

definite plus.

In addition to the experiment tracking built into Amazon SageMaker,

Azure Machine Learning, and Google Cloud Vertex AI, you can use

third-party products such as Neptune.ai, Weights & Biases, Sacred

plus Omniboard, and MLflow. Most of these are free for at least personal

use, and some are open source.

Support model deployment for prediction

Once you have a way of picking the best experiment given your

criteria, you also need an easy way to deploy the model. If you deploy

multiple models for the same purpose, you’ll also need a way to

apportion traffic among them for a/b testing.

One sticking point is the cost of deploying an endpoint, as discussed under

本文主要内容原作者Martin Heller,仅供广大读者参考,如有侵犯您的知识产权或者权益,请联系我提供证据,我会予以删除。

CXO联盟(CXO union)是一家聚焦于CIO,CDO,cto,ciso,cfo,coo,chro,cpo,ceo等人群的平台组织,其中在CIO会议领域的领头羊,目前举办了大量的CIO大会、CIO论坛、CIO活动、CIO会议、CIO峰会、CIO会展。如华东CIO会议、华南cio会议、华北cio会议、中国cio会议、西部CIO会议。在这里,你可以参加大量的IT大会、IT行业会议、IT行业论坛、IT行业会展、数字化论坛、数字化转型论坛,在这里你可以认识很多的首席信息官、首席数字官、首席财务官、首席技术官、首席人力资源官、首席运营官、首席执行官、IT总监、财务总监、信息总监、运营总监、采购总监、供应链总监。

数字化转型网(资讯媒体,是企业数字化转型的必读参考,在这里你可以学习大量的知识,如财务数字化转型、供应链数字化转型、运营数字化转型、生产数字化转型、人力资源数字化转型、市场营销数字化转型。通过关注我们的公众号,你就知道如何实现企业数字化转型?数字化转型如何做?

【CXO UNION部分社群会员】华为投资控股有限公司CISO、苏宁控股集团CISO、正威国际集团有限公司CISO、恒力集团有限公司CISO、碧桂园控股有限公司CISO、恒大集团有限公司CISO、联想控股股份有限公司CISO、国美控股集团有限公司CISO、万科企业股份有限公司CISO、浙江吉利控股集团有限公司CISO、中南控股集团有限公司CISO、美的集团股份有限公司CISO、山东魏桥创业集团有限公司CISO、青山控股集团有限公司CISO、江苏沙钢集团有限公司CISO、阳光龙净集团有限公司CISO、浙江恒逸集团有限公司CISO、小米通讯技术有限公司CISO、浙江荣盛控股集团有限公司CISO、秦康保险集团股份有限公司CISO、新疆广汇实业投资(集团)有限责任公司CISO、盛虹控股集团有限公司CISO、重庆市金科投资控殷(集团)有限责任公司CISO、海亮集团有限公司CISO、多弗国际控股集团有限公司CISO、新奥集团股份有限公司CISO、新希望集团有限公司CISO、大连万达集团股份有限公司CISO、北京建龙重工集团有限公司CISO、龙湖集团控股有限公司CISO、南通三建控股有限公司CISO、复星国际有限公司CISO、天能控股集团有限公司CISO、TCL集团CISO、万向集团公司CISO、中天钢铁集团有限公司CISO、比亚迪股份有限公司CISO、敬业集团有限公司CISO、东岭集团股份有限公司CISO、超威集团CISO、海澜集团有限公司CISO、东方希望集团有限公司CISO、河北津西钢铁集团股份有限公司CISO、山东东明石化集团有限公司CISO、顺丰控股股份有限公司CISO、西安迈科金属国际集团有限公司CISO、雅戈尔集团股份有限公司CISO、江阴澄星实业集团有限公司CISO、亨通集团有限公司CISO、百度公司CISO、上海均和集团有限公司CISO等


https://www.xamrdz.com/backend/3t61935389.html

相关文章: